If necessary, install the Mosaic package.
install.packages("mosaic")
Enter data into a data frame using the do() command. This requires use of the Mosaic package. For example, let’s say we want to enter the following contingency table, which represents survival status on the Titanic.
Status | Men | Women | Boys | Girls |
---|---|---|---|---|
Survived | 334 | 318 | 29 | 27 |
Died | 1360 | 104 | 35 | 18 |
Let’s call the column variable Person and the row variable Status.
library(mosaic)
Titanic_df <- rbind(
do(334)*data.frame(Person="Men",Status="Survived"),
do(318)*data.frame(Person="Women",Status="Survived"),
do(29)*data.frame(Person="Boys",Status="Survived"),
do(27)*data.frame(Person="Girls",Status="Survived"),
do(1360)*data.frame(Person="Men",Status="Died"),
do(104)*data.frame(Person="Women",Status="Died"),
do(35)*data.frame(Person="Boys",Status="Died"),
do(18)*data.frame(Person="Girls",Status="Died")
)
tally(~Status+Person,data=Titanic_df) #Build contingency table from data frame
## Person
## Status Boys Girls Men Women
## Died 35 18 1360 104
## Survived 29 27 334 318
Now, let’s determine the frequency marginal distribution and relative frequency marginal distribution. Use the tally command on the data in the data frame.
tally(~Status+Person,margins=TRUE,data=Titanic_df) #Frequency marginal distribution
## Person
## Status Boys Girls Men Women Total
## Died 35 18 1360 104 1517
## Survived 29 27 334 318 708
## Total 64 45 1694 422 2225
tally(~Status+Person,margins=TRUE,format="proportion",data=Titanic_df) #Relative frequency marginal distribution.
## Person
## Status Boys Girls Men Women Total
## Died 0.015730337 0.008089888 0.611235955 0.046741573 0.681797753
## Survived 0.013033708 0.012134831 0.150112360 0.142921348 0.318202247
## Total 0.028764045 0.020224719 0.761348315 0.189662921 1.000000000
From the frequency marginal distribution, we see that 1517 passengers died. The proportion of passengers who died is 0.682 (or 68.2%).
When there are a large number of observations in each cell of a contingency table, the do command should not be used (because it requires to many iterations). Instead, we enter the data as a matrix. We will work with the contingency table in Section 4.4, Table 9.
The matrix command requires using the c( ) command. In addition, you must specify the number of rows (nrow) and the number of columns (ncol). Finally, you name the rows and columns using dimnames along with list.
Notice how the cells are entered into the matrix (all entries in first column, then second column, and so on). With dimnames, name the row values first, then the column values.
Table9 <- matrix(c(9607, 570, 11662, 34625, 1274, 26426, 36370, 1170, 19861, 57102, 1305, 20841), nrow = 3, ncol = 4, dimnames = list(c("Employed", "Unemployed", "Not in the Labor Force"), c("Did Not Finish High School", "High School Graduate", "Some College", "Bachelor's Degree or Higher")))
Table9
## Did Not Finish High School High School Graduate
## Employed 9607 34625
## Unemployed 570 1274
## Not in the Labor Force 11662 26426
## Some College Bachelor's Degree or Higher
## Employed 36370 57102
## Unemployed 1170 1305
## Not in the Labor Force 19861 20841
Freq_Employed <- sum(Table9[1,]) #Frequency of "Employed"
Freq_Unemployed <- sum(Table9[2,]) #Frequency of "Unemployed"
Freq_NoLaborForce <- sum(Table9[3,]) #Frequency of "Not in the Labor Force"
Freq_Employed
## [1] 137704
Freq_Unemployed
## [1] 4319
Freq_NoLaborForce
## [1] 78790
Freq_NoHS <- sum(Table9[,1]) #Frequency of "Did Not Finish High School"
Freq_HS <- sum(Table9[,2]) #Frequency of "High School Graduate"
Freq_SC <- sum(Table9[,3]) #Frequency of "Some College"
Freq_Bachelors <- sum(Table9[,4]) #Frequency of "Bachelor's Degree or Higher"
Freq_NoHS
## [1] 21839
Freq_HS
## [1] 62325
Freq_SC
## [1] 57401
Freq_Bachelors
## [1] 79248
n <- sum(Table9) #Total number of observations in the contingency table
RelFreq_Employed <- Freq_Employed/n #Relative Frequency of "Employed"
RelFreq_Employed
## [1] 0.6236227
We can also obtain the marginal distribution using the addmargins command.
Table9_WithMargins <- addmargins(Table9) # The *addmargins* command builds the marginal distribution
Table9_WithMargins
## Did Not Finish High School High School Graduate
## Employed 9607 34625
## Unemployed 570 1274
## Not in the Labor Force 11662 26426
## Sum 21839 62325
## Some College Bachelor's Degree or Higher Sum
## Employed 36370 57102 137704
## Unemployed 1170 1305 4319
## Not in the Labor Force 19861 20841 78790
## Sum 57401 79248 220813
Notice in Table9_WithMargins that there is a “Sum” for the row variable and a “Sum” for the column variable. For example, we can see that there are 137,704 thousand individuals who are employed and 21,839 thousand individuals who did not finish high school.
Now, let’s learn how to create a contingency table and obtain the marginal distributions from raw data. To do so, we use the tally command in the Mosaic package.
Load the HomeRuns_2014 data.
HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
## Date Hitter HitterTeam Pitcher PitcherTeam INN
## 1 9/28/2014 Rizzo, Anthony CHC Fiers, Mike MIL 1
## 2 9/28/2014 Bernadina, Roger LAD Scahill, Rob COL 6
## 3 9/28/2014 Duvall, Adam SF Stauffer, Tim SD 4
## 4 9/28/2014 Duda, Lucas NYM Foltynewicz, Mike HOU 8
## Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1 Miller Park 441 109.1 22.7 86.7 81 PL
## 2 Dodger Stadi... 424 113.2 27.7 62.3 98 ND
## 3 AT&T Park 423 103.6 31.9 112.9 98 ND
## 4 Citi Field 417 106.3 26.5 73.0 83 PL
We can also use the tally command to build a marginal distribution.
tally(~ response variable + explanatory variable,margins=TRUE,data = data_file)
Consider
library(mosaic)
tally(~INN+Type,margins=TRUE,data=HomeRun)
## Type
## INN ITP JE ND PL Total
## 1 0 168 72 263 503
## 2 0 152 86 247 485
## 3 0 143 60 200 403
## 4 1 161 78 284 524
## 5 1 160 87 251 499
## 6 0 134 92 261 487
## 7 1 146 62 237 446
## 8 4 129 66 195 394
## 9 1 119 52 175 347
## 10 0 14 5 19 38
## 11 0 7 3 16 26
## 12 0 6 3 8 17
## 13 0 1 1 6 8
## 14 0 1 0 4 5
## 15 0 0 0 1 1
## 16 0 1 0 0 1
## 19 0 1 0 0 1
## Total 8 1343 667 2167 4185
Notice that there were 503 home runs hit in the first inning. There were a total of 8 inside the park home runs. There were 72 “No Doubt” home runs hit in the first inning.
If you want a relative marginal distribution, use the format = “proportion” option.
library(mosaic)
tally(~INN+Type,margins=TRUE,format="proportion",data=HomeRun)
## Type
## INN ITP JE ND PL Total
## 1 0.0000000000 0.0401433692 0.0172043011 0.0628434886 0.1201911589
## 2 0.0000000000 0.0363201912 0.0205495818 0.0590203106 0.1158900836
## 3 0.0000000000 0.0341696535 0.0143369176 0.0477897252 0.0962962963
## 4 0.0002389486 0.0384707288 0.0186379928 0.0678614098 0.1252090800
## 5 0.0002389486 0.0382317802 0.0207885305 0.0599761051 0.1192353644
## 6 0.0000000000 0.0320191159 0.0219832736 0.0623655914 0.1163679809
## 7 0.0002389486 0.0348864994 0.0148148148 0.0566308244 0.1065710872
## 8 0.0009557945 0.0308243728 0.0157706093 0.0465949821 0.0941457587
## 9 0.0002389486 0.0284348865 0.0124253286 0.0418160096 0.0829151732
## 10 0.0000000000 0.0033452808 0.0011947431 0.0045400239 0.0090800478
## 11 0.0000000000 0.0016726404 0.0007168459 0.0038231780 0.0062126643
## 12 0.0000000000 0.0014336918 0.0007168459 0.0019115890 0.0040621266
## 13 0.0000000000 0.0002389486 0.0002389486 0.0014336918 0.0019115890
## 14 0.0000000000 0.0002389486 0.0000000000 0.0009557945 0.0011947431
## 15 0.0000000000 0.0000000000 0.0000000000 0.0002389486 0.0002389486
## 16 0.0000000000 0.0002389486 0.0000000000 0.0000000000 0.0002389486
## 19 0.0000000000 0.0002389486 0.0000000000 0.0000000000 0.0002389486
## Total 0.0019115890 0.3209080048 0.1593787336 0.5178016726 1.0000000000
The highest proportion of home runs were hit in the 4th inning (0.125).