If necessary, install the Mosaic package.
install.packages("mosaic")
Enter data into a data frame using the do() command. This requires use of the Mosaic package. For example, let’s say we want to enter the following contingency table, which represents survival status on the Titanic.
Status | Men | Women | Boys | Girls |
---|---|---|---|---|
Survived | 334 | 318 | 29 | 27 |
Died | 1360 | 104 | 35 | 18 |
Let’s call the column variable Person and the row variable Status.
library(mosaic)
Titanic_df <- rbind(
do(334)*data.frame(Person="Men",Status="Survived"),
do(318)*data.frame(Person="Women",Status="Survived"),
do(29)*data.frame(Person="Boys",Status="Survived"),
do(27)*data.frame(Person="Girls",Status="Survived"),
do(1360)*data.frame(Person="Men",Status="Died"),
do(104)*data.frame(Person="Women",Status="Died"),
do(35)*data.frame(Person="Boys",Status="Died"),
do(18)*data.frame(Person="Girls",Status="Died")
)
tally(~Status+Person,data=Titanic_df) #Build contingency table from data frame
## Person
## Status Boys Girls Men Women
## Died 35 18 1360 104
## Survived 29 27 334 318
Now, let’s determine the conditional distribution using the tally command on the data in the data frame.
tally(~ response variable | explanatory variable,margins=TRUE,format=“proportion”,data = data_file)
So, to construct a conditional distribution of survival status by person, use the following command. Note that margins is set to FALSE.
tally(~Status|Person,margins=FALSE,format="proportion",data=Titanic_df) #Conditional distribution by type of person
## Person
## Status Boys Girls Men Women
## Died 0.5468750 0.4000000 0.8028335 0.2464455
## Survived 0.4531250 0.6000000 0.1971665 0.7535545
Among the men on the Titanic, 80.3% died; among the women, 24.6% died.
Now, let’s construct a conditional distribution from a contingency table. We will work with the contingency table in Section 4.4, Table 9.
To create a conditional distribution in R, use the following command:
variable <- prop.table(table, 1 or 2)
Note Use 1 to condition by the row variable; use 2 to condition by the column variable.
To find the conditional distribution from data summarized in a contingency table, we must enter the data as a matrix.
The matrix command requires using the c( ) command. In addition, you must specify the number of rows (nrow) and the number of columns (ncol). Finally, you name the rows and columns using dimnames along with list.
Notice how the cells are entered into the matrix (all entries in first column, then second column, and so on). With dimnames, name the row values first, then the column values.
Table9 <- matrix(c(9607, 570, 11662, 34625, 1274, 26426, 36370, 1170, 19861, 57102, 1305, 20841), nrow = 3, ncol = 4, dimnames = list(c("Employed", "Unemployed", "Not in the Labor Force"), c("Did Not Finish High School", "High School Graduate", "Some College", "Bachelor's Degree or Higher")))
Does a higher level of education play a role in employment status? Let’s condition by level of education to find out. Because level of education is the column variable, use 2 in the prop.table command.
Table9_Condition <- prop.table(Table9, 2)
Table9_Condition
## Did Not Finish High School High School Graduate
## Employed 0.4399011 0.55555556
## Unemployed 0.0261001 0.02044124
## Not in the Labor Force 0.5339988 0.42400321
## Some College Bachelor's Degree or Higher
## Employed 0.63361265 0.72054815
## Unemployed 0.02038292 0.01646729
## Not in the Labor Force 0.34600443 0.26298455
Notice in Table9_Condition, as the level of education increases, the proportion employed also increases.
Now, let’s learn how to create a conditional distribution from raw data. To do so, we use the tally command in the Mosaic package.
Load the HomeRuns_2014 data.
HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
## Date Hitter HitterTeam Pitcher PitcherTeam INN
## 1 9/28/2014 Rizzo, Anthony CHC Fiers, Mike MIL 1
## 2 9/28/2014 Bernadina, Roger LAD Scahill, Rob COL 6
## 3 9/28/2014 Duvall, Adam SF Stauffer, Tim SD 4
## 4 9/28/2014 Duda, Lucas NYM Foltynewicz, Mike HOU 8
## Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1 Miller Park 441 109.1 22.7 86.7 81 PL
## 2 Dodger Stadi... 424 113.2 27.7 62.3 98 ND
## 3 AT&T Park 423 103.6 31.9 112.9 98 ND
## 4 Citi Field 417 106.3 26.5 73.0 83 PL
If necessary, install the Mosaic package.
install.packages("mosaic")
We can also use the tally command to build a conditional distribution.
tally(~ response variable | explanatory variable,margins=TRUE,format=“proportion”,data = data_file)
For the HomeRun_2014 data, let’s say we want to determine if inning plays a role in the type of home run hit. In this regard, we want to find a conditional distribution of type by inning. So, inning (INN) is the explanatory variable. Set margins to FALSE.
library(mosaic)
tally(~Type|INN,margins=FALSE,format="proportion",data=HomeRun)
## INN
## Type 1 2 3 4 5 6
## ITP 0.000000000 0.000000000 0.000000000 0.001908397 0.002004008 0.000000000
## JE 0.333996024 0.313402062 0.354838710 0.307251908 0.320641283 0.275154004
## ND 0.143141153 0.177319588 0.148883375 0.148854962 0.174348697 0.188911704
## PL 0.522862823 0.509278351 0.496277916 0.541984733 0.503006012 0.535934292
## INN
## Type 7 8 9 10 11 12
## ITP 0.002242152 0.010152284 0.002881844 0.000000000 0.000000000 0.000000000
## JE 0.327354260 0.327411168 0.342939481 0.368421053 0.269230769 0.352941176
## ND 0.139013453 0.167512690 0.149855908 0.131578947 0.115384615 0.176470588
## PL 0.531390135 0.494923858 0.504322767 0.500000000 0.615384615 0.470588235
## INN
## Type 13 14 15 16 19
## ITP 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## JE 0.125000000 0.200000000 0.000000000 1.000000000 1.000000000
## ND 0.125000000 0.000000000 0.000000000 0.000000000 0.000000000
## PL 0.750000000 0.800000000 1.000000000 0.000000000 0.000000000
Among all home runs hit in the 2nd inning, 31.3% were just enough (JE).