Conditional Disn

If necessary, install the Mosaic package.

install.packages("mosaic")

Data in a Contingency Table (Using “do” Command)

Enter data into a data frame using the do() command. This requires use of the Mosaic package. For example, let’s say we want to enter the following contingency table, which represents survival status on the Titanic.

Status	Men	Women	Boys	Girls
Survived	334	318	29	27
Died	1360	104	35	18

Let’s call the column variable Person and the row variable Status.

library(mosaic)
Titanic_df <- rbind(
do(334)*data.frame(Person="Men",Status="Survived"),
do(318)*data.frame(Person="Women",Status="Survived"),
do(29)*data.frame(Person="Boys",Status="Survived"),
do(27)*data.frame(Person="Girls",Status="Survived"),
do(1360)*data.frame(Person="Men",Status="Died"),
do(104)*data.frame(Person="Women",Status="Died"),
do(35)*data.frame(Person="Boys",Status="Died"),
do(18)*data.frame(Person="Girls",Status="Died")
)
tally(~Status+Person,data=Titanic_df)    #Build contingency table from data frame

##           Person
## Status     Boys Girls  Men Women
##   Died       35    18 1360   104
##   Survived   29    27  334   318

Now, let’s determine the conditional distribution using the tally command on the data in the data frame.

tally(~ response variable | explanatory variable,margins=TRUE,format=“proportion”,data = data_file)

So, to construct a conditional distribution of survival status by person, use the following command. Note that margins is set to FALSE.

tally(~Status|Person,margins=FALSE,format="proportion",data=Titanic_df)    #Conditional distribution by type of person

##           Person
## Status          Boys     Girls       Men     Women
##   Died     0.5468750 0.4000000 0.8028335 0.2464455
##   Survived 0.4531250 0.6000000 0.1971665 0.7535545

Among the men on the Titanic, 80.3% died; among the women, 24.6% died.

Data in a Contingency Table (Matrix)

Now, let’s construct a conditional distribution from a contingency table. We will work with the contingency table in Section 4.4, Table 9.

To create a conditional distribution in R, use the following command:

variable <- prop.table(table, 1 or 2)

Note Use 1 to condition by the row variable; use 2 to condition by the column variable.

To find the conditional distribution from data summarized in a contingency table, we must enter the data as a matrix.

The matrix command requires using the c( ) command. In addition, you must specify the number of rows (nrow) and the number of columns (ncol). Finally, you name the rows and columns using dimnames along with list.

Notice how the cells are entered into the matrix (all entries in first column, then second column, and so on). With dimnames, name the row values first, then the column values.

Table9 <- matrix(c(9607, 570, 11662, 34625, 1274, 26426, 36370, 1170, 19861, 57102, 1305, 20841), nrow = 3, ncol = 4, dimnames = list(c("Employed", "Unemployed", "Not in the Labor Force"), c("Did Not Finish High School", "High School Graduate", "Some College", "Bachelor's Degree or Higher")))

Does a higher level of education play a role in employment status? Let’s condition by level of education to find out. Because level of education is the column variable, use 2 in the prop.table command.

Table9_Condition <- prop.table(Table9, 2)
Table9_Condition

##                        Did Not Finish High School High School Graduate
## Employed                                0.4399011           0.55555556
## Unemployed                              0.0261001           0.02044124
## Not in the Labor Force                  0.5339988           0.42400321
##                        Some College Bachelor's Degree or Higher
## Employed                 0.63361265                  0.72054815
## Unemployed               0.02038292                  0.01646729
## Not in the Labor Force   0.34600443                  0.26298455

Notice in Table9_Condition, as the level of education increases, the proportion employed also increases.

Raw Data

Now, let’s learn how to create a conditional distribution from raw data. To do so, we use the tally command in the Mosaic package.

Load the HomeRuns_2014 data.

HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)

##        Date           Hitter HitterTeam           Pitcher PitcherTeam INN
## 1 9/28/2014   Rizzo, Anthony        CHC       Fiers, Mike         MIL   1
## 2 9/28/2014 Bernadina, Roger        LAD      Scahill, Rob         COL   6
## 3 9/28/2014     Duvall, Adam         SF     Stauffer, Tim          SD   4
## 4 9/28/2014      Duda, Lucas        NYM Foltynewicz, Mike         HOU   8
##          Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1     Miller Park      441       109.1       22.7        86.7   81   PL
## 2 Dodger Stadi...      424       113.2       27.7        62.3   98   ND
## 3       AT&T Park      423       103.6       31.9       112.9   98   ND
## 4      Citi Field      417       106.3       26.5        73.0   83   PL

If necessary, install the Mosaic package.

install.packages("mosaic")

We can also use the tally command to build a conditional distribution.

tally(~ response variable | explanatory variable,margins=TRUE,format=“proportion”,data = data_file)

For the HomeRun_2014 data, let’s say we want to determine if inning plays a role in the type of home run hit. In this regard, we want to find a conditional distribution of type by inning. So, inning (INN) is the explanatory variable. Set margins to FALSE.

library(mosaic)
tally(~Type|INN,margins=FALSE,format="proportion",data=HomeRun)

##      INN
## Type            1           2           3           4           5           6
##   ITP 0.000000000 0.000000000 0.000000000 0.001908397 0.002004008 0.000000000
##   JE  0.333996024 0.313402062 0.354838710 0.307251908 0.320641283 0.275154004
##   ND  0.143141153 0.177319588 0.148883375 0.148854962 0.174348697 0.188911704
##   PL  0.522862823 0.509278351 0.496277916 0.541984733 0.503006012 0.535934292
##      INN
## Type            7           8           9          10          11          12
##   ITP 0.002242152 0.010152284 0.002881844 0.000000000 0.000000000 0.000000000
##   JE  0.327354260 0.327411168 0.342939481 0.368421053 0.269230769 0.352941176
##   ND  0.139013453 0.167512690 0.149855908 0.131578947 0.115384615 0.176470588
##   PL  0.531390135 0.494923858 0.504322767 0.500000000 0.615384615 0.470588235
##      INN
## Type           13          14          15          16          19
##   ITP 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
##   JE  0.125000000 0.200000000 0.000000000 1.000000000 1.000000000
##   ND  0.125000000 0.000000000 0.000000000 0.000000000 0.000000000
##   PL  0.750000000 0.800000000 1.000000000 0.000000000 0.000000000

Among all home runs hit in the 2nd inning, 31.3% were just enough (JE).