Marginal Distributions

If necessary, install the Mosaic package.

install.packages("mosaic")

Data in a Contingency Table (Using “do” Command)

Enter data into a data frame using the do() command. This requires use of the Mosaic package. For example, let’s say we want to enter the following contingency table, which represents survival status on the Titanic.

Status	Men	Women	Boys	Girls
Survived	334	318	29	27
Died	1360	104	35	18

Let’s call the column variable Person and the row variable Status.

library(mosaic)
Titanic_df <- rbind(
do(334)*data.frame(Person="Men",Status="Survived"),
do(318)*data.frame(Person="Women",Status="Survived"),
do(29)*data.frame(Person="Boys",Status="Survived"),
do(27)*data.frame(Person="Girls",Status="Survived"),
do(1360)*data.frame(Person="Men",Status="Died"),
do(104)*data.frame(Person="Women",Status="Died"),
do(35)*data.frame(Person="Boys",Status="Died"),
do(18)*data.frame(Person="Girls",Status="Died")
)
tally(~Status+Person,data=Titanic_df)    #Build contingency table from data frame

##           Person
## Status     Boys Girls  Men Women
##   Died       35    18 1360   104
##   Survived   29    27  334   318

Now, let’s determine the frequency marginal distribution and relative frequency marginal distribution. Use the tally command on the data in the data frame.

tally(~Status+Person,margins=TRUE,data=Titanic_df)    #Frequency marginal distribution

##           Person
## Status     Boys Girls  Men Women Total
##   Died       35    18 1360   104  1517
##   Survived   29    27  334   318   708
##   Total      64    45 1694   422  2225

tally(~Status+Person,margins=TRUE,format="proportion",data=Titanic_df)  #Relative frequency marginal distribution.

##           Person
## Status            Boys       Girls         Men       Women       Total
##   Died     0.015730337 0.008089888 0.611235955 0.046741573 0.681797753
##   Survived 0.013033708 0.012134831 0.150112360 0.142921348 0.318202247
##   Total    0.028764045 0.020224719 0.761348315 0.189662921 1.000000000

From the frequency marginal distribution, we see that 1517 passengers died. The proportion of passengers who died is 0.682 (or 68.2%).

Data in a Contingency Table (Using Matrix)

When there are a large number of observations in each cell of a contingency table, the do command should not be used (because it requires to many iterations). Instead, we enter the data as a matrix. We will work with the contingency table in Section 4.4, Table 9.

The matrix command requires using the c( ) command. In addition, you must specify the number of rows (nrow) and the number of columns (ncol). Finally, you name the rows and columns using dimnames along with list.

Notice how the cells are entered into the matrix (all entries in first column, then second column, and so on). With dimnames, name the row values first, then the column values.

Table9 <- matrix(c(9607, 570, 11662, 34625, 1274, 26426, 36370, 1170, 19861, 57102, 1305, 20841), nrow = 3, ncol = 4, dimnames = list(c("Employed", "Unemployed", "Not in the Labor Force"), c("Did Not Finish High School", "High School Graduate", "Some College", "Bachelor's Degree or Higher")))
Table9

##                        Did Not Finish High School High School Graduate
## Employed                                     9607                34625
## Unemployed                                    570                 1274
## Not in the Labor Force                      11662                26426
##                        Some College Bachelor's Degree or Higher
## Employed                      36370                       57102
## Unemployed                     1170                        1305
## Not in the Labor Force        19861                       20841

Freq_Employed <- sum(Table9[1,])  #Frequency of "Employed"
Freq_Unemployed <- sum(Table9[2,])  #Frequency of "Unemployed"
Freq_NoLaborForce <- sum(Table9[3,])  #Frequency of "Not in the Labor Force"
Freq_Employed

## [1] 137704

Freq_Unemployed

## [1] 4319

Freq_NoLaborForce

## [1] 78790

Freq_NoHS <- sum(Table9[,1])  #Frequency of "Did Not Finish High School"
Freq_HS <- sum(Table9[,2])  #Frequency of "High School Graduate"
Freq_SC <- sum(Table9[,3])  #Frequency of "Some College"
Freq_Bachelors <- sum(Table9[,4])  #Frequency of "Bachelor's Degree or Higher"
Freq_NoHS

## [1] 21839

Freq_HS

## [1] 62325

Freq_SC

## [1] 57401

Freq_Bachelors

## [1] 79248

n <- sum(Table9)   #Total number of observations in the contingency table
RelFreq_Employed <- Freq_Employed/n   #Relative Frequency of "Employed"
RelFreq_Employed

## [1] 0.6236227

We can also obtain the marginal distribution using the addmargins command.

Table9_WithMargins <- addmargins(Table9)   # The *addmargins* command builds the marginal distribution
Table9_WithMargins

##                        Did Not Finish High School High School Graduate
## Employed                                     9607                34625
## Unemployed                                    570                 1274
## Not in the Labor Force                      11662                26426
## Sum                                         21839                62325
##                        Some College Bachelor's Degree or Higher    Sum
## Employed                      36370                       57102 137704
## Unemployed                     1170                        1305   4319
## Not in the Labor Force        19861                       20841  78790
## Sum                           57401                       79248 220813

Notice in Table9_WithMargins that there is a “Sum” for the row variable and a “Sum” for the column variable. For example, we can see that there are 137,704 thousand individuals who are employed and 21,839 thousand individuals who did not finish high school.

Building a Contingency Table from Raw Data

Now, let’s learn how to create a contingency table and obtain the marginal distributions from raw data. To do so, we use the tally command in the Mosaic package.

Load the HomeRuns_2014 data.

HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)

##        Date           Hitter HitterTeam           Pitcher PitcherTeam INN
## 1 9/28/2014   Rizzo, Anthony        CHC       Fiers, Mike         MIL   1
## 2 9/28/2014 Bernadina, Roger        LAD      Scahill, Rob         COL   6
## 3 9/28/2014     Duvall, Adam         SF     Stauffer, Tim          SD   4
## 4 9/28/2014      Duda, Lucas        NYM Foltynewicz, Mike         HOU   8
##          Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1     Miller Park      441       109.1       22.7        86.7   81   PL
## 2 Dodger Stadi...      424       113.2       27.7        62.3   98   ND
## 3       AT&T Park      423       103.6       31.9       112.9   98   ND
## 4      Citi Field      417       106.3       26.5        73.0   83   PL

We can also use the tally command to build a marginal distribution.

tally(~ response variable + explanatory variable,margins=TRUE,data = data_file)

Consider

library(mosaic)
tally(~INN+Type,margins=TRUE,data=HomeRun)

##        Type
## INN      ITP   JE   ND   PL Total
##   1        0  168   72  263   503
##   2        0  152   86  247   485
##   3        0  143   60  200   403
##   4        1  161   78  284   524
##   5        1  160   87  251   499
##   6        0  134   92  261   487
##   7        1  146   62  237   446
##   8        4  129   66  195   394
##   9        1  119   52  175   347
##   10       0   14    5   19    38
##   11       0    7    3   16    26
##   12       0    6    3    8    17
##   13       0    1    1    6     8
##   14       0    1    0    4     5
##   15       0    0    0    1     1
##   16       0    1    0    0     1
##   19       0    1    0    0     1
##   Total    8 1343  667 2167  4185

Notice that there were 503 home runs hit in the first inning. There were a total of 8 inside the park home runs. There were 72 “No Doubt” home runs hit in the first inning.

If you want a relative marginal distribution, use the format = “proportion” option.

library(mosaic)
tally(~INN+Type,margins=TRUE,format="proportion",data=HomeRun)

##        Type
## INN              ITP           JE           ND           PL        Total
##   1     0.0000000000 0.0401433692 0.0172043011 0.0628434886 0.1201911589
##   2     0.0000000000 0.0363201912 0.0205495818 0.0590203106 0.1158900836
##   3     0.0000000000 0.0341696535 0.0143369176 0.0477897252 0.0962962963
##   4     0.0002389486 0.0384707288 0.0186379928 0.0678614098 0.1252090800
##   5     0.0002389486 0.0382317802 0.0207885305 0.0599761051 0.1192353644
##   6     0.0000000000 0.0320191159 0.0219832736 0.0623655914 0.1163679809
##   7     0.0002389486 0.0348864994 0.0148148148 0.0566308244 0.1065710872
##   8     0.0009557945 0.0308243728 0.0157706093 0.0465949821 0.0941457587
##   9     0.0002389486 0.0284348865 0.0124253286 0.0418160096 0.0829151732
##   10    0.0000000000 0.0033452808 0.0011947431 0.0045400239 0.0090800478
##   11    0.0000000000 0.0016726404 0.0007168459 0.0038231780 0.0062126643
##   12    0.0000000000 0.0014336918 0.0007168459 0.0019115890 0.0040621266
##   13    0.0000000000 0.0002389486 0.0002389486 0.0014336918 0.0019115890
##   14    0.0000000000 0.0002389486 0.0000000000 0.0009557945 0.0011947431
##   15    0.0000000000 0.0000000000 0.0000000000 0.0002389486 0.0002389486
##   16    0.0000000000 0.0002389486 0.0000000000 0.0000000000 0.0002389486
##   19    0.0000000000 0.0002389486 0.0000000000 0.0000000000 0.0002389486
##   Total 0.0019115890 0.3209080048 0.1593787336 0.5178016726 1.0000000000

The highest proportion of home runs were hit in the 4th inning (0.125).