install.packages("mosaic")
Enter data into a data frame using the do() command. This requires use of the Mosaic package. For example, let’s say we want to enter the following contingency table, which represents survival status on the Titanic.
Status | Men | Women | Boys | Girls |
---|---|---|---|---|
Survived | 334 | 318 | 29 | 27 |
Died | 1360 | 104 | 35 | 18 |
Let’s call the column variable Person and the row variable Status.
library(mosaic)
Titanic_df <- rbind(
do(334)*data.frame(Person="Men",Status="Survived"),
do(318)*data.frame(Person="Women",Status="Survived"),
do(29)*data.frame(Person="Boys",Status="Survived"),
do(27)*data.frame(Person="Girls",Status="Survived"),
do(1360)*data.frame(Person="Men",Status="Died"),
do(104)*data.frame(Person="Women",Status="Died"),
do(35)*data.frame(Person="Boys",Status="Died"),
do(18)*data.frame(Person="Girls",Status="Died")
)
tally(~Status+Person,data=Titanic_df) #Build contingency table from data frame
## Person
## Status Boys Girls Men Women
## Died 35 18 1360 104
## Survived 29 27 334 318
Now, let’s draw a barplot of the conditional distribution. Be sure to set margins to FALSE.
Titanic_Condition <- tally(~Status|Person,margins=FALSE,format="proportion",data=Titanic_df) #Conditional distribution by type of person
barplot(Titanic_Condition, beside = TRUE, cex.names = .7,legend=TRUE, ylim=c(0,1.2),main="Survival Status on the Titanic", xlab = "Type of Passenger", ylab = "Relative Frequency", col = c('#6897bb', '#c06723', '#baebae'))
Now, let’s construct a conditional distribution from a contingency table. We will work with the contingency table in Section 4.4, Table 9.
To create a conditional distribution in R, use the following command:
variable <- prop.table(table, 1 or 2)
Note Use 1 to condition by the row variable; use 2 to condition by the column variable.
The matrix command requires using the c( ) command. In addition, you must specify the number of rows (nrow) and the number of columns (ncol). Finally, you name the rows and columns using dimnames along with list.
Notice how the cells are entered into the matrix (all entries in first column, then second column, and so on). With dimnames, name the row values first, then the column values.
Table9 <- matrix(c(9607, 570, 11662, 34625, 1274, 26426, 36370, 1170, 19861, 57102, 1305, 20841), nrow = 3, ncol = 4, dimnames = list(c("Employed", "Unemployed", "Not in the Labor Force"), c("Did Not Finish High School", "High School Graduate", "Some College", "Bachelor's Degree or Higher")))
Does a higher level of education play a role in employment status? Let’s condition by level of education to find out. Because level of education is the column variable, use 2 in the prop.table command.
Table9_Condition <- prop.table(Table9, 2)
Table9_Condition
## Did Not Finish High School High School Graduate
## Employed 0.4399011 0.55555556
## Unemployed 0.0261001 0.02044124
## Not in the Labor Force 0.5339988 0.42400321
## Some College Bachelor's Degree or Higher
## Employed 0.63361265 0.72054815
## Unemployed 0.02038292 0.01646729
## Not in the Labor Force 0.34600443 0.26298455
Now that we have the conditional distribution, use the barplot command. The syntax is as follows:
barplot(df_name,beside=TRUE)
Note: cex.names decreases the font size of the labels. legend = TRUE adds a legend. ylim=c(0,1.2) adjusts the length of the y-axis so the legend does not overlay the graph. You should experiment with the limits until you are happy with the graph.
barplot(Table9_Condition, beside = TRUE, cex.names = .7,legend=TRUE, ylim=c(0,1.2),main="Employment Status by Level of Education", xlab = "Level of Education", ylab = "Relative Frequency", col = c('#6897bb', '#c06723', '#baebae'))
Now, let’s learn how to create a conditional distribution bar graph from raw data. First, obtain To do so, we use the tally command in the Mosaic package.
Load the HomeRuns_2014 data.
HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
## Date Hitter HitterTeam Pitcher PitcherTeam INN
## 1 9/28/2014 Rizzo, Anthony CHC Fiers, Mike MIL 1
## 2 9/28/2014 Bernadina, Roger LAD Scahill, Rob COL 6
## 3 9/28/2014 Duvall, Adam SF Stauffer, Tim SD 4
## 4 9/28/2014 Duda, Lucas NYM Foltynewicz, Mike HOU 8
## Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1 Miller Park 441 109.1 22.7 86.7 81 PL
## 2 Dodger Stadi... 424 113.2 27.7 62.3 98 ND
## 3 AT&T Park 423 103.6 31.9 112.9 98 ND
## 4 Citi Field 417 106.3 26.5 73.0 83 PL
SUe the tally command to build a conditional distribution.
tally(~ response variable | explanatory variable,margins=FALSE,format=“proportion”,data = data_file)
For the HomeRun_2014 data, let’s say we want to determine if inning plays a role in the type of home run hit. In this regard, we want to find a conditional distribution of type by inning. So, inning (INN) is the explanatory variable.
Don’t forget to set margins to FALSE and use ylim to adjust the limits on the y-axis so the legend is not blocking the bars.
library(mosaic)
HomeRun_Condition <- tally(~Type|INN,margins=FALSE,format="proportion",data=HomeRun)
barplot(HomeRun_Condition, beside = TRUE, cex.names = .7,legend=TRUE, ylim=c(0,1.6),main="Type of Home Run by Inning", xlab = "Type", ylab = "Relative Frequency", col = c('#6897bb', '#c06723', '#baebae'))
```