In R, creating a side-by-side bar graph from summarized data requires data manipulation to format the dataset in the correct way to be used in the command.
Option 1: Loading a csv file into R.
Step 1: First, load Table4 into R. We will use Table 4 from Section 2.1 to illustrate the process.
Table4 <- read.csv("https://sullystats.github.io/Statistics6e/Data/Chapter2/Table4.csv")
Table4
## Education X1990 X2017
## 1 No_HS 39344 26582
## 2 HS_Diploma 47643 60032
## 3 Some_College 29780 45110
## 4 Associates 9792 18761
## 5 Bachelors 20833 43585
## 6 Grad_Prof 11478 27181
Notice that R names the columns “X1990” and “X2017” so that the column heads are not numeric.
Step 2: When drawing side-by-side bar graphs, we want to use relative frequencies. Let’s create the relative frequencies for both 1990 and 2017 in R.
prop_1990 <- Table4$X1990/sum(Table4$X1990)
prop_2017 <- Table4$X2017/sum(Table4$X2017)
Now, we will replace the X1990 and X2017 frequencies with the relative frequencies.
Table4$X1990 <- prop_1990
Table4$X2017 <- prop_2017
Table4
## Education X1990 X2017
## 1 No_HS 0.24764902 0.1201441
## 2 HS_Diploma 0.29988670 0.2713298
## 3 Some_College 0.18744886 0.2038861
## 4 Associates 0.06163530 0.0847951
## 5 Bachelors 0.13113237 0.1969935
## 6 Grad_Prof 0.07224775 0.1228514
Note: You could also have used cbind(…) to add the prop_1990 and prop_2017 columns to the current Table 4.
Step 3: Now, we use the tidyr and ggplot2 packages in R.
If you have not done so already, install these packages.
install.packages("tidyr")
install.packages("ggplot2")
library(tidyr)
library(ggplot2)
In the tidyr package, we have the ability to gather two columns into a single column.
Table4_new <- gather(Table4,date,number,X1990:X2017)
Table4_new
## Education date number
## 1 No_HS X1990 0.24764902
## 2 HS_Diploma X1990 0.29988670
## 3 Some_College X1990 0.18744886
## 4 Associates X1990 0.06163530
## 5 Bachelors X1990 0.13113237
## 6 Grad_Prof X1990 0.07224775
## 7 No_HS X2017 0.12014409
## 8 HS_Diploma X2017 0.27132985
## 9 Some_College X2017 0.20388608
## 10 Associates X2017 0.08479510
## 11 Bachelors X2017 0.19699346
## 12 Grad_Prof X2017 0.12285142
Step 4 Now use ggplot2 to draw the graph.
ggplot(Table4_new,aes(Education,number,fill=date))+geom_col(position="dodge")+ylab("Relative Frequency")+labs(title="Educational Attainment in 1990 versus 2017")
In this command we have a few features:
Option 2: Typing Data Directly into R
Step 1: Type the data directly into R. First, create a column of data for 1990 and 2017 using c(…). Note that for the 1990 data, we include the level of education so that it becomes a variable name for the data.
X1990 <- c(No_HS=39344,HS_Diploma=47643,Some_College=29780,Associates=9792,Bachelors=20833,Grad_Prof=11478)
X2017 <- c(26582,60032,45110,18761,43585,27181)
Now, combine the columns into a matrix using cbind(…).
Table4 <- cbind(X1990,X2017)
Table4
## X1990 X2017
## No_HS 39344 26582
## HS_Diploma 47643 60032
## Some_College 29780 45110
## Associates 9792 18761
## Bachelors 20833 43585
## Grad_Prof 11478 27181
Note :
To find out the format of the data, use the class command.
class(Table4)
## [1] "matrix" "array"
Notice that Table4 is classified as a matrix. To convert Table 4 from a matrix to a contingency table, use the as.table command.
Table4 <- as.table(Table4)
Now, we want to view the names of the columns and rows in the table. The command dimnames reports the names of the rows and columns in a contigency table.
dimnames(Table4)
## [[1]]
## [1] "No_HS" "HS_Diploma" "Some_College" "Associates" "Bachelors"
## [6] "Grad_Prof"
##
## [[2]]
## [1] "X1990" "X2017"
We want to name the rows and columns using the names command.
names(dimnames(Table4)) <- c('Education','Year')
Table4
## Year
## Education X1990 X2017
## No_HS 39344 26582
## HS_Diploma 47643 60032
## Some_College 29780 45110
## Associates 9792 18761
## Bachelors 20833 43585
## Grad_Prof 11478 27181
Now, let’s view the data as a data frame.
head(as.data.frame(Table4))
## Education Year Freq
## 1 No_HS X1990 39344
## 2 HS_Diploma X1990 47643
## 3 Some_College X1990 29780
## 4 Associates X1990 9792
## 5 Bachelors X1990 20833
## 6 Grad_Prof X1990 11478
Notice the column heads on each column.
When drawing side-by-side bar graphs, we want to use relative frequencies. Let’s create the relative frequencies for both 1990 and 2017 in R using the prop.table command. The syntax for the command is
prop.table(table_name, 1) computes proportions by the row variable prop.table(table_name, 2) computes proportions by the column variable
Table4 <- prop.table(Table4,2)
Table4
## Year
## Education X1990 X2017
## No_HS 0.24764902 0.12014409
## HS_Diploma 0.29988670 0.27132985
## Some_College 0.18744886 0.20388608
## Associates 0.06163530 0.08479510
## Bachelors 0.13113237 0.19699346
## Grad_Prof 0.07224775 0.12285142
Now we can use barchart (from Mosaic) to create a side-by-side bar graph.
library(mosaic)
barchart(Freq ~ Education, groups = Year,data=as.data.frame(Table4),format="proportion",main="Educational Attainment in 1990 versus 2017", auto.key=list(space='right'),scales=list(x=list(rot=90)))