First, be sure the package mosaic is installed.
install.packages('mosaic')
Now, we will load population data. First, let’s consider the distance of all home runs hit during the 2014 baseball season.
HomeRun <- read.csv("https://sullystats.github.io/Statistics6e/Data/HomeRun_2014.csv")
head(HomeRun,n=4)
## Date Hitter HitterTeam Pitcher PitcherTeam INN
## 1 9/28/2014 Rizzo, Anthony CHC Fiers, Mike MIL 1
## 2 9/28/2014 Bernadina, Roger LAD Scahill, Rob COL 6
## 3 9/28/2014 Duvall, Adam SF Stauffer, Tim SD 4
## 4 9/28/2014 Duda, Lucas NYM Foltynewicz, Mike HOU 8
## Ballpark TrueDist SpeedOffBat Elev.Angle Horiz.Angle Apex Type
## 1 Miller Park 441 109.1 22.7 86.7 81 PL
## 2 Dodger Stadi... 424 113.2 27.7 62.3 98 ND
## 3 AT&T Park 423 103.6 31.9 112.9 98 ND
## 4 Citi Field 417 106.3 26.5 73.0 83 PL
We are going to focus on the variable “TrueDist”, which is the distance (in feet) the home run traveled.
Let’s look at the distribution of this variable and get some summary statistics.
library(mosaic)
gf_histogram(~TrueDist,data=HomeRun,binwidth=10,color="black",fill="blue",xlab="Distance (in feet)",ylab="Frequency",title="Distance of a Home Run in 2014",)
favstats(~TrueDist,data=HomeRun)
## min Q1 median Q3 max mean sd n missing
## 304 378 396 413 489 395.2172 24.81088 4185 0
Notice the distribution is approximately normal with \(\mu\) = 395.2 feet and \(\sigma\) = 24.8 feet.
Now, let’s take a random sample of n = 9 home run distances from this data set and determine the sample mean of the home run distance.
mean(~TrueDist,data=sample(HomeRun,9)) # Find the mean of a sample of size 9
## [1] 389.4444
Let’s take another random sample of n = 9 home run distances and determine the sample mean.
mean(~TrueDist,data=sample(HomeRun,9)) # Find the mean of a sample of size 9
## [1] 396.3333
Notice that the sample mean changes from sample to sample because we have different home runs in the random sample.
To get a sense as to the shape, center, and spread of the sampling distribution of \(\bar{x}\) we need to obtain many, many random samples of size n = 9.
SamplingDist <- bind_rows(do(5000) * c(mean = mean(~TrueDist, data = sample(HomeRun,9))))
head(SamplingDist,n=4)
## mean
## 1 401.5556
## 2 407.6667
## 3 384.2222
## 4 407.1111
You can see the sample mean for the first four random samples. Now, let’s look at the shape, center, and spread of the sampling distribution of \(\bar{x}\).
gf_histogram(~mean,data=SamplingDist,binwidth=5,color="black",fill="blue",xlab="Mean Distance (in feet)",ylab="Frequency",title="Distribution of Sample Mean Distance of a Home Run in 2014 with n = 9",)
mean(~mean,data=SamplingDist)
## [1] 395.1058
sd(~mean,data=SamplingDist)
## [1] 8.231234
Notice the shape of the distribution of the sample mean is approximately normal. The mean of the sampling distribution of \(\bar{x}\) is \(\mu_\bar{x} =\mu\) and the standard deviation of the sampling distribution of \(\bar{x}\) is \(\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\).
Let’s repeat this for n = 16 to see the role sample size plays.
SamplingDist_16 <- bind_rows(do(5000) * c(mean = mean(~TrueDist, data = sample(HomeRun,16))))
head(SamplingDist_16,n=4)
## mean
## 1 383.2500
## 2 404.3125
## 3 401.1250
## 4 393.6250
You can see the sample mean for the first four random samples. Now, let’s look at the shape, center, and spread of the sampling distribution of \(\bar{x}\).
gf_histogram(~mean,data=SamplingDist_16,binwidth=5,color="black",fill="blue",xlab="Mean Distance (in feet)",ylab="Frequency",title="Distribution of Sample Mean Distance of a Home Run in 2014 with n = 16",)
mean(~mean,data=SamplingDist_16)
## [1] 395.0118
sd(~mean,data=SamplingDist_16)
## [1] 6.168464
The shape of the distribution is still approximately normal and the mean of the sampling distribution of \(\bar{x}\) is \(\mu_\bar{x} =\mu\). Notice the standard deviation of the sampling distribution of \(\bar{x}\) is now lower because the sample size has increased. This is because \(\sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\).