Load the data from Table 1 in Section 4.1 into R.
Table1 <- read.csv("https://sullystats.github.io/Statistics6e/Data/Chapter4/Table1.csv")
head(Table1,n=4)
## Speed Distance
## 1 100 257
## 2 102 264
## 3 103 274
## 4 101 266
Notice that Speed is in the first column and Distance is in the second column. Now, we want to add the observation corresponding to Justin Thomas to the data set. To do so, use the rbind( ) command (this is row bind).
Example6 <- rbind(Table1,c(120,305))
Example6
## Speed Distance
## 1 100 257
## 2 102 264
## 3 103 274
## 4 101 266
## 5 105 277
## 6 100 263
## 7 99 258
## 8 105 275
## 9 120 305
Rather than reading the data from Github, we could manually enter the data into R.
Example6a <- data.frame("Speed"=c(100, 102, 103, 101, 105, 100, 99, 105, 120), "Distance"=c(257, 264, 274, 266, 277, 263, 258, 275, 305))
Base R has an influence.measures command that identifies influential observations.
golf_model <- lm(Distance ~ Speed, data=Example6) # Find and name the regression model
influence.measures(golf_model)
## Influence measures of
## lm(formula = Distance ~ Speed, data = Example6) :
##
## dfb.1_ dfb.Sped dffit cov.r cook.d hat inf
## 1 -0.4987 0.4581 -0.847 0.599 0.25486 0.157
## 2 -0.1091 0.0921 -0.309 1.248 0.04992 0.122
## 3 0.1230 -0.0883 0.607 0.701 0.14526 0.114
## 4 0.0794 -0.0709 0.165 1.490 0.01535 0.136
## 5 -0.0479 0.0703 0.389 1.079 0.07378 0.115
## 6 0.0489 -0.0449 0.083 1.595 0.00399 0.157
## 7 -0.2025 0.1892 -0.301 1.465 0.04950 0.184
## 8 -0.0193 0.0282 0.156 1.446 0.01380 0.115
## 9 5.7588 -5.8973 -6.299 4.553 13.36257 0.900 *
The influences.measures command outputs a table featuring 7 columns and 9 rows. Justin Thomas’s swing is the 9th row, because it is the 9th observation in the data set. Each of these columns are a separate influence measure. We will focus on the 5th column and the 7th column. The 5th column (cook.d) is Cook’s distance measure. Cook’s distance measure is a common outlier measure. As seen above, Justin Thomas’s swing has a Cook’s measure of 13.36, which qualifies as an outlier. Using Cook’s Distance logic, any Cook’s distance measure over 1 qualifies as an outlier or influential observation. The 7th column is an influence yes or no column based on the 6 metrics. Justin Thomas’s swing is a significant influential observation due to the star in the “inf” column.
The Mosaic package has a graphical version of an influence test. It is part of the mplot( ) command.
library(mosaic)
golf_model <- lm(Distance ~ Speed, data=Example6) # Find and name the regression model
mplot(golf_model,which = 4) # the "which" option can take on a value from 1 to 7.
Any observation with a Cook’s d in excess of 1 is considered influential. Clearly, the observation corresponding to Justin Thomas is influential.
Note:
- which = 1 draws a residual plot (residuals versus fits) - which = 2 draws a QQ plot of the residuals - which = 3 draws a residual plot (standardized residuals versus fits) - which = 4 is Cook’s d - which = 5 is residuals versus leverage - which = 6 is Cook’s d versus leverage - which = 7 is confidence intervals of estimates (Chapter 14)