The auto data set (auto-mpg.txt
on Canvas) concerns city-cycle fuel consumption in miles per gallon (mpg) and other attributes collected for 398 vehicle instances. The variables are: mpg, cylinders, displacement, horsepower, weight, acceleration, model year, origin and car name.
Perform PCA on this dataset including exploratory data analysis and write a report summarizing your results. In particular:
(a) \(\mathbf{Solution.}\qquad\) Below we see that we have multiple predictors. The predictors weight and displacement have a storng correlation. Some variables have siginificant outliers, like horsepower. We also oberserve that there are two categorical variables: origin and carname.
# Load the data
auto = read.delim("./Data/auto-mpg.txt", header = FALSE, sep = "")
colnames(auto) = c("mpg", "cylinders", "displacement", "horsepower", "weight",
"acceleration", "model_year", "origin", "car_name")
auto$origin = factor(auto$origin, labels = c("American","European","Japanese"))
auto$horsepower=as.numeric(auto$horsepower)
auto = auto[complete.cases(auto), ]
summary(auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
## acceleration model_year origin car_name
## Min. : 8.00 Min. :70.00 American:245 Length:392
## 1st Qu.:13.78 1st Qu.:73.00 European: 68 Class :character
## Median :15.50 Median :76.00 Japanese: 79 Mode :character
## Mean :15.54 Mean :75.98
## 3rd Qu.:17.02 3rd Qu.:79.00
## Max. :24.80 Max. :82.00
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : Factor w/ 3 levels "American","European",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ car_name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## Construct Pairs Plot
auto %>%
filter(horsepower != 1) %>%
select(-c(car_name,model_year)) %>%
ggpairs(.,
legend = 1,
mapping = aes(colour= origin),
lower = list(continuous = wrap("smooth",
alpha = 0.3,
size=0.1)))
(b) \(\mathbf{Solution.}\qquad\) We first exclude the last two variable since they are both factor predictors(origin is too discrete, and carname is simply not a good predictor). These two predictors could also be the potential response variables that we hope we could classify them into based on the remaining variables.
## pca with covariance
pca_data = as.matrix(auto[ , c(1:7)])
pca_results_cov = princomp(pca_data, cor = F)
summary(pca_results_cov)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 854.5910457 38.865860028 1.615443e+01 4.8154369111
## Proportion of Variance 0.9975368 0.002063236 3.564477e-04 0.0000316726
## Cumulative Proportion 0.9975368 0.999600082 9.999565e-01 0.9999882027
## Comp.5 Comp.6 Comp.7
## Standard deviation 2.348466e+00 1.688113e+00 5.216331e-01
## Proportion of Variance 7.533223e-06 3.892381e-06 3.716571e-07
## Cumulative Proportion 9.999957e-01 9.999996e-01 1.000000e+00
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## 7.303259e+05 1.510555e+03 2.609658e+02 2.318843e+01 5.515293e+00 2.849726e+00
## Comp.7
## 2.721010e-01
From the results above we see that PCA with covariance matrix assigns more uneven weights on the predictors than the correlation matrix, likely because the predictors are not on the same scale. The first component assigns the largest weight to the predictor ’weight’, which has largest mean and variance among the predictors. Thus, it is reasonable to standardize the dataset. Below we use the correlation matrix as the input instead.
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.2384450 0.9303716 0.8534599 0.42885323 0.34916518
## Proportion of Variance 0.7158051 0.1236559 0.1040563 0.02627358 0.01741662
## Cumulative Proportion 0.7158051 0.8394610 0.9435173 0.96979087 0.98720749
## Comp.6 Comp.7
## Standard deviation 0.232931666 0.18785747
## Proportion of Variance 0.007751023 0.00504149
## Cumulative Proportion 0.994958510 1.00000000
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 5.01063582 0.86559140 0.72839377 0.18391509 0.12191632 0.05425716 0.03529043
(c) \(\mathbf{Solution.}\qquad\) From the cumulative proportion of the PCA result we can see that the first component explains the largest proportion of variance (71.6%), the first 4 components explains most part of the total variance (96.0%). The scree plots also indicate that the ‘elbow’ appears at Comp. 4. Thus we take the first 4 components.
par(mfrow=c(1,2))
plot(pca_results_cor)
plot(eigen(cor(pca_data))$values, type = "l", xlab='Components', ylab = 'Variances')
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 0.7158051 0.8394610 0.9435173 0.9697909 0.9872075 0.9949585 1.0000000
(d) \(\mathbf{Solution.}\qquad\) For Comp.1, four of the variables have negative effects, while the rest have positive effect. Model_year has the largest effect on Comp.2, while displace has the largest negative effect. Displacement has small effect on Comps.2 and 3. The loads of cylinder for Comp.3 is also insignificant, while acceleration has the largest (negative) effect. All loadings of variables except mpg for Comp.4 are negative, indicating negative effects.
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## mpg 0.398 0.207 0.257 0.751 0.341 0.210
## cylinders -0.416 0.199 -0.139 0.477 -0.493 -0.333 0.432
## displacement -0.429 0.180 -0.100 0.298 0.143 -0.813
## horsepower -0.423 0.170 0.711 -0.523
## weight -0.414 0.225 -0.276 -0.108 0.265 0.697 0.367
## acceleration 0.285 -0.893 0.121 0.231 -0.224
## model_year 0.230 0.910 -0.302 -0.128
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143
## Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
(e) \(\mathbf{Solution.}\qquad\) All the data are colored by origin. We can see that second principal component cannot seperate the data well, but the first principle component does a decent job of seperating “American” origin from the other two groups. However the first componen does not do a good job of seperating origin “Japanese” and “European” that most of the green and blue points are overlapping. In addition, point 365 might be a potential outlier.
# Inspect the data visually
pca_projections <- as.data.frame(standardize(pca_data) %*%
loadings(pca_results_cor)[,1:2])
colnames(pca_projections) <- c("V1", "V2")
# Add back in categorical variables
pca_projections <- cbind(pca_projections,
auto[ , c("origin", "car_name")])
# Plot projections in 2 dimensions
# Randomize data in case data is sorted
ggplot(data = pca_projections) +
geom_point(aes(x = V1, y = V2, col = origin), size=4.5) +
geom_text(aes(x = V1, y = V2, label=rownames(pca_projections)), size=3)
(f) \(\mathbf{Solution.}\qquad\) The directions are almost coincide for cylinder/weight and acceleration/mpg with respect to component 1 and 2. They correspond to the strong correlation of each other as we could see from the ggpairs plot in 1(a).
The table above contains airline distances among 10 US cities. Perform multidimensional scaling to produce a 2-dimensional map for these cities, with names and locations of the cities.
\(\mathbf{Solution.}\qquad\)
# Construct Data
data = list(list(587, 1212, 701, 1936, 604, 748, 2139, 2182, 543),
list(920, 940, 1745, 1188, 713, 1858, 1737, 597),
list(879, 831, 1726, 1631, 949, 1021, 1494),
list(1374, 968, 1420, 1645, 1891, 1220),
list(2339, 2451, 347, 959, 2300),
list(1092, 2594, 2734, 923),
list(2571, 2408, 205),
list(678, 2442),
list(2329))
A = matrix(0, nrow = 10, ncol = 10)
for (i in 1:length(data)) {
A[i,] = c(rep(0, i), unlist(data[i]))
}
A[lower.tri(A)] = t(A)[lower.tri(A)]
cities = c("Atlanta", "Chicago", "Denver", "Houston", "Los Angeles", "Miami",
"New York", "San Francisco", "Seattle", "Washington DC")
rownames(A) = cities
colnames(A) = cities
# Perform MDS
cities.mds = cmdscale(A, k = 2)
qplot(x = cities.mds[, 1], y = cities.mds[, 2],
label = row.names(A),
geom = "text",
xlab = "First Dimension",
ylab = "Second Dimension",
main = "Two-Dimensional Map of Cities")
Consider the USArrests
data (in the ISLR package). We will now perform hierarchical clustering on the states.
(a) \(\mathbf{Solution.}\qquad\)
data(USArrests)
hc.complete = hclust(dist(USArrests), method = "complete")
plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.6)
(b) \(\mathbf{Solution.}\qquad\)
## [1] "cluster 1"
## [1] "Alabama" "Alaska" "Arizona" "California"
## [5] "Delaware" "Florida" "Illinois" "Louisiana"
## [9] "Maryland" "Michigan" "Mississippi" "Nevada"
## [13] "New Mexico" "New York" "North Carolina" "South Carolina"
## [1] "cluster 2"
## [1] "Arkansas" "Colorado" "Georgia" "Massachusetts"
## [5] "Missouri" "New Jersey" "Oklahoma" "Oregon"
## [9] "Rhode Island" "Tennessee" "Texas" "Virginia"
## [13] "Washington" "Wyoming"
## [1] "cluster 3"
## [1] "Connecticut" "Hawaii" "Idaho" "Indiana"
## [5] "Iowa" "Kansas" "Kentucky" "Maine"
## [9] "Minnesota" "Montana" "Nebraska" "New Hampshire"
## [13] "North Dakota" "Ohio" "Pennsylvania" "South Dakota"
## [17] "Utah" "Vermont" "West Virginia" "Wisconsin"
(c) \(\mathbf{Solution.}\qquad\)
scaled_USArrests = scale(USArrests, center = FALSE, scale = TRUE)
hc.complete_scale = hclust(dist(scaled_USArrests), method = "complete")
cluster_scale = cutree(hc.complete_scale, num_clusters)
print('cluster 1')
## [1] "cluster 1"
## [1] "Alabama" "Georgia" "Louisiana" "Mississippi"
## [5] "North Carolina" "South Carolina"
## [1] "Alaska" "Arizona" "California" "Colorado" "Florida"
## [6] "Illinois" "Maryland" "Michigan" "Missouri" "Nevada"
## [11] "New Mexico" "New York" "Tennessee" "Texas"
## [1] "cluster 3"
## [1] "Arkansas" "Connecticut" "Delaware" "Hawaii"
## [5] "Idaho" "Indiana" "Iowa" "Kansas"
## [9] "Kentucky" "Maine" "Massachusetts" "Minnesota"
## [13] "Montana" "Nebraska" "New Hampshire" "New Jersey"
## [17] "North Dakota" "Ohio" "Oklahoma" "Oregon"
## [21] "Pennsylvania" "Rhode Island" "South Dakota" "Utah"
## [25] "Vermont" "Virginia" "Washington" "West Virginia"
## [29] "Wisconsin" "Wyoming"
(d) \(\mathbf{Solution.}\qquad\) The plot below shows the result after scaling the variables. We can see that scaling impacts the clustering result as well as the height of the tree. For instance,Texas is most “closed” to Colorado without scaling while to Tennessee after scaling. In addition, the height of the tree without scaling is around 300 while is narrowed to around 2 after scaling. In our case, scaling is more appropriate because our 4 variables have different data units (e.g.UrbanPop and Rape). Therefore, it is more reasonable to do scaling so the hierarchical clustering algorithm can work well.