STATS 503: Homework 6

Question 1

The auto data set (auto-mpg.txt on Canvas) concerns city-cycle fuel consumption in miles per gallon (mpg) and other attributes collected for 398 vehicle instances. The variables are: mpg, cylinders, displacement, horsepower, weight, acceleration, model year, origin and car name.

Perform PCA on this dataset including exploratory data analysis and write a report summarizing your results. In particular:

Describe the data and present some initial pictorial and numerical summaries, such as scatterplots, histograms, etc.
Consider which variables should or should not be included in PCA on this dataset. Compare PCA on covariances and correlations (you should choose one of them to proceed with for the subsequent questions).
Comment on the percentage of variance explained and number of principal components to retain. Include a scree plot.
Comment on variable loadings and their potential interpretations.
Make a plot of the data projected on the first two PCs. Comment on any interesting features, including potential outliers, if any.
Make a PCA biplot and comment on any interesting features.

(a) \(\mathbf{Solution.}\qquad\) Below we see that we have multiple predictors. The predictors weight and displacement have a storng correlation. Some variables have siginificant outliers, like horsepower. We also oberserve that there are two categorical variables: origin and carname.

# Load the data
auto = read.delim("./Data/auto-mpg.txt", header = FALSE, sep = "")
colnames(auto) = c("mpg", "cylinders", "displacement", "horsepower", "weight",
                   "acceleration", "model_year", "origin", "car_name")
auto$origin = factor(auto$origin, labels = c("American","European","Japanese"))
auto$horsepower=as.numeric(auto$horsepower)
auto = auto[complete.cases(auto), ]
summary(auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration     model_year         origin      car_name        
##  Min.   : 8.00   Min.   :70.00   American:245   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   European: 68   Class :character  
##  Median :15.50   Median :76.00   Japanese: 79   Mode  :character  
##  Mean   :15.54   Mean   :75.98                                    
##  3rd Qu.:17.02   3rd Qu.:79.00                                    
##  Max.   :24.80   Max.   :82.00

str(auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : Factor w/ 3 levels "American","European",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

## Construct Pairs Plot
auto %>%
  filter(horsepower != 1) %>%
  select(-c(car_name,model_year)) %>%
  ggpairs(.,
          legend = 1,
          mapping = aes(colour= origin),
          lower = list(continuous = wrap("smooth",
                                         alpha = 0.3,
                                         size=0.1)))

(b) \(\mathbf{Solution.}\qquad\) We first exclude the last two variable since they are both factor predictors(origin is too discrete, and carname is simply not a good predictor). These two predictors could also be the potential response variables that we hope we could classify them into based on the remaining variables.

## pca with covariance
pca_data = as.matrix(auto[ , c(1:7)])
pca_results_cov = princomp(pca_data, cor = F)
summary(pca_results_cov)

## Importance of components:
##                             Comp.1       Comp.2       Comp.3       Comp.4
## Standard deviation     854.5910457 38.865860028 1.615443e+01 4.8154369111
## Proportion of Variance   0.9975368  0.002063236 3.564477e-04 0.0000316726
## Cumulative Proportion    0.9975368  0.999600082 9.999565e-01 0.9999882027
##                              Comp.5       Comp.6       Comp.7
## Standard deviation     2.348466e+00 1.688113e+00 5.216331e-01
## Proportion of Variance 7.533223e-06 3.892381e-06 3.716571e-07
## Cumulative Proportion  9.999957e-01 9.999996e-01 1.000000e+00

pca_results_cov$sdev^2

##       Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6 
## 7.303259e+05 1.510555e+03 2.609658e+02 2.318843e+01 5.515293e+00 2.849726e+00 
##       Comp.7 
## 2.721010e-01

From the results above we see that PCA with covariance matrix assigns more uneven weights on the predictors than the correlation matrix, likely because the predictors are not on the same scale. The first component assigns the largest weight to the predictor ’weight’, which has largest mean and variance among the predictors. Thus, it is reasonable to standardize the dataset. Below we use the correlation matrix as the input instead.

## Correlation
pca_results_cor = princomp(pca_data, cor = T)
summary(pca_results_cor)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.2384450 0.9303716 0.8534599 0.42885323 0.34916518
## Proportion of Variance 0.7158051 0.1236559 0.1040563 0.02627358 0.01741662
## Cumulative Proportion  0.7158051 0.8394610 0.9435173 0.96979087 0.98720749
##                             Comp.6     Comp.7
## Standard deviation     0.232931666 0.18785747
## Proportion of Variance 0.007751023 0.00504149
## Cumulative Proportion  0.994958510 1.00000000

pca_results_cor$sdev^2

##     Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7 
## 5.01063582 0.86559140 0.72839377 0.18391509 0.12191632 0.05425716 0.03529043

(c) \(\mathbf{Solution.}\qquad\) From the cumulative proportion of the PCA result we can see that the first component explains the largest proportion of variance (71.6%), the first 4 components explains most part of the total variance (96.0%). The scree plots also indicate that the ‘elbow’ appears at Comp. 4. Thus we take the first 4 components.

par(mfrow=c(1,2))
plot(pca_results_cor)
plot(eigen(cor(pca_data))$values, type = "l", xlab='Components', ylab = 'Variances')

cumsum(pca_results_cor$sdev^2 / sum(pca_results_cor$sdev^2))

##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 0.7158051 0.8394610 0.9435173 0.9697909 0.9872075 0.9949585 1.0000000

(d) \(\mathbf{Solution.}\qquad\) For Comp.1, four of the variables have negative effects, while the rest have positive effect. Model_year has the largest effect on Comp.2, while displace has the largest negative effect. Displacement has small effect on Comps.2 and 3. The loads of cylinder for Comp.3 is also insignificant, while acceleration has the largest (negative) effect. All loadings of variables except mpg for Comp.4 are negative, indicating negative effects.

loadings(pca_results_cor)

## 
## Loadings:
##              Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## mpg           0.398  0.207  0.257  0.751  0.341  0.210       
## cylinders    -0.416  0.199 -0.139  0.477 -0.493 -0.333  0.432
## displacement -0.429  0.180 -0.100  0.298         0.143 -0.813
## horsepower   -0.423         0.170         0.711 -0.523       
## weight       -0.414  0.225 -0.276 -0.108  0.265  0.697  0.367
## acceleration  0.285        -0.893  0.121  0.231 -0.224       
## model_year    0.230  0.910        -0.302        -0.128       
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.143  0.143  0.143  0.143  0.143  0.143  0.143
## Cumulative Var  0.143  0.286  0.429  0.571  0.714  0.857  1.000

(e) \(\mathbf{Solution.}\qquad\) All the data are colored by origin. We can see that second principal component cannot seperate the data well, but the first principle component does a decent job of seperating “American” origin from the other two groups. However the first componen does not do a good job of seperating origin “Japanese” and “European” that most of the green and blue points are overlapping. In addition, point 365 might be a potential outlier.

# Inspect the data visually
pca_projections <- as.data.frame(standardize(pca_data) %*%
                                   loadings(pca_results_cor)[,1:2])
colnames(pca_projections) <- c("V1", "V2")

# Add back in categorical variables
pca_projections <- cbind(pca_projections,
auto[ , c("origin", "car_name")])

# Plot projections in 2 dimensions
# Randomize data in case data is sorted
ggplot(data = pca_projections) +
  geom_point(aes(x = V1, y = V2, col = origin), size=4.5) +
  geom_text(aes(x = V1, y = V2, label=rownames(pca_projections)), size=3)

(f) \(\mathbf{Solution.}\qquad\) The directions are almost coincide for cylinder/weight and acceleration/mpg with respect to component 1 and 2. They correspond to the strong correlation of each other as we could see from the ggpairs plot in 1(a).

autoplot(pca_results_cor, data = auto, colour = 'origin',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

Question 2

The table above contains airline distances among 10 US cities. Perform multidimensional scaling to produce a 2-dimensional map for these cities, with names and locations of the cities.

\(\mathbf{Solution.}\qquad\)

# Construct Data
data = list(list(587, 1212, 701, 1936, 604, 748, 2139, 2182, 543),
            list(920, 940, 1745, 1188, 713, 1858, 1737, 597),
            list(879, 831, 1726, 1631, 949, 1021, 1494),
            list(1374, 968, 1420, 1645, 1891, 1220),
            list(2339, 2451, 347, 959, 2300),
            list(1092, 2594, 2734, 923),
            list(2571, 2408, 205),
            list(678, 2442),
            list(2329))

A = matrix(0, nrow = 10, ncol = 10)
for (i in 1:length(data)) {
  A[i,] = c(rep(0, i), unlist(data[i]))
}
A[lower.tri(A)] = t(A)[lower.tri(A)]

cities = c("Atlanta", "Chicago", "Denver", "Houston", "Los Angeles", "Miami", 
           "New York", "San Francisco", "Seattle", "Washington DC")
rownames(A) = cities
colnames(A) = cities

# Perform MDS
cities.mds = cmdscale(A, k = 2)
qplot(x = cities.mds[, 1], y = cities.mds[, 2], 
      label = row.names(A), 
      geom = "text",
      xlab = "First Dimension", 
      ylab = "Second Dimension", 
      main = "Two-Dimensional Map of Cities")

Question 3

Consider the USArrests data (in the ISLR package). We will now perform hierarchical clustering on the states.

Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one. Now which states belong to which clusters?
What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

(a) \(\mathbf{Solution.}\qquad\)

data(USArrests)
hc.complete = hclust(dist(USArrests), method = "complete")
plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = 0.6)

(b) \(\mathbf{Solution.}\qquad\)

num_clusters = 3
clusters = cutree(hc.complete, num_clusters)
print('cluster 1')

## [1] "cluster 1"

print(rownames(USArrests)[clusters == 1])

##  [1] "Alabama"        "Alaska"         "Arizona"        "California"    
##  [5] "Delaware"       "Florida"        "Illinois"       "Louisiana"     
##  [9] "Maryland"       "Michigan"       "Mississippi"    "Nevada"        
## [13] "New Mexico"     "New York"       "North Carolina" "South Carolina"

print('cluster 2')

## [1] "cluster 2"

print(rownames(USArrests)[clusters == 2])

##  [1] "Arkansas"      "Colorado"      "Georgia"       "Massachusetts"
##  [5] "Missouri"      "New Jersey"    "Oklahoma"      "Oregon"       
##  [9] "Rhode Island"  "Tennessee"     "Texas"         "Virginia"     
## [13] "Washington"    "Wyoming"

print('cluster 3')

## [1] "cluster 3"

print(rownames(USArrests)[clusters == 3])

##  [1] "Connecticut"   "Hawaii"        "Idaho"         "Indiana"      
##  [5] "Iowa"          "Kansas"        "Kentucky"      "Maine"        
##  [9] "Minnesota"     "Montana"       "Nebraska"      "New Hampshire"
## [13] "North Dakota"  "Ohio"          "Pennsylvania"  "South Dakota" 
## [17] "Utah"          "Vermont"       "West Virginia" "Wisconsin"

(c) \(\mathbf{Solution.}\qquad\)

scaled_USArrests = scale(USArrests, center = FALSE, scale = TRUE)
hc.complete_scale = hclust(dist(scaled_USArrests), method = "complete")
cluster_scale = cutree(hc.complete_scale, num_clusters)
print('cluster 1')

## [1] "cluster 1"

print(rownames(scaled_USArrests)[cluster_scale == 1])

## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina"

print(rownames(scaled_USArrests)[cluster_scale == 2])

##  [1] "Alaska"     "Arizona"    "California" "Colorado"   "Florida"   
##  [6] "Illinois"   "Maryland"   "Michigan"   "Missouri"   "Nevada"    
## [11] "New Mexico" "New York"   "Tennessee"  "Texas"

print('cluster 3')

## [1] "cluster 3"

print(rownames(scaled_USArrests)[cluster_scale == 3])

##  [1] "Arkansas"      "Connecticut"   "Delaware"      "Hawaii"       
##  [5] "Idaho"         "Indiana"       "Iowa"          "Kansas"       
##  [9] "Kentucky"      "Maine"         "Massachusetts" "Minnesota"    
## [13] "Montana"       "Nebraska"      "New Hampshire" "New Jersey"   
## [17] "North Dakota"  "Ohio"          "Oklahoma"      "Oregon"       
## [21] "Pennsylvania"  "Rhode Island"  "South Dakota"  "Utah"         
## [25] "Vermont"       "Virginia"      "Washington"    "West Virginia"
## [29] "Wisconsin"     "Wyoming"

(d) \(\mathbf{Solution.}\qquad\) The plot below shows the result after scaling the variables. We can see that scaling impacts the clustering result as well as the height of the tree. For instance,Texas is most “closed” to Colorado without scaling while to Tennessee after scaling. In addition, the height of the tree without scaling is around 300 while is narrowed to around 2 after scaling. In our case, scaling is more appropriate because our 4 variables have different data units (e.g.UrbanPop and Rape). Therefore, it is more reasonable to do scaling so the hierarchical clustering algorithm can work well.

plot(hc.complete_scale, main="Complete Linkage (After Scaling)", 
     xlab = "", sub = "", cex = 0.6)

STATS 503: Homework 6

Israel Diego

4/19/2020

Question 1

Question 2

Question 3