STATS 503: Homework 2

Question 1

In this problem, you will develop models to predict the wine type based on the Wine data set.

Explore the data graphically in order to investigate the association between Type and the other features. Which of the other features seem most likely to be useful in predicting Type? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
Perform LDA, QDA and Naive Bayes on the training data in order to predict Type. What are the test errors of the models obtained?

(a) \(\mathbf{Solution.}\qquad\) Below we show box plots of Type vs all other predictors. We see that the most important features that differentiate Type are Alcohol, Phenols, Flavanoids, and Proline since they have the least amount of overlap. Other potential features are Color a Hue since they seem to split two types from one other type.

# QUESTION 1a -----------------------------------------------------------------
wine_train = read.csv("./Data/wine_train.csv", header = TRUE)
wine_test = read.csv("./Data/wine_test.csv", header = TRUE)

# Set Type variable as factor
wine_train$Type = factor(wine_train$Type)
wine_test$Type = factor(wine_test$Type)
cols = colnames(wine_train)[2:ncol(wine_train)]

# Create Box Plots
par(mfrow = c(2, 3))
for(col in cols) {
  plot(wine_train$Type, wine_train[,col], col = "steelblue",
       ylab = col, main = paste0("Wine Type &", col))
}

(b) \(\mathbf{Solution.}\qquad\) Below we compute the test errors for LDA, QDA and Naive Baayes.

# QUESTION 1b ----------
# LDA
lda = lda(Type ~ ., data = wine_train)
lda_pred = predict(lda, wine_test)$class
lda_test_err = mean(lda_pred != wine_test$Type)

# QDA
qda = qda(Type ~ ., data = wine_train)
qda_pred = predict(qda, wine_test)$class
qda_test_err = mean(qda_pred != wine_test$Type)

# Naive Bayes
naiveB = naiveBayes(Type ~ ., data = wine_train)
naiveB_pred = predict(naiveB, newdata = wine_test)
naiveB_test_err = mean(naiveB_pred != wine_test$Type)

We find that the test errors for LDA, QDA, and Naive Bayes are 0.0182, 0.0364, 0.0364 respectively.

Question 2

Use the \(k\) -nearest neighbor classifier on the Theft dataset. Use cross-validation to select the best \(k\) and use the test data to evaluate the performance of the selected model. Show the training, cross-validation and test errors for each choice of \(k\) and report your findings.

\(\mathbf{Solution.}\qquad\)

# QUESTION 2 ------------------------------------------------------------------
theft_train = read.csv("./Data/theft_train.csv", header = TRUE)
theft_test = read.csv("./Data/theft_test.csv", header = TRUE)

## Kfold_CV_knn function from Lab 06
Kfold_CV_knn <- function(K, K_knn, train, train_label) {
  fold_size = floor(nrow(train) / K)
  cv_error = rep(0, K)
  for(i in 1:K) {
    # iteratively select K-1 folds as training data in CV procedure, remaining 
    # as test data.
    if(i != K) {
      CV_test_id = ((i - 1) * fold_size + 1):(i * fold_size)
    }
    else {
      CV_test_id = ((i - 1) * fold_size + 1):nrow(train)
    }
    CV_train = train[-CV_test_id, ]
    CV_test = train[CV_test_id, ]
    # calculate the mean and standard deviation using CV_train
    mean_CV_train = colMeans(CV_train)
    sd_CV_train = apply(CV_train, 2, sd)
    # normalize the CV_train and CV_test using above mean and sd
    CV_train = scale(CV_train, center = mean_CV_train, scale = sd_CV_train)
    CV_test = scale(CV_test, center = mean_CV_train, scale = sd_CV_train)
    # Fit knn
    pred_CV_test = knn(CV_train, CV_test, train_label[-CV_test_id], k = K_knn)
    # Calculate CV error by taking averages
    cv_error[i] = mean(pred_CV_test != train_label[CV_test_id])
  }
  return(mean(cv_error))
}

set.seed(2020)
theft = 3
K_fold = 10
K_knn = seq(from = 2, to = 50, by = 2)
cv_error = rep(0,length(K_knn))
train_err = rep(0, length(K_knn))
test_err = rep(0, length(K_knn))

# data normalization for training and test data using means and stds from 
# training data.
mean_train = colMeans(theft_train[, -theft])
sd_train = apply(theft_train[, -theft], 2, sd)
K_train = scale(theft_train[, -theft], center = mean_train, scale = sd_train)
K_test = scale(theft_test[, -theft], center = mean_train, scale = sd_train)

for (i in 1:length(K_knn)) {
  cv_error[i] = Kfold_CV_knn(K_fold, K_knn[i], theft_train[, -theft], 
                             theft_train$theft)
  train_pred = knn(K_train, K_train,
                   cl = theft_train$theft, k = K_knn[i])
  
  test_pred = knn(K_train, K_test,
                  cl = theft_train$theft, k = K_knn[i])
  train_err[i] = mean(train_pred != theft_train$theft)
  test_err[i] = mean(test_pred != theft_test$theft)
}

best_k = K_knn[which(cv_error == min(cv_error))]
c(min(cv_error), best_k)

## [1]  0.3705714 32.0000000

# CV Error, Train error, and Test errors based on K from CV
pred_train = knn(train = K_train, test = K_train, cl = theft_train$theft, 
                 k = best_k)
pred_test = knn(train = K_train, test = K_test, cl = theft_train$theft, 
                k = best_k)
train_err_cv = mean(pred_train != theft_train$theft)
test_err_cv = mean(pred_test != theft_test$theft)


# Show training, CV, and test errors
df = data.frame(K_knn, cv_error, train_err, test_err)

ggplot(df, aes(x = K_knn, y = cv_error)) + 
  geom_line(aes(y = cv_error, colour = "CV Error"), linetype = "twodash") + 
  geom_line(aes(x = K_knn, y = train_err, colour = "Train Error"), 
            linetype = "solid") + 
  geom_line(aes(x = K_knn, y = test_err, colour = "Test Error"), 
            linetype = "longdash") + 
  labs(colour = "Error Types") + 
  ylab("error rate") + 
  xlab("K") + 
  ggtitle("Error Rates with respect to K in KNN") + 
  scale_color_manual(values = c("darkred", "steelblue", "black"))

We find that our CV approach picks \(k=\) 32 and the corresponding CV, train, and test errors are 0.3706, 0.3414, 0.384 which are quite close.

Question 3

The textbook (“An Introduction to Statistical Learning with Applications in R”) describes that the cv.glm() function can be used in order to compute the LOOCV error estimate. Alternatively, one could compute those quantities using just the glm() and predict.glm() functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a logistic regression model on the Weekly data set (in the ISLR package).

Fit a logistic regression model that predicts Direction using Lag 1 and Lag2. Report and comment on the result.
Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first observation. Report and comment on the result.
Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if \(Pr(\mathrm{Direction="Up"}\mid \mathrm{Lag1,Lag2})>0.5\) . Was this observation correctly classified?
Write a for loop from \(i=1\) to \(i=n,\) where \(n\) is the number of observations in the data set, that performs each of the following steps:
1. Fit a logistic regression model using all but the \(i\)th observation to predict Direction using Lag1 and Lag2.
2. Compute the posterior probability of the market moving up for the \(i\)th observation.
3. Use the posterior probability for the \(i\)th observation in order to predict whether or not the market moves up.
4. Determine whether or not an error was made in predicting the direction for the \(i\)th observation. If an error was made, then indicate this as a \(1,\) and otherwise indicate it as a \(0 .\)
Take the average of the \(n\) numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

(a) \(\mathbf{Solution.}\qquad\) The logistic regression that Lag1 is not significant while Lag2 is significant.

# QUESTION 3a -----------------------------------------------------------------
data(Weekly)
log_mod = glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(log_mod)$coefficients

##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  0.22122405 0.06146572  3.599145 0.0003192652
## Lag1        -0.03872222 0.02621658 -1.477013 0.1396722362
## Lag2         0.06024830 0.02654589  2.269590 0.0232324586

(b) \(\mathbf{Solution.}\qquad\) Similarly to 3(a), we find that Lag1 is not significant while Lag2 is significant.

# QUESTION 3b ----------
Weekly2 = Weekly[-1,]
log_mod2 <- glm(Direction ~ Lag1 + Lag2, data = Weekly2, family = binomial)
summary(log_mod2)$coefficients

##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  0.22324305 0.06149894  3.630031 0.0002833875
## Lag1        -0.03843317 0.02621860 -1.465874 0.1426825151
## Lag2         0.06084763 0.02656088  2.290874 0.0219707105

(c) \(\mathbf{Solution.}\qquad\) Below we compute our prediction.

# QUESTION 3c ----------
pred_mod = predict(log_mod2, Weekly[1, c("Lag1", "Lag2")])
pred = binomial()$linkinv(pred_mod)

We find that the corresponding probability is 0.571. However this observation was incorrectly classified since it is supposed to be Down.

(d) \(\mathbf{Solution.}\qquad\)

# QUESTION 3d ----------
n = nrow(Weekly)
prob = rep(0, n)
pred = rep(0, n)
error = rep(0, n)
for(i in 1:n) {
  # (i)
  Weekly_LOOCV = Weekly[-i,]
  log_LOOCV = glm(Direction ~ Lag1 + Lag2, data = Weekly_LOOCV, family = binomial)
  
  # (ii)
  pred_LOOCV = predict(log_LOOCV, Weekly[i, c("Lag1", "Lag2")])
  prob[i] = binomial()$linkinv(pred_LOOCV)
  
  # (iii)
  pred[i] = ifelse(prob[i] > 0.5, "Up", "Down")
  
  # (iv)
  error[i] = ifelse(pred[i] != Weekly[i, "Direction"], 1, 0)
}

(e) \(\mathbf{Solution.}\qquad\) We find that LOOCV the test error is about 0.45, which means 45% of the observations were misclassified. We could potentially improve this classification model by adding other predictors and excluding Lag1.

# QUESTION 3e ----------
test_err = mean(error)
print(test_err)

## [1] 0.4499541

STATS 503: Homework 2

Israel Diego

2/19/2020

Question 1

Question 2

Question 3