In this problem, you will develop models to predict the wine type based on the Wine data set.
Type
and the other features. Which of the other features seem most likely to be useful in predicting Type
? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.(a) \(\mathbf{Solution.}\qquad\) Below we show box plots of Type
vs all other predictors. We see that the most important features that differentiate Type
are Alcohol, Phenols, Flavanoids, and Proline since they have the least amount of overlap. Other potential features are Color a Hue since they seem to split two types from one other type.
# QUESTION 1a -----------------------------------------------------------------
wine_train = read.csv("./Data/wine_train.csv", header = TRUE)
wine_test = read.csv("./Data/wine_test.csv", header = TRUE)
# Set Type variable as factor
wine_train$Type = factor(wine_train$Type)
wine_test$Type = factor(wine_test$Type)
cols = colnames(wine_train)[2:ncol(wine_train)]
# Create Box Plots
par(mfrow = c(2, 3))
for(col in cols) {
plot(wine_train$Type, wine_train[,col], col = "steelblue",
ylab = col, main = paste0("Wine Type &", col))
}
(b) \(\mathbf{Solution.}\qquad\) Below we compute the test errors for LDA, QDA and Naive Baayes.
# QUESTION 1b ----------
# LDA
lda = lda(Type ~ ., data = wine_train)
lda_pred = predict(lda, wine_test)$class
lda_test_err = mean(lda_pred != wine_test$Type)
# QDA
qda = qda(Type ~ ., data = wine_train)
qda_pred = predict(qda, wine_test)$class
qda_test_err = mean(qda_pred != wine_test$Type)
# Naive Bayes
naiveB = naiveBayes(Type ~ ., data = wine_train)
naiveB_pred = predict(naiveB, newdata = wine_test)
naiveB_test_err = mean(naiveB_pred != wine_test$Type)
We find that the test errors for LDA, QDA, and Naive Bayes are 0.0182, 0.0364, 0.0364 respectively.
Use the \(k\) -nearest neighbor classifier on the Theft
dataset. Use cross-validation to select the best \(k\) and use the test data to evaluate the performance of the selected model. Show the training, cross-validation and test errors for each choice of \(k\) and report your findings.
\(\mathbf{Solution.}\qquad\)
# QUESTION 2 ------------------------------------------------------------------
theft_train = read.csv("./Data/theft_train.csv", header = TRUE)
theft_test = read.csv("./Data/theft_test.csv", header = TRUE)
## Kfold_CV_knn function from Lab 06
Kfold_CV_knn <- function(K, K_knn, train, train_label) {
fold_size = floor(nrow(train) / K)
cv_error = rep(0, K)
for(i in 1:K) {
# iteratively select K-1 folds as training data in CV procedure, remaining
# as test data.
if(i != K) {
CV_test_id = ((i - 1) * fold_size + 1):(i * fold_size)
}
else {
CV_test_id = ((i - 1) * fold_size + 1):nrow(train)
}
CV_train = train[-CV_test_id, ]
CV_test = train[CV_test_id, ]
# calculate the mean and standard deviation using CV_train
mean_CV_train = colMeans(CV_train)
sd_CV_train = apply(CV_train, 2, sd)
# normalize the CV_train and CV_test using above mean and sd
CV_train = scale(CV_train, center = mean_CV_train, scale = sd_CV_train)
CV_test = scale(CV_test, center = mean_CV_train, scale = sd_CV_train)
# Fit knn
pred_CV_test = knn(CV_train, CV_test, train_label[-CV_test_id], k = K_knn)
# Calculate CV error by taking averages
cv_error[i] = mean(pred_CV_test != train_label[CV_test_id])
}
return(mean(cv_error))
}
set.seed(2020)
theft = 3
K_fold = 10
K_knn = seq(from = 2, to = 50, by = 2)
cv_error = rep(0,length(K_knn))
train_err = rep(0, length(K_knn))
test_err = rep(0, length(K_knn))
# data normalization for training and test data using means and stds from
# training data.
mean_train = colMeans(theft_train[, -theft])
sd_train = apply(theft_train[, -theft], 2, sd)
K_train = scale(theft_train[, -theft], center = mean_train, scale = sd_train)
K_test = scale(theft_test[, -theft], center = mean_train, scale = sd_train)
for (i in 1:length(K_knn)) {
cv_error[i] = Kfold_CV_knn(K_fold, K_knn[i], theft_train[, -theft],
theft_train$theft)
train_pred = knn(K_train, K_train,
cl = theft_train$theft, k = K_knn[i])
test_pred = knn(K_train, K_test,
cl = theft_train$theft, k = K_knn[i])
train_err[i] = mean(train_pred != theft_train$theft)
test_err[i] = mean(test_pred != theft_test$theft)
}
best_k = K_knn[which(cv_error == min(cv_error))]
c(min(cv_error), best_k)
## [1] 0.3705714 32.0000000
# CV Error, Train error, and Test errors based on K from CV
pred_train = knn(train = K_train, test = K_train, cl = theft_train$theft,
k = best_k)
pred_test = knn(train = K_train, test = K_test, cl = theft_train$theft,
k = best_k)
train_err_cv = mean(pred_train != theft_train$theft)
test_err_cv = mean(pred_test != theft_test$theft)
# Show training, CV, and test errors
df = data.frame(K_knn, cv_error, train_err, test_err)
ggplot(df, aes(x = K_knn, y = cv_error)) +
geom_line(aes(y = cv_error, colour = "CV Error"), linetype = "twodash") +
geom_line(aes(x = K_knn, y = train_err, colour = "Train Error"),
linetype = "solid") +
geom_line(aes(x = K_knn, y = test_err, colour = "Test Error"),
linetype = "longdash") +
labs(colour = "Error Types") +
ylab("error rate") +
xlab("K") +
ggtitle("Error Rates with respect to K in KNN") +
scale_color_manual(values = c("darkred", "steelblue", "black"))
We find that our CV approach picks \(k=\) 32 and the corresponding CV, train, and test errors are 0.3706, 0.3414, 0.384 which are quite close.
The textbook (“An Introduction to Statistical Learning with Applications in R”) describes that the cv.glm()
function can be used in order to compute the LOOCV error estimate. Alternatively, one could compute those quantities using just the glm()
and predict.glm()
functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a logistic regression model on the Weekly
data set (in the ISLR package).
Direction
using Lag 1
and Lag2
. Report and comment on the result.Direction
using Lag1
and Lag2
using all but the first observation. Report and comment on the result.Direction
using Lag1
and Lag2
.(a) \(\mathbf{Solution.}\qquad\) The logistic regression that Lag1 is not significant while Lag2 is significant.
# QUESTION 3a -----------------------------------------------------------------
data(Weekly)
log_mod = glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(log_mod)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22122405 0.06146572 3.599145 0.0003192652
## Lag1 -0.03872222 0.02621658 -1.477013 0.1396722362
## Lag2 0.06024830 0.02654589 2.269590 0.0232324586
(b) \(\mathbf{Solution.}\qquad\) Similarly to 3(a), we find that Lag1 is not significant while Lag2 is significant.
# QUESTION 3b ----------
Weekly2 = Weekly[-1,]
log_mod2 <- glm(Direction ~ Lag1 + Lag2, data = Weekly2, family = binomial)
summary(log_mod2)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22324305 0.06149894 3.630031 0.0002833875
## Lag1 -0.03843317 0.02621860 -1.465874 0.1426825151
## Lag2 0.06084763 0.02656088 2.290874 0.0219707105
(c) \(\mathbf{Solution.}\qquad\) Below we compute our prediction.
# QUESTION 3c ----------
pred_mod = predict(log_mod2, Weekly[1, c("Lag1", "Lag2")])
pred = binomial()$linkinv(pred_mod)
We find that the corresponding probability is 0.571. However this observation was incorrectly classified since it is supposed to be Down.
(d) \(\mathbf{Solution.}\qquad\)
# QUESTION 3d ----------
n = nrow(Weekly)
prob = rep(0, n)
pred = rep(0, n)
error = rep(0, n)
for(i in 1:n) {
# (i)
Weekly_LOOCV = Weekly[-i,]
log_LOOCV = glm(Direction ~ Lag1 + Lag2, data = Weekly_LOOCV, family = binomial)
# (ii)
pred_LOOCV = predict(log_LOOCV, Weekly[i, c("Lag1", "Lag2")])
prob[i] = binomial()$linkinv(pred_LOOCV)
# (iii)
pred[i] = ifelse(prob[i] > 0.5, "Up", "Down")
# (iv)
error[i] = ifelse(pred[i] != Weekly[i, "Direction"], 1, 0)
}
(e) \(\mathbf{Solution.}\qquad\) We find that LOOCV the test error is about 0.45, which means 45% of the observations were misclassified. We could potentially improve this classification model by adding other predictors and excluding Lag1
.
## [1] 0.4499541