For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(a) \(\mathbf{Solution.}\qquad\) The flexible model would perform better, because the large sample size allows it to fit more parameters. Since \(n\) is large, this reduces the risk of overfitting and the small number of predictors would also limit the variance of the model.
 Â
(b) \(\mathbf{Solution.}\qquad\) The flexible model would perform worse since it is likely to overfit the data.
 Â
(c) \(\mathbf{Solution.}\qquad\) The flexible model would perform better since it can fit non-linear relationships better than the inflexible model.
 Â
(d) \(\mathbf{Solution.}\qquad\) The flexible model would perform worse, because high variance in the error term implies that the data is very noisy and the flexible model is likely to fit this noise whereas the inflexible model is less likely to overfit the noise.
Use the \(k\) -nearest neighbor classifier on the diabetes
dataset. In particular, consider \(k=1,2, \ldots, 20 .\) Show both the training and test errors for each choice and report your findings.
Hint: Note the prediction/input variables are of different units and scales. Therefore, standardization is necessary before applying the KNN method. Please refer to the lab notes for details.
Limit your solutions to at most 5 pages (including code and figures).
\(\mathbf{Solution.}\qquad\) First we will read the training and test data in order for us to create a knn-classifier.
data_train = read.csv("./Data/diabetes_train.csv")
data_test = read.csv("./Data/diabetes_test.csv")
# Convert Outcome Variables to factors ----------------------------------------
data_train$Outcome = factor(data_train$Outcome)
data_test$Outcome = factor(data_test$Outcome)
levels(data_train$Outcome) = c("Diabetes", "No_Diabetes")
levels(data_test$Outcome) = c("Diabetes", "No_Diabetes")
summary(data_train)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.:103.0 1st Qu.: 64.00 1st Qu.: 0.00
## Median : 3.000 Median :123.0 Median : 72.00 Median :22.50
## Mean : 4.054 Mean :124.8 Mean : 69.67 Mean :20.07
## 3rd Qu.: 7.000 3rd Qu.:145.0 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :114.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.00 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.00 1st Qu.:27.88 1st Qu.:0.2537 1st Qu.:25.00
## Median : 0.00 Median :32.50 Median :0.4025 Median :31.00
## Mean : 84.07 Mean :32.55 Mean :0.5023 Mean :34.33
## 3rd Qu.:130.00 3rd Qu.:36.80 3rd Qu.:0.6750 3rd Qu.:41.25
## Max. :846.00 Max. :59.40 Max. :2.4200 Max. :81.00
## Outcome
## Diabetes :223
## No_Diabetes:205
##
##
##
##
The summary table above reveals that some minimum values are equal to 0 0, such as BloodPressure, Glucose, SkinThickness, and BMI. Since it is not possible to have 0 value for these variables, we exclude these observations from the training set and test set.
data_train = data_train %>% filter(BloodPressure != 0 & BMI != 0 &
Glucose != 0 & SkinThickness != 0)
data_test = data_test %>% filter(BloodPressure != 0 & BMI != 0 &
Glucose != 0 & SkinThickness != 0)
Next we will separate the outcome variable from the design matrix and scale the data appropriately.
# Pull out Outcome Variable and designing matrix ------------------------------
train_label = data_train %>% .$Outcome
train_x = data_train %>% select(-c("Outcome"))
test_label = data_test %>% .$Outcome
test_x = data_test %>% select(-c("Outcome"))
# Scale the Data --------------------------------------------------------------
mean_train = colMeans(train_x)
std_train = sqrt(diag(var(train_x)))
# training data
train_x = scale(train_x, center = mean_train, scale = std_train)
# test data
test_x = scale(test_x, center = mean_train, scale = std_train)
Finally we can make predictions for values of \(k=1,\ldots,20\) and store the training and test errors as vectors.
# Make Diabetes Prediction ----------------------------------------------------
set.seed(1)
# Test values of k = 1:20
k_range = c(1:20)
train_error = c()
test_error = c()
for (i in 1:length(k_range)) {
# Get Training Error
pred_train = knn(train_x,
train_x,
train_label,
k = k_range[i])
train_error[i] = mean(pred_train != train_label)
# Get Test Error
pred_test = knn(train_x,
test_x,
train_label,
k = k_range[i])
test_error[i] = mean(pred_test != test_label)
}
Below we show the error plots as a function of \(1/k\).
# Create Error Plot
errors = data.frame(train_error, test_error, k_range)
ggplot(errors, aes(x = 1/k_range)) +
geom_line(aes(y = train_error), col = "darkred") +
geom_point(aes(y = train_error), col = "darkred") +
geom_line(aes(y = test_error), col = "steelblue") +
geom_point(aes(y = test_error), col = "steelblue") +
ylab("Error Rate") + xlab("1/k") +
ggtitle("Training and test error rate for KNN") +
theme_minimal()
The optimal value of \(k\) corresponds to the value that minimizes prediction error on the test set. Below we extract train and test errors and display their confusion matrices.
# Pull out the true responses for the test data
pred_train = knn(train_x,
train_x,
train_label,
k = k_range[which.min(test_error)])
pred_test = knn(train_x,
test_x,
train_label,
k = k_range[which.min(test_error)])
table(pred_train, train_label)
## train_label
## pred_train Diabetes No_Diabetes
## Diabetes 150 0
## No_Diabetes 0 133
## [1] 1
## test_label
## pred_test Diabetes No_Diabetes
## Diabetes 23 12
## No_Diabetes 7 32
## [1] 0.7432432
Based on the training and test error plots, we can see that the test error is minimized when \(\frac{1}{k}=1\), in other words \(k=1\) nearest neighbors. For this value of \(k\), our highest accuracy on the test set is 0.743. Note: The value of \(k\) might differ slightly if we use a different seed for the random number generator. This is because, when there is a tie on the nearest neighbors, the knn classifier will choose randomly between the two classes.