Question 1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.
The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.
The relationship between the predictors and response is highly non-linear.
The variance of the error terms, i.e. \(\sigma^{2}=\mathrm{Var}(\epsilon),\) is extremely high.

(a) \(\mathbf{Solution.}\qquad\) The flexible model would perform better, because the large sample size allows it to fit more parameters. Since \(n\) is large, this reduces the risk of overfitting and the small number of predictors would also limit the variance of the model.

(b) \(\mathbf{Solution.}\qquad\) The flexible model would perform worse since it is likely to overfit the data.

(c) \(\mathbf{Solution.}\qquad\) The flexible model would perform better since it can fit non-linear relationships better than the inflexible model.

(d) \(\mathbf{Solution.}\qquad\) The flexible model would perform worse, because high variance in the error term implies that the data is very noisy and the flexible model is likely to fit this noise whereas the inflexible model is less likely to overfit the noise.

Question 2

Use the \(k\) -nearest neighbor classifier on the diabetes dataset. In particular, consider \(k=1,2, \ldots, 20 .\) Show both the training and test errors for each choice and report your findings.

Hint: Note the prediction/input variables are of different units and scales. Therefore, standardization is necessary before applying the KNN method. Please refer to the lab notes for details.

Limit your solutions to at most 5 pages (including code and figures).

\(\mathbf{Solution.}\qquad\) First we will read the training and test data in order for us to create a knn-classifier.

data_train = read.csv("./Data/diabetes_train.csv")
data_test = read.csv("./Data/diabetes_test.csv")

# Convert Outcome Variables to factors ----------------------------------------
data_train$Outcome = factor(data_train$Outcome)
data_test$Outcome = factor(data_test$Outcome)

levels(data_train$Outcome) = c("Diabetes", "No_Diabetes")
levels(data_test$Outcome) = c("Diabetes", "No_Diabetes")

summary(data_train)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.:103.0   1st Qu.: 64.00   1st Qu.: 0.00  
##  Median : 3.000   Median :123.0   Median : 72.00   Median :22.50  
##  Mean   : 4.054   Mean   :124.8   Mean   : 69.67   Mean   :20.07  
##  3rd Qu.: 7.000   3rd Qu.:145.0   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :114.00   Max.   :99.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.00   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.00   1st Qu.:27.88   1st Qu.:0.2537           1st Qu.:25.00  
##  Median :  0.00   Median :32.50   Median :0.4025           Median :31.00  
##  Mean   : 84.07   Mean   :32.55   Mean   :0.5023           Mean   :34.33  
##  3rd Qu.:130.00   3rd Qu.:36.80   3rd Qu.:0.6750           3rd Qu.:41.25  
##  Max.   :846.00   Max.   :59.40   Max.   :2.4200           Max.   :81.00  
##         Outcome   
##  Diabetes   :223  
##  No_Diabetes:205  
##                   
##                   
##                   
##

The summary table above reveals that some minimum values are equal to 0 0, such as BloodPressure, Glucose, SkinThickness, and BMI. Since it is not possible to have 0 value for these variables, we exclude these observations from the training set and test set.

data_train = data_train %>% filter(BloodPressure != 0 & BMI != 0 & 
                             Glucose != 0 & SkinThickness != 0)
data_test = data_test %>% filter(BloodPressure != 0 & BMI != 0 & 
                             Glucose != 0 & SkinThickness != 0)

Next we will separate the outcome variable from the design matrix and scale the data appropriately.

# Pull out Outcome Variable and designing matrix ------------------------------
train_label = data_train %>% .$Outcome 
train_x = data_train %>% select(-c("Outcome"))

test_label = data_test %>% .$Outcome 
test_x = data_test %>% select(-c("Outcome"))

# Scale the Data --------------------------------------------------------------
mean_train = colMeans(train_x)
std_train = sqrt(diag(var(train_x)))

# training data
train_x = scale(train_x, center = mean_train, scale = std_train)

# test data
test_x = scale(test_x, center = mean_train, scale = std_train)

Finally we can make predictions for values of \(k=1,\ldots,20\) and store the training and test errors as vectors.

# Make Diabetes Prediction ----------------------------------------------------
set.seed(1)

# Test values of k = 1:20
k_range = c(1:20)
train_error = c()
test_error = c()

for (i in 1:length(k_range)) {
  # Get Training Error 
  pred_train = knn(train_x, 
                    train_x, 
                    train_label, 
                    k = k_range[i])
  train_error[i] = mean(pred_train != train_label)
  
  # Get Test Error
  pred_test = knn(train_x, 
                  test_x, 
                  train_label, 
                  k = k_range[i])
  test_error[i] = mean(pred_test != test_label)
}

Below we show the error plots as a function of \(1/k\).

# Create Error Plot
errors = data.frame(train_error, test_error, k_range)
ggplot(errors, aes(x = 1/k_range)) + 
  geom_line(aes(y = train_error), col = "darkred") + 
  geom_point(aes(y = train_error), col = "darkred") +
  geom_line(aes(y = test_error), col = "steelblue") + 
  geom_point(aes(y = test_error), col = "steelblue") +
  ylab("Error Rate") + xlab("1/k") + 
  ggtitle("Training and test error rate for KNN") + 
  theme_minimal()

The optimal value of \(k\) corresponds to the value that minimizes prediction error on the test set. Below we extract train and test errors and display their confusion matrices.

# Pull out the true responses for the test data
pred_train = knn(train_x, 
                 train_x, 
                 train_label, 
                 k = k_range[which.min(test_error)])

pred_test = knn(train_x, 
                test_x, 
                train_label, 
                k = k_range[which.min(test_error)])

table(pred_train, train_label)

##              train_label
## pred_train    Diabetes No_Diabetes
##   Diabetes         150           0
##   No_Diabetes        0         133

train_error = mean(pred_train == train_label)
train_error

## [1] 1

table(pred_test, test_label)

##              test_label
## pred_test     Diabetes No_Diabetes
##   Diabetes          23          12
##   No_Diabetes        7          32

test_error = mean(pred_test == test_label)
test_error

## [1] 0.7432432

Based on the training and test error plots, we can see that the test error is minimized when \(\frac{1}{k}=1\), in other words \(k=1\) nearest neighbors. For this value of \(k\), our highest accuracy on the test set is 0.743. Note: The value of \(k\) might differ slightly if we use a different seed for the random number generator. This is because, when there is a tie on the nearest neighbors, the knn classifier will choose randomly between the two classes.

STATS 503: Homework 1

Israel Diego

1/31/2020

Question 1

Question 2