STATS 503: Homework 5

Question 1

With \(f(x)=\beta_{0}+x^{T} \beta,\) consider the optimization problem \[ \min _{\beta_{0}, \beta} \sum_{i=1}^{n}\left[1-y_{i} f\left(x_{i}\right)\right]_{+}+\frac{\lambda}{2}\|\beta\|^{2} \]

where the subscript “+” indicates positive part. This has the form “loss + penalty”, which is a familiar paradigm in statistical function estimation. Show that the solution to this problem, with \(\lambda=1 / C\), is the same as that for the support vector machine, i.e. \[ \begin{aligned} \min _{\beta_{0}, \beta} & \frac{1}{2}\|\beta\|^{2}+C \sum_{i=1}^{n} \xi_{i} \\ \text { subject to } & \xi_{i} \geq 0, y_{i}\left(\beta_{0}+x_{i}^{T} \beta\right) \geq 1-\xi_{i}, \forall i \end{aligned} \]

where \(C\) is the “cost” tuning parameter.

\(\mathbf{Solution.}\qquad\) First we begin with the optimization problem below, \[ \min_{\beta_{0},\beta}\sum_{i=1}^{n}\left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\frac{\lambda}{2}\|\beta\|^{2} \] dividing the optimization problem by \(\lambda\) gives us, \[ \min_{\beta_{0},\beta}\frac{1}{\lambda}\sum_{i=1}^{n}\left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\frac{1}{2}\|\beta\|^{2} \] By letting \(\lambda=1/C\), we have, \[ \min_{\beta_{0},\beta}C\sum_{i=1}^{n}\left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\frac{1}{2}\|\beta\|^{2} \] Finally, we define, \(\xi_{i}:=\left[1-y_{i}f\left(x_{i}\right)\right]_{+}\). The optimization problem becomes, \[ \begin{aligned}\min_{\beta_{0},\beta}C & \sum_{i=1}^{n}\xi_{i}+\frac{1}{2}\|\beta\|^{2}\\ \text{ subject to } & \xi_{i}\geq0,y_{i}\left(\beta_{0}+x_{i}^{T}\beta\right)\geq1-\xi_{i},\forall i \end{aligned} \]

Question 2

In this problem, you will use support vector machines and neural networks to predict handwritten digits based on the MNIST dataset. See file mnist_info for how to load and preprocess the data.

Fit support vector machine classifiers to the training data, with radial and polynomial kernels, with different values of gamma and degree and cost. Use cross-validation to select the best model, apply the selected model to the test data, and report the results.
Now repeat (a), this time using neural networks, with MLP and CNN, with different hyperparameters. Use the validation approach to select the best model, apply the selected model to the test data, report the results, and compare with that of the SVM classifier.

(a) \(\mathbf{Solution.}\qquad\) First, in order to make the training process easier, we will switch the training and test sets that were given for parts (a) and (b), which will now give us 10,000 observations for the training set and 60,000 observations for the test set.

Next we plot some training images to in figure 1.

Figure 1: Plot of first 25 images.

Next we will scale values of the training and test set to be between 0 and 1. For the SVM we will also flatten the data to a matrix where the dimensions of the training data are \(10,000 \times 784\) and for the test data \(60,000 \times 784\).

Below we train svm model with radial and polynomial kernels and display classification errors of each potential model.

##No PCA
tune.out =tune(svm, labels~., data=train_dat, 
               ranges=list(cost=c(0.1,1,10,100), 
                           gamma = 10^(-3:0), 
                           kernel=c("radial", "polynomial")))

summary(tune.out)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma kernel
##    10  0.01 radial
## 
## - best performance: 0.081 
## 
## - Detailed performance results:
##     cost gamma     kernel error dispersion
## 1    0.1 0.001     radial 0.871 0.03142893
## 2    1.0 0.001     radial 0.181 0.05342700
## 3   10.0 0.001     radial 0.107 0.02311805
## 4  100.0 0.001     radial 0.100 0.02666667
## 5    0.1 0.010     radial 0.326 0.05834762
## 6    1.0 0.010     radial 0.102 0.02740641
## 7   10.0 0.010     radial 0.081 0.02846050
## 8  100.0 0.010     radial 0.081 0.02846050
## 9    0.1 0.100     radial 0.797 0.04922736
## 10   1.0 0.100     radial 0.463 0.04785394
## 11  10.0 0.100     radial 0.418 0.04315347
## 12 100.0 0.100     radial 0.418 0.04315347
## 13   0.1 1.000     radial 0.874 0.03373096
## 14   1.0 1.000     radial 0.873 0.03497618
## 15  10.0 1.000     radial 0.873 0.03497618
## 16 100.0 1.000     radial 0.873 0.03497618
## 17   0.1 0.001 polynomial 0.874 0.03373096
## 18   1.0 0.001 polynomial 0.874 0.03373096
## 19  10.0 0.001 polynomial 0.874 0.03373096
## 20 100.0 0.001 polynomial 0.635 0.04743416
## 21   0.1 0.010 polynomial 0.635 0.04743416
## 22   1.0 0.010 polynomial 0.208 0.05553777
## 23  10.0 0.010 polynomial 0.130 0.04268749
## 24 100.0 0.010 polynomial 0.126 0.04452215
## 25   0.1 0.100 polynomial 0.126 0.04452215
## 26   1.0 0.100 polynomial 0.126 0.04452215
## 27  10.0 0.100 polynomial 0.126 0.04452215
## 28 100.0 0.100 polynomial 0.126 0.04452215
## 29   0.1 1.000 polynomial 0.126 0.04452215
## 30   1.0 1.000 polynomial 0.126 0.04452215
## 31  10.0 1.000 polynomial 0.126 0.04452215
## 32 100.0 1.000 polynomial 0.126 0.04452215

For the svm model with radial kernel, we see that a high value of \(\gamma\) and low cost tends to deteriorate performance. From the tune() function we find that the best model has \(\gamma =0.01\) and \(\textrm{cost} = 10\). For the svm model with polynomial kernel, we see that having a higher cost is better along with a lower degree. From the tune() function we find that the best model has \(d =1\) and \(\textrm{cost} = 1\), which corresponds to a linear svm. Below we report the classification error rate for SVM with radial kernel,

test_dat = data.frame(test_labels, test_images.1)
colnames(test_dat)[1] = "labels"
test_pred = predict(tune.out$best.model, newdata = test_dat)

# Classification error on test set
sum((test_pred != test_labels)) / dim(test_dat)[1]

## [1] 0.09921667

(b) \(\mathbf{Solution.}\qquad\) First we fit the data using MLP model with various hyper parameters. In the table below we show the cross-validation accuracy for each MLP model. The best performing model has batch size equal to 8, learning rate equal to 0.1, and an accuracy of 0.97.

Table 1: MLP Accuracy table
batch_size	learn_rate	val_accuracy
8	0.01	0.9660
8	0.10	0.9700
8	0.50	0.9730
16	0.01	0.9615
16	0.10	0.9650
16	0.50	0.9645
32	0.01	0.9655
32	0.10	0.9630
32	0.50	0.9645

Below we show a plot of the best model we chose for MLP.

plot(best_History_MLP)

Below we compute the classification error rate for MLP.

cat('Test loss:', score.mlp$loss, "\n")

## Test loss: 0.2973463

cat('Test error: ', 1 - score.mlp$acc, "\n")

## Test error:  0.04553336

Next we fit the data using CNN model with various hyper parameters. In the table below we show the cross-validation accuracy for each CNN model. The best performing model has batch size equal to 32, learning rate equal to 0.01, and an accuracy of 0.9935.

Table 2: CNN Accuracy table
batch_size	learn_rate	val_accuracy
8	0.01	0.9910
8	0.10	0.9925
8	0.50	0.9880
16	0.01	0.9925
16	0.10	0.9910
16	0.50	0.9900
32	0.01	0.9935
32	0.10	0.9915
32	0.50	0.9905

Below we plot the performance of the best CNN model we trained.

plot(best_history_CNN)

Below we compute the classification error rate for CNN.

cat('Test loss:', score.cnn$loss, "\n")

## Test loss: 0.08424857

cat('Test error:', 1 - score.cnn$acc, "\n")

## Test error: 0.01740003

We see that the performance of both MLP and CNN are good and outperform the predictions of the SVM models. The error rate for CNN is slightly lower than MLP, and it is about 2% error rate.

STATS 503: Homework 5

Israel Diego

4/06/2020

Question 1

Question 2