With \(f(x)=\beta_{0}+x^{T} \beta,\) consider the optimization problem \[ \min _{\beta_{0}, \beta} \sum_{i=1}^{n}\left[1-y_{i} f\left(x_{i}\right)\right]_{+}+\frac{\lambda}{2}\|\beta\|^{2} \]
where the subscript “+” indicates positive part. This has the form “loss + penalty”, which is a familiar paradigm in statistical function estimation. Show that the solution to this problem, with \(\lambda=1 / C\), is the same as that for the support vector machine, i.e. \[ \begin{aligned} \min _{\beta_{0}, \beta} & \frac{1}{2}\|\beta\|^{2}+C \sum_{i=1}^{n} \xi_{i} \\ \text { subject to } & \xi_{i} \geq 0, y_{i}\left(\beta_{0}+x_{i}^{T} \beta\right) \geq 1-\xi_{i}, \forall i \end{aligned} \]
where \(C\) is the “cost” tuning parameter.
\(\mathbf{Solution.}\qquad\) First we begin with the optimization problem below, \[ \min_{\beta_{0},\beta}\sum_{i=1}^{n}\left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\frac{\lambda}{2}\|\beta\|^{2} \] dividing the optimization problem by \(\lambda\) gives us, \[ \min_{\beta_{0},\beta}\frac{1}{\lambda}\sum_{i=1}^{n}\left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\frac{1}{2}\|\beta\|^{2} \] By letting \(\lambda=1/C\), we have, \[ \min_{\beta_{0},\beta}C\sum_{i=1}^{n}\left[1-y_{i}f\left(x_{i}\right)\right]_{+}+\frac{1}{2}\|\beta\|^{2} \] Finally, we define, \(\xi_{i}:=\left[1-y_{i}f\left(x_{i}\right)\right]_{+}\). The optimization problem becomes, \[ \begin{aligned}\min_{\beta_{0},\beta}C & \sum_{i=1}^{n}\xi_{i}+\frac{1}{2}\|\beta\|^{2}\\ \text{ subject to } & \xi_{i}\geq0,y_{i}\left(\beta_{0}+x_{i}^{T}\beta\right)\geq1-\xi_{i},\forall i \end{aligned} \]
In this problem, you will use support vector machines and neural networks to predict handwritten digits based on the MNIST dataset. See file mnist_info
for how to load and preprocess the data.
gamma
and degree
and cost
. Use cross-validation to select the best model, apply the selected model to the test data, and report the results.(a) \(\mathbf{Solution.}\qquad\) First, in order to make the training process easier, we will switch the training and test sets that were given for parts (a) and (b), which will now give us 10,000 observations for the training set and 60,000 observations for the test set.
Next we plot some training images to in figure 1.
Next we will scale values of the training and test set to be between 0 and 1. For the SVM we will also flatten the data to a matrix where the dimensions of the training data are \(10,000 \times 784\) and for the test data \(60,000 \times 784\).
Below we train svm model with radial
and polynomial
kernels and display classification errors of each potential model.
##No PCA
tune.out =tune(svm, labels~., data=train_dat,
ranges=list(cost=c(0.1,1,10,100),
gamma = 10^(-3:0),
kernel=c("radial", "polynomial")))
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma kernel
## 10 0.01 radial
##
## - best performance: 0.081
##
## - Detailed performance results:
## cost gamma kernel error dispersion
## 1 0.1 0.001 radial 0.871 0.03142893
## 2 1.0 0.001 radial 0.181 0.05342700
## 3 10.0 0.001 radial 0.107 0.02311805
## 4 100.0 0.001 radial 0.100 0.02666667
## 5 0.1 0.010 radial 0.326 0.05834762
## 6 1.0 0.010 radial 0.102 0.02740641
## 7 10.0 0.010 radial 0.081 0.02846050
## 8 100.0 0.010 radial 0.081 0.02846050
## 9 0.1 0.100 radial 0.797 0.04922736
## 10 1.0 0.100 radial 0.463 0.04785394
## 11 10.0 0.100 radial 0.418 0.04315347
## 12 100.0 0.100 radial 0.418 0.04315347
## 13 0.1 1.000 radial 0.874 0.03373096
## 14 1.0 1.000 radial 0.873 0.03497618
## 15 10.0 1.000 radial 0.873 0.03497618
## 16 100.0 1.000 radial 0.873 0.03497618
## 17 0.1 0.001 polynomial 0.874 0.03373096
## 18 1.0 0.001 polynomial 0.874 0.03373096
## 19 10.0 0.001 polynomial 0.874 0.03373096
## 20 100.0 0.001 polynomial 0.635 0.04743416
## 21 0.1 0.010 polynomial 0.635 0.04743416
## 22 1.0 0.010 polynomial 0.208 0.05553777
## 23 10.0 0.010 polynomial 0.130 0.04268749
## 24 100.0 0.010 polynomial 0.126 0.04452215
## 25 0.1 0.100 polynomial 0.126 0.04452215
## 26 1.0 0.100 polynomial 0.126 0.04452215
## 27 10.0 0.100 polynomial 0.126 0.04452215
## 28 100.0 0.100 polynomial 0.126 0.04452215
## 29 0.1 1.000 polynomial 0.126 0.04452215
## 30 1.0 1.000 polynomial 0.126 0.04452215
## 31 10.0 1.000 polynomial 0.126 0.04452215
## 32 100.0 1.000 polynomial 0.126 0.04452215
For the svm model with radial kernel, we see that a high value of \(\gamma\) and low cost tends to deteriorate performance. From the tune()
function we find that the best model has \(\gamma =0.01\) and \(\textrm{cost} = 10\). For the svm model with polynomial kernel, we see that having a higher cost is better along with a lower degree. From the tune()
function we find that the best model has \(d =1\) and \(\textrm{cost} = 1\), which corresponds to a linear svm. Below we report the classification error rate for SVM with radial kernel,
test_dat = data.frame(test_labels, test_images.1)
colnames(test_dat)[1] = "labels"
test_pred = predict(tune.out$best.model, newdata = test_dat)
## [1] 0.09921667
(b) \(\mathbf{Solution.}\qquad\) First we fit the data using MLP model with various hyper parameters. In the table below we show the cross-validation accuracy for each MLP model. The best performing model has batch size equal to 8, learning rate equal to 0.1, and an accuracy of 0.97.
batch_size | learn_rate | val_accuracy |
---|---|---|
8 | 0.01 | 0.9660 |
8 | 0.10 | 0.9700 |
8 | 0.50 | 0.9730 |
16 | 0.01 | 0.9615 |
16 | 0.10 | 0.9650 |
16 | 0.50 | 0.9645 |
32 | 0.01 | 0.9655 |
32 | 0.10 | 0.9630 |
32 | 0.50 | 0.9645 |
Below we show a plot of the best model we chose for MLP.
Below we compute the classification error rate for MLP.
## Test loss: 0.2973463
## Test error: 0.04553336
Next we fit the data using CNN model with various hyper parameters. In the table below we show the cross-validation accuracy for each CNN model. The best performing model has batch size equal to 32, learning rate equal to 0.01, and an accuracy of 0.9935.
batch_size | learn_rate | val_accuracy |
---|---|---|
8 | 0.01 | 0.9910 |
8 | 0.10 | 0.9925 |
8 | 0.50 | 0.9880 |
16 | 0.01 | 0.9925 |
16 | 0.10 | 0.9910 |
16 | 0.50 | 0.9900 |
32 | 0.01 | 0.9935 |
32 | 0.10 | 0.9915 |
32 | 0.50 | 0.9905 |
Below we plot the performance of the best CNN model we trained.
Below we compute the classification error rate for CNN.
## Test loss: 0.08424857
## Test error: 0.01740003
We see that the performance of both MLP and CNN are good and outperform the predictions of the SVM models. The error rate for CNN is slightly lower than MLP, and it is about 2% error rate.