(5 pts) We will use the 1st variable (CMEDV) of the 14 variables as our response y and the remaining variables as the feature variables . Sphere the train-validate feature variables, i.e., transform the feature matrix so that its rows have zero sample mean and sample variance equal to one (see Slide 17 of Lec 3 ). Display the vector of sample means and the vector of sample standard deviations. These vectors should be 13 dimensional and you will use them to likewise transform the test data.

% Loading train data

clear

clc

boston = readtable('boston-corrected.csv');

y = boston.CMEDV;

X = transpose(table2array(boston(:,2:end)));

[d, n] = size(X);

subset = 460;

[X_sph, mu, sigma] = sphere_X(X(:, 1:subset));

disp(mu)

disp(sigma)

(5pts) Here you will be using the using the Sphered train-validate data produced in part (a). Using all the data except for the hold-out sample compute the effective degrees of freedom (Lec over the range with a step size 0.1 and plot the result. For this same range of λ find the ridge regression coefficients that minimize the PRSS over the train-validate sample instances) and plot its mean squared prediction errors as a function of λ. Now apply your learned predictor without regularization to the test samples in the hold-out set and report its mean squared prediction error. Note: to be applied to a novel instance in the test set, you need to transform the novel sample the same way you transformed the training sample when learning the predictor coefficients. In particular, with whain the predictor on the novel sample will have form where and were found in part a.

step = 0.1;

N = 20 / step;

lambdas = linspace(0, 20, N + 1);

MSE = zeros(N + 1, 1);

PRSS = zeros(N + 1,1);

EDD = zeros(N + 1, 1);

w_train = zeros(d + 1, subset);

for i = 1:(N + 1)

% Parameters

lambda = lambdas(i);

X_til = X_sph(:,1:subset);

y_T = y(1:subset);

[d, ~] = size(X_til);

% Compute w_hat, y_hat, MSE

EDD(i) = compute_EDD(X_til, lambda);

[w_train(:, i), ~, MSE(i)] = compute_ridge(y_T, X_til, lambda);

end

Figure 1: Plot Effective Degrees of Freedom over range .

figure(1)

plot(lambdas, EDD, "Linewidth", 1.5)

title("Effective degrees of freedom")

ylabel("EDD", "FontSize", 12)

xlabel("\lambda", "FontSize", 12)

Figure 2: Plot Mean-squared prediction error over range .

figure(2)

plot(lambdas, MSE, "Linewidth", 1.5)

title("Mean-squared Prediction Error")

ylabel("MSPE", "FontSize", 12)

xlabel("\lambda", "FontSize", 12)

% MSPE on hold out set with lambda = 0, i.e. w_train(:, 1)

X_hold = [ones(1, 46); (X(:, subset+1:end) - mu) ./ sigma];

y_hold = y(subset+1:end);

y_hat2 = X_hold' * w_train(:, 1);

MSE_hold_1b = compute_MSE(y_hold, y_hat2);

fprintf(['MSPE on hold-out set with ', num2str(char(955)),' = 0: ',num2str(MSE_hold_1b), '\n'])

(10pts) Now you will use 10 -fold cross-validation on the sphered train-validate samples (see Lec 7 slides to determine the optimal cross-validated value of Plot the average of the 10 mean kquared prediction errors that you found over the cross-validation folds, as a function of Find the that minimizes this error. What is the corresponding effective degrees of freedom associated with Now compute the cross-validated ridge weights wridge that minimizes the PRSS over the entire train+validate set using this value of Report the average mean squared prediction error of the CV ridge weights when applied to the hold-out set and compare to the error on the hold-out set for the unregularized weights learned in part b.

MSE_i = zeros(N + 1, 1);

for i = 1:N + 1

% Parameters

lambda = lambdas(i);

MSE_k = zeros(10, 1);

for k = 0:9

X_T = X_sph(:,1:subset);

% Hold out set

X_valid = [ones(1, 46); X_T(:,46*k + 1:46*(k + 1))];

y_valid = y(46*k + 1:46*(k + 1));

% Split set

X_T(:,46*k + 1:46*(k + 1)) = [];

y_T = y(1:subset);

y_T(46*k + 1:46*(k + 1)) = [];

% Compute w_hat, y_hat, MSE

[w_train, ~, ~] = compute_ridge(y_T, X_T, lambda);

% fprintf("The MSE is: " + round(MSE, 3) + "\n")

MSE_k(k+1) = compute_MSE(y_valid, X_valid' * w_train);

end

MSE_i(i) = mean(MSE_k);

end

Figure 3: MSPE 10-fold cross validation

figure(3)

plot(lambdas, MSE_i, "Linewidth", 1.5)

title("MSPE 10-fold cross validation")

ylabel("MSPE", "FontSize", 12)

xlabel("\lambda", "FontSize", 12)

Find and EDD of this :

[~,indx] = min(MSE_i);

lambda_st = (indx - 1) * step;

display(lambda_st)

lambda_st = 4.8000

Display EDD:

EDD_st = compute_EDD(X_sph(:,1:subset), lambda_st);

display(EDD_st)

EDD_st = 12.5544

Compute the cross-validated ridge weights that minimizes the PRSS over the entire train+validate set using this value of lambda star

[w_ridge, ~, ~] = compute_ridge(y(1:subset), X_sph(:,1:subset), lambda_st);

disp(w_ridge)

Compare to hold out set

MSE_hold_1c = compute_MSE(y_hold, X_hold' * w_ridge);

display(MSE_hold_1c)

MSE_hold_1c = 27.7251

fprintf(['The MSPE on hold-out set with ', num2str(char(955)),' = ',num2str(lambda_st), ...

' and CV ridge weights is ', num2str(MSE_hold_1b - MSE_hold_1c), ' smaller than ', ...

'MSPE without regularization.'])

(7 pts) Implement the (batch) gradient descent (GD) update rule to train your regression model on the training data. Set the initial weight to the step size parameter to and the number of GD iterations to Plot the training error as a function of iteration and report the learned weight vector obtained after K iterations. Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample. Compare the learned GD weights to the directly computed weights learned from the training data.

K = 500;

eta = 0.05;

w_k = zeros(d + 1, 1);

train_error = zeros(K, 1);

% Perform Batch GD

for j = 1:K

k = mod((j - 1), 10);

% Get batch of X_j and y_j

X_j = [ones(1, 46); X_sph(:,46*k + 1:46*(k + 1))];

y_j = y(46*k + 1:46*(k + 1));

g = (2 / n) * X_j * (X_j' * w_k - y_j);

w_k = w_k - eta * g;

y_hat = X_j' * w_k;

train_error(j) = compute_MSE(y_j, y_hat);

end

Figure 4: Plot the train error as a function of iteration

figure(4)

plot(train_error, "Linewidth", 1.5)

title("Train error as function of iteration, Batch GD")

ylabel("MSPE", "FontSize", 12)

xlabel("kth iteration", "FontSize", 12)

Report final learned weight vector

w_GD = w_k;

disp(w_GD)

Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample.

y_hat = [ones(1, subset); X_sph(:, 1:subset)]' * w_GD;

MSE_train_GD = compute_MSE(y(1:subset), y_hat);

display(MSE_train_GD)

MSE_train_GD = 21.9025

MSE_test_GD = compute_MSE(y_hold, X_hold' * w_GD);

display(MSE_test_GD)

MSE_test_GD = 28.3228

Compare the learned GD weights to the directly computed weights learned from the training data. We show the difference between estimates of and :

[w_LS, ~, ~] = compute_ridge(y(1:subset), X_sph(:, 1:subset), 0);

compare_w = w_LS - w_GD;

disp(compare_w)

pts) Implement stochastic gradient descent (SGD) to train your regression model. Set the initial weight to the step size parameter to and the number of SGD epochs to ortor As in a, plot the training error as a function of epoch and report the learned weight vector obtained after K epochs. Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample. Compare the learned SGD weights to the GD weights, and to the directly computed weights learned from the training data.

eta = 0.005;

w_k = zeros(d + 1, 1);

train_error = zeros(K, 1);

% Perform SGD by epochs

for epoch = 1:K

rand_j = randperm(10) - 1;

for i = 1:10

k = rand_j(i);

% Get batch of X_j and y_j

X_j = [ones(1, 46); X_sph(:,46*k + 1:46*(k + 1))];

y_j = y(46*k + 1:46*(k + 1));

g = (2 / n) * X_j * (X_j' * w_k - y_j);

w_k = w_k - eta * g;

y_hat = X_j' * w_k;

end

train_error(epoch) = compute_MSE(y_j, y_hat);

end

Figure 5: Plot the train error as a function of K epochs

% Plot the training error as a function of epoch

figure(5)

plot(train_error, "Linewidth", 1.5)

title('Train error as function of K epochs')

ylabel("MSPE", "FontSize", 12)

xlabel("kth epoch", "FontSize", 12)

Report the learned weight vector obtained after K epochs.

w_SGD = w_k;

disp(w_SGD)

Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample.

y_hat = [ones(1, subset); X_sph(:, 1:subset)]' * w_SGD;

MSE_train_SGD = compute_MSE(y(1:subset), y_hat);

display(MSE_train_SGD)

MSE_train_SGD = 21.9126

MSE_test_SGD = compute_MSE(y_hold, X_hold' * w_SGD);

display(MSE_test_SGD)

MSE_test_SGD = 28.3397

Compare the learned SGD weights to the GD weights by taking differences of the estimates.

compare_w_SGD_GD = w_SGD - w_GD;

disp(compare_w_SGD_GD)

Compare SGD weights to the directly computed weights learned from the training data by taking differences of the estimates.

compare_w_SGD = w_LS - w_SGD;

compare_w_SGD_GD = w_SGD - w_GD;

disp(compare_w_SGD)

Implement the subgradient method for minimizing J and apply it to the nuclear data. Submit two figures: One showing the data and the learned line, the other showing J as a function of iteration number. Also report the estimated hyperplane parameters and the minimum achieved value of the objective function.

clear

clc

% Implement the subgradient method for minimizing J and apply it to the nuclear data.

% Submit two figures.

load nuclear.mat;

rng(0);

% Initialize parameters

K = 100;

lambda = 0.001;

[d, n] = size(x);

x = [ones(1, n); x];

theta_k = ones(d + 1, 1);

D = diag([0; ones(d, 1)]);

J = zeros(K, 1);

% f = waitbar(0, 'Please wait...');

% Iterate K times to update theta_k

tic;

for j = 1:K

% Compute sub-gradient

indic = theta_k' * x * diag(y);

g = -(1/n) * x * (y' .* (indic < 1)') + lambda * D * theta_k;

alpha_j = K / j;

theta_k = theta_k - alpha_j * g;

J(j) = compute_J(y, x, theta_k, lambda);

% waitbar(j / K, f, sprintf('Trial %d of %d', j, K));

end

time_GD = toc;

We obtain the learned line from: .

data = [ones(n, 1), linspace(0, 7, n)'];

pred_line = -(1/theta_k(3)) * (data * theta_k(1:2));

Figure 6: Showing the data and the learned line

figure(6)

scatter(x(2, y == 1), x(3, y == 1))

hold on;

scatter(x(2, y == -1), x(3, y == -1))

plot(linspace(0, 7, n)', pred_line, 'LineWidth', 2)

title('Data and Learned line')

xlabel('Total energy', "FontSize",12)

ylabel('Tail energy',"FontSize",12)

hold off

Figure 7: Showing J as a function of iteration number.

figure(7)

plot(J, 'Linewidth', 1.5)

xlabel("kth iteration", "FontSize",12)

ylabel("$J(w,w_0)$", "FontSize", 12, "Interpreter","latex")

title('J as function of iteration with subgradient method')

Report the estimated hyperplane parameters and the minimum achieved value of the objective function.

hyper_p = theta_k;

display(hyper_p)

min_obj = min(J);

display(min_obj)

min_obj = 0.3105

(6 pts) Now implement the stochastic subgradient method, which is like the subgradient method, except that your step direction is a subgradient of a random Be sure to cycle through all data points before starting a new loop through the data. Report/hand in the same items as for part .

rng(0);

% cycle K times through theta_k

% f = waitbar(0, 'Please wait...');

cycle_size = 10;

J = zeros(cycle_size * K, 1);

tic;

for j = 1:K

% Set up permutations

rand_j = randperm(cycle_size) - 1;

% Step-size

alpha_j = K / j;

% Begin Cycling through data

for i = 1:cycle_size

k = rand_j(i);

subset = n / cycle_size;

x_j = x(:,subset*k + 1:subset*(k + 1));

y_j = y(subset*k + 1:subset*(k + 1));

% Compute sub-gradient

indic = theta_k' * x_j * diag(y_j);

g = -(1/n) * x_j * (y_j' .* (indic < 1)') + lambda * D * theta_k;

theta_k = theta_k - alpha_j * g;

J(cycle_size * (j - 1) + i) = compute_J(y_j, x_j, theta_k, lambda);

end

% waitbar(j / K, f, sprintf('Trial %d of %d', j, K));

end

time_SGD = toc;

We obtain the learned line from: .

data = [ones(n, 1), linspace(0, 7, n)'];

pred_line = -(1/theta_k(3)) * (data * theta_k(1:2));

Figure 8: Showing the data and the learned line

figure(8)

scatter(x(2, y == 1), x(3, y == 1))

hold on;

scatter(x(2, y == -1), x(3, y == -1))

plot(linspace(0, 7, n)', pred_line, 'LineWidth', 2)

title('Data and Learned line')

xlabel('Total energy', "FontSize",12)

hold off

ylabel('Tail energy',"FontSize",12)