EECS 545: Homework 3

Author: Israel Diego (UMID: 1750 6241)

1) Cross-validated (batch) ridge regression (20pts)

a)

(5 pts) We will use the 1st variable (CMEDV) of the 14 variables as our response y and the remaining variables as the feature variables . Sphere the train-validate feature variables, i.e., transform the feature matrix so that its rows have zero sample mean and sample variance equal to one (see Slide 17 of Lec 3 ). Display the vector of sample means and the vector of sample standard deviations. These vectors should be 13 dimensional and you will use them to likewise transform the test data.
% Loading train data
clear
clc
boston = readtable('boston-corrected.csv');
y = boston.CMEDV;
X = transpose(table2array(boston(:,2:end)));
[d, n] = size(X);
subset = 460;
[X_sph, mu, sigma] = sphere_X(X(:, 1:subset));
disp(mu)
3.6731 10.9359 11.1597 0.0696 0.5547 6.2818 68.6800 3.7950 9.6674 409.4587 18.4759 357.2528 12.6795
disp(sigma)
8.7132 22.8004 6.8092 0.2547 0.1147 0.7053 27.8636 2.0960 8.7768 169.6645 2.1781 91.5471 7.1469

b)

(5pts) Here you will be using the using the Sphered train-validate data produced in part (a). Using all the data except for the hold-out sample compute the effective degrees of freedom (Lec over the range with a step size 0.1 and plot the result. For this same range of λ find the ridge regression coefficients that minimize the PRSS over the train-validate sample instances) and plot its mean squared prediction errors as a function of λ. Now apply your learned predictor without regularization to the test samples in the hold-out set and report its mean squared prediction error. Note: to be applied to a novel instance in the test set, you need to transform the novel sample the same way you transformed the training sample when learning the predictor coefficients. In particular, with whain the predictor on the novel sample will have form where and were found in part a.
step = 0.1;
N = 20 / step;
lambdas = linspace(0, 20, N + 1);
MSE = zeros(N + 1, 1);
PRSS = zeros(N + 1,1);
EDD = zeros(N + 1, 1);
w_train = zeros(d + 1, subset);
for i = 1:(N + 1)
% Parameters
lambda = lambdas(i);
X_til = X_sph(:,1:subset);
y_T = y(1:subset);
[d, ~] = size(X_til);
% Compute w_hat, y_hat, MSE
EDD(i) = compute_EDD(X_til, lambda);
[w_train(:, i), ~, MSE(i)] = compute_ridge(y_T, X_til, lambda);
end
Figure 1: Plot Effective Degrees of Freedom over range .
figure(1)
plot(lambdas, EDD, "Linewidth", 1.5)
title("Effective degrees of freedom")
ylabel("EDD", "FontSize", 12)
xlabel("\lambda", "FontSize", 12)
Figure 2: Plot Mean-squared prediction error over range .
figure(2)
plot(lambdas, MSE, "Linewidth", 1.5)
title("Mean-squared Prediction Error")
ylabel("MSPE", "FontSize", 12)
xlabel("\lambda", "FontSize", 12)
% MSPE on hold out set with lambda = 0, i.e. w_train(:, 1)
X_hold = [ones(1, 46); (X(:, subset+1:end) - mu) ./ sigma];
y_hold = y(subset+1:end);
y_hat2 = X_hold' * w_train(:, 1);
MSE_hold_1b = compute_MSE(y_hold, y_hat2);
fprintf(['MSPE on hold-out set with ', num2str(char(955)),' = 0: ',num2str(MSE_hold_1b), '\n'])
MSPE on hold-out set with λ = 0: 27.8722

c)

(10pts) Now you will use 10 -fold cross-validation on the sphered train-validate samples (see Lec 7 slides to determine the optimal cross-validated value of Plot the average of the 10 mean kquared prediction errors that you found over the cross-validation folds, as a function of Find the that minimizes this error. What is the corresponding effective degrees of freedom associated with Now compute the cross-validated ridge weights wridge that minimizes the PRSS over the entire train+validate set using this value of Report the average mean squared prediction error of the CV ridge weights when applied to the hold-out set and compare to the error on the hold-out set for the unregularized weights learned in part b.
MSE_i = zeros(N + 1, 1);
for i = 1:N + 1
% Parameters
lambda = lambdas(i);
MSE_k = zeros(10, 1);
for k = 0:9
X_T = X_sph(:,1:subset);
% Hold out set
X_valid = [ones(1, 46); X_T(:,46*k + 1:46*(k + 1))];
y_valid = y(46*k + 1:46*(k + 1));
% Split set
X_T(:,46*k + 1:46*(k + 1)) = [];
y_T = y(1:subset);
y_T(46*k + 1:46*(k + 1)) = [];
% Compute w_hat, y_hat, MSE
[w_train, ~, ~] = compute_ridge(y_T, X_T, lambda);
% fprintf("The MSE is: " + round(MSE, 3) + "\n")
MSE_k(k+1) = compute_MSE(y_valid, X_valid' * w_train);
end
MSE_i(i) = mean(MSE_k);
end
Figure 3: MSPE 10-fold cross validation
figure(3)
plot(lambdas, MSE_i, "Linewidth", 1.5)
title("MSPE 10-fold cross validation")
ylabel("MSPE", "FontSize", 12)
xlabel("\lambda", "FontSize", 12)
Find and EDD of this :
[~,indx] = min(MSE_i);
lambda_st = (indx - 1) * step;
display(lambda_st)
lambda_st = 4.8000
Display EDD:
EDD_st = compute_EDD(X_sph(:,1:subset), lambda_st);
display(EDD_st)
EDD_st = 12.5544
Compute the cross-validated ridge weights that minimizes the PRSS over the entire train+validate set using this value of lambda star
[w_ridge, ~, ~] = compute_ridge(y(1:subset), X_sph(:,1:subset), lambda_st);
disp(w_ridge)
22.5113 -1.0208 1.0195 0.0546 0.5551 -2.0109 2.7969 0.0560 -3.0587 2.3084 -1.8326 -2.0018 0.8795 -3.7053
Compare to hold out set
MSE_hold_1c = compute_MSE(y_hold, X_hold' * w_ridge);
display(MSE_hold_1c)
MSE_hold_1c = 27.7251
fprintf(['The MSPE on hold-out set with ', num2str(char(955)),' = ',num2str(lambda_st), ...
' and CV ridge weights is ', num2str(MSE_hold_1b - MSE_hold_1c), ' smaller than ', ...
'MSPE without regularization.'])
The MSPE on hold-out set with λ = 4.8 and CV ridge weights is 0.14703 smaller than MSPE without regularization.

2) Descent methods for regression (15pts)

a)

(7 pts) Implement the (batch) gradient descent (GD) update rule to train your regression model on the training data. Set the initial weight to the step size parameter to and the number of GD iterations to Plot the training error as a function of iteration and report the learned weight vector obtained after K iterations. Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample. Compare the learned GD weights to the directly computed weights learned from the training data.
K = 500;
eta = 0.05;
w_k = zeros(d + 1, 1);
train_error = zeros(K, 1);
% Perform Batch GD
for j = 1:K
k = mod((j - 1), 10);
% Get batch of X_j and y_j
X_j = [ones(1, 46); X_sph(:,46*k + 1:46*(k + 1))];
y_j = y(46*k + 1:46*(k + 1));
g = (2 / n) * X_j * (X_j' * w_k - y_j);
w_k = w_k - eta * g;
y_hat = X_j' * w_k;
train_error(j) = compute_MSE(y_j, y_hat);
end
Figure 4: Plot the train error as a function of iteration
figure(4)
plot(train_error, "Linewidth", 1.5)
title("Train error as function of iteration, Batch GD")
ylabel("MSPE", "FontSize", 12)
xlabel("kth iteration", "FontSize", 12)
Report final learned weight vector
w_GD = w_k;
disp(w_GD)
22.2902 -0.7859 0.5658 -0.3961 0.6295 -1.0096 3.2500 -0.0843 -2.1139 0.7817 -0.4815 -1.8390 0.9278 -3.4638
Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample.
y_hat = [ones(1, subset); X_sph(:, 1:subset)]' * w_GD;
MSE_train_GD = compute_MSE(y(1:subset), y_hat);
display(MSE_train_GD)
MSE_train_GD = 21.9025
MSE_test_GD = compute_MSE(y_hold, X_hold' * w_GD);
display(MSE_test_GD)
MSE_test_GD = 28.3228
Compare the learned GD weights to the directly computed weights learned from the training data. We show the difference between estimates of and :
[w_LS, ~, ~] = compute_ridge(y(1:subset), X_sph(:, 1:subset), 0);
compare_w = w_LS - w_GD;
disp(compare_w)
0.2211 -0.2787 0.5350 0.5775 -0.0902 -1.1463 -0.4896 0.1689 -1.0938 1.8941 -1.7053 -0.1985 -0.0466 -0.3065

b)

pts) Implement stochastic gradient descent (SGD) to train your regression model. Set the initial weight to the step size parameter to and the number of SGD epochs to ortor As in a, plot the training error as a function of epoch and report the learned weight vector obtained after K epochs. Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample. Compare the learned SGD weights to the GD weights, and to the directly computed weights learned from the training data.
eta = 0.005;
w_k = zeros(d + 1, 1);
train_error = zeros(K, 1);
% Perform SGD by epochs
for epoch = 1:K
rand_j = randperm(10) - 1;
for i = 1:10
k = rand_j(i);
% Get batch of X_j and y_j
X_j = [ones(1, 46); X_sph(:,46*k + 1:46*(k + 1))];
y_j = y(46*k + 1:46*(k + 1));
g = (2 / n) * X_j * (X_j' * w_k - y_j);
w_k = w_k - eta * g;
y_hat = X_j' * w_k;
end
train_error(epoch) = compute_MSE(y_j, y_hat);
end
Figure 5: Plot the train error as a function of K epochs
% Plot the training error as a function of epoch
figure(5)
plot(train_error, "Linewidth", 1.5)
title('Train error as function of K epochs')
ylabel("MSPE", "FontSize", 12)
xlabel("kth epoch", "FontSize", 12)
Report the learned weight vector obtained after K epochs.
w_SGD = w_k;
disp(w_SGD)
22.2731 -0.7861 0.5606 -0.4056 0.6228 -1.0098 3.2291 -0.0858 -2.1063 0.7797 -0.4893 -1.8375 0.9284 -3.4522
Evaluate the mean squared prediction error of the learned weight vector when reapplied to the entire training sample and when applied to the test sample.
y_hat = [ones(1, subset); X_sph(:, 1:subset)]' * w_SGD;
MSE_train_SGD = compute_MSE(y(1:subset), y_hat);
display(MSE_train_SGD)
MSE_train_SGD = 21.9126
MSE_test_SGD = compute_MSE(y_hold, X_hold' * w_SGD);
display(MSE_test_SGD)
MSE_test_SGD = 28.3397
Compare the learned SGD weights to the GD weights by taking differences of the estimates.
compare_w_SGD_GD = w_SGD - w_GD;
disp(compare_w_SGD_GD)
-0.0171 -0.0003 -0.0052 -0.0094 -0.0067 -0.0002 -0.0209 -0.0015 0.0075 -0.0019 -0.0078 0.0015 0.0006 0.0117
Compare SGD weights to the directly computed weights learned from the training data by taking differences of the estimates.
compare_w_SGD = w_LS - w_SGD;
compare_w_SGD_GD = w_SGD - w_GD;
disp(compare_w_SGD)
0.2382 -0.2784 0.5402 0.5869 -0.0836 -1.1460 -0.4687 0.1704 -1.1013 1.8961 -1.6975 -0.2000 -0.0472 -0.3182

3) Subgradient methods for the optimal soft margin hyperplane (25 pts)

b)

Implement the subgradient method for minimizing J and apply it to the nuclear data. Submit two figures: One showing the data and the learned line, the other showing J as a function of iteration number. Also report the estimated hyperplane parameters and the minimum achieved value of the objective function.
clear
clc
% Implement the subgradient method for minimizing J and apply it to the nuclear data.
% Submit two figures.
load nuclear.mat;
rng(0);
% Initialize parameters
K = 100;
lambda = 0.001;
[d, n] = size(x);
x = [ones(1, n); x];
theta_k = ones(d + 1, 1);
D = diag([0; ones(d, 1)]);
J = zeros(K, 1);
% f = waitbar(0, 'Please wait...');
% Iterate K times to update theta_k
tic;
for j = 1:K
% Compute sub-gradient
indic = theta_k' * x * diag(y);
g = -(1/n) * x * (y' .* (indic < 1)') + lambda * D * theta_k;
alpha_j = K / j;
theta_k = theta_k - alpha_j * g;
J(j) = compute_J(y, x, theta_k, lambda);
% waitbar(j / K, f, sprintf('Trial %d of %d', j, K));
end
time_GD = toc;
We obtain the learned line from: .
data = [ones(n, 1), linspace(0, 7, n)'];
pred_line = -(1/theta_k(3)) * (data * theta_k(1:2));
Figure 6: Showing the data and the learned line
figure(6)
scatter(x(2, y == 1), x(3, y == 1))
hold on;
scatter(x(2, y == -1), x(3, y == -1))
plot(linspace(0, 7, n)', pred_line, 'LineWidth', 2)
title('Data and Learned line')
xlabel('Total energy', "FontSize",12)
ylabel('Tail energy',"FontSize",12)
hold off
Figure 7: Showing J as a function of iteration number.
figure(7)
plot(J, 'Linewidth', 1.5)
xlabel("kth iteration", "FontSize",12)
ylabel("$J(w,w_0)$", "FontSize", 12, "Interpreter","latex")
title('J as function of iteration with subgradient method')
Report the estimated hyperplane parameters and the minimum achieved value of the objective function.
hyper_p = theta_k;
display(hyper_p)
hyper_p = 3×1
-0.9927 -3.1202 15.4776
min_obj = min(J);
display(min_obj)
min_obj = 0.3105

c)

(6 pts) Now implement the stochastic subgradient method, which is like the subgradient method, except that your step direction is a subgradient of a random Be sure to cycle through all data points before starting a new loop through the data. Report/hand in the same items as for part .
rng(0);
% cycle K times through theta_k
% f = waitbar(0, 'Please wait...');
cycle_size = 10;
J = zeros(cycle_size * K, 1);
tic;
for j = 1:K
% Set up permutations
rand_j = randperm(cycle_size) - 1;
% Step-size
alpha_j = K / j;
% Begin Cycling through data
for i = 1:cycle_size
k = rand_j(i);
subset = n / cycle_size;
x_j = x(:,subset*k + 1:subset*(k + 1));
y_j = y(subset*k + 1:subset*(k + 1));
% Compute sub-gradient
indic = theta_k' * x_j * diag(y_j);
g = -(1/n) * x_j * (y_j' .* (indic < 1)') + lambda * D * theta_k;
theta_k = theta_k - alpha_j * g;
J(cycle_size * (j - 1) + i) = compute_J(y_j, x_j, theta_k, lambda);
end
% waitbar(j / K, f, sprintf('Trial %d of %d', j, K));
end
time_SGD = toc;
We obtain the learned line from: .
data = [ones(n, 1), linspace(0, 7, n)'];
pred_line = -(1/theta_k(3)) * (data * theta_k(1:2));
Figure 8: Showing the data and the learned line
figure(8)
scatter(x(2, y == 1), x(3, y == 1))
hold on;
scatter(x(2, y == -1), x(3, y == -1))
plot(linspace(0, 7, n)', pred_line, 'LineWidth', 2)
title('Data and Learned line')
xlabel('Total energy', "FontSize",12)
hold off
ylabel('Tail energy',"FontSize",12)
Figure 9: Showing J as a function of iteration with Stochastic subgradient method.
figure(9)
plot(J)
xlabel("kth iteration", "FontSize",12)
ylabel("$J(w,w_0)$", "FontSize", 12, "Interpreter","latex")
title('J as function of iteration with Stochastic subgradient method')
title('J as function cycle number')
Report the estimated hyperplane parameters and the minimum achieved value of the objective function.
hyper_p = theta_k;
display(hyper_p)
hyper_p = 3×1
-0.9340 -1.0288 5.8822
min_obj = min(J);
display(min_obj)
min_obj = 0.3004

d)

Comment on the (empirical) rate of convergence of the stochastic subgradient method relative to the subgradient method. Explain your finding.
Timing both processes, we find that the stochastic subgradient approach takes:
display(time_SGD)
time_SGD = 14.1284
While the subgradient methodtakes:
display(time_GD)
time_GD = 92.4966
Thus the stochastic approach is much faster and reaches a smaller objective value.

e)

Submit your code.

4) Coordinate Descent Ridge Regression (15pts)

d)

(5 pts) Here you will implement the coordinate descent ridge regression algorithm and apply it to the sphered train-validate Baston Housing data you treated in Problem Make sure you include the y-intercept weight in the data dependent term but exclude it from the penalty term in . Using the ridge regularization parameter and initializing your w to be a vector with all elements equal to run your CD ridge regression for 700 iterations (about 54 cycles) and plot the evolution of the weights (a single plot of all weights except for over iteration), and also plot the learning curve with respect to mean squared prediction error . The label on the x -axis should be number of iterations. Your weights should converge to the exact ridge weights implemented in Problem 1 when
clear
clc
lambda = 100;
K = 700;
boston = readtable('boston-corrected.csv');
data_y = boston.CMEDV;
data_X = transpose(table2array(boston(:,2:end)));
[d, n] = size(data_X);
subset = 460;
% Using the ridge regularization parameter lambda = 100 and initializing
% your w to be a vector with all elements equal to 1, run your CD ridge
% regression for 700 iterations (about 54 cycles)
y = data_y(1:subset);
[X, ~, ~] = sphere_X(data_X(:, 1:subset));
y_hat = zeros(subset, d);
w = ones(d, 1);
w_0 = mean(y);
a_i = 2 * sum(X.^2, 2);
w_evol = zeros(d, K);
MSE = zeros(K, 1);
% Cycle through w by CD
for j = 1:K
i = mod((j - 1), d) + 1;
% Compute y_hat
w_copy = w;
w_copy(i) = 0;
y_hat = w_0 + X' * w_copy;
% Compute c_i
c_i = 2 * dot(X(i,:), (y - y_hat));
w(i) = c_i / (a_i(i) + 2*lambda);
w_evol(:, j) = w;
MSE(j) = compute_MSE(y, X' * w);
end
Figure 10: Plot the evolution of the weights (a single plot of all weights except for over iteration)
labels = [];
for i = 1:d
element = "w_" + string(i);
labels = [labels,element];
end
figure(10)
plot(w_evol', "LineWidth", 1.5)
title("Evolution of $w$ over K iterations", "Interpreter","latex")
ylabel("Estimate of $w_i$", "Interpreter", "latex")
xlabel('kth iteration', "FontSize", 12)
legend(labels)
Figure 11: Plot the learning curve with respect to mean squared prediction error
figure(11)
plot(MSE, "LineWidth", 1.5)
xlabel("kth iteration", 'FontSize',12)
ylabel("MSPE", "FontSize",12)
title('Learning Curve w.r.t. MSPE')

5) Coordinate Descent Lasso Regression (15pts)

b)

(5 pts) Here you will implement the CD Lasso algorithm and apply it to the Boston Housing data you downloaded for Problem Run the CD Lasso regression optimization on the sphered train-validate set you created in Problem 1 for 700 iterations for and with initialized to be a vector of all 1 's. As in Problem plot the weight trajectories over iteration and the learning curve. Are there any values of the weight vector that have been thresholded to zero after 700 iterations?
clear
clc
lambda = 100;
K = 700;
boston = readtable('boston-corrected.csv');
data_y = boston.CMEDV;
data_X = transpose(table2array(boston(:,2:end)));
[d, n] = size(data_X);
subset = 460;
% Run the CD Lasso regression optimization on the sphered train-validate
% set you created in Problem 1 for 700 iterations for lambda = 100 and with
% w initialized to be a vector of all 1's.
y = data_y(1:subset);
[X, mu, sigma] = sphere_X(data_X(:, 1:subset));
[w, w_evol, MSE] = CD_Lasso(y, X, lambda, K);
% Hold out set
X_hold = (data_X(:, subset+1:end) - mu) ./ sigma;
y_hold = data_y(subset+1:end);
Figure 12: Plot the weight trajectories over iteration and the learning curve.
labels = [];
for i = 1:d
element = "w_" + string(i);
labels = [labels, element];
end
figure(12)
plot(w_evol', "LineWidth", 1.5)
title("Evolution of $w$ over K iterations", "Interpreter","latex")
ylabel("Estimate of $w_i$", "Interpreter","latex")
xlabel('kth iteration', "FontSize", 12)
legend(labels)
Figure 13: Plot the learning curve with respect to mean squared prediction error
figure(13)
plot(MSE, "LineWidth", 1.5)
xlabel("kth iteration", 'FontSize',12)
ylabel("MSPE", "FontSize",12)
title('Learning Curve w.r.t. MSPE')
Next we find which estimate was thresholded to 0.
disp(w)
-0.7460 0.6640 -0.0227 0.5045 -1.6029 2.9289 0 -2.4542 0.9549 -0.6757 -1.9084 0.7943 -3.7114
We see that was thresholded to be 0.

c)

(5pts) Using the same cross validation procedure as you implemented in Problem 1 , find the lasso CV error curve on the train-validate data and determine the value of λ that minimizes the CV error. To reduce computation time you may constrain your search region to i.e. partition into 1000 equal subintervals. Using this value of λ and the entire sphered train-validate dataset, learn the optimal lasso weights and apply these weights in the same manner as in part c of Problem 1 to predict over the test data. Report the mean squared prediction errors of this lasso predictor on the test data and compare to that of the CV ridge estimator determined in Problem 1.c.
step = 0.1;
N = 100 / step;
lambdas = linspace(0, 100, N + 1);
MSE_i = zeros(N, 1);
% f = waitbar(0, 'Please wait...');
for i = 1:(N + 1)
% Parameters
lambda = lambdas(i);
MSE_k = zeros(10, 1);
for k = 0:9
X_T = X;
% Hold out set
X_valid = X_T(:, 46*k + 1:46*(k + 1));
y_valid = y(46*k + 1:46*(k + 1));
% Split set
X_T(:,46*k + 1:46*(k + 1)) = [];
y_T = y(1:subset);
y_T(46*k + 1:46*(k + 1)) = [];
% Compute w_hat, y_hat, MSE
[w, ~, MSE] = CD_Lasso(y_T, X_T, lambda, K);
w_0 = mean(y_T);
% fprintf("The MSE is: " + round(MSE, 3) + "\n")
MSE_k(k+1) = compute_MSE(y_valid, w_0 + X_valid' * w);
end
MSE_i(i) = mean(MSE_k);
% waitbar(i / (N + 1), f, sprintf('Trial %d of %d', i, N + 1));
end
Find determine the value that minimizes the CV error.
[~,indx] = min(MSE_i);
lambda_st = (indx - 1) * step;
display(lambda_st)
lambda_st = 23.7000
% Compute the cross-validated ridge weights w_ridge that minimizes the PRSS over the
% entire train+validate set using this value of lambda star
[w_test, ~, ~] = CD_Lasso(y, X, lambda_st, K);
Report the mean squared prediction errors of this CV lasso predictor on the test data and compare to that of the CV ridge estimator determined in Problem 1.c.
MSE_hold_5c = compute_MSE(y_hold, w_0 + X_hold' * w_test);
display(MSE_hold_5c)
MSE_hold_5c = 27.6157
% Compare MSE_hold_1c and Compare MSE_hold_5c
fprintf(['The MSPE on hold-out set with lasso CV is ', ...
num2str(27.7251 - MSE_hold_5c), ' smaller than ', ...
'MSPE from Problem 1.c.'])
The MSPE on hold-out set with lasso CV is 0.10935 smaller than MSPE from Problem 1.c.

Functions

function [w_hat, y_hat, MSE] = compute_ridge(y, X, lambda)
% Compute Ridge Regression Parameters
[d, n] = size(X);
X = [ones(1, n); X];
D = diag([0; ones(d, 1)]);
w_hat = (X * X' + lambda * D) \ X * y;
y_hat = X' * w_hat;
MSE = compute_MSE(y, y_hat);
end
function MSE = compute_MSE(y, y_hat)
% Compute MSPE
[n, ~] = size(y);
MSE = (1 / n) * (y - y_hat)' * (y - y_hat);
end
function [X_sph, mu, sigma] = sphere_X(X)
% Sphere the data
[~, n] = size(X);
Xc = X * (eye(n) - 1/n * ones(n, 1) * ones(1, n));
mu = 1/n * (X * ones(n, 1));
sigma = diag(sqrt(1 / (n-1) * (Xc*Xc')));
X_sph = diag(1 ./ sigma) * Xc;
end
function EDD = compute_EDD(X, lambda)
% Compute Effective Degrees of Freedom
[d, ~] = size(X);
% Compute w_hat, y_hat, MSE
EDD = trace(X' * ((X * X' + lambda * eye(d)) \ X));
end
function J = compute_J(y, X, theta, lambda)
% Compute Objective Function from Optimal soft margin hyperplane
[d, n] = size(X);
D = diag([0; ones(d - 1, 1)]);
J = (lambda / 2) * (theta' * D * theta);
for i = 1:n
J = J + (1/n) * max(0, 1 - y(i) * dot(theta, X(:, i)));
end
end
function w = soft(a, c, lambda)
% Soft margin
if c > lambda
w = (c - lambda) / a;
elseif c < -lambda
w = (c + lambda) / a;
else
w = 0;
end
end
function [w, w_evol, MSE] = CD_Lasso(y, X, lambda, K)
% Perform CD Lasso
[d, ~] = size(X);
w = ones(d, 1);
w_0 = mean(y);
a_i = 2 * sum(X.^2, 2);
w_evol = zeros(d, K);
MSE = zeros(K, 1);
% Cycle through w by CD
for j = 1:K
i = mod((j - 1), d) + 1;
% Compute y_hat
w_copy = w;
w_copy(i) = 0;
y_hat = w_0 + X' * w_copy;
% Compute c_i
c_i = 2 * dot(X(i,:), (y - y_hat));
w(i) = soft(a_i(i), c_i, lambda);
w_evol(:, j) = w;
MSE(j) = compute_MSE(y, X' * w);
end
end