Let \(Y\) denote the class label in the two-class classification problem, and use \(1\) and \(-1\) to denote the two possible values that \(Y\) can take respectively. Prove the following result: \[ f^{*}(x)=\underset{f(x)}{\arg \min } \mathrm{E}_{Y | x}\left(e^{-Y f(x)}\right)=\frac{1}{2} \log \frac{\operatorname{Pr}(Y=1 | x)}{\operatorname{Pr}(Y=-1 | x)^{\prime}} \] that is, the minimizer of the population version of the AdaBoost criterion, is one-half of the log odds of \(\operatorname{Pr}(Y=1 | x) .\) Thus, the additive expansion produced by AdaBoost is estimating one-half of the log odds. since the sign of the log odds is the same as the output of the Bayes optimal classifier, it justifies using the sign of the additive expansion produced by AdaBoost as the classification rule.
\(\mathbf{Solution.}\qquad\) We can write the conditional expectation as,\(\mathrm{E}_{Y|x}\left(e^{-Yf(x)}\right)=P(Y=1\mid x)e^{-(1)f(x)}+P(Y=-1\mid x)e^{-(-1)f(x)}\) Taking derivative of \(\mathrm{E}_{Y|x}\left(e^{-Yf(x)}\right)\) with respect to \(f(x)\) and setting equal to \(0\) yields, \[\begin{align*} 0 & =\frac{\partial}{\partial f(x)}\left[\mathrm{E}_{Y\mid x}\left(e^{-Yf(x)}\right)\right]\\ 0 & =-P(Y=1\mid x)e^{-f(x)}+P(Y=-1\mid x)e^{f(x)}\\ P(Y=-1\mid x)e^{f(x)} & =P(Y=1\mid x)e^{-f(x)}\\ e^{2f(x)} & =\frac{P(Y=1\mid x)}{P(Y=-1\mid x)}\\ 2f(x) & =\log\frac{P(Y=1\mid x)}{P(Y=-1\mid x)}\\ f(x) & =\frac{1}{2}\log\frac{P(Y=1\mid x)}{P(Y=-1\mid x)} \end{align*}\]
This exercise relates to the bank marketing data set, which can be found in
the files under the Assignments/Data
directory. The data is related with direct
marketing campaigns of a Portuguese banking institution. The marketing campaigns
were based on phone calls. Often, more than one contact to the same client was
required, in order to access if the product (bank term deposit) would be (‘yes’)
or not (‘no’) subscribed. See file bank_marketing_info
for more details and the
full list of variables. The classification goal is to predict if the client will
subscribe (yes/no) a term deposit (variable deposit).
importance()
function to
determine which variables are most important. Describe the effect of mtry
,
nodesize
, and ntree
on the error rate obtained.summary0
function to determine which
variables are most important. Describe the effect of interaction.depth
,
shrinkage
, and n.trees
on the error rate obtained.(a) \(\mathbf{Solution.}\qquad\) We fit the classification tree using the default parameters and gini index as the measure of impurity.
# QUESTION 2a -----------------------------------------------------------------
train = read.csv("./Data/bank_marketing_train.csv", header = TRUE)
test = read.csv("./Data/bank_marketing_test.csv", header = TRUE)
cols <- c("deposit", "job", "marital", "education", "default", "housing",
"loan","contact", "month", "poutcome")
train[cols] <- lapply(train[cols], factor)
test[cols] <- lapply(test[cols], factor)
train$deposit = factor(train$deposit)
test$deposit = factor(test$deposit)
train$job = factor(train$job)
test$job = factor(test$job)
train$marital = factor(train$marital)
test$marital = factor(test$marital)
# Fit Classification Tree
tree = rpart(deposit ~ ., data = train, parms = list(split = "gini"),
method = "class")
fancyRpartPlot(tree)
Next we calculate the following proportions,
# Get predictions on Test set
test.pred = predict(tree, test, type = "class")
cf_mat = table(test.pred, test$deposit)
## Percentage of clients in the test set that were misclassified
sum(test.pred != test$deposit) / nrow(test)
## [1] 0.1812481
## Percentage of 'no' clients that were misclassified
cf_mat["yes", "no"] / sum(test$deposit == "no")
## [1] 0.2378223
## Percentage of 'yes' clients that were misclassified
cf_mat["no", "yes"] / sum(test$deposit == "yes")
## [1] 0.1197007
We see that the percentage of ‘no’ clients that were misclassfied is about twice as large as the percentage of ‘yes’ clients that were misclassfied.
 Â
(b) \(\mathbf{Solution.}\qquad\) Below we plot a subtree with seven terminal nodes. In this smaller tree, duration
is the strongest splitting variable. Other splitting variables are contact
, month
, poutcome
, and housing
.
# QUESTION 2b ---------------
tree_2 = rpart(deposit ~ ., data = train, parms = list(split = "gini"),
method = "class", cp = 0.019)
fancyRpartPlot(tree_2)
 Â
(c) \(\mathbf{Solution.}\qquad\) Below we fit our random forest and compute the following misclassification errors.
# QUESTION 2c ---------------
# Get Predictions From our chosen model
rf_tree = randomForest(deposit ~ ., data = train,
mtry = floor(sqrt(length(train) - 1)),
importance = TRUE)
rf_test_pred = predict(rf_tree, newdata = test)
cf_mat = table(rf_test_pred, test$deposit)
## Percentage of clients in the test set that were misclassified
mean(rf_test_pred != test$deposit)
## [1] 0.143625
## Percentage of 'no' clients that were misclassified
cf_mat["yes", "no"] / sum(test$deposit == "no")
## [1] 0.1787966
## Percentage of 'yes' clients that were misclassified
cf_mat["no", "yes"] / sum(test$deposit == "yes")
## [1] 0.1053616
Comparing the test error from CART to Random Forest, we see that the test error has dropped from 0.181 to 0.144. The misclassification of ‘no’ and ‘yes’ clients. Next we determine the most important variables using the importance()
function. We see that the five most important variables are the five variables with the highest MeanDecreaseAccuracy
which are duration
, month
, contact
, day
, and housing
.
## no yes MeanDecreaseAccuracy MeanDecreaseGini
## duration 195.152237 255.346756 275.08075738 1334.477155
## month 100.996490 42.135797 115.02493666 498.805985
## contact 46.238153 20.765853 51.13855718 139.299654
## day 51.800002 10.224557 49.89891790 257.117275
## housing 34.977807 29.301953 43.28862753 104.558066
## poutcome 56.363534 7.931119 42.62086556 206.331849
## age 33.755245 15.953448 35.25637341 264.894416
## pdays 21.749930 21.171218 28.54565271 145.431647
## campaign 13.654154 16.772335 21.81525970 111.427321
## job 25.577985 3.915455 20.81001516 228.872166
## previous 16.590239 11.158511 17.75687164 85.619120
## balance 13.761276 10.981707 16.75805976 287.226490
## education 14.756626 2.938072 12.94192070 78.520358
## loan 4.895538 12.080747 12.75896002 29.217710
## marital 3.536912 9.765847 9.99378527 62.003280
## default -2.889794 2.138818 -0.03424955 4.211927
mtry
represents the number of variables randomly sampled at each split. As
mtry
increases, test error will initially decrease. If we use all the
variables then this is equivalent to Bagging classifier and we lose the
property of random forest that decorrelates the trees. Thus, if we continue to
increase mtry
, the test error will begin to increase.nodesize
represents the minimum number of terminal nodes. As we increase
the minimum number of terminal nodes, the test error will initially decrease,
then it will increase due to overfitting on the training set.ntree
represents the number of trees to grow. As ntree
increases, the
test error will decrease, but eventually will taper off and not see much
improvement. Â
(d) \(\mathbf{Solution.}\qquad\) Below we use Boosting and compute the following misclassification errors.
# QUESTION 2d ---------------
ntrees = 3000
train$deposit = ifelse(train$deposit == "yes", 1, 0)
test$deposit = ifelse(test$deposit == "yes", 1, 0)
# Fit Boosting Tree
ada_tree = gbm(deposit ~ ., data = train, distribution = "adaboost",
n.trees = ntrees, shrinkage = 0.05, interaction.depth = 3)
## Get Confusion Matrix of our test predictions
ada_pred = predict(ada_tree, newdata = test,
n.trees = ntrees, type = "response",
shrinkage = 0.05,
interaction.depth = 3)
ada_pred = ifelse(ada_pred > 0.5, 1, 0)
cf_mat = table(ada_pred, test$deposit)
## Percentage of clients in the test set that were misclassified
mean(ada_pred != test$deposit)
## [1] 0.1388474
## [1] 0.1512894
## [1] 0.1253117
Comparing the test error from Random Forest to Boosting, we see that the test error has dropped from 0.144 to 0.139. For the Boosting method, the misclassification of ‘no’ clients dropped and the misclassification of ‘yes’ clients increased a bit compared to Random Forest. Next we determine the most important variables using the summary()
function. We see that the five most important variables duration
, month
, job
, balance
, and poutcome
.
## var rel.inf
## duration duration 29.78680445
## month month 22.43817412
## job job 8.90332839
## balance balance 7.65645646
## poutcome poutcome 6.11600790
## age age 5.51333977
## day day 5.29151237
## contact contact 3.94368105
## pdays pdays 3.54959206
## housing housing 2.12094579
## campaign campaign 1.36517409
## education education 1.24172230
## previous previous 0.94044915
## marital marital 0.82837517
## loan loan 0.29966043
## default default 0.00477652
interaction.depth
specifies the maximum depth of each tree. As interaction.depth
increases, the test error will decrease at first and then will increase due to overfitting.shrinkage
is also known as the learning rate. As shrinkage
increases the test error will decrease at first and then will increase because the learning rate will be too high.n.trees
specifies the total number of trees to be fit. As n.trees
increases, the test error will continue to decrease, but eventually will taper off and not see much improvement.