STATS 503: Homework 4

Question 1

Let \(Y\) denote the class label in the two-class classification problem, and use \(1\) and \(-1\) to denote the two possible values that \(Y\) can take respectively. Prove the following result: \[ f^{*}(x)=\underset{f(x)}{\arg \min } \mathrm{E}_{Y | x}\left(e^{-Y f(x)}\right)=\frac{1}{2} \log \frac{\operatorname{Pr}(Y=1 | x)}{\operatorname{Pr}(Y=-1 | x)^{\prime}} \] that is, the minimizer of the population version of the AdaBoost criterion, is one-half of the log odds of \(\operatorname{Pr}(Y=1 | x) .\) Thus, the additive expansion produced by AdaBoost is estimating one-half of the log odds. since the sign of the log odds is the same as the output of the Bayes optimal classifier, it justifies using the sign of the additive expansion produced by AdaBoost as the classification rule.

\(\mathbf{Solution.}\qquad\) We can write the conditional expectation as,\(\mathrm{E}_{Y|x}\left(e^{-Yf(x)}\right)=P(Y=1\mid x)e^{-(1)f(x)}+P(Y=-1\mid x)e^{-(-1)f(x)}\) Taking derivative of \(\mathrm{E}_{Y|x}\left(e^{-Yf(x)}\right)\) with respect to \(f(x)\) and setting equal to \(0\) yields, \[\begin{align*} 0 & =\frac{\partial}{\partial f(x)}\left[\mathrm{E}_{Y\mid x}\left(e^{-Yf(x)}\right)\right]\\ 0 & =-P(Y=1\mid x)e^{-f(x)}+P(Y=-1\mid x)e^{f(x)}\\ P(Y=-1\mid x)e^{f(x)} & =P(Y=1\mid x)e^{-f(x)}\\ e^{2f(x)} & =\frac{P(Y=1\mid x)}{P(Y=-1\mid x)}\\ 2f(x) & =\log\frac{P(Y=1\mid x)}{P(Y=-1\mid x)}\\ f(x) & =\frac{1}{2}\log\frac{P(Y=1\mid x)}{P(Y=-1\mid x)} \end{align*}\]

Question 2

This exercise relates to the bank marketing data set, which can be found in the files under the Assignments/Data directory. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. See file bank_marketing_info for more details and the full list of variables. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable deposit).

The data set has been divided at random into training and testing parts that contain \(70 \%\) and \(30 \%\) of the clients respectively. Fit a classification tree using only the training set. Find the percentage of clients in the test set that were misclassified by your optimal tree. Of all the ‘no’ clients of the test set what percentage was misclassified and of all the ‘yes’ clients of the test set what percentage was misclassified?
Plot a subtree of the optimal tree that has at most 8 terminal nodes. What are some of the variables that were used in tree construction?
Try (a) again using Random Forest. Use the importance() function to determine which variables are most important. Describe the effect of mtry, nodesize, and ntree on the error rate obtained.
Try (a) again using Boosting. Use the summary0 function to determine which variables are most important. Describe the effect of interaction.depth, shrinkage, and n.trees on the error rate obtained.

(a) \(\mathbf{Solution.}\qquad\) We fit the classification tree using the default parameters and gini index as the measure of impurity.

# QUESTION 2a -----------------------------------------------------------------
train = read.csv("./Data/bank_marketing_train.csv", header = TRUE)
test = read.csv("./Data/bank_marketing_test.csv", header = TRUE)

cols <- c("deposit", "job", "marital", "education", "default", "housing",
          "loan","contact", "month", "poutcome")
train[cols] <- lapply(train[cols], factor)
test[cols] <- lapply(test[cols], factor)


train$deposit = factor(train$deposit)
test$deposit = factor(test$deposit)
train$job = factor(train$job)
test$job = factor(test$job)
train$marital = factor(train$marital)
test$marital = factor(test$marital)

# Fit Classification Tree
tree = rpart(deposit ~ ., data = train, parms = list(split = "gini"), 
             method = "class")
fancyRpartPlot(tree)

Next we calculate the following proportions,

# Get predictions on Test set
test.pred = predict(tree, test, type = "class")
cf_mat = table(test.pred, test$deposit)

## Percentage of clients in the test set that were misclassified
sum(test.pred != test$deposit) / nrow(test)

## [1] 0.1812481

## Percentage of 'no' clients that were misclassified
cf_mat["yes", "no"] / sum(test$deposit == "no")

## [1] 0.2378223

## Percentage of 'yes' clients that were misclassified
cf_mat["no", "yes"] / sum(test$deposit == "yes")

## [1] 0.1197007

We see that the percentage of ‘no’ clients that were misclassfied is about twice as large as the percentage of ‘yes’ clients that were misclassfied.

(b) \(\mathbf{Solution.}\qquad\) Below we plot a subtree with seven terminal nodes. In this smaller tree, duration is the strongest splitting variable. Other splitting variables are contact, month, poutcome, and housing.

# QUESTION 2b ---------------
tree_2 = rpart(deposit ~ ., data = train, parms = list(split = "gini"),
                     method = "class", cp = 0.019)
fancyRpartPlot(tree_2)

(c) \(\mathbf{Solution.}\qquad\) Below we fit our random forest and compute the following misclassification errors.

# QUESTION 2c ---------------
# Get Predictions From our chosen model
rf_tree = randomForest(deposit ~ ., data = train, 
                       mtry = floor(sqrt(length(train) - 1)),
                       importance = TRUE)

rf_test_pred = predict(rf_tree, newdata = test)
cf_mat = table(rf_test_pred, test$deposit)

## Percentage of clients in the test set that were misclassified
mean(rf_test_pred != test$deposit)

## [1] 0.143625

## Percentage of 'no' clients that were misclassified
cf_mat["yes", "no"] / sum(test$deposit == "no")

## [1] 0.1787966

## Percentage of 'yes' clients that were misclassified
cf_mat["no", "yes"] / sum(test$deposit == "yes")

## [1] 0.1053616

Comparing the test error from CART to Random Forest, we see that the test error has dropped from 0.181 to 0.144. The misclassification of ‘no’ and ‘yes’ clients. Next we determine the most important variables using the importance() function. We see that the five most important variables are the five variables with the highest MeanDecreaseAccuracy which are duration, month, contact, day, and housing.

df = data.frame(importance(rf_tree))
df[order(-df$MeanDecreaseAccuracy),]

##                   no        yes MeanDecreaseAccuracy MeanDecreaseGini
## duration  195.152237 255.346756         275.08075738      1334.477155
## month     100.996490  42.135797         115.02493666       498.805985
## contact    46.238153  20.765853          51.13855718       139.299654
## day        51.800002  10.224557          49.89891790       257.117275
## housing    34.977807  29.301953          43.28862753       104.558066
## poutcome   56.363534   7.931119          42.62086556       206.331849
## age        33.755245  15.953448          35.25637341       264.894416
## pdays      21.749930  21.171218          28.54565271       145.431647
## campaign   13.654154  16.772335          21.81525970       111.427321
## job        25.577985   3.915455          20.81001516       228.872166
## previous   16.590239  11.158511          17.75687164        85.619120
## balance    13.761276  10.981707          16.75805976       287.226490
## education  14.756626   2.938072          12.94192070        78.520358
## loan        4.895538  12.080747          12.75896002        29.217710
## marital     3.536912   9.765847           9.99378527        62.003280
## default    -2.889794   2.138818          -0.03424955         4.211927

mtry represents the number of variables randomly sampled at each split. As mtry increases, test error will initially decrease. If we use all the variables then this is equivalent to Bagging classifier and we lose the property of random forest that decorrelates the trees. Thus, if we continue to increase mtry, the test error will begin to increase.
nodesize represents the minimum number of terminal nodes. As we increase the minimum number of terminal nodes, the test error will initially decrease, then it will increase due to overfitting on the training set.
ntree represents the number of trees to grow. As ntree increases, the test error will decrease, but eventually will taper off and not see much improvement.

(d) \(\mathbf{Solution.}\qquad\) Below we use Boosting and compute the following misclassification errors.

# QUESTION 2d ---------------
ntrees = 3000
train$deposit = ifelse(train$deposit == "yes", 1, 0)
test$deposit = ifelse(test$deposit == "yes", 1, 0)

# Fit Boosting Tree
ada_tree = gbm(deposit ~ ., data = train, distribution = "adaboost", 
               n.trees = ntrees, shrinkage = 0.05, interaction.depth = 3)

## Get Confusion Matrix of our test predictions
ada_pred = predict(ada_tree, newdata = test, 
                   n.trees = ntrees, type = "response",
                   shrinkage = 0.05,
                   interaction.depth = 3)
ada_pred = ifelse(ada_pred > 0.5, 1, 0)
cf_mat = table(ada_pred, test$deposit)

## Percentage of clients in the test set that were misclassified
mean(ada_pred != test$deposit)

## [1] 0.1388474

## Percentage of 'no' clients that were misclassified
cf_mat[2, 1] / sum(test$deposit == 0)

## [1] 0.1512894

## Percentage of 'yes' clients that were misclassified
cf_mat[1, 2] / sum(test$deposit == 1)

## [1] 0.1253117

Comparing the test error from Random Forest to Boosting, we see that the test error has dropped from 0.144 to 0.139. For the Boosting method, the misclassification of ‘no’ clients dropped and the misclassification of ‘yes’ clients increased a bit compared to Random Forest. Next we determine the most important variables using the summary() function. We see that the five most important variables duration, month, job, balance, and poutcome.

summary(ada_tree)

##                 var     rel.inf
## duration   duration 29.78680445
## month         month 22.43817412
## job             job  8.90332839
## balance     balance  7.65645646
## poutcome   poutcome  6.11600790
## age             age  5.51333977
## day             day  5.29151237
## contact     contact  3.94368105
## pdays         pdays  3.54959206
## housing     housing  2.12094579
## campaign   campaign  1.36517409
## education education  1.24172230
## previous   previous  0.94044915
## marital     marital  0.82837517
## loan           loan  0.29966043
## default     default  0.00477652

interaction.depth specifies the maximum depth of each tree. As interaction.depth increases, the test error will decrease at first and then will increase due to overfitting.
shrinkage is also known as the learning rate. As shrinkage increases the test error will decrease at first and then will increase because the learning rate will be too high.
n.trees specifies the total number of trees to be fit. As n.trees increases, the test error will continue to decrease, but eventually will taper off and not see much improvement.

STATS 503: Homework 4

Israel Diego

3/13/2020

Question 1

Question 2