Machine Learning Demystified Exercise Answers

Exercise 1

You may run the code below to reproduce the model results with a new random seed.

set.seed(2)

trainIndex <- createDataPartition(date_fruit$Class, p = .9, list = FALSE)
date_train <- as.data.frame(date_fruit[trainIndex, ])
date_test <- as.data.frame(date_fruit[-trainIndex, ])

cv_spec <- trainControl(method = 'cv', number = 5)

date_rf_train <- train(
  form = Class ~ .,
  data = date_train,
  method = 'rf',
  preProcess = c('center', 'scale'),
  tuneGrid = data.frame(mtry = 1:10),
  metric = 'Accuracy',
  trControl = cv_spec,
  ntree = 500,
  importance = TRUE
)

plot(date_rf_train)
confusionMatrix(date_rf_train)

date_predict_test <- predict(date_rf_train, newdata = date_test)

confusionMatrix(date_predict_test, date_test$Class)

varImp(date_rf_train, type = 1, scale = FALSE)

The results are quantitatively slightly different. For example, the maximum accuracy is now achieved with mtry = 3, the accuracy values in the training and test sets are slightly different, a few of the confusion matrix entries are different, and the exact values of variable importance are different.

The reason the results are different is because there are many places where randomness comes into play in the model fitting process, in particular:

The initial training-test split
The split of the training set into five cross-validation folds in the model tuning phase (this is done 10 times, once for each mtry value we are testing)
Within the random forest algorithm itself, the sampling with replacement to create subsets of the data

However, it’s important to note that the difference in overall model performance between the two random seeds is negligible, so we can be fairly confident that our assessment of how well the model does isn’t very dependent on how the random number generator happened to be initialized.

Exercise 2

Here is an example using a simpler model than random forest. It uses the rpart method to fit single classification trees to the data instead of growing a whole forest of them. The parameter cp or complexity is our tuning parameter that governs how closely the tree fits the data. I used trial and error to determine a reasonable range for the tuning grid of that parameter.

cv_spec <- trainControl(method = 'cv', number = 5)

date_rpart_train <- train(
  form = Class ~ .,
  data = date_train,
  method = 'rpart',
  preProcess = c('center', 'scale'),
  tuneGrid = data.frame(cp = seq(0.0001, 0.01, by = 0.0001)),
  metric = 'Accuracy',
  trControl = cv_spec
)

plot(date_rpart_train)
confusionMatrix(date_rpart_train)

date_rpart_predict_test <- predict(date_rpart_train, newdata = date_test)

confusionMatrix(date_rpart_predict_test, date_test$Class)

The model performance is quite a bit worse than random forest and support vector machine, with only about 80% accuracy on training and test sets. This highlights how the random forest algorithm typically delivers better performance than a simple tree-based approach.

Exercise 3

Run the following code to do ridge regression on the sugarcane yield dataset. The code is identical to the lasso code except alpha = 0 instead of alpha = 1.

set.seed(4)

tsh_ridge_fit <- train(
  x = sugarcane %>% select(all_of(sugarcane_predictor_variables)) %>% as.data.frame,
  y = sugarcane$TSH,
  method = 'glmnet',
  preProcess = c('center', 'scale'),
  tuneGrid = expand.grid(
    alpha = 0,
    lambda = c(0.00001, 0.0001, 0.001, 0.01, 0.1)
  ),
  metric = 'RMSE',
  trControl = cv_spec_sugarcane
)

tsh_ridge_fit
plot(tsh_ridge_fit)

tsh_ridge_fit$bestTune

lambda_use <- min(tsh_ridge_fit$finalModel$lambda[tsh_ridge_fit$finalModel$lambda >= tsh_ridge_fit$bestTune$lambda])
position <- which(tsh_ridge_fit$finalModel$lambda == lambda_use)
best_coefs <- coef(tsh_ridge_fit$finalModel)[, position]

data.frame(coefficient = round(best_coefs, 3))

The results are not too different in terms of R-squared and RMSE. However the shrinkage is more severe than the lasso; the coefficient estimates are closer to zero. This may be desirable in some situations. In fact, it is possible to take a hybrid approach between ridge and lasso called elastic net regression, where you also tune the model to find an appropriate alpha value somewhere between 0 and 1.

Exercise 4

Here I am trying a model called Bayesian regularized neural network by specifying method = 'brnn'. The only tuning parameter is the number of neurons. I am testing out each value from 1 to 10 using the argument tuneGrid = data.frame(neurons = 1:10). This will prompt you to install the brnn package if you didn’t have it installed already in your R package library.

By the way, I know nothing about Bayesian regularized neural networks and I picked it basically at random from the list of supported models. That is just to emphasize that it is easy, sometimes almost too easy, to fit a machine learning model to a dataset without knowing anything about what the model is actually doing inside the “black box.”

set.seed(5)

tsh_brnn_fit <- train(
  x = sugarcane %>% select(all_of(sugarcane_predictor_variables)) %>% as.data.frame,
  y = sugarcane$TSH,
  method = 'brnn',
  preProcess = c('center', 'scale'),
  tuneGrid = data.frame(neurons = 1:10),
  metric = 'RMSE',
  trControl = cv_spec_sugarcane
)

tsh_brnn_fit
plot(tsh_brnn_fit)

The RMSE and R-squared values are basically the same as what we got for the lasso and ridge regression (RMSE is ~6 and R-squared is ~0.25), if not slightly worse. Thus in terms of performance there is no good reason to prefer this model over lasso or ridge. In fact, lasso and ridge provide much more interpretable parameter estimates, so in this case I would stick with either one of them.