You may run the code below to reproduce the model results with a new random seed.
set.seed(2)
trainIndex <- createDataPartition(date_fruit$Class, p = .9, list = FALSE)
date_train <- as.data.frame(date_fruit[trainIndex, ])
date_test <- as.data.frame(date_fruit[-trainIndex, ])
cv_spec <- trainControl(method = 'cv', number = 5)
date_rf_train <- train(
form = Class ~ .,
data = date_train,
method = 'rf',
preProcess = c('center', 'scale'),
tuneGrid = data.frame(mtry = 1:10),
metric = 'Accuracy',
trControl = cv_spec,
ntree = 500,
importance = TRUE
)
plot(date_rf_train)
confusionMatrix(date_rf_train)
date_predict_test <- predict(date_rf_train, newdata = date_test)
confusionMatrix(date_predict_test, date_test$Class)
varImp(date_rf_train, type = 1, scale = FALSE)
The results are quantitatively slightly different. For example, the
maximum accuracy is now achieved with mtry = 3
, the
accuracy values in the training and test sets are slightly different, a
few of the confusion matrix entries are different, and the exact values
of variable importance are different.
The reason the results are different is because there are many places where randomness comes into play in the model fitting process, in particular:
mtry
value we are testing)However, it’s important to note that the difference in overall model performance between the two random seeds is negligible, so we can be fairly confident that our assessment of how well the model does isn’t very dependent on how the random number generator happened to be initialized.
Here is an example using a simpler model than random forest. It uses
the rpart
method to fit single classification trees to the
data instead of growing a whole forest of them. The parameter
cp
or complexity is our tuning parameter that governs how
closely the tree fits the data. I used trial and error to determine a
reasonable range for the tuning grid of that parameter.
cv_spec <- trainControl(method = 'cv', number = 5)
date_rpart_train <- train(
form = Class ~ .,
data = date_train,
method = 'rpart',
preProcess = c('center', 'scale'),
tuneGrid = data.frame(cp = seq(0.0001, 0.01, by = 0.0001)),
metric = 'Accuracy',
trControl = cv_spec
)
plot(date_rpart_train)
confusionMatrix(date_rpart_train)
date_rpart_predict_test <- predict(date_rpart_train, newdata = date_test)
confusionMatrix(date_rpart_predict_test, date_test$Class)
The model performance is quite a bit worse than random forest and support vector machine, with only about 80% accuracy on training and test sets. This highlights how the random forest algorithm typically delivers better performance than a simple tree-based approach.
Run the following code to do ridge regression on the sugarcane yield
dataset. The code is identical to the lasso code except
alpha = 0
instead of alpha = 1
.
set.seed(4)
tsh_ridge_fit <- train(
x = sugarcane %>% select(all_of(sugarcane_predictor_variables)) %>% as.data.frame,
y = sugarcane$TSH,
method = 'glmnet',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(
alpha = 0,
lambda = c(0.00001, 0.0001, 0.001, 0.01, 0.1)
),
metric = 'RMSE',
trControl = cv_spec_sugarcane
)
tsh_ridge_fit
plot(tsh_ridge_fit)
tsh_ridge_fit$bestTune
lambda_use <- min(tsh_ridge_fit$finalModel$lambda[tsh_ridge_fit$finalModel$lambda >= tsh_ridge_fit$bestTune$lambda])
position <- which(tsh_ridge_fit$finalModel$lambda == lambda_use)
best_coefs <- coef(tsh_ridge_fit$finalModel)[, position]
data.frame(coefficient = round(best_coefs, 3))
The results are not too different in terms of R-squared and RMSE.
However the shrinkage is more severe than the lasso; the coefficient
estimates are closer to zero. This may be desirable in some situations.
In fact, it is possible to take a hybrid approach between ridge and
lasso called elastic
net regression, where you also tune the model to find an appropriate
alpha
value somewhere between 0 and 1.
Here I am trying a model called Bayesian regularized neural network
by specifying method = 'brnn'
. The only tuning parameter is
the number of neurons. I am testing out each value from 1 to 10 using
the argument tuneGrid = data.frame(neurons = 1:10)
. This
will prompt you to install the brnn package if you
didn’t have it installed already in your R package library.
By the way, I know nothing about Bayesian regularized neural networks and I picked it basically at random from the list of supported models. That is just to emphasize that it is easy, sometimes almost too easy, to fit a machine learning model to a dataset without knowing anything about what the model is actually doing inside the “black box.”
set.seed(5)
tsh_brnn_fit <- train(
x = sugarcane %>% select(all_of(sugarcane_predictor_variables)) %>% as.data.frame,
y = sugarcane$TSH,
method = 'brnn',
preProcess = c('center', 'scale'),
tuneGrid = data.frame(neurons = 1:10),
metric = 'RMSE',
trControl = cv_spec_sugarcane
)
tsh_brnn_fit
plot(tsh_brnn_fit)
The RMSE and R-squared values are basically the same as what we got for the lasso and ridge regression (RMSE is ~6 and R-squared is ~0.25), if not slightly worse. Thus in terms of performance there is no good reason to prefer this model over lasso or ridge. In fact, lasso and ridge provide much more interpretable parameter estimates, so in this case I would stick with either one of them.