...
with code)At the end of this course, you will understand …
At the end of this course, you will be able to …
Class
column indicates varietyClass
to a factor variableClass
predictor_variables <- names(date_fruit)[!names(date_fruit) %in% 'Class']
date_long <- pivot_longer(date_fruit, cols = all_of(predictor_variables), names_to = 'variable')
ggplot(date_long, aes(x = Class, y = value)) +
geom_boxplot(aes(fill = Class)) +
facet_wrap(~ variable, scales = 'free_y') +
theme_bw() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1))
findCorrelation()
in caret package identifies any variables correlated with at least one other variable above a certain thresholdcreateDataPartition()
uses stratified random sampling to balance training and test setstrain()
is the main workhorse functiontrainControl()
train()
train()
form
: the model formula
Class ~ .
means to use all other variables to predict Class
data
: data frame containing variablesmethod = 'rf'
: fit a random forest model using the R package randomForest in the back end.
preProcess = c('center', 'scale')
: z-transform all variablestrain()
, continuedtuneGrid
: data frame with all combination of tuning parameters
mtry
, the number of variables randomly selected each time a new split point is determined.mtry
means a closer fit to the data at the risk of overfittingmetric = 'Accuracy'
: metric of model performance (i.e. the loss function) is 'Accuracy'
, the overall proportion of correct classificationstrControl
: pass the model training specification we created using trainControl()
ntree
is an argument to the randomForest()
function, the number of trees generated for each random forest model
importance = TRUE
: variable importance values will be calculated for us to look at latertrain()
:
mtry
we providemtry
that maximizes prediction accuracy of the hold-out foldsmtry
on the full training set!plot()
shows the accuracy for each value of mtry
. What do you think?confusionMatrix()
returns one with model’s predicted classes down the rows and true classes across the columnspredict()
function has a method for caret modelsnewdata
argument must have a column for all predictor variables that are in the modelvarImp()
calculates variable importance scores for many different types of modelstype = 1
indicates to give overall scores, not separately by classscale = FALSE
gives us the raw importance values, otherwise they are scaled to a maximum of 100train()
can be used to fit other ML models with very similar argumentsmethod
and tuneGrid
argumentsC
is the name of the parameter that determines balance between overfitting and underfitting
C
values do not fit the data quite as closely, protecting against overfittingtrain()
to cross-validate SVM modelTSH
); regression not classification this timeRep
column indicates which experimental block each row of data comes fromsugarcane <- read_csv('https://usda-ree-ars.github.io/SEAStats/machine_learning_demystified/datasets/sugarcaneyield.csv')
sugarcane_predictor_variables <- setdiff(names(sugarcane), c('TSH', 'Rep'))
scatter_plots <- map(sugarcane_predictor_variables, ~ ggplot(sugarcane, aes(x = !!sym(.), y = TSH)) +
geom_point() +
geom_smooth(method = 'gam') +
theme_bw())
wrap_plots(scatter_plots, ncol = 4)
map()
function creates list of vectors of the row numbers that belong to each blocktrain()
used as beforeform
argument, supply x
(data frame) and y
(vector)method
tuneGrid
includes a range of lambda
values increasing by factors of ten and sets alpha = 1
metric = 'RMSE'
, or root mean squared error; lower is bettertrControl
argumenttrain()
to cross-validate lasso modelset.seed(3)
tsh_lasso_fit <- train(
x = sugarcane %>% select(all_of(sugarcane_predictor_variables)) %>% as.data.frame,
y = sugarcane$TSH,
method = 'glmnet',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(
alpha = 1,
lambda = c(0.00001, 0.0001, 0.001, 0.01, 0.1)
),
metric = 'RMSE',
trControl = cv_spec_sugarcane
)
bestTune
element of fitted model object shows optimal tuning parameterstsh_lasso_fit
contains an element called finalModel
, the final model fit to the entire training setcoef(tsh_lasso_fit$finalModel)
returns a large matrix of coefficientslambda
value in the final model that is closest to the optimal lambda
, then extracts that column index from the matrixlambda_use <- min(tsh_lasso_fit$finalModel$lambda[tsh_lasso_fit$finalModel$lambda >= tsh_lasso_fit$bestTune$lambda])
position <- which(tsh_lasso_fit$finalModel$lambda == lambda_use)
best_coefs <- coef(tsh_lasso_fit$finalModel)[, position]
lm_coefs <- lm(TSH ~ ., data = sugarcane %>% select(all_of(c('TSH', sugarcane_predictor_variables))))$coefficients
data.frame(lasso_coefficient = round(best_coefs, 3),
unshrunk_coefficient = round(lm_coefs, 3))