... with code)At the end of this course, you will understand …
At the end of this course, you will be able to …
Image credit Shutterstock
original by sandserif comics
Image credit Silicon Angle
Regression and classification
Supervised and unsupervised classification
Spurious correlation?
Regularization balances between underfitting and overfitting
Image credit Cold Spring Harbor Lab
Training and testing split
Image credit Math is in the Air
Image credit Britannica
Class column indicates varietyClass to a factor variableClasspredictor_variables <- names(date_fruit)[!names(date_fruit) %in% 'Class']
date_long <- pivot_longer(date_fruit, cols = all_of(predictor_variables), names_to = 'variable')
ggplot(date_long, aes(x = Class, y = value)) +
geom_boxplot(aes(fill = Class)) +
facet_wrap(~ variable, scales = 'free_y') +
theme_bw() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1))findCorrelation() in caret package identifies any variables correlated with at least one other variable above a certain thresholdcreateDataPartition() uses stratified random sampling to balance training and test setsDecision tree
train() is the main workhorse functiontrainControl()train()train()form: the model formula
Class ~ . means to use all other variables to predict Classdata: data frame containing variablesmethod = 'rf': fit a random forest model using the R package randomForest in the back end.
preProcess = c('center', 'scale'): z-transform all variablestrain(), continuedtuneGrid: data frame with all combination of tuning parameters
mtry, the number of variables randomly selected each time a new split point is determined.mtry means a closer fit to the data at the risk of overfittingmetric = 'Accuracy': metric of model performance (i.e. the loss function) is 'Accuracy', the overall proportion of correct classificationstrControl: pass the model training specification we created using trainControl()ntree is an argument to the randomForest() function, the number of trees generated for each random forest model
importance = TRUE: variable importance values will be calculated for us to look at latertrain():
mtry we providemtry that maximizes prediction accuracy of the hold-out foldsmtry on the full training set!plot() shows the accuracy for each value of mtry. What do you think?confusionMatrix() returns one with model’s predicted classes down the rows and true classes across the columnspredict() function has a method for caret modelsnewdata argument must have a column for all predictor variables that are in the modelvarImp() calculates variable importance scores for many different types of modelstype = 1 indicates to give overall scores, not separately by classscale = FALSE gives us the raw importance values, otherwise they are scaled to a maximum of 100train() can be used to fit other ML models with very similar argumentsmethod and tuneGrid argumentsC is the name of the parameter that determines balance between overfitting and underfitting
C values do not fit the data quite as closely, protecting against overfittingtrain() to cross-validate SVM modelImage credit Hannah Penn
TSH); regression not classification this timeRep column indicates which experimental block each row of data comes fromsugarcane <- read_csv('https://usda-ree-ars.github.io/SEAStats/machine_learning_demystified/datasets/sugarcaneyield.csv')
sugarcane_predictor_variables <- setdiff(names(sugarcane), c('TSH', 'Rep'))
scatter_plots <- map(sugarcane_predictor_variables, ~ ggplot(sugarcane, aes(x = !!sym(.), y = TSH)) +
geom_point() +
geom_smooth(method = 'gam') +
theme_bw())
wrap_plots(scatter_plots, ncol = 4)map() function creates list of vectors of the row numbers that belong to each blocktrain() used as beforeform argument, supply x (data frame) and y (vector)methodtuneGrid includes a range of lambda values increasing by factors of ten and sets alpha = 1metric = 'RMSE', or root mean squared error; lower is bettertrControl argumenttrain() to cross-validate lasso modelset.seed(3)
tsh_lasso_fit <- train(
x = sugarcane %>% select(all_of(sugarcane_predictor_variables)) %>% as.data.frame,
y = sugarcane$TSH,
method = 'glmnet',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(
alpha = 1,
lambda = c(0.00001, 0.0001, 0.001, 0.01, 0.1)
),
metric = 'RMSE',
trControl = cv_spec_sugarcane
)bestTune element of fitted model object shows optimal tuning parameterstsh_lasso_fit contains an element called finalModel, the final model fit to the entire training setcoef(tsh_lasso_fit$finalModel) returns a large matrix of coefficientslambda value in the final model that is closest to the optimal lambda, then extracts that column index from the matrixlambda_use <- min(tsh_lasso_fit$finalModel$lambda[tsh_lasso_fit$finalModel$lambda >= tsh_lasso_fit$bestTune$lambda])
position <- which(tsh_lasso_fit$finalModel$lambda == lambda_use)
best_coefs <- coef(tsh_lasso_fit$finalModel)[, position]
lm_coefs <- lm(TSH ~ ., data = data.frame(sugarcane[, 'TSH'], scale(sugarcane[, sugarcane_predictor_variables])))$coefficients
data.frame(lasso_coefficient = round(best_coefs, 3),
unshrunk_coefficient = round(lm_coefs, 3))