Machine Learning Demystified

What is this course?

  • Cut through mystery and hype surrounding machine learning (ML)
  • Introduce you to the basic principles behind machine learning
  • “Bust” some of the common myths about machine learning
  • Walk through two different machine learning tasks in R

Minimal prerequisites

  • Some experience with statistical analysis
  • A little bit of exposure to R

Advanced prerequisites

  • Ability to write model formulas in R
  • Familiarity with tidyverse for R data manipulation and plotting

How to follow the course

  • Slides and text version of lessons are online
  • Fill in code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Conceptual learning objectives

At the end of this course, you will understand …

  • What machine learning is and how it is similar to and different from statistics
  • Why the common myths about machine learning are not true
  • What the difference between prediction and inference is as a goal
  • What training and testing data are
  • What a loss function is
  • What supervised and unsupervised learning are
  • What the basic steps of a machine learning workflow are

Practical skills

At the end of this course, you will be able to …

  • Explore and pre-process data for machine learning models
  • Fit a random forest model in R for a classification task
  • Fit a lasso regression in R for a regression task
  • Cross-validate ML models
  • Use the R package caret to fit many kinds of ML models using the same code syntax

What is machine learning?

Image credit Shutterstock

Machine learning myths: Busted!

Myth Busted

  • Myth 1. Machine learning is way more powerful than statistics
  • Myth 2. Machine learning is too fancy/new/different for me to learn
  • Myth 3. My data aren’t big enough to do machine learning
  • Myth 4. I don’t need machine learning

Myth 1. Machine learning is way more powerful than statistics — BUSTED!

  • Machine learning is the (over)hyped hot new thing
  • Statistics has been around a long time and its limitations are well known
  • But ML models are essentially statistical models, and all models are only as good as the assumptions they make and the data you put into them

original by sandserif comics

Myth 2. Machine learning is too fancy/new/different for me to learn — BUSTED!

  • Yes some are very complex and require a lot of theoretical background to understand
  • But a lot of software tools exist to make it easy to fit ML models … almost too easy in fact
  • Having a modest level of understanding of the models you work with is a good thing

Myth 3. My data aren’t big enough to do machine learning — BUSTED!

  • Big Data: an even more overhyped buzzword than machine learning?
  • Yes, some ARS researchers have truly huge datasets that need ML techniques
  • But ML can be a useful tool even with relatively small “hand-collected” datasets

Image credit Silicon Angle

Myth 4. I don’t need machine learning — BUSTED!

  • ML is not just hot air and hype
  • ML techniques are useful across all scientific fields
  • The focus on predictive modeling is changing the way all statistics and all research are done
  • All scientists should have basic ML literacy

But what is machine learning, anyway?

  • Machine learning: any job you give a machine (computer) to do, that it gets better at doing, the more data it gets
  • ML models, including statistical models, are machines that take data as input and spit out predictions as output
  • Ideally, the more data you have, the better your model will be at its job of making predictions

What jobs do machine learning models do?

Regression and classification

  • Classification: predicting which discrete category something belongs to
  • Regression: predicting a continuous outcome variable

Supervised versus unsupervised classification

Supervised and unsupervised classification

  • Supervised: you know which category each of your data points belong to, and try to find patterns in the data that predict which category new data points belong to
  • Unsupervised: you are trying to find natural groups or clusters in your data without knowing beforehand which group each data point belongs to

Supervised versus unsupervised learning: examples

  • supervised: dataset of some patients with heart disease and some without, try to predict heart disease from clinical measurements on the patients
  • unsupervised: Google News categorizing news articles together that are about the same news event
  • Supervised learning is great but requires better quality data so large-scale projects may need unsupervised approach

What kind of machine learning is going on at ARS?

Image credit 2016CIAT/Neil Palmer
  • Genomic selection for plant and livestock breeding
  • Identifying weeds from drone imagery
  • Exploring the effect of diet on gut bacterial microbiome composition

What’s the big deal with prediction?

  • Inference: finding patterns in data and using them to understand processes going on in the world
  • Prediction: finding patterns in data and correctly reproducing them with a model
  • For prediction, how is more important than why

The importance of causal associations

Spurious correlation?

  • An ML model that incorporates true causal relationships between variables, and not just chance associations, is more robust.
  • Take shark attacks and ice cream sales as an example
  • The model might be predictive as long as the confounding variable affects both, but if that changes, it will fail
  • Better to “use your noggin” and try to model true causal relationships

What’s a loss function?

  • Loss function: A function that usually represents some form of prediction error; when you fit an ML model you want to minimize it
  • In a linear regression it’s the sum of squared residuals (distance from data points to the regression line)
  • ML models sometimes use mean squared error (MSE) as the loss function, which is the same as traditional linear regression.
  • Sometimes they use mean absolute error (MAE) (mean of the absolute values of the distances from the data points to the fitted line), less sensitive to outliers
  • Classification tasks use different loss functions

What is regularization?

Regularization balances between underfitting and overfitting

  • Overfitting: Model fits too closely to data and its predictions reproduce random noise
  • Underfitting: Model doesn’t fit data closely enough and its predictions miss true patterns
  • Regularization: Anything that makes a model less complex and more general
    • Includes things like variable selection, model selection, random effects

Balance overfitting and underfitting

  • Choose ideal amount of regularization to balance between overfitting and underfitting
  • Fit the training data a little less closely to get better performance making predictions on new data
  • ML loss functions have “hyperparameters” that set the level of regularization

How to fit a machine learning model

Image credit Cold Spring Harbor Lab

Training and test sets

  • Training set: data used to fit the model
  • Test set: data used to evaluate the model
  • Ideally they are as independent from one another as possible
  • Data leakage: Occurs when the training and test sets are not independent; random noise in the training set is correlated with random noise in the test set
  • There is a limit to how different the training and test set should be
  • Transferability: The ability of a model to produce accurate predictions in new contexts
  • Get data to test your model that is independent from your training data but still within the domain you want to make predictions in

Training and testing split

Internal validation

  • If no independent test set is available, you have to set aside a portion of your data as the test set
  • Model never “sees” the test data when estimating parameters, only to test its predictions
  • Even though it “wastes” data, it is good to reserve 10% to 20% of your data to test the model
  • Select test sample to be representative and to minimize data leakage

Data visualization and pre-processing

  • All data analysis and modeling should start with looking at the data!
  • Erroneous or invalid data points should be removed
  • Don’t necessarily remove “outliers” because you won’t be able to do that later when making predictions with the model

Removing useless features

  • Feature: a term used in the ML literature for predictor or x variable
  • Before fitting a model, remove features that don’t carry much variation
    • Features with no variance or very little variance (for example, a column with 998 zeros and 2 ones)
    • Features that are highly correlated with other features
  • Those features contribute nothing to prediction performance and just slow down computation

Standardizing variables

  • Variables/features usually need to be transformed to a common scale
    • Often a z-transformation is used
  • Otherwise the model will give more weight to variables that happen to have larger magnitudes of their units
  • To avoid data leakage do the standardization separately on the training set and the test set

Tuning the model

  • We want to estimate parameters on our training set that will maximize prediction performance on the test set
  • Some parameters capture the effects of the different predictor variables on the outcome variable
  • Tuning parameters, or hyperparameters: special parameters that do things like set the balance between overfitting and underfitting

Image credit Rebecca Wilson

Cross-validation

  • Cross-validation: Repeatedly splitting a dataset into different training and test sets, and fitting the model on different training set each time until all data points have been included into the test set once.
  • Split up dataset into k equally-sized folds
  • Repeat the cross-validation process for a lot of other sets of tuning parameters
  • The set of tuning parameters that minimizes your loss function (has the best prediction performance) is the one you will use to fit the final model

Cross-validation

Cross-validation, continued

  • It is a good idea to repeat cross-validation multiple times because results may depend on how the folds are split
  • If you have a blocking structure in your data, make sure that the blocks aren’t split up into more than one fold, or you will get overinflated estimates of model performance

Fitting and validating the final model

  • Fit a final model to the entire training set using optimal tuning parameters selected by cross-validation
  • Now we can see how well the model does on data it’s never seen before!
  • Make predictions on the test set that was originally split off from the training set back at the start of this process, and evaluate performance
  • There is no universal threshold of whether performance is good enough; it is a practical question

Model tuning and validation workflow

Assessing variable importance

  • ML models can be used for inference as well
  • Different methods exist to quantify how important each feature is for predicting the outcome

Do it all over again

  • You can fit different ML models on the same training set and use the one that performs best on the test set
  • As long as you use regularization to prevent overfitting
  • More complicated is not always better!
  • The “best” model is one that gives acceptable prediction performance and one that you can understand what it’s doing

Why don’t people use this kind of workflow in conventional statistics?

  • They should!
  • Overfitting is more of a problem if your goal is prediction instead of understanding a natural process, but it’s always a problem
  • Lots of publications are based on mining data and making conclusions based on patterns that are really random noise
  • Cross-validation and external validation are excellent tests of a model’s performance and can help with both prediction and inference

A note about software