Introduction

What is this course? It’s my attempt to cut through some of the mystery and hype surrounding machine learning (ML). For some time now, machine learning has been a big buzzword, not only in our scientific research fields but across all of popular culture. But what exactly is machine learning? I’ll spend a little bit of time introducing you to the basic principles behind machine learning, “bust” some of the common myths about machine learning, and show you how you can use machine learning in your research. After that gentle introduction, we’ll walk through two different machine learning tasks using the R statistical programming language.

Download the worksheet for this lesson here.

What should you know coming into this course?

This course is designed for practicing researchers who have some experience with statistical analysis. It would be great if you already have a little bit of exposure to R. That’s because in the walkthroughs in this lesson, we will make use of some common R packages for manipulating data (dplyr) and plotting data (ggplot2), as well as write model formulas using the basic R model formula syntax.But if you don’t know any of that, don’t sweat it because you can just follow along by copying the code.

This course is also designed for people with little to no knowledge about machine learning, who are just interested in learning what it is. If you came here expecting to learn how to tweak the tuning parameters of a multilayer perceptron for image semantic segmentation, this is probably too basic for you.

What will you learn from this course?

Conceptual learning objectives

At the end of this course, you will understand …

  • What machine learning is and how it is similar to and different from statistics
  • Why the common myths about machine learning are not true
  • What the difference between prediction and inference is as a goal
  • What training and testing data are
  • What a loss function is
  • What supervised and unsupervised learning are
  • What the basic steps of a machine learning workflow are

Practical skills

At the end of this course, you will be able to …

  • Explore and pre-process data for machine learning models
  • Fit a random forest model in R for a classification task
  • Fit a lasso regression in R for a regression task
  • Cross-validate ML models
  • Use the R package caret to fit many kinds of ML models using the same code syntax

What is machine learning?

Machine learning myths

Before we get into too much technical detail, I want to start by discussing some common myths about machine learning, both good and bad.

Myth 1. Machine learning is way more powerful than statistics. Machine learning has been hyped, some might say overhyped, for some years now, as the hot new thing. In contrast, statistics has been around for a while, and its limitations are well known. In our culture with its short collective attention span, we tend to gravitate toward shiny new things and think “old = bad, new = good.” So you might think that machine learning and artificial intelligence are making classic inferential statistics obsolete. The counterargument to that is that machine learning models are essentially statistical models. They suffer from the same limitations as all statistical models: they are only as good as the data that you put into them. If there is bias or confounding influences in your data that you don’t account for, it will be reflected in poor model performance. If you try to extend or generalize a model too far away from the data you used to fit the model, you’re going to get poor model performance. That’s true for any kind of model.