In a previous blog post, we saw how machine learning (ML) is becoming foundational across many industries and how supervised learning is currently enjoying the lion’s share of ML. Recall that supervised learning is concerned with prediction, such as whether a customer will churn or not, and classification of data, such as whether an email is spam or whether a tumor in a diagnostic image is benign. These are just a couple of examples. To keep your competitive advantage, you’ll need to leverage ML in one form or another. So what do you, as a business leader, need to know about ML to leverage it as effectively as possible?
1. PARADIGM SHIFT: ML MODELS LEARN FROM DATA
The first thing you need to know about ML is that it brings with it a new paradigm of software development, Software 2.0, which is less about giving computers specific instructions on what to do in every case they encounter, and more about specifying high-level models that learn from data, known as training data. Consider the ML challenge of detecting spam email, given the text of the email.
In classical Software 1.0, you would write code that would explicitly specify that, were a tumor above a given size and of a certain texture, among other conditions, it would be classified as malignant. Below a certain size, benign. In the world of Software 2.0, you specify the types of algorithms you want to use and feed them labeled data, that is, images that have already been classified as benign or malignant, and the algorithm discovers patterns in these in order to generalize the classification to new, unlabeled data. A lot of the time, this training data is hand-labeled by humans and, for this reason, researchers such as David Donoho prefer the term “recycled intelligence” to “artificial intelligence” because the machine is merely recycling the human intelligence contained in the labeled examples, and not creating any new form(s) of intelligence.
2. YOUR OPTIMIZATION FUNCTION IS KEY
In addition to labeled training data, you need to supply a ML model with an optimization function, also known as an objective function or loss function. This tells the algorithm what you’re optimizing for. One all too commonly used optimization metric is accuracy, that is, what percentage of your data your model makes the correct prediction for. This may seem like a great choice: who would want a model that isn’t the most accurate? There are many cases where you’d want to be more careful rather than blindly optimizing for accuracy—the most prevalent being when your data has imbalanced classes.
Let’s use an illustrative example. Let’s say you’re building a spam filter to classify emails as spam or not, and only 1% of emails are actually spam (this is what is meant by imbalanced classes: 1% of the data is spam, 99% is not). Then a model that classifies all emails as non-spam has an accuracy of 99%, which sounds great, but is a meaningless model. Similarly, if you have more men than women applying for a job, merely optimizing for accuracy can introduce gender bias.
Thankfully, there are alternative metrics, such as an F1-score, that account for such class imbalances. It is key that you speak with your data scientists about what they’re optimizing for and how it relates to your business question. A good place to start these discussions is not by focusing on a single metric but by looking at what’s called the confusion matrix of the model, which contains the following numbers:
- False negatives (real spam incorrectly classified as non-spam)
- False positives (non-spam incorrectly classified as spam)
- True negatives (non-spam correctly classified)
- True positives (spam correctly classified).
Source: Glass Box Medicine
The confusion matrix is information-rich and a great place to dissect what your model actually does. Moreover, you can use it to calculate many metrics of interest, such as accuracy and F1-score, along with two other important metrics: precision and recall. You don’t necessarily need to know what these are, but your data scientists will be impressed if you do!
A lot of attention is currently focused on the importance of the data you feed your ML models, and in this climate, it’s easy to forget the importance of your optimization function. But look no further than YouTube, which no longer optimizes solely for view-time (how long people stay glued to the videos) because doing so resulted in more violent and incendiary content being recommended, along with more conspiracy videos and fake news.
An interesting lesson here is that optimizing for revenue (as viewing time is correlated with number of ads YouTube can serve you, and thus, revenue) may not be aligned with other goals, such as showing truthful content. This is an algorithmic version of Goodhart’s Law, which states: “When a measure becomes a target, it ceases to be a good measure." The most well-known example is a Soviet nail factory, in which the workers were first given a target of a number of nails and produced many small nails. To counter this, the target was altered to the total weight of the nails, so they then made a few giant nails. But algorithms fall prey to Goodhart’s law also, like the YouTube recommendation system. Rachel Thomas, Director at the USF Center for Applied Data Ethics & founder of fast.ai, details several other relevant examples here.
3. SPLIT YOUR DATA
The attentive reader may be asking: “How do we calculate the accuracy of the model if we’ve already used all our labeled data to train it?” This is a key and crucial point: If you train your model on a dataset, you’d expect it to perform better on that data than on new data that it sees. To get around this, before even training the model, you could split the data into a training set and a test set—this procedure is called train test split. We train the model on the training set and compute the accuracy (or any other metric) on the test set (we can do this as it is also labeled!), which the model hasn’t yet seen, so that the computed scores have a better chance of generalizing to new data.
4. THE IMPORTANCE OF SOLID DATA FOUNDATIONS AND TOOLING
Having good quality data is a huge challenge in itself! This is why when executives ask me how they can make their companies ML-driven, I respond by showing them Monica Rogati’s Hierarchy of Needs, which has ML close to the top as one of the final pieces of the puzzle. This illustrates that before ML can happen, you require solid data foundations and tools for extracting, loading, and transforming data (ETL), as well as tools for cleaning and aggregating data from disparate sources.
5. BIASED DATA PRODUCES BIASED ALGORITHMS
This is key: ML can only be as good as the data you feed it. If your data is biased, your model will be too. For example, if you’re building an ML recruiting tool to predict the success of applicants based on resumes, and your training data is biased against women, then your ML tool will also be biased against women. This exact scenario recently happened at Amazon. As Cassie Kozyrkov has analogized, a teacher is only as good as the books they’re reading to teach the students. If the books are biased, their lessons will be too.
To recap, the five things you need to know about machine learning are:
- ML is a new paradigm of software development in which your models learn from data.
- Your choice of optimization function is key.
- Splitting your data into train and test sets is essential.
- ML relies on solid data foundations and tooling.
- Biased data produces biased algorithms.