Imputation

An interactive tutorial

In this chapter, you’ll learn how to handle NaN values in your dataset. You’ll learn a method to “fill in the blanks” called iterative imputation. But you’ll also learn the dangers of imputation, and some simple alternatives to imputation including listwise deletion and feature deletion.

Your job at $C ongo .com$ is to keep the package delivery workers happy. When they get stressed, more packages get damaged and lost. Here’s one worker’s stress levels during a week in July, as measured by her smartwatch:

A plot of time vs stress measurements. The x-axis is time, running from July 16 to July 22. The y-axis is stress, running from 0 to 100. Some solid purple points are plotted. But there are big gaps in the data.

Yes, and it’s a big problem! We can’t make decisions to lower stress if we don’t know what workers’ stress levels are. You can see the missing values in the raw data:

date Jul 16, 1am Jul 16, 2am Jul 16, 3am Jul 16, 4am Jul 16, 5am ⋮ stress 70.0 NaN 58.0 - 1.0 11.0 ⋮

In the second row, NaN stands for “Not a Number”. A normal person would write “N/A”.

Datasets can have missing rows (due to sampling), missing columns (due to not measuring everything), and missing cells — that’s what a NaN value is.

But before we even try to handle NaN values, there are more ways for values to be “missing”! Look at the four other stress values listed above. Does one of them stand out as suspicious?

No, that looks reasonable. The stress values are between 0 and 100, so the odd one out is -1.

Yeah, there are a bunch of these strange -1 values, despite the stress values ranging between 0 and 100.

-1 actually means something like “sensor unreadable”. What should we do with those values?

Yeah, interpreting -1 as a real stress value could mess up our analysis — e.g. “What was their minimum stress value this week?” would return -1, which seems unreasonable.

I wouldn’t do that. Imagine you do some analysis on this data — say, “What was their minimum stress value this week?” If we leave these -1 values in, we’d get the answer -1. I don’t think that’s a sensible answer.

I think a more reasonable approach is to replace it with NaN, meaning “missing value”.

This dataset also contains strange -2 values. Suppose -2 means: “Too high to even display! Off the charts!” Then what could we do with those values?

I wouldn’t do that, for the same reason: it’s not a true stress value.

I think either NaN or 100 is reasonable. This “off the charts” error is an example of what statisticians call censoring. It tells us something about the value, but it’s not an exact reading. We could replace it with a large stress value, but for this chapter, we’ll play it safe and replace it with NaN.

Imputation: prediction or sampling?

Our task in this chapter is to somehow deal with those NaNs. A common approach is to “fill in the blanks” with guesses. This is called imputation. (This strange word rhymes with “computation”.)

Here are two possible ways we could impute the missing values. This first method uses mean-fill, which replaces every NaN with 23.1, the mean of the known values:

The previous plot of time vs stress measurements, with more points added. Where there were gaps before, there is a long horizontal line of outlined purple points.

This second one uses norm-fill, which fits a normal distribution to the known values, then replaces every NaN with a random sample:

The previous plot of time vs stress measurements, with the gaps filled in differently: the outlined purple points are scattered in a normal distribution.

Which method do you prefer?

Okay! There’s no right or wrong answer here. But here are some ways to consider which is better.

Imagine you were given both stress timeseries above, and told that one of them was real, and one was fake. Which seems more likely to be “real”?

Yeah, those perfectly flat lines in the mean-fill would look very suspicious!

Those perfectly flat lines?! No, that would look very suspicious to me! You rarely see that in real-world data, especially in messy health statistics. I think the norm-fill looks more believable.

This is the idea behind one philosophy, which says that imputation is sampling. That is, we try to infer the distribution of possible true values, and randomly pick one as your imputation. So if we believe that stress comes from a normal distribution, the norm-fill method is better.

But here’s a different question. Imagine you had to bet on each missing value, and you were rewarded by how close your guess is. Which imputation would earn you the most money?

No, the mean-fill will actually earn you more. We’ll see why that is in a minute.

Yes! And in a minute, we’ll see why that is.

This is the idea behind the second philosophy, which says that imputation is prediction. That is, we again try to infer the distribution of the possible true values, but then we fill in the blanks with expected values. So if we imagine that stress comes from a normal distribution, it’s better to just impute with the mean every time. The norm-fill, by adding random deviations, just makes your prediction worse.

I think it depends on what you’re doing with your data. Your manager at $C ongo .com$ has said your task is to predict workers’ stress, so let’s run with the view that imputation is prediction. In a few minutes, you’ll see a contextual example of where sampling might make more sense.

Why are these values missing?

So far, our best solution to the missing stress values is to fill them all in with the observed mean, 23.1. Can we improve on that method?

For example, why not fill them in with some other constant value? Say, 50?

This suggestion might seem stupid, until we realize we’ve made an important assumption: that the observed values are representative of the missing values. That might not be true!

We need to ask: why are these values missing? Here are three possible theories.

Theory 1: whenever the worker is feeling too stressed, she takes off her watch. How would this affect your belief about our estimate, 23.1?

No, this theory should make us suspect that 23.1 is too low. The theory suggests that a stress value being missing is caused by the true stress value being high. So all those missing values are likely to have been higher than average, bringing up the mean.

Yeah. If missingness is caused by the true value being high, then all those missing values were probably higher than average, bringing up the mean.

Now for Theory 2: whenever the watch measures a stress value below 10, it reports NaN instead of the measurement. How would this affect your belief about our estimate, 23.1?

Right, it’s the inverse: if missingness is caused by the true stress value being low, then the true mean is probably lower.

No, this should make us think our estimate is too high. The theory suggests that missingness is caused by the true stress value being low. So all those missing values are likely to have been lower than average, bringing down the mean.

Theory 3: sensor failures happen when a Plutonium-238 atom decays in the smartwatch’s battery. How does this affect your belief about our estimate, 23.1?

Yes, in this case our estimate is just right. (Although if your stress is resulting in increased radioactivity, please call a doctor.)

Well, it’s widely agreed that atomic decay is a random event. (If your stress is resulting in radioactivity, call a doctor.) This means the measurements we are able to see are a fully random sample of the true values.

As you can see, the ideal imputation depends on why the data is missing.

When we decided to use mean-fill, we made an implicit assumption: the missing values are Missing Completely At Random (MCAR). But in the real world, your data is usually Missing Not At Random (MNAR), as in Theories 1 and 2. So mean-fill is probably going to bias our dataset.

A test set

So far, we just have thorny questions with no answers. And in general, that’s all you’ll have: no way to know why the values are missing, or what the “best” imputation method is.

But fortunately, you’re able to investigate. You hire a sports analyst to monitor the worker’s vitals for another three days, and you get this report back:

A plot of time vs stress, like the first plot, but the time range is Sep 19 to Sep 21. The plot is mostly complete, without many gaps.

If we assume that this is a representative test set, then we can use it to answer some of our earlier questions. For example, here’s how mean-fill performs on this test set:

The previous plot, with predictions added. At each time point, there is an outlined purple point representing a prediction. From each prediction, a line is drawn to the corresponding solid purple point. The predictions are all in a long horizontal line.

The true measurements are solid purple, and the predictions are outlined purple. The gray lines show the error for each measurement.

Eyeball it: does the mean 23.1 work well for this test set?

Yeah, the mean in the test set is 23.9, which is pretty close to our guess.

It’s not perfect, but I would say it works reasonably well. The mean in the test set is 23.9, which is pretty close to our guess.

This suggests (but certainly doesn’t prove!) that the stress data is Missing Completely At Random.

We can also use this test set to see how norm-fill compares to mean-fill, when viewing this as a prediction task. Below is the test set again, with the predictions being one norm-fill sample:

The previous plot, but with different predictions, drawn from a normal distribution.

Which one, on average, has the smaller error lines?

Right — and here’s how we can quantify that.

I grant that it’s hard to eyeball this one! It’s actually mean-fill. We can quantify that as follows.

To measure our prediction quality, we’ll use mean squared error (MSE). We take the errors for each known value, square them, then take the mean. Mean-fill’s error is 400, but norm-fill’s error is 971.

Using more features to improve prediction

So after checking our methods and theories with a test set, our best method is still to impute with the mean, 23.1. Can we do better?

To improve our prediction, what’s the data scientist’s knee-jerk response?

Alright, let’s try that. Luckily, we have two other timeseries that might help: the worker’s heart rate and movement. Here’s the full “training set” for that week in July:

Three plots. They share the same x-axis which is time, from Jul 16 to Jul 22. The three plots have separate y-axes: the top is stress in purple, the middle is heart rate in red, and the bottom is movement in blue. There are many gaps in each plot.

Welcome to the real world! It seems different sensors were working at different times. But it tells us something about the relationship between these variables, and we know something about most data points, so maybe we can solve this like a sudoku ...

Now, to do any prediction task, we need to choose a model. The first thing we’ll do is discard the “date” field of the dataset, and view the timeseries as a scatter plot:

Three scatter plots. The left shows heartrate vs movement. The middle shows stress vs heartrate. The right shows stress vs movement. The middle plot has a fairly linear relationship between stress and heartrate.

We can now view the dataset as points in a 3D cube. We show three scatter-plots, which are different views of the cube. You can imagine the first plot as a top-down view, and the second and third plots as “side-on” views. A point’s color is its stress level (where pink is high).

Looking at the scatter plots, which feature do you think is more useful for predicting stress?

Hmm, the middle chart looks more promising to me. I could draw a straight line through those points. But the relationship between movement and stress seems less clear.

Yeah, the middle chart suggests a fairly linear relationship between heart rate and stress. The relationship between movement and stress seems less clear.

We now need a model to fit to these data points. The scatter plot suggests that linear regression could work well. Linear regression works by trying to draw a straight line (or plane) through all the points.

When we run linear regression, it performs much better than constant mean, with an error of just 42:

A plot of predictions, but with much better predictions (small error bars).

Listwise deletion

Here’s a reason to suspect we can do even better. We started this chapter with 54 datapoints about stress. We then added more data: the heart rate and movement features. How many datapoints are used in the scatter plot?

Right, there are just 15 data points in the scatter plot!

No, there are actually just 15 data points in the scatter plot!

The thing is, scatter plots (and thus linear regression) can’t handle missing values. There’s no sensible way to plot a data point with a NaN value, and so all of these points are discarded entirely:

The three plots of stress, heart rate and movement, as before. But this time it only shows each time point that has all three features recorded. There aren’t very many.

This approach is called listwise deletion. It results in a trade-off between the number of features and the number of data points. Imagine a dataset with 10 features, and 10% missing values. If you apply listwise deletion, what’s the least number of data points you could end up with?

Right — for example, all the missing values could be in the $1 0^{t h}$ feature, meaning every data point must be discarded.

No, in the worst case it can actually delete all of your data points! For example, all the missing values could be in the $1 0^{t h}$ feature. Then every data point has a NaN at feature $10$ , and must be discarded.

So here’s why we think we could do better: linear regression throws away most of our data set. Can we improve it by finding a way to plot those partial data points?

Imputing all features

Here’s an idea: let’s impute the movement and heart rate data too! That should help us keep those partial data points, and make every value in the training set available to help prediction.

Here’s what we get by using mean-fill on all three features:

The previous three plots of stress, heart rate and movement. Each feature is has its gaps filled in with the mean.

Now we can linear regression on the full training set, without throwing anything away! Have a guess — do you think it will perform better or worse than listwise deletion?

You must have good intuition!

That’s a very natural guess, but it actually performs worse!

We run linear regression on the mean-filled dataset. Then we use the test set again (which includes heart rate and movement data for prediction). We get this prediction:

A plot of predictions. The predictions are better than mean-fill, but not by much.

The error is back up to 252! It’s only slightly better than mean-fill. Somehow, imputation has made things worse!

The clues are in the scatter plots:

The previous three scatter plots, with the imputed values included. Each plot has a long horizontal line of new points, and a long vertical line of new points.

There are more data points in the scatter plot now, which is great! But now look at that middle plot. How has imputation affected the correlation between heart rate and stress?

I think it’s less clear! It was previously a nice straight line. The plot is now a mess, with an additional horizontal stripe and vertical stripe.

Indeed, what was previously a nice straight line is now a mess. Every plot has one long horizontal stripe and one long vertical stripe.

Perhaps using mean-fill on each feature is too naive? And indeed, there are many more sophisticated ways to impute a feature, like “join the dots”, or advanced timeseries analysis (like we did in the previous chapter).

But all these methods carry the same problem: if we impute each feature separately, we ignore the relationships between features.

Iterative imputation

So here’s where we’re at. Our best method to predict stress is still just to listwise-delete, then use linear regression. Imputation, despite making every measurement in the training set available, has only made things worse. This is because it destroys the relationships between features.

But ... perhaps our imputation method is not advanced enough? Can we impute each feature in a way that respects the relationships between features? One way is called iterative imputation.

Iterative imputation starts by imputing each feature separately with mean-fill, just like we’ve already done. But then it iteratively tries to replace those naive imputations with better ones.

First, it uses the imputed heart rate and movement features to predict stress, and replaces the naive stress imputations with those predictions.

Then it uses the imputed stress and movement features to predict heart rate, and replaces the naive heart rate imputations with those predictions.

Then ... do you see where this is going? What does it predict next?

No, it predicts each feature in turn. It’s done stress, then heart rate, and the next one is movement.

That describes one iteration. It will then repeat the last three steps, predicting stress, then heart rate, then movement. Then it will repeat them again, as many times as you like.

If we do this using IterativeImputer from scikit-learn, we get a scatter plot like this:

The scatter plots as before, but with new imputed points. This time, the new points follow lines that seem to follow the correlation between features.

Looking great! We now have the best of both worlds! By using imputation, we can use our entire dataset for training, and by using iterative imputation, it still respects the relationships between features.

So have a guess — will this perform better or worse than listwise deletion?

That was my guess, too!

You have better intuition than me! I would have guessed it would perform better.

But when we run mean-squared error on our test set, we get 67. Dang! It’s still significantly worse than just listwise-deleting most of our dataset!

Here’s what I believe is going on ... Look at the last scatter plot between stress and movement. Earlier, our scatter plots suggested a weak relationship between stress and movement. But iterative imputation has hallucinated a strong relationship. Then our linear regression model uses this false relationship for prediction. Perhaps it would be better off just ignoring it ...

Feature deletion

Let’s rethink our initial problem. In data science, we’re very used to thinking: more data can never hurt! But when using listwise deletion, adding a new feature can hurt, by reducing the number of data points. And advanced methods like iterative imputation don’t necessarily help, because they suggest relationships that don’t exist.

So what if, instead of deleting data points, we delete features instead? Which feature would you try deleting?

Yeah — it’s relationship to stress seems unclear, so let’s try getting rid of it.

Well, it turns out deleting the heart rate isn’t so bad. But deleting the movement feature turns out much better.

No, stress is what we’re trying to predict! If we delete that, our model has nothing to go on. No, I recommend deleting the movement feature, because its relationship to stress seems unclear.

We’ll predict stress using just the heart rate. No imputation, just listwise deletion of points that are missing stress or heart rate.

As always, have a guess — will this perform better or worse than listwise deletion?

I would have thought so! But Incredibly, this simple model gives us an error of just 32:

Yeah! Incredibly, this simple model gives us an error of just 32:

A plot of predictions. The predictions look great!

Conclusion

Imputation, despite being theoretically dodgy, is often a necessary evil: in the worst case, listwise deletion can remove all of your dataset! This is why AutoML systems will often impute by default.

But if you’re using imputation, you should be aware of the problems it can cause: destroyed variances, destroyed correlations, and hallucinated correlations. In this chapter, we explored alternatives to imputation, and got lucky: with careful feature selection, we got a good prediction score with just listwise deletion.

End notes

If you want to play around with these results yourself, here’s the Colab notebook.

In the next chapter, we’ll see the power of vectors: how to embed words and images in space, with the spooky ability to do arithmetic like $(king - man) + woman = queen$ .

This chapter is free this week, but Everyday Data Science is a paid course. The following little section is for course buyers, who also get access to all premium chapters of the course.

💓 Premium section: Heart rate variability!

So far, we’ve seen that heart rate correlates positively with stress. But there’s another way that your heart indicates stress, called heart rate variability (HRV). It measures the variance (or standard deviation) in your heart rate.

This was a preview of chapter 4 of Everyday Data Science. To read all chapters, buy the course for $29. Yours forever.

In this interactive course, you’ll participate in my life stories, and learn data science tricks for optimizing your day-to-day life. You’ll make the perfect glass of lemonade using Thompson sampling. You’ll lose weight with differential equations. And you might just qualify for the Olympics with a bit of statistics!

In 2021, I wrote the book Everyday Data Science. This year, I’ve teamed up with Jim Fisher to transform the book into this course. Each chapter is an interactive tale, like a conversation with a storyteller. 📜 Here’s the first chapter to show you what we mean!

What’s in the course? 🎁

Eight interactive chapters. Each chapter is a self-contained case study, explaining a principle of data science. The first chapter is free! We’re working on the remaining chapters, and they’ll drop when they’re ready.

If your heart rate is more variable, what would you guess that indicates?

Counter-intuitively, it’s a sign of lower stress! Aren’t our bodies bizarre?!

So, considering that heart rate variability predicts stress, which of the following imputations of our heart rate might be better for predicting stress?

Yeah, prediction techniques like mean-fill have the annoying habit of reducing the variability (standard deviation). So, sampling techniques like norm-fill could be better here, because it would try to preserve the variability in the data.

Well, prediction techniques like mean-fill have the annoying habit of reducing the variability (standard deviation). So if we tried to use variability to predict stress, it would suggest that the user is very stressed during times without heart rate data.

So, sampling techniques like norm-fill could be better here, because it would try to preserve the variability in the data.

As it happens, if we use our model and dataset above, sampling doesn’t actually help our prediction. That’s partly because we’re using linear regression, which isn’t smart enough to use the standard deviation. And it’s partly because our data points are hourly summaries, whereas HRV is defined using the gap between every heart beat.

But the principle stands: if your data’s distribution is important, and not just its expected value, then consider thinking about imputation as sampling rather than prediction.

See you next time! 👋

Next in Everyday Data Science:

An Everyday Look at Anomaly Detection

Or, Your Body 👨‍👨‍👧‍👦

Averages are used to describe a population, but it can be dangerous to live by averages, because ... what if you are an outlier? In this chapter, we have a story of a man whose life was saved because he didn’t fit the average and spoke up.

In this chapter, you’ll learn how to model systems with the normal distribution and Python. You’ll learn how to detect anomalies by setting an optimal control limit. Along the way, you’ll learn how the prostate works, and how control limits might just save your life.

0/35

An Everyday Look At A/B Testing

Or, When Life Gives You Lemons 🍋

An Everyday Look At Differential Equations

Or, ODEs On A Diet! ⚖️

An Everyday Look At Time Series

Or, The Way You Do That Walk 👟

An Everyday Look at Imputation 🆕

Or: No, Garmin, I Didn’t Die In My Sleep

An Everyday Look at Anomaly Detection

Or, Your Body 👨‍👨‍👧‍👦

An Everyday Look At Embeddings

Or, All Books Live In $R^{300}$

An Everyday Look At Synthetic Data 🆕

Or, Grow Your Own

An Everyday Look At Bayesian Regression

Or, The Olympics is Calling 🏃‍♀️

An Everyday Look at Bayesian-Optimal Pricing 🆕

Or, Everything Must Go!

An Everyday Look at Graphs

Or, Walking The Dog 🐕‍🦺

Imputation

An interactive tutorial

Imputation: prediction or sampling?

Why are these values missing?

A test set

Using more features to improve prediction

Listwise deletion

Imputing all features

Iterative imputation

Feature deletion

Conclusion

End notes

💓 Premium section: Heart rate variability!

What’s in the course? 🎁

Next in Everyday Data Science:

An Everyday Look at Anomaly Detection

Or, Your Body 👨‍👨‍👧‍👦

An Everyday Look At A/B Testing

Or, When Life Gives You Lemons 🍋

An Everyday Look At Differential Equations

Or, ODEs On A Diet! ⚖️

An Everyday Look At Time Series

Or, The Way You Do That Walk 👟

An Everyday Look at Imputation 🆕

Or: No, Garmin, I Didn’t Die In My Sleep

An Everyday Look at Anomaly Detection

Or, Your Body 👨‍👨‍👧‍👦

An Everyday Look At Embeddings

Or, All Books Live In R300

An Everyday Look At Synthetic Data 🆕

Or, Grow Your Own

An Everyday Look At Bayesian Regression

Or, The Olympics is Calling 🏃‍♀️

An Everyday Look at Bayesian-Optimal Pricing 🆕

Or, Everything Must Go!

An Everyday Look at Graphs

Or, Walking The Dog 🐕‍🦺

Imputation

An interactive tutorial

Or, All Books Live In $R^{300}$