Linear Regression vs. Decision Trees: Handling Outliers

In regression tasks, it’s often assumed that decision trees are more robust to outliers than linear regression. See this Quora question for a typical example. I believe this is also mentioned in the book “Introduction to Statistical Learning”, which may be the source of the notion. Predictions from a decision tree are based on the average of the instances at the leaf. This averaging effect should reduce the effect of outliers, or so the argument goes.

However, in a notebook, I demonstrated that decision trees can sometimes react more poorly to outliers. Specifically, an outlier increased the sum of squared errors more in a decision tree thin in a linear regression. Depending on the specifics, a tree can put an outlier on its own leaf, which could lead to some spectacular failures in prediction.

This is not to say that linear regression is always, or even generally better at handling outliers. Rather, it is best not to assume one or the other technique will outperform in all cases. As usual, the specifics of a given application should dictate the techniques used.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s