Homers and hits part 1

Homers & Hits: Probabilistic Classification, part 1

Saturday, April 4th, 2020

Note: this is the sixth post of the multi-part series Statcast Data Science. If you'd like to start at the beginning, you can check out the introduction here.

If you'd like to follow along at home, I've published the code for this post on Github.

After my last post exploring the biases in MLB's Statcast data, we've now got a relatively clean dataset to do some analysis. One of the first, most obvious applications for Statcast's batted ball trajectory data is to create new metrics evaluating the skill of hitters and pitchers, independent of things like the park and defense.

Today, in place of batting average and RBIs, batters are measured using weighted runs created plus, or wRC+. Runs, as it turns out, are the currency of baseball; the more you score and the fewer you allow over a season, the more likely you are to win any given game that year. Weighted runs created (wRC) is a simple way of estimating the number of runs a batter contributed to his team. Each of a player's plate appearances is given a run value based on the outcome — 0.5 runs for a single, 1.4 runs for a home run, -0.3 runs for an out, and so on (check out this three part series for info on how those values are derived). wRC+ adjusts this for park and league, then divides by the average number of runs per plate appearance.

The result is a metric where 100 is league average and higher is better, i.e. more runs produced. For context, in the season he set the single-season home run record, Barry Bonds had a 235 wRC+, meaning he produced 135% more runs than average. At the bottom end, in 2006, Clint Barmes recorded a wRC+ of 38, the worst for a player with at least 500 plate appearances since at least 2002.

The trouble with measuring runs this way is that the outcomes — single, double, triple, home run, walk, or out — are partially outside of the batter's control. A deep fly ball to triple's alley in San Francisco's AT&T Park might well be a second-decker home run at Yankee Stadium. A sharp ground ball fielded by the all-time great shortstop Andrelton Simmons might be a single against a lesser fielder.

That's where using Statcast data comes into play. Instead of relying on the result of the play, we can estimate the run value using the batted ball trajectory, i.e. the exit velocity, launch angle, and potentially the spray angle. One simple shortcut would be to use that data to predict the likelihood of each of the possible outcomes, then weight each run value by its corresponding probability. This is an example of probabilistic classification. In ordinary classification, an estimator predicts to which of a set of categories a new observation belongs to. For example, if given an image of an animal, I might ask you if the pictured cute furry quadruped is a dog or a cat. Probabilistic classification simply turns the problem from a simple guess — "I think it's a dog" — to an estimate of the probability of each category — "There's a 75% it's a dog, and a 25% it's a cat."

This is the first post of a series (within a series) exploring different probabilistic classification methods applied to the challenge of predicting home runs and hits from the 2016 season. Simply predicting home runs is an example of a binary classifier, like the dog/cat example. In this case, the output will be a single probability, estimating the likelihood of a home run. The latter, predicting hits, will estimate the probability of five outcomes — out, single, double, triple, and home run — summing to 100%.

For both of these tasks, we'll make predictions using 10-fold cross-validation. As an example, say we have 1,000 total batted balls. If we train the chosen algorithm on all 1,000 data points, then use that same dataset to check for accuracy, we'll get a biased result, with a lower predicted error than reality. What we need is two datasets: one to train our model, and another to test it. One simple way to achieve this is to leave one data point out, train the model on the other 999 examples, then predict the outcome of the left out batted ball. Do this for each batted ball, and we now have an unbiased set of predictions for each one.

That process of iteratively slicing up a dataset for separate training and testing is cross-validation. When algorithms slow or datasets grow, in-place of the simple leave one out approach, we can save time by splitting our data into a handful of equally-sized groups, e.g. 10 groups of 100 observations, and leave one group out at a time. The groups are commonly known as folds, as in "10-fold cross-validation."

So, with a full set of actual and predicted (via 10-fold cross-vaidation) batted ball outcomes, we need a scoring rule to evaluate the accuracy. For our purposes, we want to use the log-loss — the logarithm of the probability estimate for the actual outcome. For example, say we give a particular batted ball an 80% of being a home run. If it's indeed a home run, then the score will be log(0.8) = -0.22, otherwise it'll be log(0.2) = -1.6. The use of logarithms might seem unnecessarily complex, but it actually makes the computation easy. Summing the log-losses is equivalent to multiplying each probability, i.e. the odds of the actual results based on our predicted probabilities, which for a large number of observations would be a number very close to zero. Thus, log-loss allows us to score the different models based on the probability they explain the results.

The first algorithm out of the gate is a simple generalized linear model. I'm currently working on covering GLM's more thoroughly in the next release of A Hitchhiker's Guide to Linear Modeling, but for the purposes of this post, among other things, they're simply a convenient extension of a linear model for binary response data. Predicting the probability of an event falling within a set of categories is notably different from a traditional linear model, e.g. probabilities are always between 0% & 100%, and sum to 100%. To deal with this, GLM's use a link function like the logit or probit that transform probabilities onto the real line, i.e. negative infinity to infinity. This enables the use of convenient, well-studied, tried and true linear methods.

To implement this, I used Statsmodels' generalized linear models module for the home run predictions using a probit link function, and their multinomial logit regression model for the hit predictions since they have no multinomial probit.

To visualize the results, I've created a precision-recall curve. Precision is the percentage of predicted positives (in this case home runs) that were correct. Recall is the percentage of actual positive events (again, home runs) that were correctly predicted. For example, if given 100 batted balls, we predict 10 of them to be home runs, with only 8 correct, then the precision would be 8/10, or 80%. However, if there were in fact 12 actual home runs in the original 100 batted balls, then the recall would be only 8/12, or 67%.

Since our model produces probabilities instead of predictions, I've left out the key step of translating the former to the latter. In the case of predicting a home run, a binary example, we can simply choose a threshold at which to predict the event, typically 50%. However, we can sweep this threshold from 100% to 0%, in effect being less and less selective about what we predict as a home run. This generates the precision-recall curve, with recall on the x axis. With a prediction threshold of 100%, our recall is 0%, since we predict 0 home runs. Similarly, with a prediction threshold of 0%, we predict every event as a home run, thus giving a recall of 100%. Thus, sweeping the threshold at which we "predict" a home run from 100% to 0% generates a trade-off curve of recall.

Let's see what that looks like for predicting home runs from MLB's 2016 statcast data. Below, I've plotted the precision-recall curve for a GLM using exit velocity (EV) and launch angle (LA), along with a second curve for a GLM with EV, LA, and spray angle (SA).

It's immediately clear that the results are near identical, with or without the spray angle to further inform the algorithm. The L=90% in the legend is displaying our score, the average likelihood of each batted ball, as predicted by each model, where higher is better.

When reading a precision-recall curve, the first thing I do is look at the precision at 100% recall on the far right. This is the base rate, telling us how likely the event we're trying to predict actually occurs. In this case, the odds of an at bat ending in a home run are about 4%.

On the left, we can see that, at our most selective, the GLM can achieve a precision of only 60%, i.e. only 60% of its home run predictions are actually home runs. This seems pretty low, especially given the phenomena of the no doubter — a ball hit so hard and high, some of the players won't even bother to turn to watch it leave the park. For all the hits when the announcer says "If it's fair, it's gone," we'd expect our algorithm to have a near 100% precision, so this disappointing result may mean a linear model isn't best.

Before we investigate that further, let's examine the precision-recall curve for the hit classifier. I've plotted it below for the algorithm with spray angle (SA) included, but it's similar without it.

For a given recall, we want the precision to be as high as possible. Given that, it's clear that predicting outs and home runs is much easier than singles, doubles, and triples. Triples are rare and appear off the bat like a double (hard hit ball past the outfielders that stays in the park), which itself is hard to predict. For non-home runs, the batter's speed and the fielder's arm can mean the difference between a single and double, or double and triple. Without that data also captured, we should expect making those predictions to be more difficult than the simpler task of predicting a home run. That said, the curves show the model isn't doing much better than our priors, particularly for singles and triples, which are relatively flat from right to left.

Upon reflection, it's obvious that a linear model is ill-suited to this problem. Sure, all else equal, higher exit velocity will lead to more home runs consistently, but launch angle has a goldilocks zone, not too high, not too low, that leads to hitting the ball out of the park. We can see this by plotting the batted ball classifications statcast provides for each ball in play.

Next up, I'll explore using Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) to improve our classifier.

Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at matt.e.fay@gmail.com.