Imputing Missing Statcast Data
Monday, July 17th, 2017
So far, I've covered what Statcast data is, how to scrape and store that data, and how to process that data in the cloud. Next on the agenda is cleaning the data, i.e. finding and fixing any corrupted or inaccurate records. This might seem overly careful, but, as anyone who's worked with real world data can tell you, it is completely necessary. Somewhat disappointingly, it turns out that the sensors responsible for reporting batted-ball trajectories aren't perfect, and occasionally fail to track the ball through its entire flight path. The team at MLB Advanced Media is slowly improving their algorithms to deal with this issue, but for the first two years, they were simply publishing null values for 10-15% of all events. Last year, Rob Arthur at FiveThirtyEight showed that these missing data points were not random. It turns out the radar and visions systems mounted behind home plate struggle with balls hit directly into the dirt (steep groundballs), or straight up in the air (pop-ups). Unfortunately, that means our dataset is now biased, e.g. players who hit lots of weak infield flyballs will often have their mishit balls ignored. To mitigate this issue, we can impute the missing data, i.e. replace the null values with our best guess.
Rob later showed on his personal website that he could successfully impute the missing data using the other pieces of information available for every play, e.g. the type of hit (groundball, line drive, fly ball) and its result (out, single, double, triple, homerun). I decided to reproduce his work, using a random forest regression model from scikit-learn, an open-source machine learning package in Python.
Before discussing random forests, a quick review! Machine learning algorithms make predictions, either through classification, e.g. "is this a picture of a hot dog?", or regression, e.g. "how many hot dogs are in this picture?" In order to function, they must first learn from a set of examples aka training data, e.g. a database of images of different foods, some of which are hot dogs (of course). By learning, we mean that parameters in the model are being optimized to fit the data.
Random forest models are an example of a bagging algorithm, in which the model is fit to many often overlapping subsets of the data. The result, a collection of models, is then used as an ensemble to make predictions. For example, a simple way to combine those predictions into one is through a simple average. Random forest models happen to be made up of decision trees. Get it? Forest. Trees. Yup!
Decision trees are a simple model that intelligently split or branch the input data into separate regions or leaves with similar outputs. Lots of puns! I couldn't recommend this beautiful interactive, a visual introduction to machine learning using decision trees, more highly.
Lucky for me, random forests are known for their simplicity and flexibility, making implementation easy. I only had three decisions to make.
- How many trees should my algorithm have? In this case, more trees improves accuracy, but decreases speed while increasing storage space requirements if I want to save the model for later.
- What features (machine learning parlance for inputs) should I use? My database has many potential attributes to choose from — pitch velocity, the pitcher's release point, the number of balls and strikes in the count, etc.
- How many datapoints do I need to train the model with? I could train my model on the complete dataset, but that would be quite large and slow. Similar to the number of trees, this is a tradeoff between accuracy and model size.
First, let's find out how many trees we need. The number of trees in a random forest model is an example of a hyperparameter of an algorithm. Unlike most parameters in a machine learning model, hyperparameters are set before learning, i.e. they are not found in the process of fitting the training data. In our case, before we can fit the random forest model to our statcast data, we have to select how many trees our random forest will be made up of. Machine learning folks will talk about tuning hyperparameters, i.e. evaluating the capabilities of a model given different values of a hyperparameter (or set of hyperparameters). We're interested in tuning the number of trees hyperparameter of our random forest.
This is a relatively benign example, for a couple of reasons. First, we know that more trees is always better, or in more precise language, the expected score of the model is monotonically increasing (never decreasing) with the number of trees. This is not the case for all machine learning hyperparameters, so we lucked out this time. Second, while more trees is always better, we can't simply keep adding trees until we have a 100% accurate model. Each additional tree has a diminishing return on model accuracy, so at some point, the added benefit is negligible. To find the right number of trees for our problem is to decide exactly how small of an added benefit can be considered negligible.
One way to go about this would be to use cross-validation (explained about 2/3 of the way through this post) on our statcast data to score a few differently-sized random forests. That would work, but very slowly, taking 3-10x as long. Luckily, bagging algorithms have a convenient shortcut for estimating model performance, called an out-of-bag estimate. Since each tree is fit to a subset of the data, given enough trees, each datapoint in our training data is missing from at least one tree. For each data point, we can compare the actual outcome to the predicted one from each tree that didn't include that point in its training data. Neat, right? This is a fast, downwardly biased estimate — the out of bag score will be a bit lower than the expected score on fresh data since each point is modeled with fewer than the total set of trees — of the model's predictive capabilities,
We can also streamline this process by iteratively adding trees to an existing random forest, instead of building a new one each time. These two techniques in combination — out-of-bag scoring and re-using the existing model — mean we can, at almost no additional cost of computation, add trees to a random forest until the score stops improving. Specifically, I coded up a Tree Selecting Random Forest Regressor built off of scikit-learn's Random Forest Regressor that scales up the number of trees by 20% iteratively, until that increase results in an improvement in score by <1% — by default, but tunable as a hyperparameter. What we've done here is replaced number of trees as a hyperparamer with the tree improvement threshold.
Below, I've plotted the out of bag score as the model increased trees in predicting statcast exit velocity, launch angle, and batted ball distance. In this case, R^2, known as the coefficient of determination, was used to evaluate the model's predictive capabilities. A score of 0 means the model has no clue, while 1 means it is perfectly accurate.
At first, the gains from adding a couple trees are massive, but the added benefit of increasing from 23 trees to 28 was <1%, thus 28 was enough for this problem. Other prediction tasks might require more, or be permissible with fewer, but instead of manually checking this each time, we've simply made that optimization a part of the algorithm.
Next, we'll tackle the problem of "feature selection." This is machine learning lingo for figuring out which data points we ought to use as inputs for our algorithm. Some features are useful in making predictions, while others lend no further insight. As an example, in predicting the distance a batted ball traveled, the result of the play (out, single, double, triple, or homerun) could be useful (homeruns travel a lot further than the typical single), but the handedness of the pitcher might not. While using that sort of intuitive reasoning to decide on features is useful, its not rigorous or scalable.
Instead, I implemented something called recursive feature elimination. In this scheme, we start with every feature available, and recursively eliminate the least important features until some stopping criterion. You might be asking yourself "how do we know which features are least important?" This is where the structure of the random forest algorithm helps us out again. As I previously mentioned, each tree in a random forest intelligently splits the input data up into similar regions. It turns out that some features are used more often than others in splitting up the data. If we add up all the times each feature is used to make a split, we can come up with a percentage of splits attributable to each feature, i.e. a sorted list of feature importances.
Unlike other machine learning algorithms, random forests are generally robust to overfitting. Translation: while some models' performance will decrease when given random, unrelated features, random forests will largely ignore useless data. This means that feature selection for our random forest is the same problem as determining the number of trees: a tradeoff between model accuracy and size/speed. Low ranking features might have a small amount of signal, but are mostly noise, and thus can be discarded. To handle this, I created a function similar to scikit-learn's RFECV (Recursive Feature Elimination using Cross-Validation), but slightly altered for this specific need. It recursively eliminates one feature at a time until the cross-validation score goes below 98% — by default, but customizable — of the peak score observed.
Below, I've plotted the cross-validation score as the model reduced features in predicting statcast exit velocity, launch angle, and batted ball distance. In this case, rms error was used to evaluate the model's predictive capabilities. A score of 0 means the model is perfectly accurate, while higher values represent roughly the average error of each prediction.
The model starts off with 21 features in total, and in order, removes:
- strikes - the number of strikes in the count before the ball was put in play
- balls - the number of balls in the count before the ball was put in play
- hit_location - the fielder that fielded the ball, in position numbering from 1 to 9
- vy0 - the velocity of the pitch in the y-direction, measured at the initial point
- start_speed - the initial speed of the pitch
- zone - the strike zone is split up into zones, with numbers representing each subregion
- hc_x - "stringers," aka real people watching the game, give an x & y estimate of where the ball was fielded. Not sure what the hc stands for, but this is used to superimpose a marker on a graphic of the field for MLB's gameday app.
- vz0 - the velocity of the pitch in the z-direction, measured at the initial point
- vx0 - the velocity of the pitch in the x-direction, measured at the initial point
- effective_speed - the effective speed of the pitch, taking into account the extension of the pitcher towards homeplate in releasing the pitch. Taller, long-armed pitchers that release the ball closer to the plate have higher effective velocities.
The 11th feature removed, events, which represents the outcome of the play (out, single, double, triple, homerun), resulted in a noticeable increase in RMS error. Thus, the final set of features, in order of importance, was:
- bb_type - 42% of splits - the type of batted ball, e.g. groundball, flyball, linedrive, etc.
- hitDistanceGD - 37% of splits - thanks to a known transformation, hc_x and hc_y can be used to estimate the distance and horizontal angle of the ball's trajectory. This is the distance, in feet, from gameday coordinates.
- hc_y - 6% of splits - the y-coordinate from stringers.
- px - 2% of splits - the x-coordinate of the pitch as it crosses the plate
- pz - 2% of splits - the z-coordinate of the pitch as it crosses the plate
- sprayAngle - 2% of splits - the horizontal angle of the ball's trajectory estimated from hc_x and hc_y
- z0 - 2% of splits - the z-coordinate of the pitch, measured at the initial point
- pfx_x - 2% of splits - the horizontal movement of the pitch attributed to spin
- pfx_z - 2% of splits - the vertical movement of the pitch attributed to spin
- x0 - 2% of splits - the x-coordinate of the pitch, measured at the initial point
- events - 1% of splits - the outcome of the play, e.g. out, single, double, triple, homerun
For a more thorough documentation of these parameters, Mike Fast (now Director of R&D for the Houston Astro's) put together a glossary.
The last problem, determining how much of the data to train our random forest with, is a third example of the accuracy versus space/time tradeoff. More training data will increase accuracy, but slow down the fitting process, and any future predictions. To evaluate this tradeoff, machine learning practitioners will plot what's called a learning curve. I've plotted one below for predicting statcast exit velocity, launch angle, and batted ball distance.
The x-axis shows the number of datapoints used for training, from about 100 to 100,000, while the y-axis uses cross-validated rms errors to evaluate the model's predictive capabilities. In this case, we're able to get away with using only 10,000, or 1% of the data, with negligible loss of accuracy. Two orders of magnitude is a big win! To automate this step as well, I ended up building a function to iteratively reduce the training data size until the loss in accuracy exceeds a desired threshold.
So, now that we've built a pipeline for building a random forest model to impute missing statcast exit velocities, launch angles, and batted ball distances, let's see how well it works. Before we go imputing the values, let's check to see how accurate it is. Since the model is only trained on about ~1% of the data, we can use the other ~99% to evaluate. First, the exit velocities from the 2016 season:
At the bottom right of the figure above, you'll see the root mean squared error (RMSE), coefficient of determination (R2), and the mean average error (MAE). Interestingly, despite all of our fancy machine learning work, exit velocity is pretty difficult to predict. Still, this algorithm is performing much better than simply throwing out the data, or using a simple imputing technique of assuming the average for each missing point.
Next up, launch angle (vertical angle of the trajectory of the ball off the bat):
While the rms and mean average error are roughly similar for launch angle, we can see that the R squared term is much higher. This is because the range of values — roughly -90 to 90 degrees — is much larger than it was for exit velocity — about 40 to 120 mph. Thus, the model is capturing much more of the variance between data points. Interestingly, the four clusters seem to correspond with bb_type, the most important feature. From lowest to highest angle, that would be ground ball, line drive, fly ball, and pop up.
Finally, the hit distances:
This time, the rms and mean average error are actually substantially higher, but again because of the larger spread of values, the R squared term is even better, at an impressive 0.9. Other than a significant number of mis-predictions close to zero feet, which may actually represent errors in the data itself, the model does extremely well, forming a tight line.
After all the hard work, we're left with a model that is somewhat capable of imputing exit velocity, pretty good with launch angle, and excellent at hit distance.
Now that we've got a working imputing model, we can put it to good use, filling in the missing values. The final thing to do is compare the captured data to the newly imputed, no longer missing data. While histograms are a great tool for this task, when estimating the probability distribution of a population given a random sample, I prefer to use Kernel Density Estimation. You can find my implementation, based off of scikit-learn's but with added confidence bands and cross-validation for bandwidth selection, here along with a convenient plotting function that automates the entire histogram process.
First, let's tackle the exit velocity (again, 2016 season):
For comparison, I've plotted not only the test data and missing data imputed, but also that original test data modeled using the imputing random forest. Since the imputing model isn't perfect, it distorts the distribution of the original data, decreasing its range. The confidence bands represent the 95% interval within which the actual distribution should be.
Clearly, the missing data is composed of lower velocity batted balls. It appears to be bimodal (having two peaks, one at 60mph, and one at 75mph), but that may simply be an artifact of the imputing model.
Next up, launch angle (vertical angle of the trajectory of the ball off the bat):
Here, we see quite a lot of distortion coming from the random forest imputer, with clear clustering. But again, the missing data have a clear divergence from the existing data, occurring much more frequently at steep angles of descent and elevation (far left and right), while rarely occurring at angles closer to zero.
Since these variables are related, this next plot is easy to predict. If missing data is usually lower exit velocity, and either straight up or down in angle, then those balls must surely carry only a short distance.
Sure enough, missing data is characteristically short in hit distance. But if we inspect the histogram of the existing data, we can make out the traditional positioning of fielders, since the hit distance records where the ball is caught. At zero, we see the catcher, then a small bump for the pitcher at around 50 — pitchers aren't known for their fielding skill. Infielders make up the largest peak, from 100 to 150 feet, then a gap between the infielders and outfielders.
Questions | Comments | Suggestions
If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at firstname.lastname@example.org.