Exploring Biases in Statcast Data
Wednesday, September 20th, 2017
Regular readers of this series know that I've covered what Statcast data is, how to scrape and store that data, how to process that data in the cloud, and most recently, how to impute the missing data. Now that I have a complete dataset, before I start building predictive models, it's time to explore the data for any biases. So what does it mean for data to be biased?
Many statistical tests require a dataset to be independent, and identically distributed, or iid for short. In this case, independence implies that the result of any observation doesn't affect the outcome of another. For example, say our dataset is a collection of numbers drawn from a hat. If we don't replace each number after drawing it from the hat, then each subsequent draw will be influenced by every previous one, i.e. our data won't be independent. Identically distributed simply means the odds of any outcome are the same across trials. In our example, if we used two hats that contained a different set of numbers, each pull would be independent (as long as those numbers were replaced of course), but they would not be identically distributed.
MLB data is anything but independent and identically distributed. The resulting batted-ball trajectory of an at-bat is obviously effected by non-iid factors like pitcher and batter, but perhaps also things like the field, or the weather. Luckily, Jonathan Judge, Nick Wheatley-Schaller and Sean O'Rourke at Baseball Prospectus, a website devoted to the sabermetric analysis of baseball, studied this very problem back in 2016 and found that indeed, those very factors (pitcher, batter, field, and temperature) were impacting the measured speed and launch angle of hits off the bat.
With an extra season's worth of data available, and a complete dataset thanks to imputing the biased missing data, I thought I'd reproduce and extend that analysis.
First off, a bit of background on the methodology Jonathan has used for everything from catcher defense to field-independent pitching: linear mixed effects modeling. I just finished up the first edition of A Hitchhiker's Guide to Linear Modeling, which I began for the express purpose of explaining mixed effects models, so if you're interested in a precise but concise explanation of the math, go for it! That said, for a social science perspective, I'd recommend Bodo Winter's two tutorials using R's LME4 package, the same tool used by Jonathan. While Python's Statsmodels package also has a linear mixed effects model, it didn't have all of the features I needed. Instead, I used rpy2 to interface with R via Python, enabling me the full feature set of LME4 without having to learn a new language.
For the purposes of this post, linear mixed effects models allow us to simultaneously fit a model with factors for everything from the temperature and venue to the individual batter and pitcher. This kind of analysis is critical for real-world data that isn't cleanly sampled. For example, the Astros hit a lot of home runs in 2017. Was that because they had great hitters, their home park where they played half their games had short fences, or the pitchers they played against happened to be particularly homer prone? Mixed models can answer this type of question.
It's pretty obvious that players have a large impact, but I was most interested in the impact of the venues. Did each baseball field have an effect on the measured exit velocities, launch angles, or hit distances? At least for hit distance, we know that air density, and thus temperature and elevation, changes the flight of the ball, but for the other two, it seems counter-intuitive that venue would play a role.
To investigate, I generated a linear mixed effects model predicting the exit velocity, launch angle, and hit distance of the ball off the bat using the identity of the batter & pitcher along with the temperature, venue, and whether or not the value was imputed. I did this for both the 2015 and 2016 seasons, creating two separate estimations of these factors.
Below, I've plotted the estimated impact of each venue on observed exit velocities for 2015 and 2016:
As you can see from the high correlation, we're clearly picking up on a real effect here. Apparently, we should be expecting any hits in the Arizona Diamondbacks park to be ~1.5 mph faster than average! And the Mets, Reds, and Astros parks feature hits ~1 mph slower. What could be causing this? Since we've controlled for the players, temperature, and which observations were imputed, we can confidently rule those out. In their article exploring these effects, Jonathan, Nick, & Sean claimed this was the result of consistent calibration errors, but I see no evidence for that sort of confident assessment when so many other factors could be in play. For example, since humidity reduces the coefficient of restitution (COR), or bounciness of the ball, Arizona's dry weather could explain its outlier status.
Moving on, let's look next at the launch angle:
Again, we see a strong year-to-year correlation, suggesting stable estimates of the park's effects, this time with a much larger spread. Interestingly, the Rockies Coors Field has one of the lowest expected launch angles, despite being known as a home run park. Given the thin air, you'd expect batters to adjust their strategy to hit the ball in the air more often, but don't forget, the pitchers are also aware of this, and may similarly be adjusting their strategy by keeping the ball low in the strike zone. Finally, the thin air also causes breaking pitches to break less, and "rising fastballs" to rise less. If hitters expect a pitch to stay up longer than it does, they'll swing over those pitches, causing more groundballs and lower launch angles.
These are all just theories to explain my skepticism towards immediately attributing these effects to measurement error.
Finally, plotting the hit distances from 2015 & 2016:
The correlation here is lower, but still significant. Notably, Coors Field, despite its thin air, does not feature the longest hits on average, outdone by the Diamondbacks park and the climate controlled Tropicana Field of the Tampa Bay Rays.
Baseball venues clearly impact the batted ball trajectory in a predictable, significant way. Attributing those differences from stadium to stadium solely to measurement noise is, however, not supported by the evidence, especially given what we know about the effects of weather. For that reason, I wouldn't recommend using this sort of analysis to "correct" the data. What's measured may well be "real," the result of altered strategies, weather, or something we haven't even considered.
One place where I do feel confident in attributing error to the sensors responsible for statcast is in the hits they fail to measure. Since my model took into account whether each data point was real or imputed, it also estimated a separate effect for each park on missing data points. From my previous post on Imputing Missing Statcast Data, I showed that those missing hits are not random, but in fact have a characteristic signature — miss-hit balls directly up in the air or into the ground. Could it be that some sensors are more prone to missing balls hit in some launch angles than others?
To find out, let's plot the venue biases for the estimates of those missing values:
Sure enough, we see a very strong year-to-year correlation, and the largest spread between venues. A batted-ball missed at Wrigley Field is expected to be 15 degrees lower than on average. Missing hits at the Rogers Center in Toronto were more than 20 degrees higher than expected in 2015! Apparently, different sensors are more prone to missing hits at particular angles.
It's important to note that the imputed data was estimated without using the venue, so this bias couldn't be coming from the algorithm used for filling in the missing values.
Questions | Comments | Suggestions
If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at email@example.com.