Statcast Data Science

Monday, June 5th, 2017

In my past life as an electric powertrain engineer, I specialized in systems modeling and simulation. The systems were things like motorcycles, cars, or even industrial equipment, and it was my role to answer questions like “how far”, “how fast”, or “how much energy consumed”. How was I qualified to do this? Well, it turns out that the stuff that makes up an electric powertrain — things like motors and gearboxes — are pretty well understood thanks to physics, and I just happened to have a degree with that in the title. For a variety of (very good) reasons, MATLAB and the larger MathWorks ecosystem of products are popular with engineers to do this sort of analysis, so I became a bit of a MATLAB savant.

Little known fact: MATLAB has objects, reference data types, and even multiple inheritance!

As much fun as it was to learn from brilliant peers, working on challenging, technical problems, after four years, I had a bit of a quarter life crisis, and quit my job in search of more meaningful opportunities to make the world a better place. I was still excited by math, science, and engineering, but was looking for more directly philanthropic employment. Unfortunately for me, some of the most pressing global issues don’t need hardware engineering skills, and many nonprofits can’t afford expensive, proprietary software like MATLAB. I set out to broaden my technical skills, preferably using open-source tools, which lead me to data science.

For the uninitiated, data science is, simply put, the use of scientific practices in drawing conclusions from data. As an engineer, I was guided by a well-developed set of first-principles, i.e. physics, which provide an underlying theory of action for how stuff works. For example, when predicting the temperature of a motor given some driving profile, say a drive down the coast from San Francisco to Los Angeles, thermodynamics provides some convenient laws and governing equations, along with big words like “convection” and “enthalpy” to put on your slides.

Life outside the design of hardware is not so convenient. For better or worse (ok, worse), social scientists haven’t had as much success as physicists in establishing a rich, thoroughly proven-out foundation for their field. That means we can put a woman on the moon, but can’t accurately predict how long her marriage might last once she returns. That’s where data science comes into the picture: when working in a field without strong first-principles to derive our results, data science gives us another way that doesn’t require subject-matter expertise, and maybe even discover some underlying theory in the process!

Now that I was leaving the comfortable world of electric vehicles, data science seemed like a great skill to have.

When I was twelve, the local baseball team won the world series in dramatic fashion, so needless to say, I’m a bit of a baseball fan. Yes, I know it’s just rooting for laundry, public financing for new stadiums is straight swindling of taxpayers, and race and gender issues abound, but sports make for a pretty good moral equivalent for war. At least it’s better than football, right?

If you made it this far, you’re keenly aware that I’m also a nerd, so it should come as no surprise that I follow the game through the lens of sabermetrics, or (roughly) baseball science, made famous by Michael Lewis’s best-selling book Moneyball. If you’re unfamiliar with the story, the Oakland A’s, armed with the latest sabermetric research at the time, built an incredibly successful team despite the loss of three of their best players and a razor-thin budget. Even with an abundance of data thanks to a rich tradition of statistical record-keeping, baseball decision-makers — the men and women, well mostly white men, who draft, sign, and trade players — were just beginning to use analytics to sort out the stars from the scrubs. Since then, sabermetrics has flourished, both in the public sphere with the ascendance of websites like baseball-reference, fangraphs, and baseball prospectus hosting sortable leaderboards and analytical articles, and in front offices, where Ivy league grads have largely displaced ex-jocks.

The Oakland A's (top left) had the second best record in baseball despite the third smallest salary.

With the success of early adopters like the Athletics, Red Sox, and Rays, the industry has invested heavily in both statheads, and the numbers they crunch. In 2006, Major League Baseball installed a camera-based tracking system called PITCHf/x to record and publish the velocity and trajectory of every pitch, and in 2014, they announced plans to also capture player movements and batted-ball trajectories using a new system called Statcast. While not all of that data (roughly 7 TB per game) is publicly available, analysts have been churning through what has been published, and MLB Advanced Media, the organization responsible for Statcast, has begun posting basic leaderboards. The race is on to discover the next big thing in sabermetrics.

As an aspiring data scientist, this seemed like the perfect project to learn how to scrape, store, transform, model, and visualize large datasets using open-source tools. Over the next few posts, I’ll describe what I learned in the process. Hopefully you'll enjoy reading my findings as much as I did discovering them!

Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at matt.e.fay@gmail.com.