Statcast Data Science: A Retrospective

Sunday, April 5th, 2020

Note: this is the seventh and final post of the multi-part series Statcast Data Science. If you'd like to start at the beginning, you can check out the introduction here.

One of my roommates and I both came down with a slight sore throat Tuesday, March 3rd. Luckily, neither of us developed anything resembling COVID-19 symptoms, but we both stayed home just to be safe. Outside of a single day in the office a week later, I’ve been staying inside ever since, working from the desk in my bedroom, so needless to say, I’ve had a bit of a head start on shelter-in-place here in San Francisco.

At that point, I had some sense that COVID-19 would be coming to the US, but it was only a month prior when my friend, an epidemiologist at Johns Hopkins, had told me to worry more about the flu. Sports were still going strong, Joe Biden had just more-or-less locked up the nomination on Super Tuesday, and it would be another week before anyone ran out of toilet paper. My parents were already talking about stockpiling food and settling in for the long haul, but it wasn’t until I came back from my one day at work — expectantly turning on the tv to watch Zion dunk on the next unlucky NBA team to find the season had been cancelled — that I realized things were really serious.

When I started at Zipline back in November of 2017, I was thrilled to have found a job I could throw myself into — one that was both intellectually stimulating and tangibly making the world a better place. I had abruptly quit my previous career as an electric vehicle engineer, spending what would become 15 months learning to code and working on freelance data science projects while in-search of a fulfilling career. As I said at the time "No, not Silicon Valley “change the world” doublespeak: wake-up in the morning with a sense of urgency, with the intimate knowledge that what I do will make a difference."

That thrill was accompanied by an intense drive to work hard. At this point, Zipline was a start-up of ~60 people, delivering blood transfusions to ~12 rural hospitals in Rwanda with an aging fleet of drones held together by zip ties and crazy glue. In only my second week on the job, one of our software engineers discovered a critical bug that might cause the aircraft to dive too low while delivering packages; we were at-risk of a ground-collision at one of the very hospitals we were serving. I ended up analyzing the past ~1,000 flights on my laptop while on the way home for Thanksgiving, finishing up de-risking a problem that would have shut down our operations for days while my parents drove me home from the airport.

It was gripping work, and I purposefully dropped all of my hobbies at the time, devoting most of my free-time to work projects. That included this blog, and the Statcast Data Science project I was in the midst of. That was interesting work, but I now had a chance to learn those skills on the job while improving the health outcomes of pregnant mothers and their babies in rural Rwanda.

Flash forward two and a half years later, and I’m not here to tell you that I’ve burned out. Anything but! In normal situations, when it comes to technical projects, I’d still much rather devote my time to Zipline’s mission to expand access to essential medicines to everyone in the world. But with this deluge of working from home, truth is, I need a little separation of work and play to keep myself sane in these trying times.

And what could be a better indoor, socially distanced hobby than getting back to writing on my blog? I have some grander plans for personal projects coming up, but to start, I wanted to close the book on my last one from 2017 — Statcast Data Science. Now armed with the perspective of two and a half years working as an honest to goodness data scientist, data engineer, and now data team lead, what did I get right, and what would I have done differently?

What did I get right?

At the start of my project, I needed to decide what language I would learn to work on this project, deciding between R & Python. I documented my thought process for choosing Python in the first half of this post, and it largely holds up. As shown in Github’s State of the Octoverse, and the Stack Overflow Developer Survey, despite its immense popularity, Python is still one of the fastest growing languages, with a vast developer community and seemingly unlimited online resources to learn everything from web development to deep learning. Similarly, learning to prefer Pandas DataFrames over Numpy Arrays was another great idea, one I’ve kept.

I ended up largely using machine learning models from scikit-learn, an open-source Python package with simple, performant implementations of a wide variety of supervised and unsupervised learning algorithms. Unfortunately, it has a few nuisances with how it handles data, largely using Numpy arrays and all their flaws. I ended up spending a few weeks diving into the deep-end of Python programming, learning about metaprogramming to extend the package to use Pandas dataframes. This may not have been the best use of my time for this specific project, but I ended up learning a lot about how Python actually works, while making my workflows much simpler and readable.

A few months in, I had managed to collect a decent dataset: batted-ball trajectory data going back to 2015 along with pitch trajectory, play-by-play, and weather data going all the way back to 2008. At this point, I was starting to analyze it, looking at things like imputing missing values using machine learning. Unfortunately, my little 12” Macbook, with only two cores and zero fans, was not up to the task. At the time, the decision seemed to be between writing some of the ML algorithms in C (a language I’d need to first learn) to speed them up, or scaling out on bigger hardware using AWS. In retrospect, the answer seems obvious, and I’m glad I came away with the same conclusion: setup an AWS account and rent some bigger computers!

What would I do differently?

As you might guess, there’s quite a lot that I’d do differently. To start, I ended up saving my models largely using joblib, as suggested by scikit-learn. This, it turns out, is a really bad idea, for a wide variety of reasons. I didn’t realize this at the time, but given that I had spent quite a bit of time improving my own version of scikit-learn, I probably should have also improved the way those models serialized, to avoid such a low performance, insecure data model.

Storing those models in Git LFS seemed like a really nice idea at the time. This allowed me to store the models with the code, making it very easy to distribute them to the cloud compute each time I’d install and run my software. Unfortunately, Git LFS just isn’t a very good technology, and you actually have to pay for the storage. After shutting down my project, I ended up needing to “rewrite history” in Git to remove these models so I wouldn’t have to continue paying for them! A much simpler solution would have been to store them in AWS S3, which it turns out even has a fancy versioning option if you really need it (you probably don’t).

As for the data I scraped from baseball savant, I initially stored that in a local SQLite database, and eventually migrated that to an AWS RDS instance of PostgreSQL in the cloud. These are all great technologies, and I’m glad I got the chance to experiment with them using AWS free tier, but it turns out I mostly didn’t need them, and if I were to do this again, I’d have avoided SQL altogether. An RDS instance costs money to keep on all the time, which I didn’t need. Instead, I’d have gone with S3 again, storing my data in Apache Parquet format. It turns out that Parquet is an excellent format for storing and loading tables that make their way into Pandas, which has a built-in Parquet loader.

On the data scraping side of things, I eventually learned that what I was building was a simple extract, transform, load pipeline, often abbreviated ETL. Whenever I wanted to run my models, I’d first call an “update” method on one of my database objects before loading. This update method was “incremental,” an important attribute of any scalable ETL where it only searches for and adds new data, instead of reloading the entire dataset. But it still took a good bit of time, and I’d often skip it to save time, sometimes forgetting I didn’t have fresh data.

The simplest solution to this problem is to decouple the updating from the loading, having them run in separate processes, since they’re needed at different times. In my case, I didn’t need to “stream” live data from baseball savant, but instead periodically load data in daily batches. Folks often will schedule ETL jobs using cron, a simple program with a rich syntax for scheduling. In my case, the simplest/cheapest solution likely would involve scheduling an AWS Cloudwatch Event to kick off my update job. Conveniently, those events are more or less free, and don’t require having a personal computer or cloud instance running.

For even more cost savings, I’d likely end up running that ETL job using AWS Lambda. While AWS EC2 (elastic cloud compute) is more or less a way to rent a computer in the cloud, lambda allows you to rent fractions of a computer for only at most a few minutes. EC2 instances require some time to boot, but lambda is more or less instantaneous. Most importantly, they’re VERY cheap.

So in sum, at 3am each morning, an AWS Cloudwatch Event would kick off an AWS Lambda instance to run my simple ETL job, querying baseball savant for data from the previous day’s games, uploading that to AWS S3, taking advantage of the Parquet format’s partitioning functionality to save separate files for each day so I wouldn’t have to append data to one giant file.

Last but not least, I’d have skipped using Apache Spark altogether. To be clear, I think Spark is great, and use it at work everyday! But setting up a Spark cluster using AWS EMR is a pain. An improvement is Databricks, a company run by the folks who invented Spark from a lab at Berkeley that offers a convenient, managed Spark cluster solution. For my use case though, I didn’t realize at the time that my data simply wasn’t large enough to necessitate a cluster. Sure, I needed to scale up from my fanless 12” Macbook, but AWS EC2 is full of massive instances that are much easier and cheaper to work with. It turns out setting up a Jupyter notebook to run over ssh on a remote instance is quite simple, and there are even convenient plug-ins to make this work with VSCode if you’d like a fancy IDEA as well.

All in all, spending a large portion of my 15 months unemployed learning data science in Python through Statcast data was a great choice! I learned a ton, and in the process made a small contribution to the online sabermetric community. There’s still a lot of questions left to explore in this space that I’ve frankly been surprised more folks haven’t investigated, but it’s time for me to move on to new projects. If you’re looking to work with this data, I actually wouldn’t recommend building off of my code, since the baseball savant data available has actually changed schema a few times since I last worked on this. Luckily, there are lots of articles on doing this, but oddly most of them are in R, which frankly isn't the right language for this sort of work in my opinion.

Happy data sciencing!

Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at matt.e.fay@gmail.com.