Processing Statcast Data

Sunday, July 1st, 2017

Note: this is the third post of the multi-part series Statcast Data Science. If you'd like to start at the beginning, you can check out the introduction here.

If you'd like to follow along at home, I've published the code for this post on Github.

With baseball databases for batted-ball & pitch trajectory, play-by-play, and even weather information stored on my machine, it was time to start analyzing the data. At only 5 GB in total, the data was small enough to fit on the SSD of my little, fan-less Macbook, and a single season's worth of data loaded into memory was on the order of MB's, so playing around with the data locally wasn't a problem. However, as I started using more and more complex algorithms, my machine was quickly bogged down, taking hours or even days for some tasks.

One way to solve the problem is to get a bigger, faster machine, an expensive and inflexible solution. Luckily, in scaling their internet services into the "big data" era, starting in the mid-2000's, tech companies like Google, Yahoo, and Facebook pivoted away from centralized, on-premises, custom mainframes to large remote warehouses of networked, commodity hardware, aka "the cloud." Today, cloud services from Amazon, Google, Microsoft, IBM, and others allow everyone from little guys like me to giant corporations to flexibly and cheaply rent cloud computing services. For example, Amazon Web Services, which sprung from Amazon's internal capabilities hosting their own website, is used by Netflix, Yelp, and notably, MLB Advanced Media.

So, following the industry trend, I decided to move my data science project to the cloud.

For all its flexibility and cost-effectiveness, working in the cloud presents some challenges. For example, to save money, instead of permanently renting out machines, it often makes sense to only spin up instances (this is jargon for renting a few computers) as they're needed. Since those cloud instances aren't permanent, any files you want to save for later must be saved elsewhere, e.g. cloud storage services. However, "spinning up" an instance, which involves installing your desired operating system, along with any software and files onto your newly rented computer, can take a long time if you have to load large files, e.g. my baseball databases, onto each new instance. To get around this, I decided to move them from a local implementation, to one hosted using Amazon RDS, a relational database service. Instead of storing and loading data locally, I would now send queries to a server in the cloud hosting my databases.

SQLite, the simple, lightweight engine I was using isn't compatible with Amazon RDS, so I decided to switch to PostgreSQL, the most popular open-source option. Luckily, this was a minor code change thanks to my object-oriented implementation and SQLAlchemy, the Python package I was using to connect Pandas to SQL.

With the databases figured out, the next step was figuring out how to load my code and any dependencies onto each machine. This can be done using a bootstrap action, which is essentially a set of commands each instance will run when its started up, after basics like the operating system (Linux in my case) have been installed. Those commands are generally written in the form of shell scripts, a simple programming language common to Unix-based machines (Linux & macOS) accessed through a command-line interface (think black background with a monospaced font and a blinking cursor).

The simplest way to install Python code onto a machine is using a program called pip, a recursive acronym that stands for "Pip installs Python." It allowed me to quickly and reliably install the packages I needed onto each cloud instance. This was so headache-free that I ended up packaging up my own code in order to use pip for installation. Conveniently, in addition to the Python Package Index (where popular packages like Pandas & SQLAlchemy can be accessed), it can also install packages from other sources like Github, where my code resides.

In addition to code, I had some models saved that I wanted to load onto each machine. It turns out that Python packages can also include data, so I simply added them to my package. The only hang-up was that their file size was too large for Git, a version control system usually intended for small files, like source code. To get around this, Git LFS (Large File Storage) is a Git extension which stores the large files separately, solving the problem. The one downside is, unlike hosting repositories on Github, Git LFS files cost money to host.

At this point, I was able to run code on a single cloud instance. While that itself is extremely useful, enabling me to rent out computers at remote data centers, my computing tasks were becoming too large for a single computer. Renting out larger and larger hardware was an option (36 vitrual CPU's anyone?), but the flexible, modern solution is to split up jobs between multiple instances, known as distributed computing. One popular way to go about this is using Apache Spark, an open-source framework with API's in Java, Scala, Python, and R.

While I had plenty of experience with MATLAB's parallel computing toolbox, as a good and lazy coder, I wanted to avoid any and all unnecessary work, including either re-writing my algorithms to work with Spark, or using Spark's own MLlib (machine learning library). Because most of my models were derived from the scikit-learn Python package, and my computing tasks were mostly cross-validation (explained about 2/3 of the way through this post), I was able to adapt spark-sklearn, a package developed by Databricks, the creators of Spark that parallelizes cross-validation jobs across Spark clusters (this is jargon for a group of computers). I updated the code for the latest release of scikit-learn (it hadn't been kept up to date unfortunately), and added a few convenience features, all of which you can find here. The end result was so streamlined, the only code I had to change was import statements.

All I had left to do was acquire a cluster to run my Spark jobs. Luckily, Amazon EMR (elastic map reduce) provides a scalable, pre-configured cluster of cloud instances for running distributed computing frameworks like Hadoop, Spark, and others. There were lots of little gotchas along the way —bootstrap actions failing, security and permissions issues, configuration settings — but I found this post extremely helpful in the process, and I'd love to offer advice to folks going through similar challenges.


Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at

Using Format