Scraping & Storing Statcast Data

Thursday, June 22nd, 2017

Note: this is the second post of the multi-part series Statcast Data Science. If you'd like to start at the beginning, you can check out the introduction here.

If you'd like to follow along at home, I've published the code for this post on Github.

Newly unemployed, I had resolved to broaden my skills by learning data science techniques using open-source tools, and had even found an exciting project worth pursuing — scouring through MLB's newly published Statcast data for new baseball wisdom. The next question was which tools were right for the job. As far as programming languages, I was quickly narrowed my list to two options: R and Python. Both are open-source programming languages, actively developed by robust communities, capable of various data science tasks. Python is a general-purpose programming language, considered by many to be an excellent first language for an aspiring coder, while R is explicitly intended for statistical computing, and is thus less universal (Python ships with every Mac).

My choice was made easier after asking my friend Michael Gethers, a data scientist and former R evangelist who had recently switched to Python. Looking back on my choice to learn Python, I think I made the right decision, but wouldn't disagree too strongly with an R proponent. My fundamental computer science skills are stronger for it (decorators, MRO, and metaclasses anyone?), and of course, choosing the more general-purpose language was in the original spirit of broadening my technical skill-set. Perhaps most importantly, while R is widespread in the statistics community, Python is much more common for machine learning and intelligence uses.

After spending a few weeks going through a few O'Reilly courses, puzzling through the Python challenge, and paging through the Python Tutorial, I felt comfortable enough with the language to try out scraping data from the web. Like any good (lazy) coder, I did some thorough googling in search of existing solutions, but only found many folks using R and nothing in Python, so I set out to write a scraper myself. Thanks to Daren Willman's twitter, I was able to find an API for MLB's mobile app Gameday that contained launch angle, exit velocity, and hit distance for each batted ball, but because Daren had already done the work of collecting and hosting that data at his website baseballsavant, all I had to do was systematically query the Statcast Search page for all tracked data going back to 2015. With a little help from Chrome's Developer Tools, I was able to parse the URL used for downloading a CSV file, with fields for everything from the obvious (game date, venue) to the specific (pitch type, base-out state). Large CSV files would time out, so I found it easiest to request data one game at a time by stepping through dates and venues.

At this point, I was able to programmatically download Statcast data from baseballsavant, but the next question was how to store it all. Instead of storing every CSV locally, which would be unwieldy to process, I knew I wanted to create a local SQL database for many reasons. After going through codeacademy's three SQL classes, reading about data management best practices, and playing around with local databases in memory using Python's built-in sqlite3 module, I discovered Pandas.

Pandas is a Python package "providing high-performance, easy-to-use data structures and data analysis tools" commonly used by data scientists. The two main datatypes— the 1-d Series and 2-d DataFrame — are labeled, vector and array-like containers based on R's data.frame, only with lots of useful built-in methods for data manipulation. After initially working with NumPy arrays, which are similar to MATLAB's arrays, I came to prefer Pandas datatypes for their ease of labeling, keeping data organized and consistent. Pandas also has some convenient I/O functions for loading and saving various file formats, including CSV and SQL databases, which I used extensively. All I had to do was wrap things up in a few for loops, catch (well, technically except) any temporary networking issues, and add some logging to keep track of the script, since it would take about a day to run through every date-venue combination.

In a data-collecting mood, I ended up creating three more databases, one each for the game information, play-by-play data, and weather details MLB hosts for various media purposes. There, you'll find a treasure-trove of JSON and XML files, documented and organized to varying degree depending on how far into the past you're willing to go.

Instead of copying over code from my Statcast database, I generalized it, putting most of the functionality in an abstract superclass while leaving implementation details to individual subclasses for each dataset, which you can find here, here, here, and here. Note: there's an intermediate abstract class used for the three databases from the MLB source. To be honest, most of the work went into catching all the games that were only partially documented - mostly from spring training, often split-squad.

At the end of it all, I had batted-ball trajectory data going back to 2015 along with pitch trajectory, play-by-play, and weather data going all the way back to 2008, stored locally with a unique id for each game linking the databases. Now for some analysis!

Questions | Comments | Suggestions

If you have any feedback and want to continue the conversation, please get in touch; I'd be happy to hear from you! Feel free to use the form, or just email me directly at matt.e.fay@gmail.com.