Devin Pleuler — April 2020
February 2023 Update:
It's been almost three years since I originally published this online. I definitely was not expecting 1k+ GitHub stars on the project. Thank you!
This is a long-needed update to entirely rework the code samples; linked directly below. There have been various changes in the soccer analytics software ecosystem that have significantly changed what I believe are best practices for working with public soccer data in the python environment. In particular,
mplsoccer
andkloppy
are game-changing. There are also official vendor interfaces (e.g.statsbombpy
) which make it easier for people to work with their data. More of this please!Instead of separating the code into multiple notebooks, it's now centralized into a single document. It no longer requires an installation of my own custom modules and instead relies entirely on pip-installable packages. The code samples currently utilize
Statsbomb
andMetrica
open data, who are owed a lot of credit for releasing some of their data for public exploration.The code samples are also available within this GitHub repository here – which means that you can propose changes via pull requests and even file issues. My hope is that the handbook continues to gather wider community engagement and contribution.
April 2020 Original:
This is probably overdue. I believe that people who have managed to wiggle themselves into dream jobs have a responsibility to help others reach there too. This was written during the depths of social isolation imposed by the COVID-19 pandemic. During this period, I've had an atypically large number of students and career changers reach out to me with questions, and a little extra free time, so I'm finally completing my assigned homework.
There are plenty of resources out there that cover some of the more "how do I get a job in sports analytics" career-strategy questions out there, like THIS and THIS from Sam Gregory. This handbook is more geared at some of the technical skills, concepts, and sports analytics history that I think are worth familiarizing yourself with.
In the handbook you can find three primary things:
- Resources and suggestions for technical skills worth having for work in soccer analytics (but can probably be extended to other sports)
- A series of tutorials delivered in
Jupyter
notebook format usingStatsBomb
Open Data, covering various data science techniques common in soccer analytics. - Collected research and articles that I believe are required reading to get up to speed with both the history and state-of-the-art in soccer analytics.
Live long and prosper. 🖖🏻
The most important attributes for contributing to the soccer analytics landscape are a deep knowledge of the game, an ability to communicate clearly and effectively, and a bucket load of skepticism. Unfortunately, getting a job in soccer analytics is largely independent of these attributes and mostly depends on good fortune and timing. It is not a meritocracy, and I hope that changes.
But the most important technical skill once you have landed a job in soccer analytics is experience with scripting languages, preferably Python
or R
as they're great for data science.
I personally prefer Python
, and therefore my recommendations will be geared in that direction. My primary reasons for this suggestion are:
- Simple syntax makes it great for first-time programmers
- Excellent documentation and community support
- Most analytics departments are using it
- Plays nicely with others
- It's magic
There are a ton of great resources online for learning Python
, so I'm not going to reinvent the wheel here. Here are some that look good:
- The Hitchhiker's Guide to Python
- The Python Tutorial direct from the official source:
docs.python.org
. - Plenty others
Note: Starting with Python 3 (I'd suggest version 3.7+) is probably the best route at this point. Python 2 is in the painful process of being put out to pasture.
From a data science perspective, you can do just about anything worthwhile with the SciPy
Stack. All of it's libraries are well-supported, and easily google-able if you run into issues. As a beginner, I wouldn't stray too far from these foundational libraries. If you do, you should have a decent reason for it. Some of its important components include:
Numpy
- A fundamental library for scientific computing inPython
. Particularly great for optimized vector and matrix calculations.Pandas
- A fast data analysis and manipulation library. It'sDataFrame
functionality is super useful (and reminiscent of some good bits ofR
)Matplotlib
is the de factoPython
plotting library. It's finicky, but powerful. I've learned to love it.
I'd also suggest scikit-learn
(a.k.a. sklearn
), which I find very user-friendly and is built on top of the libraries mentioned above. In our tutorials, we will predominantly use the sklearn
implementations.
First, it's worth explaining what varieties of soccer performance data exist in the wild. Typically, and colloquially, there are two types of data: Event Data
and Tracking Data
.
November 2020 Addition: It's probably now worth including Broadcast Tracking
as a standalone category.
Event Data is effectively chronological event-by-event tabulation of on-ball actions. It's typically collected from broadcast footage by third-party collectors and sold on the open market to clubs, broadcasters, the gambling industry, and even private individuals. The primary companies competing in this space are Opta
(now owned by STATS Perform
) and StatsBomb
, but there are other competitors.
Tracking Data is an entirely different beast. Player tracking systems record the coordinate position of every player on the field (and usually the ball), many times per second. State-of-the-art systems collect up to 25 samples-per-second. Because these systems are expensive to install and operate, and require in-stadium hardware, this data is mostly available to the clubs themselves, but academics frequently get their hands on this data in a highly anonymized format through tediously painful research agreements. There are various competitors in this space, such as ChyronHego
, Second Spectrum
, STATS Perform
, Metrica
, Signality
, and others.
The difference in scale between two data types is enormous. A single game of Event Data
features around ~2-3 thousand individual events. A single game of Tracking Data
represents 2+ million individual measurements.
Broadcast Tracking is a new variety of data that has rapidly grown in popularity over the last couple years. As the state-of-the-art in computer vision has progressed rapidly, the problem of collecting high-resolution tracking data from broadcast video has become a tractable problem. Obviously what is being collected is not a complete data set, but obviously the most important and relevant areas are captured. The leaders in this space appear to be SkillCorner
and Sportlogiq
.
The introduction of Broadcast Tracking
is particularly interesting for the player recruitment theater. Since access to full-tracking data is typically limited to teams in a single league, it provides scouting departments a more complete picture of players in leagues that their team does not belong to.
StatsBomb has provided a large volume of data "freely available for public use" via their Open Data repository on Github in order to better serve the analytics community. We will be using this data in some of the tutorials below.
Metrica has released two matches of tracking data, which are the first examples of publicly available tracking data to my knowledge. This is a huge contribution to the soccer analytics community, and I plan on contributing some examples of how to best use tracking data.
SkillCorner has provided 9 matches of broadcast tracking data into open source.
Last Row (Ricardo Tavares) has provided some tracking-like data for educational purposes on the Friends of Tracking github.
How could I go so long without mentioning Jupyter
? That's because it deserves it's own section. I discovered Jupyter only a year-or-two ago, and I've become a much stronger analyst because of it.
Jupyter
Notebooks are easily sharable documents that contain executable Python
code alongside human-readable text for annotative purposes. They're perfect for sharing code and demonstrating concepts. We will be using these the deliver the tutorials below.
The notebooks will by hosted on Google Colab
, which allows you to write, run, and share Jupyter
notebooks within your Google Drive. For Free!
If you're unfamiliar with Google Colab
(or Jupyter
), check out this introduction video.
After the programming side, my suggestion is earning some experience with relational databases. In particular, I think MySQL
or PostgreSQL
are great places to start. Like the rest of my recommendations, they're both open source. I mention that here because you can find a ton of enterprise solutions in this area.
Understanding SQL
, which has various dialects (but you really only need to know one to adequately Google the quirks between them), is important for efficiently fetching data before processing it. At some clubs, for a hire that is coming into an already functioning analytics department, this skill is possibly the most important.
I use a lot of sqlalchemy
, which has a little bit of a learning curve, but I've found tremendously useful for bridging the gap between Python
and SQL
. And it's super cross-platform.
Don't forget Excel
. It's possibly the most important piece of software ever built. Nobody is too good for Excel.
Having some data-visualization experience in your toolkit is also valuable. After Matplotlib
, I would recommend:
D3.js
is highly recommended for those with even a bit ofJavascript
and web development experience. The learning curve is totally worth it.Altair
for those making the transition over fromR
and really missggplot
, one ofR
most redeeming qualities.Seaborn
is a nice visualization library built on top ofMatplotlib
.Tableau
is totally fine. The tradeoff between customizability and ease of use is worth it in plenty of situations. Don't be a hero.- Don't forget how powerful conditional formatting is in
Excel
.
Knowing some basic version control is really important for working effectively on a data science or analytics team. Git
(and GitHub) is the easy recommendation here. Also, code testing is a thing, unbeknownst to a majority of my code. I'd suggest using nose
.
It's probably worth adding a note about IDE's (integrated development environment) in here for the sake of completeness (i.e. what you write your code in). I've raved about Jupyter
notebooks above, but they aren't great for larger software projects.
Personally, I enjoy using Atom
(made by GitHub
) because I'm apparently a glutton for punishment. A lot of people swear by PyCharm
, and others love VS Code
. They're all fine. It's also smart to get familiar with vim
or emacs
, and general bash
commands. Survival skills in the command-line environment is important when you start getting into data engineering stuff.
When you eventually reach a place where you might want to put some of your analytics stuff online, but don't want to leave Python
, I'd suggest using one of these web frameworks:
Flask
is an awesome lightweight framework that lets you prototype stuff easily and quickly. Great for building APIs.Django
is a fully-featured framework that is a bit harder to use, but does a lot of hard-stuff for you. It's ORM is quite similar tosqlalchemy
, which is a plus.
And for the more-experienced analytics enthusiasts, I'd suggest picking up some of these:
Apache Spark
(andDatabricks
) for massive code parallelization across clusters.Numba
for high performance Python.Tensorflow
and/orKeras
for deep learning (alsoPyTorch
).
There are lots of different ways to install both Python
and all these different packages. The easiest way to get up and running on your local machine is probably Anaconda
. I also suggest learning how to use pip
and virtual environments.
- A Framework for Tactical Analysis and ... by Sarah Rudd
- An Extension of the Pythagorean Expectation ... by Howard Hamilton
- Large-Scale Analysis of Soccer Matches ... by Alina Bialkowski et. al
- Spatio-Temporal Analysis of Team Sports – A Survey by Joachim Gudmundsson and Michael Horton
- Physics-Based Modeling of Pass Probabilities in Soccer by Will Spearman et. al.
- Data-Driven Ghosting using Deep Imitation Learning by Hoang M. Le, Peter Carr, Yisong Yue, and Patrick Lucey
- Beyond Expected Goals by Spearman
- Not All Passes Are Created Equal: ... by Paul Power et. all
- Wide Open Spaces: ... by Javier Fernandez and Luke Bornn
- Decomposing the Immeasurable Sport: ... by Fernandez, Bornn, and Dan Cervone
- Modelling the Collective Movement of Football Players by Francisco José Peralta Alguacil
- Player Vectors: Characterizing Soccer Players’ Playing Style ... by Tom Decroos and Jesse Davis
- Actions Speak Louder than Goals: ... by Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis
They've created a python library from this research. Find in Resources section below.
- Dynamic Analysis of Team Strategy in Professional Footbal by Laurie Shaw and Mark Glickman
- Ready Player Run: Off-ball run identification and classification by Sam Gregory
- SoccerMap: A Deep Learning Architecture for ... by Javier Fernandez and Luke Bornn
- A new look into Off-ball Scoring Opportunity: ... by Hugo M. R. Rios-Neto, Wagner Meira Jr., Pedro O. S. Vaz-de-Melo
- Assessing The Performance of Premier League Goalscorers by Sam Green
- Counting Across Borders by Ben Torvaney
- Defending Your Patch by Thom Lawrence
- Pass Footedness in the Premier League by James Yorke
- Messi Walks Better Than Most Players Run by Bobby Gardiner
- Game of Throw-Ins by Eliot McKinley
- Expected Threat by Karun Singh
- Passing Out at the Back by Will Gürpinar-Morgan
- The 10 Commandments of Football Analytics by Tom Worville
- Breaking Down Set Pieces ... by Euan Dewar
- Data Based Coaching: ... by Kieran Doyle
- Coaches Reward Goalscorers ... by McKinley and John Muller
Many of these are borrowed from Sam Gregory's list here. This is far from complete, and will definitely add to this from time to time.
-
Self-Supervised Representations for Tracking Data
This 2020 OptaPro Forum talk from Karun Singh represents some state-of-the-art research around autoencoders and feature extraction from tactical context.
-
Fun conversation at SSAC 2019 between StatsBomb CEO Ted Knutson, Houston Rockets GM Daryl Morey, and some other guy.
-
This classic 2018 OptaPro Forum talk from the effervescent Marek Kwiatkowski is one of my favorites. Suggests a mixed model approach for personalizing certain soccer metrics.
-
Great talk from Thom Lawrence at the 2019 StatsBomb Innovation Conference covering approaches to Expected Possession value.
-
Probably the smartest stuff I've seen on evaluation of goalkeeper performance, presented by Derrick Yam.
-
This PyCon 2016 talk from Jake VanderPlas is a great crash course in doing statistics with for loops. It really provides a great perspective for those of us without an extensive background in hard statistics. Great speaker, too.
-
This whole series, produced by a handful of soccer analytics experts including David Sumpter, is not-to-miss. It probably the most comprehensive resource out there for getting started in soccer analytics. And it uses
python
!
-
A python library for valuing the individual actions performed by soccer players. Includes an Expected Threat (xT) implementation. From Tom Decroos et. al.
-
A python library written by Francisco Goitia to access StatsBomb data.
-
A python library for visualising soccer event data. Also by Tom Decroos.
-
Not
Python
, but this soccer visualization library from Ben Torvaney is great. -
A python library to convert StatsBomb's JSON data into CSV format.
-
Jake VanderPlas made his entire Python Data Science Handbook and accompanying Jupyter notebooks available online. It's a tremendous resource.
-
A python library to access American Soccer Analysis data
- The Numbers Game by Chris Anderson and David Sally
- Football Hackers by Christoph Biermann
- Soccermatics by David Sumpter
I maintain a Twitter Thread of potential ideas that I think would be interesting soccer analytics projects.