Brought to you by Cambridge Spark, Data Science Specialists. Written by guest blogger, Manja Bogicevic.

In just 10 minutes, 16 players with 6 balls can produce almost 13 million data points!

The origins of football go way back from the beginning of the time, some would say. These days communities, families and friends around the whole world are coming together to watch the World Cup 2018, hosted in Russia. The fascinating fact is that each game will also generate millions of data points and events. Sports such as basketball and tennis have a long history of using data. Now football is the latest sport to become data-driven.

Data-driven scouting is on the rise

One way in which football is strategically adopting data is by collecting various data points to scout for new potential signings. Arsenal paid over £2million for the US company StatDNA [1], whose data has since been used to advise their signings.

The data collected on players is used to build a database within which the club can search for players with the best potential to join their team. The big profits made on these players shows that using data can also have financial benefits.

Photo credit: Fauzan Saari

Betting using data

Data scientists are implementing an analytical approach to bet on football matches [3] and brought the same mindset when advising clubs which players will be the best fit for them to invest in.

Data scientists defined the salaries of players based on 55 metrics (from goals scored, to aggression and ball control) and compared this to the actual salaries from the previous year to reveal overpaid and underpaid players [4]. This method could be used in any industry where there are identifiable attributes in order to determine fairer wages. We could also resolve the issue between man and women salaries as well.

Training with wearables

Players nowadays use wearable technologies [5], and balls are fitted with sensors [6], providing real time performance statistics to clubs. Information collected can include distance covered on the pitch, passes completed. These insights can help managers decide who has performed well enough to earn a place in the starting eleven for the next match.

Collecting data during training can also prevent injuries, increasing the likelihood of the team having of a successful season. Hoffenheim having been playing in the German Bundesliga since 2008 and data science from their partner SAP forms a vital component of their training sessions [7].

Real time tactics on the World Cup

Data collated from training can be used as statistics from previous games to aid tactical decisions before and during the match [8]. Many clubs work with data, not just from their past matches but also from their opponent’s matches, allowing teams to be strategic in matches against their opponents.

Data science orientated solutions are also extremely powerful during the match itself. Coaches can receive a half-time report, thanks to Data Scientists. Another benefit of a club becoming more data-driven is that algorithms can reveal insights that human statisticians would most likely miss.

How can a football club not rely on data?

Did you know that athletes are not only monitored by cameras in stadiums, but also by many smart devices such as heart rate sensors and even local GPS-like systems? Given the success story of the data-driven football club which I previously mentioned, it’s normal that an increasing number of clubs will follow this pattern. The physiological monitoring service collects and transmits information directly from the athlete’s bodies, including heart rate, distance, speed, acceleration and power, and then display those metrics live on an iPad. All this information is made available live on a device to coaches and trainers on the sideline during training session, as well as post-session for in-depth analysis.

Interestingly enough, analysis of the data can help identify the physically fit players from those who could use a rest. Many football leagues and clubs also collaborate with Opta, a leading provider of football sports data. Opta can determine every single action of a player in a specific zone on the field [9], regardless of whether he has a ball or not. It can also measure the distance the player runs during the course of a game. There are more than 100 match-specific statistic categories [2], for instance shots, goals, assists, yellow and red cards, won and lost duels and also some lesser-known categories.

It’s not uncommon to read things from other Data Scientists like “I’ve included all players who started more than 10 games, have more than 4 years in the league, didn’t miss time due to injury, and stayed with one team the entire time.” This often amounts to selecting on the dependent variable and biases your results. A lot of people don’t realise that many of the problems in sports analytics are just specific substantive examples of commonly occurring modelling problems. I’m hoping to change this. Data Scientists encounter this problem all the time. We also want to express how confident we are about those summaries and estimates.

If you want to work in data analytics in sports, go for it. It will be the new big thing. I’d also say not to get discouraged. This stuff is hard and it takes a lot of practice and willingness to make mistakes and be wrong before you get it right. And, if I had one single piece of advice — practice matrix algebra. I’m still learning and making mistakes. But I am learning every day. I now know more then yesterday. If you learn Python every day for 1 hour you will be an expert in that field in two years.

Until next time,

Manja

Follow me on LinkedinInstagram and Medium