Football modelling and expected goals.

Introduction

During the past few years, football modelling has emerged as a standard for teams’ success. Football clubs are grasping every chance and possibility that could improve the performance of their first team. The forward movement of mathematics and machine learning now makes it possible for football teams to use data science for their analysis. Team and player evaluation is a must in gaining an edge and improvement in the competitive world of football.

Nowadays, the detailed metrics offered by data companies such as OPTA are fascinating and sophisticated. Every event on the pitch is recorded with additional variables by manual labor and in this way the quality of chance created and conceded is estimated. That detailed data provided the background for the development of these so-called “Expected Goals” models.

The probability of shot being converted is called “Expected Goal” (or xG short) and nowadays this metric is widely adopted in football analysis. Expected goals can help you understand what happened in a match -e.g. does the actual score reflect the quality of chances created? That same metric can advise you on the teams’ performance and players’ form better than other measures such as goal difference and points accumulated.

Toolbox

A reasonable analytics work is possible if you have basic computation and statistical knowledge. In football modelling, reasonable is not enough and you need to excel in areas such as programming, applied math, building databases and researching.

A well-organized and up-to-date database is critical to your football analysis. You need to be able to extract and prepare datasets for analytics in the quickest possible way. This is probably the most time-consuming process when modelling football and having an easy-to-query data store is simply gold.

You can choose between 2 industry standards: SQL and NoSQL databases.

In a nutshell, SQL databases requires that you use pre-defined schemas to determine the structure of your data. The flow of information from your source must follow the same structure and for this reason a significant up-front preparation is required. Ignoring this step will cause all sorts of trouble in the long term! Some of the industry leading SQL databases are MySQL and PostgreSQL.

NoSQL databases, on the other hand, allow you to populate your data store without defining a schema and you can add fields as you go. NoSQL databases examples include MongoDB, BigTable, Redis.

It is possible to run your database on your personal computer, but the better option is to set up a cloud service that will have auto back-ups and protect you from power failures. Moreover, this will allow you to access your filtered statistics from anywhere.

Programming is essential part in the football modelling process. Coding is needed when you are populating your database, extracting the data in a specific form and reporting the results of the analysis. Connecting to football data APIs and web scrapping requires a decent knowledge in programming. Today there are many online accessible educational tools that will allow you to learn coding in a quick and straightforward way.

There are two main languages used in the data science community: Python and R.

Python is often associated as being a general-purposes language and it has an easy-to-understand syntax. On the other hand, R has been developed for the sole purpose of statistical analysis. Python, along with Anaconda distribution seems to be the better choice here for data science. If you put the effort to learn it to a great standard, you will be able to do miracles on the data science battlefront.

Regardless of the programming language that you choose, a version control software is essential. This will allow you to move between different version of your software without any hustle and understand the progress of your coding. Git is industry standard here and you can have a free or paid account in no time.

Football modelling is a complex process that needs knowledge in advanced statistics in order to derive robust insights. Simple counting and averaging will not get you in line with the professional betting syndicates and punters. Learning statistics on your own will be a long and difficult process and for this reason, attending an online course will be more efficient. The Generalized Linear Models is a comprehensive statistical method that will instantly improve your football analysis.

The main idea behind your data science toolbox is to build a platform that will follow ordered processes leading to efficient football analysis.

Expected Goals

Nowadays, describing and studying football consists entirely of on-the-ball event types. These can be “tackles”, “big chances”, “aerial duels”, “dribbles” that are not the fundamental unit of analysis. These attributes are part of the process called football, but the main established concept currently is expected goals.

In simple terms, expected goals assign a value to the chances of a shot resulting in a goal. It takes into account data from thousands of shots and filter them based on factors such as distance, angle, type of shot, pattern of play, assist type, the number of defenders between the shooter and the goal, possession chains, body type and more. All these factors and more, depending on the football model, are used to create a percentage chance of a shot becoming a goal.

The expected goals are usually expressed as a number between 0 and 1, with 1 being a certain goal. An expected goal of 0.2 means that one out of every five occasions will result in a goal.

Some of the attributes are common to all expected goal models – i.e. shot distance and angle. Other parameters depend on the richness of the data and include variables such as assist type, pattern of play, possession chain. Discovering which parameters are essential in your model is the key to further development and success.

Let’s go over some of the attributes and explain their influence on expected goals.

Distance/Angle:

Shot location data provides coordinates (x,y) of the start and end position of each shot/event on the field and it is by far the most important predictor. These values can be used to calculate shot distance and angle. A shot closer to the goal has a greater chance of being converted than one further away.

Angle is another important metric, because a shot near the front of the goal has a greater chance of being converted. Two lines can be drawn from the shot location to each post and the angle between these lines reflects the view that the player has. These are standard parameters when calculating expected goal per shot.

Play type:

There are data sets that describe shots as a result of set-pieces, through balls, crosses, corners, dribbles, key passes. Some of these events are more likely to produce a shot/goal based on the team and leagues analysis.

For example, through balls eliminate one or more defenders and increase the scoring chance. The one pass after a through ball is even better as it has a chance to eliminate players along with the goalkeeper. Crosses are an efficient way to create goal chances, but they do not necessarily create quality attempts. Dribbles have smaller effect than through balls but they still eliminate at least one defender and increase the odds of scoring.

These attributes depend on your data source and if available, it is a good idea to include them in your model based on your analysis.

Body type:

Some data sources include the body part used for the shot – i.e. Left/Right foot, Header. In some situations, certain body parts are more likely to be used -i.e headers from crosses or corners. Obviously here, foot shots are better than headers.

Competition/Country:

It is a no-brainer that shot conversion rates depend on the quality and characteristics of the competition. Analyzing each tournament and adjusting your model for it will definitely give you an advantage.

Big Chance:

A “big chance” is an attribute that would be assigned when a player has a one on one chance against the goalkeeper. Moreover, data companies like OPTA add this metric when the player is reasonably expected to score. This attribute has a significant impact on expected goals and it is the perfect example where data is corrected by a human judgement. It is an important attribute to make note of and include in your model.

Assist:

There are attempts that are assisted and ones that are not. These assists can be intentional or not. The unintentional assists lead to a shot, but these weren’t meant to provide a scoring chance. When a player deliberately makes a choice to allow a team mate to shoot the ball on goal, then we note this attempt down as an intentional assist. Intentional assists are very important for your expected goal model as they illustrate a quality attempt.

Game State:

Creating goal scoring chances when 1 or 2 goals down is much different when there are no goals scored in the game or when a team is leading. This means that teams defend differently according to the score line in open play. This is another factor that should be taken into account when dealing with expected goals.

The impact defense has on xG:

Defensive positioning and reducing your opponent’s chance of scoring is just as important. For example, defenders can force a player to shoot a different way or make last minute movement adjustment that make it harder to score.

When analyzing the entire attacking process, from a chance creation to where the final action takes place and using the proximity of defenders and their influence on the quality of the shots, adds another level of detail to expected goals modelling.

This means that looking at where the goalkeeper and defenders are positioned in relation to where a shot is taken from, could produce the most accurate expected goals output of all.

Team/Player Performance

Player performance can be easily understood when we compare the goals that a player scored during a full season with the chances available to him through expected goal. If the player’s scored goals are significantly above his expected goals, this might be a sign of an unsustainable run. The expected goals variable can tell us more about the player’s shot selection. We can find out whether a player is taking high quality shots by comparing his average expected goal per shot.

Two good examples here are Jamie Vardy and Roberto Firminho in 2017-2018 EPL Season.

As we can see from the table below (courtesy of www.understat.com), Vardy and Firminho outperformed their season expected goals tally, which tells us that they converted more opportunities which had a lower probability of resulting in a goal. Essentially, this tells us about the strikers positioning and shot quality.

These expected goals projections can help us show the real performance of a team, who might be under/over performing based on the actual number of goals they are scoring.

A significant player/team insight from the expected goals is when the xGs are plotted for different location on the field. This is easily showing that a player is often shooting from his favorite place on the field but never scores. If the probabilities of these goal scoring opportunities are high, the player is obviously doing something wrong in these cases and his actions could be analyzed in more detail. If, however, the probabilities are low for a player on the field, but that player shoots very often, someone could point out to him that shooting might not be the best decision at that part of the field.

Another example comes from the case where players, especially strikers, score

many goals in one season. It could, however, be the case that such a player did score a lot but had a much lower amount of expected goals. This could suggest that the specific player was lucky during that season.

If a player surpasses his expected goals for a few games and does not have a notable history of being a prolific goal scorer, he is probably on a hot streak that will not last forever.

But someone like Harry Kane, who scores more goals than the chances he gets suggest he is clearly just better in front of goal than the average player and moreover being at the right place than the average player. In the 17/18 EPL season, he has scored 30 goals, having 26.86 expected goals chances and slightly overperformed. Another such player with an impressive goal tally of 32 is Mohamed Salah, who has had 25.14 xG and has clearly done much better. He is the only player with such high goal/xG difference of 6.86, and deservedly becoming the EPL top goal scorer.

This can help clubs make decisions in many ways. For example, expected goals can identify players who are good at getting into goal scoring positions before they have started scoring a quantity of goals that actually makes teams take notice.

Of course, more research has to be performed on that player’s performance, but the expected goals indicator could be a useful tool in player acquisition.

Predicting match outcome

The next section in this document is for those who have more than academic interest in predicting football results. Nowadays, expected goals models have a significant use in sports betting syndicates, professional punters and bookmakers. As noted above, there isn’t a perfect formula for an expected goal model that works fluently. There are loads of performance metrics waiting to be exploited and create a better view of teams’ or players’ recent performance. Betting syndicates are improving their xG models constantly in order to find the needed value for their sports betting investment. Placing 400-500k GBP bets, spread on 3 or more Asian handicap lines per football match is a daily business for the top world betting syndicates. Moreover, 4-5% yield on their yearly betting turnover is a standard.

Your expected goal model comes handy in here for the simple reason that it suggests the expected goals for team A and team B. The more accurate our expected goal model is, the more likely we are to find value bets. To put it simply, your betting profit depends on your expected goal model capacity to forecast accurate match score lines.

We already have an estimated expected goal numbers for team A and B and we can use these numbers to generate the odds for home, draw and away in that particular match. Goals in football matches closely follow a Poisson distribution model. Poisson distribution helps you calculate the probability of each possible score line in the match if you have the xGs for each team on hand. This means that if we assume that team A will score on average 1.2 goals, the Poisson distribution tells the odds of team A scoring exactly. 0,1,2,3 goals, etc. Furthermore, from here we can derive the odds for team A scoring more goals than team B.

Microsoft Excel has an easy implementation of the Poisson Distribution in the following formula:

=POISSON(x, mean, cumulative)

The above formula represents “x” as the exact number of goals we want to find the probability of. The “mean” is our expected goals values that we have come up with from our expected goal model. One last addition here is the “cumulative” variable that we have to set to FALSE. This will result in POISSON distribution returning the probability that a random variable takes on a value exactly equal to x.

When we calculate the probability of each team scoring 0,1,2,3 or more goals, we can find the chances for the match to finish home, draw or away. We can do that by estimating the probability of all possible results (i.e. 1-0, 2-0,1-1,1-2 etc.)

All this leads us to calculated home, draw and away odds from which we can create Asian handicap and total goals markets with the help of xG as well. Here on comes the fun part, where you have to find odds better than your projections and bet on these. The logic behind this is that the odds offered by the bookmakers are significantly better/wrong than your ones and you are taking advantage of the bookmakers’ prices. This is the whole logic behind value betting – comparing your football betting model odds with that of a bookmaker and finding where the value is. This is being done on hourly basis by professional players and being first in this part of the betting industry is key.

Conclusion:

Expected goal is a complicated attribute that is an irreplaceable part of the football analytics. It explains what has happened in a match better than other parameters and goals scored, but there are still some significant limitations. For example, expected goals models don’t catch dangerous phases of play that don’t end in shots.

We do have to address again the problem with data acquisition. The top data companies use special panoramic cameras in order to obtain such a detailed level of statistics for each match. This is the reason why the data is very expensive and becomes a problem for the average data scientist. Some analysists move forward to scrapping the data off free data provision websites such as: www.whoscored.com and www.squawka.com. Having a well-organized and up-to-date data source is probably the most important part in your expected goals adventure.

Obviously, there are limitations to any expected goal model in terms of subjective factors such as unrest in the squad, new managers, top players are injured, hard and demanding weekly schedule. That kind of information can be easily researched and make note of in order to adjust the expected goal model.

If you manage to make a note of these non-stats factors and adjust your xG model, along with a well-formulated and accurate team’s attack and defense rating, you will have a complex match performance metric that will help you with your analysis and value hunting.

The expected goal concept has been extended to other attributes in soccer, from assists to save to passes and even defensive actions. One thing sure is that expected goals does deserve its place in the toolbox of football analytics.