Machine Learning in sports betting

The global betting market increases every year as a direct result of consumer demand driven by technology advances. Betting operators focus a significant part of their investments in machine learning methods that have shown promising results in prediction. This investment comes in the form of in-house built prediction models or buying the services of specialized companies that provide very accurate probabilities of sport events. Due to the large worldwide betting turnover, it is necessary for a betting operator to increase its accuracy in sports prediction.

These models are based on detailed data and indicators such as player performance, player location stats, expected goals, expected assists, sequence and possession and defensive coverage, contribute to the game’s prediction process. With the expansion of detailed data, there will be more statistical metrics created and better predictive models developed.

In addition, sports teams, mangers, professional betting syndicates and pro punters are focusing at machine learning techniques in order to better understand and formulate strategies necessary for accurate predictions.

In general, machine learning allows computer systems to learn directly from examples, data, and experience. The advantage here is that these computer systems will no longer follow pre-programmed rules and will carry out complex process by learning from the data. With the increase of detailed data and computer processing power, machine learning systems can be trained on a large pool of examples. It is obvious that machine learning can support potentially transformative advances in a range of areas and the social and economic opportunities which follow are significant. As we have mentioned above, in betting, machine learning is helping to build better predictive algorithms to bookmakers, teams and professional punters and offering new insights into more accurate predictive models.

There are 3 types of machine learning algorithms: Supervised learning; Unsupervised learning and reinforcement learning.

In a nutshell, supervised learning consists of a target variable which is to be predicted from a given set of predictors. The training process continues until the model achieves a desired level of accuracy on the training data. Regression, Decision tree, Random Forest, KNN, Logistic regression are example of super vised learning.

With unsupervised learning, we don’t have an outcome variable to estimate. Patterns are based only on input data. Most unsupervised learning techniques are a form of cluster analysis. In that kind of analysis, you group data items that have some measure of similarity based on characteristic values.

The reinforcement learning allows the machine to train itself continually using trial and error. By learning from past experiences, it tries to capture the best possible knowledge and make accurate decisions.

The historical performance of teams, match results and players’ statistical indicators and metrics are used in such algorithms in order to create match probabilities and decide whether to bet on a certain match, given the bookmakers’ odds. We will briefly explain the above-mentioned algorithms and provide examples where possible.

Linear regression help you establish a relationship between independent and dependent variables by fitting a best line. This helps you figure out how attributes correlate to each other and what their relationship looks like. The best fit line (also known as regression line) is identified with the linear equitation Y = a*X + b. Knowing this line and the coefficients (a and b) helps you find the attributes in question. An example here is to find the relationship between the NBA teams’ score difference with the time each player of the team played with one another (Check our Introduction to basketball models and metrics). The score difference here is the dependent variable.

Logistic regression is used to estimate discreet values based on given set of independent variables. It is also known as logit regression because it predicts the probability of an event happening by fitting data to a logit function. For example, in Baseball, logistic modelling can use a binomial response variable as whether a team makes it to the playoffs with contributing factors as the number of runs and the total number of strike outs pitched during the regular season. Check out our Methods and indicators for baseball modelling.

Decision trees are mostly used in classification problems and are a type of supervised learning. It works for both categorical and continuous input and output variables. Decision tree output is easy to understand, and it doesn’t require much statistical knowledge to read and interpret them. It is one of the fastest way to find the most significant variables and the relation between two or more variables. Decision trees have been used experimentally to predict sports results. One person used a decision tree model to predict the winner of the Stanley Cup 2011 Western Conference. They got a conclusion where if the Vancouver Canucks restricted Tampa bay to less than 2.5 goals, then they had a 93% chance to win.

Decision trees are much more useful than the classic techniques such as regression and SVMs (Support Vector Machine) when it comes to predicting future sports performance. The relationships between different variables in sports are very complex and regression generally cannot recognize the relationship between different variables quite as well as decision trees. Regression also has a problem that it is difficult to determine whether there is simply correlation or whether there is causation. Decision trees are better at discarding information that is essentially useless. Decision trees can be used to classify good players whose FIFA rating is over 70.

Support Vector Machines (SVMs) are models used for data classification. They have the ability to analyze data sets and identify patterns that can then be used to forecast classes for new data points. In this algorithm, a line is drawn between two different classified groups of data and this line will be the farthest away from the two points of each data group that are closes to one another. SVMs can handle non-linear data and calculate probabilities rather than just output binary predictions. SVMs provide a viable approach for the calculation of expected goals. More about expected goals can be read here: Football modelling and expected goals.

Naive Bayes is a classification technique based on the Bayes Theorem with an assumption of independence between predictors. For example, if you take attributes such as rain, pitch size and throw-ins to predict match winner in soccer, you would assume that all those three attributes independently contribute to probability of the match winner. Even if these stats have some relation, we would naively tell that they haven’t.

The advantages of using Naive Bayes classifiers is that they are highly scalable when presented with large amounts of data. Also, Naive Bayes is known to outperform even highly sophisticated classification methods.

k- Nearest Neighbors can be used for classification and regression problems. In general, it classifies new cases by majority vote of its k-neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function. As an example, k- Nearest Neighbors is used to evaluate soccer talents for suitable positions, considering their skills and characteristics.

K-Means can easily classify a given data set through a certain number of clusters. Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose. To run a k-means algorithm, you have to randomly initialize three points called centroids. We have three centroids because we want to group the data into three clusters. K-means algorithm does two steps: cluster assignment and move centroid.
In cluster assignment step, the algorithm goes through each of the data points and depending on which cluster is closer, it assigns the data points to it.
In move centroid step, K-means moves the centroids to the average of the points in a cluster. In other words, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location.

The fundamental component of Random Forest learning algorithm is the decision trees. As we have mentioned above, decision trees are capable of fitting complex datasets and perform both classification and regression tasks. The random forest is an ensemble of decision trees that are trained, most of the time, with the “bagging” method. The idea behind this method is that a combination of learning models increases the overall result. Random forest are good to use at the first stage because you don’t know the underlying model, or when you want to build a decent model in a short time because it has a very few parameters to tune and can be used quite efficiently with default parameter settings.

Dimension reduction techniques describes the process of converting a set of data with vast dimensions into data with lesser dimensions ensuing that it conveys similar information concisely. These techniques are used while dealing with machine learning problems to obtain better features for a classification or regression task. The benefits here are in data compression and time needed for performing same computations.

Boosting algorithms are used when we have plenty of data to make a prediction. It is an ensemble of learning algorithms which combines the predictions of several bases estimators in order to improve the robustness over a single estimator. XGBoost is a boosting algorithm that possesses both linear model and the tree learning algorithm and does parallel computations on a single machine.

Machine learning has been applied to sports betting for a while now and companies like Stratagem are using the above-mentioned methods in their prediction models. Stratagem mission is very simple: they build betting models, look for patterns and make money out of them. The company uses human resources to analyze and follow matches around the globe, adding valuable detailed information to in-house model and improving their accuracy. As an example, Stratagem already uses machine learning to analyze its data (finding the best time to place a bet), but it is also developing AI tools that can analyze sporting events in real time, pulling out data the can help with match winner predictions. They have already moved forward to using deep neural networks to achieve the task of predicting match outcomes. Because of the amount of data available nowadays, their in-house software is trying to absorb as much data as possible and find the needed patterns via failure and success – the end goal being an AI that can manage multiple events simultaneously and extracting insights during that process.

Artificial Neural Networks (ANNs) are one of the most common machine learning approaches to sport betting prediction. These have interconnected components that transform a set of inputs into a desired output. The ANN’s power comes from the non-linearity of the hidden neurons in adjusting weights that contribute to the final decision. The main step here is to use the features (contained in the processed training dataset) and build the ANN classification model. We can paraphrase the above and say that the weights associated with the interconnected components are continuously changing and this contributes to higher predictive power. An appealing feature of the ANNs is that they are flexible in terms of defining the class variable.

The ANNs model has already been applied in NFL where five features were used: yards gained, rushing yards gained, turnover margin, time of possession and betting line odds. The difference between good and poor teams was discovered via unsupervised methods based on clustering. The accuracy achieved here by M.C. Purucker was 61% this was found to be an effective approach.

ANN has been used in the horse racing prediction. ANN was used for each horse in the race and the output was the finishing time of the horse. The input nodes were weight, type of race, horse trainer, horse jockey, number of horses in race, race distance, track condition and weather. E. Davoodi and A. Khanteymoori concluded an accuracy of 77% based on the above conditions.

Unanimous AI is a company that has made some astonishing accurate predictions. It is based on “swarm intelligence” with the main logic being that majority is better in solving problems and making decisions. They have successfully predicted the Superbowl results down to the exact score. Another one was the prediction of the winners of the Kentucky Derby in the exact order.

Machine learning will become a standard tool of the sports betting industry and companies such as fansunite.io are more than keen to make this aware. The company is not shy in admitting incorporating machine learning in their risk management strategy. It is a powerful tool to produce win probabilities which minimize bias and variance. Their closing line will be a product of best in class deep learning network, alongside other more common approaches.

Despite the increasing use of machine learning models for sport prediction, the industry needs new and more accurate algorithms. The betting turnover keeps piling up and it is necessary for the participants in the betting industry to seek useful strategies and accurate predictions. Machine learning Is now a common method for sports prediction and betting operators will keep modelling sports data to further enhance their prediction accuracy.

Georgie L( Analytics Writer )

Georgie has been in the industry for over 11 years, working as a trader and a broker for some of the largest syndicates in the world. Georgie has focused his model development on international soccer leagues.

Twitter: @WagerBop
Email: georgie@wagerbop.com