A Generative Model for Predicting Outcomes in College Basketball
01 March 2015
Gambling is one of mankind's oldest activities, as evidenced by writings and equipment found in tombs and other places, and the birth of probability theory is attributed to Pascal and Fermat in 1654 on the fair division on an interrupted game of chance (Hacking, 1975). In this paper, we aim at estimating probabilities in sports. Specifically, we focus on the March Madness tournament in college basketball,1 although the model is general enough to model nearly any team sport for regular season and play-off games (assuming that both teams are willing to win). Estimating probabilities in sport events is challenging, because it is unclear what variables affect the outcome and what information is publicly known before the games begin. In team sports, it is even more complicated, because the information about individual players becomes relevant. Although there has been some attempts to model individual players (Miller, Bornn, Adams, and Goldsberry, 2014), there is no standard method to evaluate the importance of individual players and remove their contribution to the team when players do not play or get injured or suspended. It is also unclear if considering individual player information can improve predictions with no overfit. For college basketball, even more variables come into play, because there are 351 teams divided in 32 conferences, they only play about 30 regular games and the match-ups are not random, so the results do not directly show the level of each team. In the literature, we can find several variants of a simple model for soccer that identifies each team by its attack and defense coefficients (Maher, 1982, Dixon and Coles, 1997, Crowder, Dixon, Ledford, and Robinson, 2002, Baio and Blangiardo, 2010, Heuer, Müller, and Rubner, 2010). In all these works, the score for the home team is drawn from a Poisson distribution, whose mean is the multiplicative contribution of the home team attack coefficient and the away team defense coefficient. The score of the visitor team is an independent Poisson random variable, whose mean is the visitor attack coefficient multiplied by the home team defense coefficient. These coefficients are estimated by maximum likelihood using the past results and used to predict future outcomes.