About the Model/Simulations
NOTE: The information below is now outdated. Please read this post for an update on how the model is operated. I will leave the information below up for posterity.
The model reads in ball-by-ball data, the majority of which is taken from the indispensable CricSheet. All matches since 2015 are used, with more weight given to more recent matches.
The ball-by-ball data allows for estimation of typical scoring and wicket taking patterns through the course of a limited-overs innings. The main purpose of the model is to assess team strength, so individual players are not modeled. Teams’ batting orders are broken up into top (1-3), middle (4-6), lower (7-8), and tail (9-11). Teams’ bowling attacks are broken up into opening bowlers (first two to appear) and other bowlers (everybody else). I would prefer to turn this into pace and spin and also model bowler type usage, but it’s a bit harder to match up the ball-by-ball data to individual players’ bowler types, so this may be a project for the future. For now, opening vs other serves as a not-terrible proxy for pace and spin.
Essentially, we are modeling two things for each ball: run rate and wicket probability. Run Rate is a simple model taking into account batter ability, bowler ability, and the expected runs in a given match state based on over and wickets lost. Wicket Probability is a slightly more complex logistic regression model, but takes into account the same things as the run rate model.
And that’s about it! We can use the ball-by-ball data to get team ratings, and then simulate thousands of matches to obtain rankings or simulations. The model is fairly simple-but for the purposes of team ranking, I’ve found that it does a pretty good job.