Benford’s law meets IPL, Intl. T20 and ODI cricket

“To grasp how different a million is from a billion, think about it like this: A million seconds is a little under two weeks; a billion seconds is about thirty-two years.”

“One of the pleasures of looking at the world through mathematical eyes is that you can see certain patterns that would otherwise be hidden.”

               Steven Strogatz, Prof at Cornell University

Introduction

Within the last two weeks, I was introduced to Benford’s Law by 2 of my friends. Initially, I looked it up and Google and was quite intrigued by the law. Subsequently another friends asked me to check the ‘Digits’ episode, from the “Connected” series on Netflix by Latif Nasser, which I strongly recommend you watch.

Benford’s Law also called the Newcomb–Benford law, the law of anomalous numbers, or the First Digit Law states that, when dealing with quantities obtained from Nature, the frequency of appearance of each digit in the first significant place is logarithmic. For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30.1% of the time, the number 2 about 17.6%, number 3 about 12.5% all the way to the number 9 at 4.6%. This interesting logarithmic pattern is observed in most natural datasets from population densities, river lengths, heights of skyscrapers, tax returns etc. What is really curious about this law, is that when we measure the lengths of rivers, the law holds perfectly regardless of the units used to measure. So the length of the rivers would obey the law whether we measure in meters, feet, miles etc. There is something almost mystical about this law.

The law has also been used widely to detect financial fraud, manipulations in tax statements, bots in twitter, fake accounts in social networks, image manipulation etc. In this age of deep fakes, the ability to detect fake images will assume paramount importance. While deviations from Benford Law do not always signify fraud, to large extent they point to an aberration. Prof Nigrini, of Cape Town used this law to identify financial discrepancies in Enron’s financial statement resulting in the infamous scandal. Also the 2009 Iranian election was found to be fradulent as the first digit percentages did not conform to those specified by Benford’s Law.

While it cannot be said with absolute certainty, marked deviations from Benford’s law could possibly indicate that there has been manipulation of natural processes. Possibly Benford’s law could be used to detect large scale match-fixing in cricket tournaments. However, we cannot look at this in isolation and the other statistical and forensic methods may be required to determine if there is fraud. Here is an interesting paper Promises and perils of Benford’s law

A set of numbers is said to satisfy Benford’s law if the leading digit d (d ∈ {1, …, 9}) occurs with probability

P(d)=log_{10}(1+1/d)

This law also works for number in other bases, in base b >=2

P(d)=log_{b}(1+1/d)

Interestingly, this law also applies to sports on the number of point scored in basketball etc. I was curious to see if this applied to cricket. Previously, using my R package yorkr, I had already converted all T20 data and ODI data from Cricsheet which is available at yorkrData2020, I wanted to check if Benford’s Law worked on the runs scored, or deliveries faced by batsmen at team level or at a tournament level (IPL, Intl. T20 or ODI).

Thankfully, R has a package benford.analysis to check for data behaviour in accordance to Benford’s Law, and I have used this package in my post

This post is also available in RPubs as Benford’s Law meets IPL, Intl. T20 and ODI

library(data.table)
library(reshape2)
library(dplyr)
library(benford.analysis)
library(yorkr)

In this post, I have randomly check data with Benford’s law. The fully converted dataset is available in yorkrData2020 which I have included above. You can try on any dataset including ODI (men,women),Intl T20(men,women),IPL,BBL,PSL,NTB and WBB.

1. Check the runs distribution by Royal Challengers Bangalore

We can see the behaviour is as expected with Benford’s law, with minor deviations

load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Royal Challengers Bangalore-BattingDetails.RData")
rcbRunsTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive") 
rcbRunsTrends
## 
## Benford object:
##  
## Data: battingDetails$runs 
## Number of observations used = 1205 
## Number of obs. for second order = 99 
## First digits analysed = 1
## 
## Mantissa: 
## 
##    Statistic  Value
##         Mean  0.458
##          Var  0.091
##  Ex.Kurtosis -1.213
##     Skewness -0.025
## 
## 
## The 5 largest deviations: 
## 
##   digits absolute.diff
## 1      1         14.26
## 2      7         13.88
## 3      9          8.14
## 4      6          5.33
## 5      4          4.78
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  battingDetails$runs
## X-squared = 5.2091, df = 8, p-value = 0.735
## 
## 
##  Mantissa Arc Test
## 
## data:  battingDetails$runs
## L2 = 0.0022852, df = 2, p-value = 0.06369
## 
## Mean Absolute Deviation (MAD): 0.004941381
## MAD Conformity - Nigrini (2012): Close conformity
## Distortion Factor: -18.8725
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

2. Check the ‘balls played’ distribution by Royal Challengers Bangalore

load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Royal Challengers Bangalore-BattingDetails.RData")
rcbBallsPlayedTrends = benford(battingDetails$ballsPlayed, number.of.digits = 1, discrete = T, sign = "positive") 
plot(rcbBallsPlayedTrends)

 

3. Check the runs distribution by Chennai Super Kings

The trend seems to deviate from the expected behavior to some extent in the number of digits for 5 & 7.

load("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails/Chennai Super Kings-BattingDetails.RData")
cskRunsTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive") 
cskRunsTrends
## 
## Benford object:
##  
## Data: battingDetails$runs 
## Number of observations used = 1054 
## Number of obs. for second order = 94 
## First digits analysed = 1
## 
## Mantissa: 
## 
##    Statistic  Value
##         Mean  0.466
##          Var  0.081
##  Ex.Kurtosis -1.100
##     Skewness -0.054
## 
## 
## The 5 largest deviations: 
## 
##   digits absolute.diff
## 1      5         27.54
## 2      2         18.40
## 3      1         17.29
## 4      9         14.23
## 5      7         14.12
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  battingDetails$runs
## X-squared = 22.862, df = 8, p-value = 0.003545
## 
## 
##  Mantissa Arc Test
## 
## data:  battingDetails$runs
## L2 = 0.002376, df = 2, p-value = 0.08173
## 
## Mean Absolute Deviation (MAD): 0.01309597
## MAD Conformity - Nigrini (2012): Marginally acceptable conformity
## Distortion Factor: -17.90664
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

4. Check runs distribution in all of Indian Premier League (IPL)

battingDF <- NULL
teams <-c("Chennai Super Kings","Deccan Chargers","Delhi Daredevils",
          "Kings XI Punjab", 'Kochi Tuskers Kerala',"Kolkata Knight Riders",
          "Mumbai Indians", "Pune Warriors","Rajasthan Royals",
          "Royal Challengers Bangalore","Sunrisers Hyderabad","Gujarat Lions",
          "Rising Pune Supergiants")


setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/ipl/iplBattingBowlingDetails")
for(team in teams){
  battingDetails <- NULL
  val <- paste(team,"-BattingDetails.RData",sep="")
  print(val)
  tryCatch(load(val),
           error = function(e) {
             print("No data1")
             setNext=TRUE
           }
           
           
  )
  details <- battingDetails
  battingDF <- rbind(battingDF,details)
}
## [1] "Chennai Super Kings-BattingDetails.RData"
## [1] "Deccan Chargers-BattingDetails.RData"
## [1] "Delhi Daredevils-BattingDetails.RData"
## [1] "Kings XI Punjab-BattingDetails.RData"
## [1] "Kochi Tuskers Kerala-BattingDetails.RData"
## [1] "Kolkata Knight Riders-BattingDetails.RData"
## [1] "Mumbai Indians-BattingDetails.RData"
## [1] "Pune Warriors-BattingDetails.RData"
## [1] "Rajasthan Royals-BattingDetails.RData"
## [1] "Royal Challengers Bangalore-BattingDetails.RData"
## [1] "Sunrisers Hyderabad-BattingDetails.RData"
## [1] "Gujarat Lions-BattingDetails.RData"
## [1] "Rising Pune Supergiants-BattingDetails.RData"
trends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive") 
trends
## 
## Benford object:
##  
## Data: battingDF$runs 
## Number of observations used = 10129 
## Number of obs. for second order = 123 
## First digits analysed = 1
## 
## Mantissa: 
## 
##    Statistic   Value
##         Mean  0.4521
##          Var  0.0856
##  Ex.Kurtosis -1.1570
##     Skewness -0.0033
## 
## 
## The 5 largest deviations: 
## 
##   digits absolute.diff
## 1      2        159.37
## 2      9        121.48
## 3      7         93.40
## 4      8         83.12
## 5      1         61.87
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  battingDF$runs
## X-squared = 78.166, df = 8, p-value = 1.143e-13
## 
## 
##  Mantissa Arc Test
## 
## data:  battingDF$runs
## L2 = 5.8237e-05, df = 2, p-value = 0.5544
## 
## Mean Absolute Deviation (MAD): 0.006627966
## MAD Conformity - Nigrini (2012): Acceptable conformity
## Distortion Factor: -20.90333
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

5. Check Benford’s law in India matches

setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails")
load("India-BattingDetails.RData")

indiaTrends = benford(battingDetails$runs, number.of.digits = 1, discrete = T, sign = "positive") 
plot(indiaTrends)

 

6. Check Benford’s law in all of Intl. T20

setwd("/Users/tvganesh/backup/software/cricket-package/yorkr-cricsheet/yorkrData2020/t20/t20BattingBowlingDetails")
teams <-c("Australia","India","Pakistan","West Indies", 'Sri Lanka',
          "England", "Bangladesh","Netherlands","Scotland", "Afghanistan",
          "Zimbabwe","Ireland","New Zealand","South Africa","Canada",
          "Bermuda","Kenya","Hong Kong","Nepal","Oman","Papua New Guinea",
          "United Arab Emirates","Namibia","Cayman Islands","Singapore",
          "United States of America","Bhutan","Maldives","Botswana","Nigeria",
          "Denmark","Germany","Jersey","Norway","Qatar","Malaysia","Vanuatu",
          "Thailand")

for(team in teams){
  battingDetails <- NULL
  val <- paste(team,"-BattingDetails.RData",sep="")
  print(val)
  tryCatch(load(val),
           error = function(e) {
             print("No data1")
             setNext=TRUE
           }
           
           
  )
  details <- battingDetails
  battingDF <- rbind(battingDF,details)
  
}
intlT20Trends = benford(battingDF$runs, number.of.digits = 1, discrete = T, sign = "positive") 
intlT20Trends
## 
## Benford object:
##  
## Data: battingDF$runs 
## Number of observations used = 21833 
## Number of obs. for second order = 131 
## First digits analysed = 1
## 
## Mantissa: 
## 
##    Statistic  Value
##         Mean  0.447
##          Var  0.085
##  Ex.Kurtosis -1.158
##     Skewness  0.018
## 
## 
## The 5 largest deviations: 
## 
##   digits absolute.diff
## 1      2        361.40
## 2      9        276.02
## 3      1        264.61
## 4      7        210.14
## 5      8        198.81
## 
## Stats:
## 
##  Pearson's Chi-squared test
## 
## data:  battingDF$runs
## X-squared = 202.29, df = 8, p-value < 2.2e-16
## 
## 
##  Mantissa Arc Test
## 
## data:  battingDF$runs
## L2 = 5.3983e-06, df = 2, p-value = 0.8888
## 
## Mean Absolute Deviation (MAD): 0.007821098
## MAD Conformity - Nigrini (2012): Acceptable conformity
## Distortion Factor: -24.11086
## 
## Remember: Real data will never conform perfectly to Benford's Law. You should not focus on p-values!

Conclusion

Maths rules our lives, more than we are aware, more that we like to admit. It is there in all of nature. Whether it is the recursive patterns of Mandelbrot sets, the intrinsic notion of beauty through the golden ratio, the murmuration of swallows, the synchronous blinking of fireflies or in the almost univerality of Benford’s law on natural datasets, mathematics govern us.

Isn’t it strange that while we humans pride ourselves of freewill, the runs scored by batsmen in particular formats conform to Benford’s rule for the first digits. It almost looks like, the runs that will be scored is almost to extent predetermined to fall within specified ranges obeying Benford’s law. So much for choice.

Something to be pondered over!

Also see

  1. Introducing GooglyPlusPlus!!!
  2. Deconstructing Convolutional Neural Networks with Tensorflow and Keras
  3. Going deeper into IBM’s Quantum Experience!
  4. Experiments with deblurring using OpenCV
  5. Big Data 6: The T20 Dance of Apache NiFi and yorkpy
  6. Deep Learning from first principles in Python, R and Octave – Part 4
  7. Practical Machine Learning with R and Python – Part 4
  8. Re-introducing cricketr! : An R package to analyze performances of cricketers
  9. Bull in a china shop – Behind the scenes in Android

One thought on “Benford’s law meets IPL, Intl. T20 and ODI cricket

  1. Its truly incredible how faithfully the law is in effect…the unseen hand writes in the eternal background and we are all blind and arrogant in our ignorance until we are awakened…

    My thought on the 5 and above is that there is probably a need to take a look further deeply into the law. If we understand it at a deeper level could we be postulating a finely tuned version of Bengirds law where other subtle effects, say Collecting the data only in day/evenings vs. Continuous day and night or Instead of a singularly divided demographic lines or species…Or data sets dominated by male and female or other abnormalities to nature‘s true guidelines that humans have created due to their own idiosyncrasies.

    Is it possible to find filters or weightings that could be factord to either modify the great Benford law, or alternately to weight or normalize thedata sets….

    A great topic to share with all

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s