R vs Python: Different similarities and similar differences

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say.  I include my own experience with the languages below.

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

1. R data types

R has the following data types

1.  Character
2. Integer
3. Numeric
4. Logical
5. Complex
6. Raw

Python has several data types

1. Int
2. float
3. Long
4. Complex and so on

2. R Vector vs Python List

A common data type in R is the vector. Python has a similar data type, the list

# R vectors
a<-c(4,5,1,3,4,5)
print(a[3])
## [1] 1
print(a[3:4]) # R does not always need the explicit print. 
## [1] 1 3
#R type of variable
print(class(a))
## [1] "numeric"
# Length of a
print(length(a))
## [1] 6
# Python lists
a=[4,5,1,3,4,5] #
print(a[2]) # Some python IDEs require the explicit print
print(a[2:5])
print(type(a))
# Length of a
print(len(a))
## 1
## [1, 3, 4]
##
## 6

2a. Other data types – Python

Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple
b = (4,5,7,8)
print(b)

#Python dictionary
c={'name':'Ganesh','age':54,'Work':'Professional'}
print(c)
#Print type of variable c

## (4, 5, 7, 8)
## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'}

2.Type of Variable

To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable
a<-c(4,5,1,3,4,5)
print(class(a))
## [1] "numeric"
#Python - Print type of tuple a
a=[4,5,1,3,4,5]
print(type(a))
b=(4,3,"the",2)
print(type(b))
##
## 

3. Length

To know length in R, use length()

#R - Length of vector
# Length of a
a<-c(4,5,1,3,4,5)
print(length(a))
## [1] 6

To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc
# Length of a
a=[4,5,1,3,4,5]
print(len(a))
# Length of b
b = (4,5,7,8)
print(len(b))

## 6
## 4

4. Accessing help

To access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio
#?sapply
#help(sapply)

Help in python on any topic involves

#Python help - This can be done on a (I)Python console
#help(len)
#?len

5. Subsetting

The key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use

#R - Subset
a<-c(4,5,1,3,4,8,12,18,1)
print(a[3])
## [1] 1
# To print a range or a slice. Print from the 3rd to the 5th element
print(a[3:6])
## [1] 1 3 4 8

Python also uses indices. The difference in Python is that the index starts from 0/

#Python - Subset
a=[4,5,1,3,4,8,12,18,1]
# Print the 4th element (starts from 0)
print(a[3])

# Print a slice from 4 to 6th element
print(a[3:6])
## 3
## [3, 4, 8]

6. Operations on vectors in R and operation on lists in Python

In R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show

#R - Operations on vectors
a<- c(5,2,3,1,7)
b<- c(1,5,4,6,8)

print(a+b)
## [1]  6  7  7  7 15
#Element wise subtraction
print(a-b)
## [1]  4 -3 -1 -5 -1
#Element wise product
print(a*b)
## [1]  5 10 12  6 56
# Exponentiating the elements of a vector
print(a^2)
## [1] 25  4  9  1 49

In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows

# Python - Operations on list
a =[5,2,3,1,7]
b =[1,5,4,6,8]

#Element wise addition with map & lambda
print(list(map(lambda x,y: x+y,a,b)))
#Element wise subtraction
print(list(map(lambda x,y: x-y,a,b)))
#Element wise product
print(list(map(lambda x,y: x*y,a,b)))
# Exponentiating the elements of a list
print(list(map(lambda x: x**2,a)))

## [6, 7, 7, 7, 15]
## [4, -3, -1, -5, -1]
## [5, 10, 12, 6, 56]
## [25, 4, 9, 1, 49]

However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations

import numpy as np
a =[5,2,3,1,7]
b =[1,5,4,6,8]
a=np.array(a)
b=np.array(b)
print(a+b)
#Element wise subtraction
print(a-b)
#Element wise product
print(a*b)
# Exponentiating the elements of a list
print(a**2)

## [ 6  7  7  7 15]
## [ 4 -3 -1 -5 -1]
## [ 5 10 12  6 56]
## [25  4  9  1 49]

7. Getting the index of element

To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which
a<- c(5,2,3,1,7)
print(which(a == 1))
## [1] 4

In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where
import numpy as np
a =[5,2,3,1,7]
a=np.array(a)
print(np.where(a==1))
## (array([3], dtype=int64),)

8. Data frames

R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R
# To check built datasets use
#data() - In R console or in R Studio
#iris - Don't print to console

We can use the in-built data sets that come with Scikit package

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# This creates a Sklearn bunch
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)

9. Working with dataframes

With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

• Check the size of the dataframe
• Take a look at the top 5 or bottom 5 rows of dataframe
• Check the content of the dataframe

a.Size

In R use dim()

#R - Size
dim(iris)
## [1] 150   5

For Python use .shape

#Python - size
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris.shape

b. Top & bottom 5 rows of dataframe

To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R
head(iris,5)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
tail(iris,5)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.tail(5))
##    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 0                5.1               3.5                1.4               0.2
## 1                4.9               3.0                1.4               0.2
## 2                4.7               3.2                1.3               0.2
## 3                4.6               3.1                1.5               0.2
## 4                5.0               3.6                1.4               0.2
##      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 145                6.7               3.0                5.2               2.3
## 146                6.3               2.5                5.0               1.9
## 147                6.5               3.0                5.2               2.0
## 148                6.2               3.4                5.4               2.3
## 149                5.9               3.0                5.1               1.8

c. Check the content of the dataframe

#R
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
##        Species
##  setosa    :50
##  versicolor:50
##  virginica :50
##
##
## 
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None d. Check column names #R names(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" colnames(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) #Get column names print(iris.columns) ## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', ## 'petal width (cm)'], ## dtype='object') e. Rename columns In R we can assign a vector to column names #R colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species") colnames(iris) ## [1] "lengthOfSepal" "widthOfSepal" "lengthOfPetal" "widthOfPetal" ## [5] "Species" In Python we can assign a list to s.columns #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"] print(iris.columns) ## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object') f.Details of dataframe #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None g. Subsetting dataframes # R #To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector] #df[row,column] iris[3,4] ## [1] 0.2 #df[row vector, column vector] iris[2:5,1:3] ## lengthOfSepal widthOfSepal lengthOfPetal ## 2 4.9 3.0 1.4 ## 3 4.7 3.2 1.3 ## 4 4.6 3.1 1.5 ## 5 5.0 3.6 1.4 #If we omit the row vector, then it implies all rows or if we omit the column vector # then implies all columns for that row iris[2:5,] ## lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa # In R we can all specific columns by column names iris$Sepal.Length[2:5]
## NULL
#Python
# To select an entire row we use .iloc. The index can be used with the ':'. If
# .iloc[start row: end row]. If start row is omitted then it implies the beginning of
# data frame, if end row is omitted then it implies all rows till end
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.iloc[3])
print(iris[:5])
# In python we can select columns by column name as follows
print(iris['sepal length (cm)'][2:6])
#If you want to select more than 2 columns then you must use the double '[[]]' since the
# index is a list itself
print(iris[['sepal length (cm)','sepal width (cm)']][4:7])
## sepal length (cm)    4.6
## sepal width (cm)     3.1
## petal length (cm)    1.5
## petal width (cm)     0.2
## Name: 3, dtype: float64
##    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
## 0                5.1               3.5                1.4               0.2
## 1                4.9               3.0                1.4               0.2
## 2                4.7               3.2                1.3               0.2
## 3                4.6               3.1                1.5               0.2
## 4                5.0               3.6                1.4               0.2
## 2    4.7
## 3    4.6
## 4    5.0
## 5    5.4
## Name: sepal length (cm), dtype: float64
##    sepal length (cm)  sepal width (cm)
## 4                5.0               3.6
## 5                5.4               3.9
## 6                4.6               3.4

h. Computing Mean, Standard deviation

#R
#Mean
mean(iris$lengthOfSepal) ## [1] 5.843333 #Standard deviation sd(iris$widthOfSepal)
## [1] 0.4358663
#Python
#Mean
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
# Convert to Pandas dataframe
print(iris['sepal length (cm)'].mean())
#Standard deviation
print(iris['sepal width (cm)'].std())
## 5.843333333333335
## 0.4335943113621737

i. Boxplot

Boxplot can be produced in R using baseplot

#R
boxplot(iris$lengthOfSepal) Matplotlib is a popular package in Python for plots #Python import sklearn as sklearn import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.boxplot(iris['sepal length (cm)']) plt.show(img) j.Scatter plot #R plot(iris$widthOfSepal,iris$lengthOfSepal) #Python import matplotlib.pyplot as plt import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)']) #plt.show(img) k. Read from csv file #R tendulkar= read.csv("tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Dimensions of dataframe dim(tendulkar) ## [1] 347 13 names(tendulkar) ## [1] "X" "Runs" "Mins" "BF" "X4s" ## [6] "X6s" "SR" "Pos" "Dismissal" "Inns" ## [11] "Opposition" "Ground" "Start.Date" Use pandas.read_csv() for Python #Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) print(tendulkar.columns) ## (347, 13) ## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos', ## 'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'], ## dtype='object') l. Clean the dataframe in R and Python. The following steps are done for R and Python 1.Remove rows with ‘DNB’ 2.Remove rows with ‘TDNB’ 3.Remove rows with absent 4.Remove the “*” indicating not out 5.Remove incomplete rows with NA for R or NaN in Python 6.Do a scatter plot #R # Remove rows with 'DNB' a <- tendulkar$Runs != "DNB"
tendulkar <- tendulkar[a,]
dim(tendulkar)
## [1] 330  13
# Remove rows with 'TDNB'
b <- tendulkar$Runs != "TDNB" tendulkar <- tendulkar[b,] # Remove rows with absent c <- tendulkar$Runs != "absent"
tendulkar <- tendulkar[c,]
dim(tendulkar)
## [1] 329  13
# Remove the "* indicating not out
tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs))
dim(tendulkar)
## [1] 329  13
# Select only complete rows - complete.cases()
c <- complete.cases(tendulkar)
#Subset the rows which are complete
tendulkar <- tendulkar[c,]
dim(tendulkar)
## [1] 327  13
# Do some base plotting - Scatter plot
plot(tendulkar$BF,tendulkar$Runs)

#Python
import pandas as pd
import matplotlib.pyplot as plt
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
print(tendulkar.shape)
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
print(tendulkar.shape)
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar[c]
print(tendulkar.shape)
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")
#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
print(tendulkar.shape)
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
#Scatter plot
plt.scatter(tendulkar.BF,tendulkar.Runs)
## (347, 13)
## (330, 13)
## (329, 13)
## (329, 13)
## (327, 13)

m.Chaining operations on dataframes

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

1. Group by ground
2. Compute average runs in each ground
3. Arrange in descending order
#R
library(dplyr)
tendulkar1 <- tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs)) %>%
arrange(desc(meanRuns))
head(tendulkar1,10)
## # A tibble: 10 × 2
##           Ground  meanRuns
##
## 1         Multan 194.00000
## 2          Leeds 193.00000
## 3  Colombo (RPS) 143.00000
## 4        Lucknow 142.00000
## 5          Dhaka 132.75000
## 6     Manchester  93.50000
## 7         Sydney  87.22222
## 8   Bloemfontein  85.00000
## 9     Georgetown  81.00000
## 10 Colombo (SSC)  77.55556
#Python
import pandas as pd
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar[c]
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")

#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
tendulkar1= tendulkar.groupby('Ground').mean()['Runs'].sort_values(ascending=False)
print(tendulkar1.head(10))
## (347, 13)
## Ground
## Multan           194.000000
## Leeds            193.000000
## Colombo (RPS)    143.000000
## Lucknow          142.000000
## Dhaka            132.750000
## Manchester        93.500000
## Sydney            87.222222
## Bloemfontein      85.000000
## Georgetown        81.000000
## Colombo (SSC)     77.555556
## Name: Runs, dtype: float64

9. Functions

product <- function(a,b){
c<- a*b
c
}
product(5,7)
## [1] 35
def product(a,b):
c = a*b
return c

print(product(5,7))

## 35



Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

Note: This post was created using RStudio’s RMarkdown which allows you to embed R and Python code snippets. It works perfectly, except that matplotlib’s pyplot does not display.

My 3 video presentations on “Essential R”

In this post I include my  3 video presentations on the topic “Essential R”. In these 3 presentations I cover the entire landscape of R. I cover the following

• R Language – The essentials
• Key R Packages (dplyr, lubridate, ggplot2, etc.)
• How to create R Markdown and share reports
• A look at Shiny apps
• How to create a simple R package

You can download the relevant slide deck and practice code at Essential R

Essential R – Part 1
This video cover basic R data types – character, numeric, vectors, matrices, lists and data frames. It also touches on how to subset these data types

Essential R – Part 2
This video continues on how to subset dataframes (the most important data type) and some important packages. It also presents one of the most important job of a Data Scientist – that of cleaning and shaping the data. This is done with an example unclean data frame. It also  touches on some  key operations of dplyr like select, filter, arrange, summarise and mutate. Other packages like lubridate, quantmod are also included. This presentation also shows how to use base plot and ggplot2

Essential R – Part 3
This final session covers R Markdown , and  touches on some of the key markdown elements. There is a brief overview of a simple Shiny app. Finally this presentation also shows the key steps to create an R package

These 3 R sessions cover most of the basic R topics that we tend to use in a our day-to-day R way of life. With this you should be able to hit the ground running!

Hope you enjoy these video presentation and also hope you have an even greater time with R!

Check out my 2 books on cricket, a) Cricket analytics with cricketr b) Beaten by sheer pace – Cricket analytics with yorkr, now available in both paperback & kindle versions on Amazon!!! Pick up your copies today!

To see all my posts click – Index of posts

Natural language processing: What would Shakespeare say?

Here is a scene from  Christopher Nolan’s classic movie Interstellar. In this scene  Cooper, a crew member of the Endurance spaceship which is on its way to 3 distant planets via a wormhole, is conversing with TARS which is one of  US Marine Corps former robots some year in the future.

TARS (flippantly): “Everybody good? Plenty of slaves for my robot colony?”
TARS: [as Cooper repairs him] Settings. General settings. Security settings.
TARS: Honesty, new setting: ninety-five percent.
Cooper: Humor, seventy-five percent.
TARS: Confirmed. Self-destruct sequence in T minus 10, 9…
Cooper: Let’s make that sixty percent.
TARS: Sixty percent, confirmed. Knock knock.
Cooper: You want fifty-five?

Natural Language has been an area of serious research for several decades ever since Alan Turing in 1950 proposed a test in which a human evaluator would simultaneously judge natural language conversations between another human and a machine, that is designed to generate human-like responses, behind a closed doors. If the responses of the human and machine were indistinguishable then we can say that the machine has passed the Turing test signifying machine intelligence.

How cool would it be if we could  converse with a machines using Natural Language  with all the subtleties of language including irony, sarcasm and humor? While considerable progress has been made in  Natural Language Processing for e.g. Watson, Siri and Cortana  the ability to handle nuances like humor, sarcasm is probably many years away.

This post looks at one aspect of Natural Language Processing, particularly in dealing with the ability to predict the next word(s) given a word or phrase.

This title of this post should really be ‘Natural language Processing: What would Shakespeare say, and what would you say’ because this post includes two interactive apps that can predict the next word

a) The first app given a (Shakespearean) phrase will predict the most likely word that Shakespeare would have said
Try the Shiny app : What would Shakespeare have said?

b) The second app will, given a regular phrase  predict the next word(s)  in regular day to day English usage
Try the Shiny app: What would you say?

Checkout my book ‘Deep Learning from first principles- In vectorized Python, R and Octave’.  My book is available on Amazon  as paperback ($16.99) and in kindle version($6.65/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions.

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. NLP encompasses many areas from computer science  besides inputs from the domain of  linguistics , psychology, information theory, mathematics and statistics

However NLP is a difficult domain as each language has its own quirkiness and ambiguities,  and English is no different. Let us take the following 2 sentences

Time flies like an arrow.
Fruit flies like a banana.

Clearly the 2 sentences mean  entirely different things when referencing  the words ‘flies like’. The English language is filled with many such ambiguous constructions

There have been 2 main approaches to Natural Language Processing – The rationalist approach and the empiricist’s approach. The empiricists  approached natural language as a data driven problem based on statistics while the rationalist school led by Noam Chomsky, the linguist,  strongly believed that sentence structure should be analyzed at a deeper level than mere surface statistics.

In his book Syntactic Structures, Chomsky introduces a famous example of his criticism of finite-state probabilistic models. He cites 2 sentences  (a) ‘colorless green ideas sleep furiously’  (b) ‘furiously sleep ideas green colorless’.  Chomsky’s contention is that while neither sentence or  any of its parts, have ever occurred in the past linguistic experience of  English it can be easily inferred that   (a) is grammatical, while (b) is not. Chomsky argument is that sentence structure is critical to Natural Language processing of any kind. Here is a good post by Peter Norvig ‘On Chomsky and the two cultures of statistical learning’. In fact,  from 1950 to the 1980s the empiricists approach fell out of favor while reasonable progress was made based on rationalist approach to NLP.

The return of the empiricists
But thanks to great strides in processing power and the significant drop in hardware the empiricists approach to Natural Language Processing  made a comeback in the mid 1980s.  The use of probabilistic language models combined with the increase in the  power of processing saw the rise of the empiricists again. Also there had been significant improvement in machine learning algorithms which allowed the use of the computing resources more efficiently.

In this post I showcase 2 Shiny apps written in R that predict the next word given a phrase using  statistical approaches, belonging to the empiricist school of thought. The 1st one will try to predict what Shakespeare would have said  given a phrase (Shakespearean or otherwise)  and the 2nd is a regular app that will predict what we would say in our regular day to day conversation. These apps will predict the next word as you keep typing in each word.

In NLP the first step is a to build a language model. In order to  build a language model the program ingests a large corpora of documents.  For the a) Shakespearean app, the corpus is the “Complete Works of Shakespeare“.  This is also available in Free ebooks by Project Gutenberg but you will have to do some cleaning and tokenzing before using it. For the b) regular English next word predicting app the corpus is composed of several hundred MBs of tweets, news items and blogs.

Once the corpus is ingested the software then creates a n-gram model. A 1-gram model is representation of all unique single words and their counts. Similarly a bigram model is representation of all 2 words and their counts found in the corpus. Similar we can have trigram, quadgram and n-gram as required. Typically language models don’t go beyond 5-gram as the processing power needed increases for these larger n-gram models.

The probability of a sentence can be determined  using the chain rule. This is shown for the bigram model  below where P(s) is the probability of a sentence ‘s’
P( The quick brown fox jumped) =
P(The) P(quick|The) P(brown|The quick) * P(fox||The quick brown) *P(jumped|The quick brown fox)
where BOS -> is the beginning of the sentence and

P(quick|The) – The probability of the word being ‘quick’ given that the previous word was ‘The’. This probability can be approximated based on Markov’s chain rule which allows that the we can compute the conditional probability
$P(w|w_{i})$

of a word based on a couple of its preceding words. Hence this allows this approximation as follows
$P(w{_{i}}|w_{1}w_{2}w_{3}..w_{i-1}) = P(w{_{i}}|w_{i-1})$

The Maximum Likelihood Estimate (MLE) is given as follows for a bigram
$P_{MLE}(w_{i}|w_{i-1}) = count(w_{i-1},w_{i})/count(w_{i-1})$
$P_{MLE}(w_{i}|w_{i-1}) = c(w_{i-1},w_{i})/c(w_{i-1})$

Hence for a corpus
We can calculate the maximum likelihood estimates of a given word from its previous word. This computation of the MLE can be extended to the trigram and the quadgram

For a trigram
$P(w_{i}|w_{i-1}w_{i-2}) = c(w_{i-2}w_{i-1},w_{i})/c(w_{i-2}w_{i-1})$

Smoothing techniques
The MLE estimates for many bigrams and trigrams will be 0, because we may have not have yet seen certain combinations. But the fact that we have not seen these combinations in the corpus should not  mean that they could never occur, So the MLE for the bigrams, trigrams etc have be smoothed so that it does not have a 0 conditional probability. One such method is to use ‘Laplace smoothing’. This smoothing tries to steal from the probability mass of words that occur in the corpus and re-distribute it to the words that do not occur in the corpus. In a way this equivalent to probability mass stealing. This is the simplest smoothing technique and is also known as the ‘add +1’ smoothing technique and requires that 1 be added to all counts

So the  MLE below
$P_{MLE}(w_{i}|w_{i-1}) = c(w_{i-1},c_{i})/c(w_{i-1})$

With the add +1 smoothing this becomes
$P_{MLE}(w_{i}|w_{i-1}) = c(w_{i-1},c_{i})+1/c(w_{i-1})+V$

This smoothing is done for bigram, trigam and quadgram.  Smoothing is usually used with an associated technique called ‘backoff’. If the phrase is not found in a n-gram model then we need to backoff to a n-1 gram model. For e.g. a lookup will be done in quadgrams, if not found the algorithm will backoff to trigram,  bigram and finally to unigram.

Hence if we had the phrase
“on my way”

The smoothed MLE for a quadgram will be checked for the next word. If this is not found this is backed of my searching smoothed MLEs for trigrams for the phrase ‘my way’ and if this not found search the bigram for the next word to ‘way’.

One such method is the Katz backoff which is given by which is based on the following method
Bigrams with nonzero count are discounted according to discount ratio d_{r} (i.e. the unigram model).
$r^{*}=(r+1)n_{r+1}/n_{_{r}}$
$d_{r} = r^{*}/r$

Count mass subtracted from nonzero counts is redistributed among the zero-count bigrams according to next lower-order distribution

A better performance is obtained with the Kneser-Ney algorithm which computes the continuation probability of words. The Kneser-Ney algorithm is included below
$P_{\mathit{KN}}(w_i \mid w_{i-1}) = \dfrac{\max(c(w_{i-1} w_i) - \delta, 0)}{\sum_{w'} c(w_{i-1} w')} + \lambda \dfrac{\left| \{ w_{i-1} : c(w_{i-1}, w_i) > 0 \} \right|}{\left| \{ w_{j-1} : c(w_{j-1},w_j) > 0\} \right|}$

where
$\lambda(w_{i-1}) = \dfrac{\delta}{c(w_{i-1})} \left| \{w' : c(w_{i-1}, w') > 0\} \right|$

This post was inspired by the final Capstone Project in which I had to create a Shiny app for predicting the next word as a part of  Data Science Specialization conducted by John Hopkins University, Bloomberg School of Public health at Coursera.

I further extended this concept  where I try to predict what Shakespeare would have said.  For this I ingest the Complete Works of Shakespeare which is the corpus. The +1 Add smoothing with Katz backoff and the Kneser-Ney algorithm on the unigram, bigram, trigram and quadgrams were then implemented.

Note: This post  in no way tries to belittle the genius of Shakespeare.  From the table below it can be seen that our day to day conversation has approximately 210K, 181K & 65K unique bigrams, trigrams and quadgrams. On the other hand, Shakespearean literature has 271K, 505K, & 517K bigrams, trigrams and quadgrams. It can be seen that Shakespeare had a rich and complex set of word combination.

Not surprisingly the computation of the conditional and continuation probabilities for the Shakespearean literature is orders of magnitude larger.
Here is a small table as comparison

This implementation was done entirely using R. The main R packages used for this implementation were tm,Rweka,dplyr. Here is a slide deck on the the implementation details of the apps and key  lessons learnt: PredictNextWord
Unfortunately I will not be able to include the implementation details as I am bound by The Coursera Honor Code.

If you have not already given the apps a try do give them a try
Try the Shiny apps
What would Shakespeare say?
What would you say?

You may like
1. My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon
2. Introducing cricketr! : An R package to analyze performances of cricketers
3. cricketr digs the Ashes!
4. A peek into literacy in India: Statistical Learning with R
5. A crime map of India in R – Crimes against women
6. Analyzing cricket’s batting legends – Through the mirage with R
7. Informed choices through Machine Learning : Analyzing Kohli, Tendulkar and Dravid