# Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR

“There’s something so paradoxical about pi. On the one hand, it represents order, as embodied by the shape of a circle, long held to be a symbol of perfection and eternity. On the other hand, pi is unruly, disheveled in appearance, its digits obeying no obvious rule, or at least none that we can perceive. Pi is elusive and mysterious, forever beyond reach. Its mix of order and disorder is what makes it so bewitching. ”

From  Infinite Powers by Steven Strogatz

Anybody who wants to be “anybody” in Big Data must necessarily be able to work on both large structured and unstructured data.  Log analysis is critical in any enterprise which is usually unstructured. As I mentioned in my previous post Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3 RDDs are typically used to handle unstructured data. Spark has the Dataframe abstraction over RDDs which performs better as it is optimized with the Catalyst optimization engine. Nevertheless, it is important to be able to process with RDDs.  This post is a continuation of my 3 earlier posts on Big Data namely

This post uses publicly available Webserver logs from NASA. The logs are for the months Jul 95 and Aug 95 and are a good place to start unstructured text analysis/log analysis. I highly recommend parsing these publicly available logs with regular expressions. It is only when you do that the truth of Jamie Zawinski’s pearl of wisdom

“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinksi

hits home. I spent many hours struggling with regex!!

For this post for the RDD part,  I had to refer to Dr. Fisseha Berhane’s blog post Webserver Log Analysis and for the Pyspark part, to the Univ. of California Specialization which I had done 3 years back Big Data Analysis with Apache Spark. Once I had played around with the regex for RDDs and PySpark I managed to get SparkR and SparklyR versions to work.

The notebooks used in this post have been published and are available at

An essential and unavoidable aspect of Big Data processing is the need to process unstructured text.Web server logs are one such area which requires Big Data techniques to process massive amounts of logs. The Common Log Format also known as the NCSA Common log format, is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed.

A publicly available webserver logs is the NASA-HTTP Web server logs. This is good dataset with which we can play around to get familiar to handling web server logs. The logs can be accessed at NASA-HTTP

Description These two traces contain two month’s worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida.

Format The logs are an ASCII file with one line per request, with the following columns:

-host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.

-timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.

-request given in quotes.

## 1 Parse Web server logs with RDDs

### 1.1 Read NASA Web server logs

Read the logs files from NASA for the months Jul 95 and Aug 95

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("Spark-Logs-Handling").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

sqlcontext = SQLContext(sc)
rdd = sc.textFile("/FileStore/tables/NASA_access_log_*.gz")
rdd.count()
Out[1]: 3461613


### 1.2Check content

Check the logs to identify the parsing rules required for the logs

i=0
for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect():
i=i+1
print(line)
if i >5:
break

ix-stp-fl2-19.ix.netcom.com – – [03/Aug/1995:23:03:09 -0400] “GET /images/faq.gif HTTP/1.0” 200 263
slip183-1.kw.jp.ibm.net – – [04/Aug/1995:18:42:17 -0400] “GET /shuttle/missions/sts-70/images/DSC-95EC-0001.gif HTTP/1.0” 200 107133
piweba4y.prodigy.com – – [05/Aug/1995:19:17:41 -0400] “GET /icons/menu.xbm HTTP/1.0” 200 527
ruperts.bt-sys.bt.co.uk – – [07/Aug/1995:04:44:10 -0400] “GET /shuttle/countdown/video/livevideo2.gif HTTP/1.0” 200 69067
dal06-04.ppp.iadfw.net – – [07/Aug/1995:21:10:19 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 200 786
p15.ppp-1.directnet.com – – [10/Aug/1995:01:22:54 -0400] “GET /images/KSC-logosmall.gif HTTP/1.0” 200 1204

### 1.3 Write the parsing rule for each of the fields

• host
• timestamp
• path
• status
• content_bytes

### 1.21 Get IP address/host name

This regex is at the start of the log and includes any non-white characted

import re
rslt=(rdd.map(lambda line: re.search('\S+',line)
.group(0))
.take(3)) # Get the IP address \host name
rslt

Out[3]: [‘in24.inetnebr.com’, ‘uplherc.upl.com’, ‘uplherc.upl.com’]

## 1.22 Get timestamp

Get the time stamp

rslt=(rdd.map(lambda line: re.search(‘(\S+ -\d{4})’,line)
    .groups())
.take(3))  #Get the  date
rslt

[(‘[01/Aug/1995:00:00:01 -0400’,),
(‘[01/Aug/1995:00:00:07 -0400’,),
(‘[01/Aug/1995:00:00:08 -0400’,)]

### 1.23 HTTP request

Get the HTTP request sent to Web server \w+ {GET}

# Get the REST call with ” “
rslt=(rdd.map(lambda line: re.search('"\w+\s+([^\s]+)\s+HTTP.*"',line)
.groups())
.take(3)) # Get the REST call
rslt
[(‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,),
(‘/’,),
(‘/images/ksclogo-medium.gif’,)]

### 1.23Get HTTP response status

Get the HTTP response to the request

rslt=(rdd.map(lambda line: re.search('"\s(\d{3})',line)
.groups())
.take(3)) #Get the status
rslt
Out[6]: [(‘200’,), (‘304’,), (‘304’,)]

## 1.24 Get content size

Get the HTTP response in bytes

rslt=(rdd.map(lambda line: re.search(‘^.*\s(\d*)$’,line)  .groups()) .take(3)) # Get the content size rslt Out[7]: [(‘1839’,), (‘0’,), (‘0’,)] ## 1.24 Putting it all together Now put all the individual pieces together into 1 big regular expression and assign to the groups 1. Host 2. Timestamp 3. Path 4. Status 5. Content_size rslt=(rdd.map(lambda line: re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
.groups())
.take(3))
rslt

[(‘in24.inetnebr.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:01 -0400]’,
‘”GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0″‘,
‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,
‘200 1839’,
‘1839’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:07 -0400]’,
‘”GET / HTTP/1.0″‘,
‘/’,
‘304 0’,
‘0’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:08 -0400]’,
‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘,
‘/images/ksclogo-medium.gif’,
‘304 0’,
‘0’)]

### 1.25 Add a log parsing function

import re
def parse_log1(line):
match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line) if match is None: return(line,0) else: return(line,1)  ### 1.26 Check for parsing failure Check how many lines successfully parsed with the parsing function n_logs = rdd.count() failed = rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1] == 0).count() print('Out of a total of {} logs, {} failed to parse'.format(n_logs,failed)) # Get the failed records line[1] == 0 failed1=rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1]==0) failed1.take(3) Out of a total of 3461613 logs, 38768 failed to parse Out[10]: [(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘, 0), (‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘, 0), (‘pipe1.nyc.pipeline.com – – [01/Aug/1995:00:12:37 -0400] “GET /history/apollo/apollo-13/apollo-13-patch-small.gif” 200 12859’, 0)] ### 1.26 The above rule is not enough to parse the logs It can be seen that the single rule only parses part of the logs and we cannot group the regex separately. There is an error “AttributeError: ‘NoneType’ object has no attribute ‘group'” which shows up #rdd.map(lambda line: re.search(‘^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s(“\w+\s+([^\s]+)\s+HTTP.*”)\s(\d{3}\s(\d*)$)’,line[0]).group()).take(4)

File “/databricks/spark/python/pyspark/util.py”, line 99, in wrapper
return f(*args, **kwargs)
File “<command-1348022240961444>”, line 1, in <lambda>
AttributeError: ‘NoneType’ object has no attribute ‘group’

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490) ### 1.27 Add rule for parsing failed records One of the issues with the earlier rule is the content_size has “-” for some logs import re def parse_failed(line): match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
if match is None:
return(line,0)
else:
return(line,1)

### 1.28 Parse records which fail

Parse the records that fails with the new rule

failed2=rdd.map(lambda line: parse_failed(line)).filter(lambda line: line[1]==1)
failed2.take(5)

Out[13]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
1),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
1),
(‘tia1.eskimo.com – – [01/Aug/1995:00:28:41 -0400] “GET /pub/winvn/release.txt HTTP/1.0” 404 -‘,
1),
(‘itws.info.eng.niigata-u.ac.jp – – [01/Aug/1995:00:38:01 -0400] “GET /ksc.html/facts/about_ksc.html HTTP/1.0” 403 -‘,
1),
(‘grimnet23.idirect.com – – [01/Aug/1995:00:50:12 -0400] “GET /www/software/winvn/winvn.html HTTP/1.0” 404 -‘,
1)]

Add both rules for parsing the log.

Note it can be shown that even with both rules all the logs are not parse.Further rules may need to be added

import re
def parse_log2(line):
# Parse logs with the rule below
match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line) # If match failed then use the rule below if match is None: match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
if match is None:
return (line, 0) # Return 0 for failure
else:
return (line, 1) # Return 1 for success


### 1.29 Group the different regex to groups for handling

def map2groups(line):
    match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line) if match is None: match = re.search('^(\S+)((\s)(-))+\s($\S+ -\d{4}$)\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(-)$',line)
return(match.groups())


### 1.30 Parse the logs and map the groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

## 2. Parse Web server logs with Pyspark

### 2.1Read data into a Pyspark dataframe

import os
logs_file_path="/FileStore/tables/" + os.path.join('NASA_access_log_*.gz')
from pyspark.sql.functions import split, regexp_extract
#base_df.show(truncate=False)
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
regexp_extract('value', r'^.*\[(\d\d\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size')) split_df.show(5,truncate=False)  +———————+————————–+———————————————–+——+————+ |host |timestamp |path |status|content_size| +———————+————————–+———————————————–+——+————+ |199.72.81.55 |01/Jul/1995:00:00:01 -0400|/history/apollo/ |200 |6245 | |unicomp6.unicomp.net |01/Jul/1995:00:00:06 -0400|/shuttle/countdown/ |200 |3985 | |199.120.110.21 |01/Jul/1995:00:00:09 -0400|/shuttle/missions/sts-73/mission-sts-73.html |200 |4085 | |burger.letters.com |01/Jul/1995:00:00:11 -0400|/shuttle/countdown/liftoff.html |304 |0 | |199.120.110.21 |01/Jul/1995:00:00:11 -0400|/shuttle/missions/sts-73/sts-73-patch-small.gif|200 |4179 | +———————+————————–+———————————————–+——+————+ only showing top 5 rows ### 2.2 Check data bad_rows_df = split_df.filter(split_df[‘host’].isNull() |  split_df['timestamp'].isNull() | split_df['path'].isNull() | split_df['status'].isNull() | split_df['content_size'].isNull()) bad_rows_df.count() Out[20]: 33905 ### 2.3Check no of rows which do not have digits We have already seen that the content_type field has ‘-‘ instead of digits in RDDs #bad_content_size_df = base_df.filter(~ base_df[‘value’].rlike(r’\d+$’))
bad_content_size_df.count()

Out[21]: 33905

To identify the rows that are bad, concatenate ‘*’ to the content_size field where the field does not have digits. It can be seen that the content_size has ‘-‘ instead of a valid number

from pyspark.sql.functions import lit, concat
bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(4,truncate=False)

+—————————————————————————————————————————————————+
|concat(value, *) |
+—————————————————————————————————————————————————+
|dd15-062.compuserve.com – – [01/Jul/1995:00:01:12 -0400] “GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0” 404 -*|
|dynip42.efn.org – – [01/Jul/1995:00:02:14 -0400] “GET /software HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:02:40 -0400] “GET /software/winvn HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:03:24 -0400] “GET /software HTTP/1.0” 302 -* |
+—————————————————————————————————————————————————+

### 2.5 Fill NAs with 0s

# Replace all null content_size values with 0.

cleaned_df = split_df.na.fill({‘content_size’: 0})

## 3. Webserver  logs parsing with SparkR

library(SparkR)
library(stringr)
file_location = "/FileStore/tables/NASA_access_log_Jul95.gz"
file_location = "/FileStore/tables/NASA_access_log_Aug95.gz"

# Initiate a SparkR session
sparkR.session()
sc <- sparkR.session()
sqlContext <- sparkRSQL.init(sc)

#df=SparkR::select(df, "value")
#m=regexp_extract(df$value,'\\\\S+',1) a=df %>% withColumn('host', regexp_extract(df$value, '^(\\S+)', 1)) %>%
withColumn('timestamp', regexp_extract(df$value, "((\\S+ -\\d{4}))", 2)) %>% withColumn('path', regexp_extract(df$value, '(\\"\\w+\\s+([^\\s]+)\\s+HTTP.*")', 2))  %>%



## 5 Hosts

### 5.1  RDD

#### 5.11 Parse and map to hosts to groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

# Create tuples of (host,1) and apply reduceByKey() and order by descending
rslt=(parsed_rdd2.map(lambda x😦x[0],1))
.reduceByKey(lambda a,b:a+b)
.takeOrdered(10, lambda x: -x[1]))
rslt

Out[18]:
[(‘piweba3y.prodigy.com’, 21988),
(‘piweba4y.prodigy.com’, 16437),
(‘piweba1y.prodigy.com’, 12825),
(‘edams.ksc.nasa.gov’, 11962),
(‘163.206.89.4’, 9697),
(‘news.ti.com’, 8161),
(‘www-d1.proxy.aol.com’, 8047),
(‘alyssa.prodigy.com’, 8037),
(‘siltb10.orl.mmc.com’, 7573),
(‘www-a2.proxy.aol.com’, 7516)]

#### 5.12Plot counts of hosts

import seaborn as sns

import pandas as pd import matplotlib.pyplot as plt df=pd.DataFrame(rslt,columns=[‘host’,‘count’]) sns.barplot(x=‘host’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.6, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

### 5.2 PySpark

#### 5.21 Compute counts of hosts

df= (cleaned_df
     .groupBy('host')
.count()
.orderBy('count',ascending=False))
df.show(5)

+——————–+—–+
| host|count|
+——————–+—–+
|piweba3y.prodigy….|21988|
|piweba4y.prodigy….|16437|
|piweba1y.prodigy….|12825|
| edams.ksc.nasa.gov |11964|
| 163.206.89.4 | 9697|
+——————–+—–+
only showing top 5 rows

### 5.22 Plot count of hosts

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df1=df.toPandas()
df2.count()
sns.barplot(x='host',y='count',data=df2)
plt.xlabel("Hosts")
plt.ylabel('Count')
plt.xticks(rotation="vertical",fontsize=10)
display()

### 5.31 Compute count of hosts

c <- SparkR::select(a,a$host) df=SparkR::summarize(SparkR::groupBy(c, a$host), noHosts = count(a$host)) df1 =head(arrange(df,desc(df$noHosts)),10)

host noHosts
1 piweba3y.prodigy.com   17572
2 piweba4y.prodigy.com   11591
3 piweba1y.prodigy.com    9868
4   alyssa.prodigy.com    7852
5  siltb10.orl.mmc.com    7573
6 piweba2y.prodigy.com    5922

### 5.32 Plot count of hosts

library(ggplot2)
p <-ggplot(data=df1, aes(x=host, y=noHosts,fill=host)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Host') + ylab('Count')
p



### 5.4 SparklyR

#### 5.41 Compute count of Hosts

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(host) %>% group_by(host) %>% summarise(noHosts=n()) %>% arrange(desc(noHosts))


### 5.42 Plot count of hosts

library(ggplot2)

p <-ggplot(data=df2, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Host’) + ylab(Count’)

p

## 6 Paths

### 6.11 Parse and map to hosts to groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[5],1))
.reduceByKey(lambda a,b:a+b)
.takeOrdered(10, lambda x: -x[1]))
rslt

[(‘”GET /images/NASA-logosmall.gif HTTP/1.0″‘, 207520),
(‘”GET /images/KSC-logosmall.gif HTTP/1.0″‘, 164487),
(‘”GET /images/MOSAIC-logosmall.gif HTTP/1.0″‘, 126933),
(‘”GET /images/USA-logosmall.gif HTTP/1.0″‘, 126108),
(‘”GET /images/WORLD-logosmall.gif HTTP/1.0″‘, 124972),
(‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘, 120704),
(‘”GET /ksc.html HTTP/1.0″‘, 83209),
(‘”GET /images/launch-logo.gif HTTP/1.0″‘, 75839),
(‘”GET /history/apollo/images/apollo-logo1.gif HTTP/1.0″‘, 68759),
(‘”GET /shuttle/countdown/ HTTP/1.0″‘, 64467)]

#### 6.12 Plot counts of HTTP Requests

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘path’,‘count’]) sns.barplot(x=‘path’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

### 6.2 Pyspark

#### 6.21 Compute count of HTTP Requests

df= (cleaned_df
.groupBy('path')
.count()
.orderBy('count',ascending=False))
df.show(5)

Out[20]:
+——————–+——+
| path| count|
+——————–+——+
|/images/NASA-logo…|208362|
|/images/KSC-logos…|164813|
|/images/MOSAIC-lo…|127656|
|/images/USA-logos…|126820|
|/images/WORLD-log…|125676|
+——————–+——+
only showing top 5 rows

#### 6.22 Plot count of HTTP Requests

import matplotlib.pyplot as plt

import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘path’,y=‘count’,data=df2)

plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xlabel(“HTTP Requests”) plt.ylabel(‘Count’) plt.xticks(rotation=90,fontsize=8)

display()

### 6.3 SparkR

library(SparkR)
c <- SparkR::select(a,a$path) df=SparkR::summarize(SparkR::groupBy(c, a$path), numRequest = count(a$path)) df1=head(df)  #### 3.14 Plot count of HTTP Requests library(ggplot2) p <-ggplot(data=df1, aes(x=path, y=numRequest,fill=path)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ xlab('Path') + ylab('Count') p  ### 6.4 SparklyR #### 6.41 Compute count of paths df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(path) %>% group_by(path) %>% summarise(noPaths=n()) %>% arrange(desc(noPaths)) df2 <-head(df1,10) df2  # Source: spark [?? x 2] # Ordered by: desc(noPaths) path noPaths 1 /images/NASA-logosmall.gif 208362 2 /images/KSC-logosmall.gif 164813 3 /images/MOSAIC-logosmall.gif 127656 4 /images/USA-logosmall.gif 126820 5 /images/WORLD-logosmall.gif 125676 6 /images/ksclogo-medium.gif 121286 7 /ksc.html 83685 8 /images/launch-logo.gif 75960 9 /history/apollo/images/apollo-logo1.gif 68858 10 /shuttle/countdown/ 64695 #### 6.42 Plot count of Paths library(ggplot2) p <-ggplot(data=df2, aes(x=path, y=noPaths,fill=path)) + geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Path') + ylab('Count') p  ### 7.1 RDD #### 7.11 Compute count of HTTP Status parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0]) parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line)) rslt=(parsed_rdd2.map(lambda x😦x[7],1)) .reduceByKey(lambda a,b:a+b) .takeOrdered(10, lambda x: -x[1])) rslt  Out[22]: [(‘200’, 3095682), (‘304’, 266764), (‘302’, 72970), (‘404’, 20625), (‘403’, 225), (‘500’, 65), (‘501’, 41)] #### 1.37 Plot counts of HTTP response status’ import seaborn as sns df=pd.DataFrame(rslt,columns=[‘status’,‘count’]) sns.barplot(x=‘status’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display() ### 7.2 Pyspark #### 7.21 Compute count of HTTP status status_count=(cleaned_df  .groupBy('status') .count() .orderBy('count',ascending=False)) status_count.show() +——+——-+ |status| count| +——+——-+ | 200|3100522| | 304| 266773| | 302| 73070| | 404| 20901| | 403| 225| | 500| 65| | 501| 41| | 400| 15| | null| 1| ### 7.22 Plot count of HTTP status Plot the HTTP return status vs the counts df1=status_count.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘status’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“HTTP Status”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display() ### 7.3 SparkR #### 7.31 Compute count of HTTP Response status library(SparkR) c <- SparkR::select(a,a$status)
df=SparkR::summarize(SparkR::groupBy(c, a$status), numStatus = count(a$status))
df1=head(df)

## 3.16 Plot count of HTTP Response status

library(ggplot2)
p <-ggplot(data=df1, aes(x=status, y=numStatus,fill=status)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Status') + ylab('Count')
p



#### 7.41 Compute count of status

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(status) %>% group_by(status) %>% summarise(noStatus=n()) %>% arrange(desc(noStatus))
df2

# Source: spark [?? x 2]
# Ordered by: desc(noStatus)
status noStatus

1 200     3100522
2 304      266773
3 302       73070
4 404       20901
5 403         225
6 500          65
7 501          41
8 400          15
9 ""            1

#### 7.42 Plot count of status

library(ggplot2)

p <-ggplot(data=df2, aes(x=status, y=noStatus,fill=status)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Status’) + ylab(Count’) p

### 8.1 RDD

#### 8.12 Compute count of content size

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[8],1))
.reduceByKey(lambda a,b:a+b)
.takeOrdered(10, lambda x: -x[1]))
rslt

Out[24]:
[(‘0’, 280017),
(‘786’, 167281),
(‘1204’, 140505),
(‘363’, 111575),
(‘234’, 110824),
(‘669’, 110056),
(‘5866’, 107079),
(‘1713’, 66904),
(‘1173’, 63336),
(‘3635’, 55528)]

#### 8.21 Plot content size

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘content_size’,‘count’]) sns.barplot(x=‘content_size’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

### 8.2 Pyspark

#### 8.21 Compute count of content_size

size_counts=(cleaned_df
                .groupBy('content_size')
.count()
.orderBy('count',ascending=False))
size_counts.show(10)
+------------+------+
|content_size| count|
+------------+------+
|           0|313932|
|         786|167709|
|        1204|140668|
|         363|111835|
|         234|111086|
|         669|110313|
|        5866|107373|
|        1713| 66953|
|        1173| 63378|
|        3635| 55579|
+------------+------+
only showing top 10 rows

#### 8.22 Plot counts of content size

Plot the path access versus the counts

df1=size_counts.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘content_size’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“content_size”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

### 8.31 Compute count of content size

library(SparkR)
c <- SparkR::select(a,a$content_size) df=SparkR::summarize(SparkR::groupBy(c, a$content_size), numContentSize = count(a$content_size)) df1=head(df) df1  content_size numContentSize 1 28426 1414 2 78382 293 3 60053 4 4 36067 2 5 13282 236 6 41785 174 8.32 Plot count of content sizes library(ggplot2) p <-ggplot(data=df1, aes(x=content_size, y=numContentSize,fill=content_size)) + geom_bar(stat=identity”) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Content Size’) + ylab(Count’) p ### 8.4 SparklyR #### 8.41Compute count of content_size df <- sdf %>% select(host,timestamp,path,status,content_size) df1 <- df %>% select(content_size) %>% group_by(content_size) %>% summarise(noContentSize=n()) %>% arrange(desc(noContentSize)) df2 <-head(df1,10) df2  # Source: spark [?? x 2] # Ordered by: desc(noContentSize) content_size noContentSize 1 0 280027 2 786 167709 3 1204 140668 4 363 111835 5 234 111086 6 669 110313 7 5866 107373 8 1713 66953 9 1173 63378 10 3635 55579 #### 8.42 Plot count of content_size library(ggplot2) p <-ggplot(data=df2, aes(x=content_size, y=noContentSize,fill=content_size)) + geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Content size') + ylab('Count') p  Conclusion: I spent many,many hours struggling with Regex and getting RDDs,Pyspark to work. Also had to spend a lot of time trying to work out the syntax for SparkR and SparklyR for parsing. After you parse the logs plotting and analysis is a piece of cake! This is definitely worth a try! Watch this space!! To see all posts click Index of posts # Revisiting World Bank data analysis with WDI and gVisMotionChart Note: I had written a post about 3 years back on World Bank Data Analysis using World Development Indicators (WDI) & gVisMotionCharts. But the motion charts stopped working some time ago. I have always been wanting to fix this and I now got to actually doing it. The issue was 2 of the WDI indicators had changed. After I fixed this I was able to host the generated motion chart using github.io pages. Please make sure that you enable flash player if you open the motion charts with Google Chrome. You may also have to enable flash if using Firefox, IE etc Please check out the 2 motions charts with World Bank data If you are using Chrome please enable (Allow) ‘flash player’ by clicking on the lock sign in the URL as shown ## Introduction Recently I was surfing the web, when I came across a real cool post New R package to access World Bank data, by Markus Gesmann on using googleVis and motion charts with World Bank Data. The post also introduced me to Hans Rosling, Professor of Sweden’s Karolinska Institute. Hans Rosling, the creator of the famous Gapminder chart, the “Heath and Wealth of Nations” displays global trends through animated charts (A must see!!!). As they say, in Hans Rosling’s hands, data dances and sings. Take a look at his Ted talks for e.g. Hans Rosling:New insights on poverty. Prof Rosling developed the breakthrough software behind the visualizations, in the Gapminder. The free software, which can be loaded with any data – was purchased by Google in March 2007. In this post, I recreate some of the Gapminder charts with the help of R packages WDI and googleVis. The WDI package of Vincent Arel-Bundock, provides a set of really useful functions to get to data based on the World Bank Data indicators. googleVis provides motion charts with which you can animate the data. You can clone/download the code from Github at worldBankAnalysis which is in the form of an Rmd file. library(WDI) library(ggplot2) library(googleVis) library(plyr) ## 1.Get the data from 1960 to 2019 for the following 1. Population – SP.POP.TOTL 2. GDP in US$ – NY.GDP.MKTP.CD
3. Life Expectancy at birth (Years) – SP.DYN.LE00.IN
4. GDP Per capita income – NY.GDP.PCAP.PP.CD
5. Fertility rate (Births per woman) – SP.DYN.TFRT.IN
6. Poverty headcount ratio – SI.POV.NAHC
# World population total
population = WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2019)
# GDP in US $gdp= WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2019) # Life expectancy at birth (Years) lifeExpectancy= WDI(indicator='SP.DYN.LE00.IN', country="all",start=1960, end=2019) # GDP Per capita income = WDI(indicator='NY.GDP.PCAP.PP.CD', country="all",start=1960, end=2019) # Fertility rate (births per woman) fertility = WDI(indicator='SP.DYN.TFRT.IN', country="all",start=1960, end=2019) # Poverty head count poverty= WDI(indicator='SI.POV.NAHC', country="all",start=1960, end=2019) ## 2.Rename the columns names(population)[3]="Total population" names(lifeExpectancy)[3]="Life Expectancy (Years)" names(gdp)[3]="GDP (US$)"
names(income)[3]="GDP per capita income"
names(fertility)[3]="Fertility (Births per woman)"
names(poverty)[3]="Poverty headcount ratio"

## 3.Join the data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries j1 <- join(population, gdp) 

j2 <- join(j1,lifeExpectancy) 

j3 <- join(j2,income) 

j4 <- join(j3,poverty) 

wbData <- join(j4,fertility) 

## 4.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixes
wdi_data =WDI_data
# The 1st matrix is the list is the set of all World Bank Indicators
indicators=wdi_data[[1]]
# The 2nd  matrix gives the set of countries and regions
countries=wdi_data[[2]]
df = as.data.frame(countries)
aa <- df$region != "Aggregates" # Remove the aggregates countries_df <- df[aa,] # Subset from the development data only those corresponding to the countries bb = subset(wbData, country %in% countries_df$country)
cc = join(bb,countries_df)
dd = complete.cases(cc)
developmentDF = cc[dd,]

## 5.Create and display the motion chart

gg<- gvisMotionChart(cc,
idvar = "country",
timevar = "year",
xvar = "GDP",
yvar = "Life Expectancy",
sizevar ="Population",
colorvar = "region")
plot(gg)
cat(gg$html$chart, file="chart1.html")


Note: Unfortunately it is not possible to embed the motion chart in WordPress. It is has to hosted on a server as a Webpage. After exploring several possibilities I came up with the following process to display the animation graph. The plot is saved as a html file using ‘cat’ as shown above. The WorldBank_chart1.html page is then hosted as a Github page (gh-page) on Github.

Here is the ggvisMotionChart

Do give  World Bank Motion Chart1  a spin.  Here is how the Motion Chart has to be used

You can select Life Expectancy, Population, Fertility etc by clicking the black arrows. The blue arrow shows the ‘play’ button to set animate the motion chart. You can also select the countries and change the size of the circles. Do give it a try. Here are some quick analysis by playing around with the motion charts with different parameters chosen

The set of charts below are screenshots captured by running the motion chart World Bank Motion Chart1

a. Life Expectancy vs Fertility chart

This chart is used by Hans Rosling in his Ted talk. The left chart shows low life expectancy and high fertility rate for several sub Saharan and East Asia Pacific countries in the early 1960’s. Today the fertility has dropped and the life expectancy has increased overall. However the sub Saharan countries still have a high fertility rate

b. Population vs GDP

The chart below shows that GDP of India and China have the same GDP from 1973-1994 with US and Japan well ahead.

From 1998- 2014 China really pulls away from India and Japan as seen below

c. Per capita income vs Life Expectancy

In the 1990’s the per capita income and life expectancy of the sub -saharan countries are low (42-50). Japan and US have a good life expectancy in 1990’s. In 2014 the per capita income of the sub-saharan countries are still low though the life expectancy has marginally improved.

In the early 1990’s China had a higher poverty head count ratio than India. By 2004 China had this all figured out and the poverty head count ratio drops significantly. This can also be seen in the chart below.

In the chart above China shows a drastic reduction in poverty headcount ratio vs India. Strangely Zambia shows an increase in the poverty head count ratio.

## 6.Get the data for the 2nd set of indicators

1. Total population  – SP.POP.TOTL
2. GDP in US$– NY.GDP.MKTP.CD 3. Access to electricity (% population) – EG.ELC.ACCS.ZS 4. Electricity consumption KWh per capita -EG.USE.ELEC.KH.PC 5. CO2 emissions -EN.ATM.CO2E.KT 6. Basic Sanitation Access – SH.STA.BASS.ZS # World population population = WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2016) # GDP in US$
gdp= WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2016)
elecAccess= WDI(indicator='EG.ELC.ACCS.ZS', country="all",start=1960, end=2016)
# Electric power consumption Kwh per capita
elecConsumption= WDI(indicator='EG.USE.ELEC.KH.PC', country="all",start=1960, end=2016)
#CO2 emissions
co2Emissions= WDI(indicator='EN.ATM.CO2E.KT', country="all",start=1960, end=2016)
sanitationAccess= WDI(indicator='SH.STA.ACSN', country="all",start=1960, end=2016)

## 7.Rename the columns

names(population)[3]="Total population"
names(gdp)[3]="GDP US($)" names(elecAccess)[3]="Access to Electricity (% popn)" names(elecConsumption)[3]="Electric power consumption (KWH per capita)" names(co2Emissions)[3]="CO2 emisions" names(sanitationAccess)[3]="Access to sanitation(% popn)" ## 8.Join the individual data frames Join the individual data frames to one large wide data frame with all the indicators for the countries  j1 <- join(population, gdp) j2 <- join(j1,elecAccess) j3 <- join(j2,elecConsumption) j4 <- join(j3,co2Emissions) wbData1 <- join(j3,sanitationAccess)  ## 9.Use WDI_data Use WDI_data to get the list of indicators and the countries. Join the countries and region #This returns list of 2 matrixes wdi_data =WDI_data # The 1st matrix is the list is the set of all World Bank Indicators indicators=wdi_data[[1]] # The 2nd matrix gives the set of countries and regions countries=wdi_data[[2]] df = as.data.frame(countries) aa <- df$region != "Aggregates"
# Remove the aggregates
countries_df <- df[aa,]
# Subset from the development data only those corresponding to the countries
ee = subset(wbData1, country %in% countries_df$country) ff = join(ee,countries_df) ## Joining by: iso2c, country ## 10.Create and display the motion chart gg1<- gvisMotionChart(ff, idvar = "country", timevar = "year", xvar = "GDP", yvar = "Access to Electricity", sizevar ="Population", colorvar = "region") plot(gg1) cat(gg1$html$chart, file="chart2.html")  This is World Bank Motion Chart2 which has a different set of parameters like Access to Energy, CO2 emissions etc The set of charts below are screenshots of the motion chart World Bank Motion Chart 2 a. Access to Electricity vs Population The above chart shows that in China 100% population have access to electricity. India has made decent progress from 50% in 1990 to 79% in 2012. However Pakistan seems to have been much better in providing access to electricity. Pakistan moved from 59% to close 98% access to electricity b. Power consumption vs population The above chart shows the Power consumption vs Population. China and India have proportionally much lower consumption that Norway, US, Canada c. CO2 emissions vs Population In 1963 the CO2 emissions were fairly low and about comparable for all countries. US, India have shown a steady increase while China shows a steep increase. Interestingly UK shows a drop in CO2 emissions d. Access to sanitation India shows an improvement but it has a long way to go with only 40% of population with access to sanitation. China has made much better strides with 80% having access to sanitation in 2015. Strangely Nigeria shows a drop in sanitation by almost about 20% of population. The code is available at Github at worldBankAnalysis Conclusion: So there you have it. I have shown some screenshots of some sample parameters of the World indicators. Please try to play around with World Bank Motion Chart1 & World Bank Motion Chart 2 with your own set of parameters and countries. You can also create your own motion chart from the 100s of WDI indicators avaialable at World Bank Data indicator. To see all posts Index of posts # Analyzing performances of cricketers using cricketr template This post includes a template which you can use for analyzing the performances of cricketers, both batsmen and bowlers in Test, ODI and Twenty 20 cricket using my R package cricketr. To see actual usage of functions in the R package cricketr see Introducing cricketr! : An R package to analyze performances of cricketers. This template can be downloaded from Github at cricketer-template The ‘cricketr’ package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions. You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar # The cricketr package The cricketr package has several functions that perform several different analyses on both batsman and bowlers. The package has function that plot percentage frequency runs or wickets, runs likelihood for a batsman, relative run/strike rates of batsman and relative performance/economy rate for bowlers are available. Other interesting functions include batting performance moving average, forecast and a function to check whether the batsmans in in-form or out-of-form. The data for a particular player can be obtained with the getPlayerData() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Ricky Ponting, Sachin Tendulkar etc. This will bring up a page which have the profile number for the player e.g. for Sachin Tendulkar this would be http://www.espncricinfo.com/india/content/player/35320.html. Hence, Sachin’s profile is 35320. This can be used to get the data for Tendulkar as shown below The cricketr package is now available from CRAN!!! You should be able to install directly with ### 1. Install the cricketr package if (!require("cricketr")){ install.packages("cricketr",lib = "c:/test") } library(cricketr) The cricketr package includes some pre-packaged sample (.csv) files. You can use these sample to test functions as shown below # Retrieve the file path of a data file installed with cricketr #pathToFile <- system.file("data", "tendulkar.csv", package = "cricketr") #batsman4s(pathToFile, "Sachin Tendulkar") # The general format is pkg-function(pathToFile,par1,...) #batsman4s(<path-To-File>,"Sachin Tendulkar") “ The pre-packaged files can be accessed as shown above. To get the data of any player use the function in Test, ODI and Twenty20 use the following ### 2. For Test cricket #tendulkar <- getPlayerData(35320,dir="..",file="tendulkar.csv",type="batting",homeOrAway=c(1,2), result=c(1,2,4)) ### 2a. For ODI cricket #tendulkarOD <- getPlayerDataOD(35320,dir="..",file="tendulkarOD.csv",type="batting") ### 2b For Twenty 20 cricket #tendulkarT20 <- getPlayerDataTT(35320,dir="..",file="tendulkarT20.csv",type="batting") ## Analysis of batsmen Important Note This needs to be done only once for a player. This function stores the player’s data in a CSV file (for e.g. tendulkar.csv as above) which can then be reused for all other functions. Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses ## Sachin Tendulkar’s performance – Basic Analyses The 3 plots below provide the following for Tendulkar 1. Frequency percentage of runs in each run range over the whole career 2. Mean Strike Rate for runs scored in the given range 3. A histogram of runs frequency percentages in runs ranges For example ### 3. Basic analyses par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar") #batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar") #batsmanRunsRanges("./tendulkar.csv","Tendulkar") dev.off() ## null device ## 1 1. Player 1 2. Player 2 3. Player 3 4. Player 4 ### 4. More analyses par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #batsman4s("./player1.csv","Player1") #batsman6s("./player1.csv","Player1") #batsmanMeanStrikeRate("./player1.csv","Player1") # For ODI and T20 #batsmanScoringRateODTT("./player1.csv","Player1") dev.off() ## null device ## 1 par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #batsman4s("./player2.csv","Player2") #batsman6s("./player2.csv","Player2") #batsmanMeanStrikeRate("./player2.csv","Player2") # For ODI and T20 #batsmanScoringRateODTT("./player1.csv","Player1") dev.off() ## null device ## 1 par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #batsman4s("./player3.csv","Player3") #batsman6s("./player3.csv","Player3") #batsmanMeanStrikeRate("./player3.csv","Player3") # For ODI and T20 #batsmanScoringRateODTT("./player1.csv","Player1") dev.off() ## null device ## 1 par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #batsman4s("./player4.csv","Player4") #batsman6s("./player4.csv","Player4") #batsmanMeanStrikeRate("./player4.csv","Player4") # For ODI and T20 #batsmanScoringRateODTT("./player1.csv","Player1") dev.off() ## null device ## 1 Note: For mean strike rate in ODI and Twenty20 use the function batsmanScoringRateODTT() ### 5.Boxplot histogram plot This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency #batsmanPerfBoxHist("./player1.csv","Player1") #batsmanPerfBoxHist("./player2.csv","Player2") #batsmanPerfBoxHist("./player3.csv","Player3") #batsmanPerfBoxHist("./player4.csv","Player4") ### 6. Contribution to won and lost matches For the 2 functions below you will have to use the getPlayerDataSp() function. I have commented this as I already have these files. This function can only be used for Test matches #player1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player1sp.csv",ttype="batting") #player2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player2sp.csv",ttype="batting") #player3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player3sp.csv",ttype="batting") #player4sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player4sp.csv",ttype="batting") par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanContributionWonLost("player1sp.csv","Player1") #batsmanContributionWonLost("player2sp.csv","Player2") #batsmanContributionWonLost("player3sp.csv","Player3") #batsmanContributionWonLost("player4sp.csv","Player4") dev.off() ## null device ## 1 ### 7, Performance at home and overseas This function also requires the use of getPlayerDataSp() as shown above. This can only be used for Test matches par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanPerfHomeAway("player1sp.csv","Player1") #batsmanPerfHomeAway("player2sp.csv","Player2") #batsmanPerfHomeAway("player3sp.csv","Player3") #batsmanPerfHomeAway("player4sp.csv","Player4") dev.off() ## null device ## 1 ### 8. Batsman average at different venues par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanAvgRunsGround("./player1.csv","Player1") #batsmanAvgRunsGround("./player2.csv","Player2") #batsmanAvgRunsGround("./player3.csv","Ponting") #batsmanAvgRunsGround("./player4.csv","Player4") dev.off() ## null device ## 1 ### 9. Batsman average against different opposition par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanAvgRunsOpposition("./player1.csv","Player1") #batsmanAvgRunsOpposition("./player2.csv","Player2") #batsmanAvgRunsOpposition("./player3.csv","Ponting") #batsmanAvgRunsOpposition("./player4.csv","Player4") dev.off() ## null device ## 1 ### 10. Runs Likelihood of batsman par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanRunsLikelihood("./player1.csv","Player1") #batsmanRunsLikelihood("./player2.csv","Player2") #batsmanRunsLikelihood("./player3.csv","Ponting") #batsmanRunsLikelihood("./player4.csv","Player4") dev.off() ## null device ## 1 ### 11. Moving Average of runs in career par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanMovingAverage("./player1.csv","Player1") #batsmanMovingAverage("./player2.csv","Player2") #batsmanMovingAverage("./player3.csv","Ponting") #batsmanMovingAverage("./player4.csv","Player4") dev.off() ## null device ## 1 ### 12. Cumulative Average runs of batsman in career par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanCumulativeAverageRuns("./player1.csv","Player1") #batsmanCumulativeAverageRuns("./player2.csv","Player2") #batsmanCumulativeAverageRuns("./player3.csv","Ponting") #batsmanCumulativeAverageRuns("./player4.csv","Player4") dev.off() ## null device ## 1 ### 13. Cumulative Average strike rate of batsman in career par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanCumulativeStrikeRate("./player1.csv","Player1") #batsmanCumulativeStrikeRate("./player2.csv","Player2") #batsmanCumulativeStrikeRate("./player3.csv","Ponting") #batsmanCumulativeStrikeRate("./player4.csv","Player4") dev.off() ## null device ## 1 ### 14. Future Runs forecast Here are plots that forecast how the batsman will perform in future. In this case 90% of the career runs trend is uses as the training set. the remaining 10% is the test set. A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated runs trend is plotted. The test set is also plotted to see how close the forecast and the actual matches Take a look at the runs forecasted for the batsman below. par(mfrow=c(2,2)) par(mar=c(4,4,2,2)) #batsmanPerfForecast("./player1.csv","Player1") #batsmanPerfForecast("./player2.csv","Player2") #batsmanPerfForecast("./player3.csv","Player3") #batsmanPerfForecast("./player4.csv","Player4") dev.off() ## null device ## 1 ### 15. Relative Mean Strike Rate plot The plot below compares the Mean Strike Rate of the batsman for each of the runs ranges of 10 and plots them. The plot indicate the following frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv") names <- list("Player1","Player2","Player3","Player4") #relativeBatsmanSR(frames,names) ### 16. Relative Runs Frequency plot The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv") names <- list("Player1","Player2","Player3","Player4") #relativeRunsFreqPerf(frames,names) ### 17. Relative cumulative average runs in career frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv") names <- list("Player1","Player2","Player3","Player4") #relativeBatsmanCumulativeAvgRuns(frames,names) ### 18. Relative cumulative average strike rate in career frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv") names <- list("Player1","Player2","Player3","player4") #relativeBatsmanCumulativeStrikeRate(frames,names) ### 19. Check Batsman In-Form or Out-of-Form The below computation uses Null Hypothesis testing and p-value to determine if the batsman is in-form or out-of-form. For this 90% of the career runs is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated. The Null Hypothesis (H0) assumes that the batsman continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the batsman is out of form the sample mean is beyond the 95% confidence interval of the population mean. A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later This is done for the Top 4 batsman #checkBatsmanInForm("./player1.csv","Player1") #checkBatsmanInForm("./player2.csv","Player2") #checkBatsmanInForm("./player3.csv","Player3") #checkBatsmanInForm("./player4.csv","Player4") ### 20. 3D plot of Runs vs Balls Faced and Minutes at Crease The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted par(mfrow=c(1,2)) par(mar=c(4,4,2,2)) #battingPerf3d("./player1.csv","Player1") #battingPerf3d("./player2.csv","Player2") par(mfrow=c(1,2)) par(mar=c(4,4,2,2)) #battingPerf3d("./player3.csv","Player3") #battingPerf3d("./player4.csv","player4") dev.off() ## null device ## 1 ### 21. Predicting Runs given Balls Faced and Minutes at Crease A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease. BF <- seq( 10, 400,length=15) Mins <- seq(30,600,length=15) newDF <- data.frame(BF,Mins) #Player1 <- batsmanRunsPredict("./player1.csv","Player1",newdataframe=newDF) #Player2 <- batsmanRunsPredict("./player2.csv","Player2",newdataframe=newDF) #ponting <- batsmanRunsPredict("./player3.csv","Player3",newdataframe=newDF) #sangakkara <- batsmanRunsPredict("./player4.csv","Player4",newdataframe=newDF) #batsmen <-cbind(round(Player1$Runs),round(Player2$Runs),round(Player3$Runs),round(Player4$Runs)) #colnames(batsmen) <- c("Player1","Player2","Player3","Player4") #newDF <- data.frame(round(newDF$BF),round(newDF$Mins)) #colnames(newDF) <- c("BallsFaced","MinsAtCrease") #predictedRuns <- cbind(newDF,batsmen) #predictedRuns ## Analysis of bowlers 1. Bowler1 2. Bowler2 3. Bowler3 4. Bowler4 player1 <- getPlayerData(xxxx,dir=“..”,file=“player1.csv”,type=“bowling”) Note For One day you will have to use getPlayerDataOD() and for Twenty20 it is getPlayerDataTT() ### 21. Wicket Frequency Plot This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerWktsFreqPercent("./bowler1.csv","Bowler1") #bowlerWktsFreqPercent("./bowler2.csv","Bowler2") #bowlerWktsFreqPercent("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ### 22. Wickets Runs plot par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerWktsRunsPlot("./bowler1.csv","Bowler1") #bowlerWktsRunsPlot("./bowler2.csv","Bowler2") #bowlerWktsRunsPlot("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ### 23. Average wickets at different venues #bowlerAvgWktsGround("./bowler3.csv","Bowler3") ### 24. Average wickets against different opposition #bowlerAvgWktsOpposition("./bowler3.csv","Bowler3") ### 25. Wickets taken moving average par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerMovingAverage("./bowler1.csv","Bowler1") #bowlerMovingAverage("./bowler2.csv","Bowler2") #bowlerMovingAverage("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ### 26. Cumulative Wickets taken par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerCumulativeAvgWickets("./bowler1.csv","Bowler1") #bowlerCumulativeAvgWickets("./bowler2.csv","Bowler2") #bowlerCumulativeAvgWickets("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ### 27. Cumulative Economy rate par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerCumulativeAvgEconRate("./bowler1.csv","Bowler1") #bowlerCumulativeAvgEconRate("./bowler2.csv","Bowler2") #bowlerCumulativeAvgEconRate("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ### 28. Future Wickets forecast Here are plots that forecast how the bowler will perform in future. In this case 90% of the career wickets trend is used as the training set. the remaining 10% is the test set. A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated wickets trend is plotted. The test set is also plotted to see how close the forecast and the actual matches par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerPerfForecast("./bowler1.csv","Bowler1") #bowlerPerfForecast("./bowler2.csv","Bowler2") #bowlerPerfForecast("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ### 29. Contribution to matches won and lost As discussed above the next 2 charts require the use of getPlayerDataSp(). This can only be done for Test matches #bowler1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler1sp.csv",ttype="bowling") #bowler2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler2sp.csv",ttype="bowling") #bowler3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler3sp.csv",ttype="bowling") par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerContributionWonLost("bowler1sp","Bowler1") #bowlerContributionWonLost("bowler2sp","Bowler2") #bowlerContributionWonLost("bowler3sp","Bowler3") dev.off() ## null device ## 1 ### 30. Performance home and overseas. This can only be done for Test matches par(mfrow=c(1,3)) par(mar=c(4,4,2,2)) #bowlerPerfHomeAway("bowler1sp","Bowler1") #bowlerPerfHomeAway("bowler2sp","Bowler2") #bowlerPerfHomeAway("bowler3sp","Bowler3") dev.off() ## null device ## 1 ### 31 Relative Wickets Frequency Percentage frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv") names <- list("Bowler1","Bowler3","Bowler2") #relativeBowlingPerf(frames,names) ### 32 Relative Economy Rate against wickets taken frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv") names <- list("Bowler1","Bowler3","Bowler2") #relativeBowlingER(frames,names) ### 33 Relative cumulative average wickets of bowlers in career frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv") names <- list("Bowler1","Bowler3","Bowler2") #relativeBowlerCumulativeAvgWickets(frames,names) ### 34 Relative cumulative average economy rate of bowlers frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv") names <- list("Bowler1","Bowler3","Bowler2") #relativeBowlerCumulativeAvgEconRate(frames,names) ### 35 Check for bowler in-form/out-of-form The below computation uses Null Hypothesis testing and p-value to determine if the bowler is in-form or out-of-form. For this 90% of the career wickets is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated. The Null Hypothesis (H0) assumes that the bowler continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the bowler is out of form the sample mean is beyond the 95% confidence interval of the population mean. A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later Note: The check for the form status of the bowlers indicate #checkBowlerInForm("./bowler1.csv","Bowler1") #checkBowlerInForm("./bowler2.csv","Bowler2") #checkBowlerInForm("./bowler3.csv","Bowler3") dev.off() ## null device ## 1 ## Key Findings ### Analysis of batsman ### Analysis of bowlers # Big Data-2: Move into the big league:Graduate from R to SparkR This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkR Note 2: You can download this RMarkdown file from Github at Big Data- Python to Pyspark and R to SparkR 1a. Read CSV- R Note: To upload the CSV to databricks see the video Upload Flat File to Databricks Table # Read CSV file tendulkar= read.csv("/dbfs/FileStore/tables/tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Check the dimensions of the dataframe dim(tendulkar)  [1] 347 12 1b. Read CSV – SparkR # Load the SparkR library library(SparkR) # Initiate a SparkR session sparkR.session() tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") # Check the dimensions of the dataframe dim(tendulkar1)  [1] 347 12 2a. Data frame shape – R # Get the shape of the dataframe in R dim(tendulkar)  [1] 347 12 2b. Dataframe shape – SparkR The same ‘dim’ command works in SparkR too! dim(tendulkar1)  [1] 347 12 3a . Dataframe columns – R # Get the names names(tendulkar) # Also colnames(tendulkar)   [1] "Runs" "Mins" "BF" "X4s" "X6s" [6] "SR" "Pos" "Dismissal" "Inns" "Opposition" [11] "Ground" "Start.Date" 3b. Dataframe columns – SparkR names(tendulkar1)   [1] "Runs" "Mins" "BF" "4s" "6s" [6] "SR" "Pos" "Dismissal" "Inns" "Opposition" [11] "Ground" "Start Date" 4a. Rename columns – R names(tendulkar)=c('Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate') names(tendulkar)   [1] "Runs" "Minutes" "BallsFaced" "Fours" "Sixes" [6] "StrikeRate" "Position" "Dismissal" "Innings" "Opposition" [11] "Ground" "StartDate" 4b. Rename columns – SparkR names(tendulkar1)=c('Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate') names(tendulkar1)   [1] "Runs" "Minutes" "BallsFaced" "Fours" "Sixes" [6] "StrikeRate" "Position" "Dismissal" "Innings" "Opposition" [11] "Ground" "StartDate" 5a. Summary – R summary(tendulkar)   Runs Minutes BallsFaced Fours Length:347 Min. : 1.0 Min. : 0.00 Min. : 0.000 Class :character 1st Qu.: 33.0 1st Qu.: 22.00 1st Qu.: 1.000 Mode :character Median : 82.0 Median : 58.50 Median : 4.000 Mean :125.5 Mean : 89.75 Mean : 6.274 3rd Qu.:181.0 3rd Qu.:133.25 3rd Qu.: 9.000 Max. :613.0 Max. :436.00 Max. :35.000 NA's :18 NA's :19 NA's :19 Sixes StrikeRate Position Dismissal Min. :0.0000 Min. : 0.00 Min. :2.00 Length:347 1st Qu.:0.0000 1st Qu.: 38.09 1st Qu.:4.00 Class :character Median :0.0000 Median : 52.25 Median :4.00 Mode :character Mean :0.2097 Mean : 51.79 Mean :4.24 3rd Qu.:0.0000 3rd Qu.: 65.09 3rd Qu.:4.00 Max. :4.0000 Max. :166.66 Max. :7.00 NA's :18 NA's :20 NA's :18 Innings Opposition Ground StartDate Min. :1.000 Length:347 Length:347 Length:347 1st Qu.:1.000 Class :character Class :character Class :character Median :2.000 Mode :character Mode :character Mode :character Mean :2.376 3rd Qu.:3.000 Max. :4.000 NA's :1 5b. Summary – SparkR summary(tendulkar1)  SparkDataFrame[summary:string, Runs:string, Minutes:string, BallsFaced:string, Fours:string, Sixes:string, StrikeRate:string, Position:string, Dismissal:string, Innings:string, Opposition:string, Ground:string, StartDate:string] 6a. Displaying details of dataframe with str() – R str(tendulkar)  'data.frame': 347 obs. of 12 variables:$ Runs      : chr  "15" "DNB" "59" "8" ...
$Minutes : int 28 NA 254 24 124 74 193 1 50 324 ...$ BallsFaced: int  24 NA 172 16 90 51 134 1 44 266 ...
$Fours : int 2 NA 4 1 5 5 6 0 3 5 ...$ Sixes     : int  0 NA 0 0 0 0 0 0 0 0 ...
$StrikeRate: num 62.5 NA 34.3 50 45.5 ...$ Position  : int  6 NA 6 6 7 6 6 6 6 6 ...
$Dismissal : chr "bowled" NA "lbw" "run out" ...$ Innings   : int  2 4 1 3 1 1 3 2 3 1 ...
$Opposition: chr "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" ...$ Ground    : chr  "Karachi" "Karachi" "Faisalabad" "Faisalabad" ...
$StartDate : chr "15-Nov-89" "15-Nov-89" "23-Nov-89" "23-Nov-89" ... 6b. Displaying details of dataframe with str() – SparkR str(tendulkar1)  'SparkDataFrame': 12 variables:$ Runs      : chr "15" "DNB" "59" "8" "41" "35"
$Minutes : chr "28" "-" "254" "24" "124" "74"$ BallsFaced: chr "24" "-" "172" "16" "90" "51"
$Fours : chr "2" "-" "4" "1" "5" "5"$ Sixes     : chr "0" "-" "0" "0" "0" "0"
$StrikeRate: chr "62.5" "-" "34.3" "50" "45.55" "68.62"$ Position  : chr "6" "-" "6" "6" "7" "6"
$Dismissal : chr "bowled" "-" "lbw" "run out" "bowled" "lbw"$ Innings   : chr "2" "4" "1" "3" "1" "1"
$Opposition: chr "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan"$ Ground    : chr "Karachi" "Karachi" "Faisalabad" "Faisalabad" "Lahore" "Sialkot"
$StartDate : chr "15-Nov-89" "15-Nov-89" "23-Nov-89" "23-Nov-89" "1-Dec-89" "9-Dec-89" 7a. Head & tail -R print(head(tendulkar),3) print(tail(tendulkar),3)   Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 1 15 28 24 2 0 62.50 6 bowled 2 2 DNB NA NA NA NA NA NA 4 3 59 254 172 4 0 34.30 6 lbw 1 4 8 24 16 1 0 50.00 6 run out 3 5 41 124 90 5 0 45.55 7 bowled 1 6 35 74 51 5 0 68.62 6 lbw 1 Opposition Ground StartDate 1 v Pakistan Karachi 15-Nov-89 2 v Pakistan Karachi 15-Nov-89 3 v Pakistan Faisalabad 23-Nov-89 4 v Pakistan Faisalabad 23-Nov-89 5 v Pakistan Lahore 1-Dec-89 6 v Pakistan Sialkot 9-Dec-89 Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 342 37 125 81 5 0 45.67 4 caught 2 343 21 71 23 2 0 91.30 4 run out 4 344 32 99 53 5 0 60.37 4 lbw 2 345 1 8 5 0 0 20.00 4 lbw 4 346 10 41 24 2 0 41.66 4 lbw 2 347 74 150 118 12 0 62.71 4 caught 2 Opposition Ground StartDate 342 v Australia Mohali 14-Mar-13 343 v Australia Mohali 14-Mar-13 344 v Australia Delhi 22-Mar-13 345 v Australia Delhi 22-Mar-13 346 v West Indies Kolkata 6-Nov-13 347 v West Indies Mumbai 14-Nov-13 7b. Head – SparkR head(tendulkar1,3)   Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 1 15 28 24 2 0 62.5 6 bowled 2 2 DNB - - - - - - - 4 3 59 254 172 4 0 34.3 6 lbw 1 Opposition Ground StartDate 1 v Pakistan Karachi 15-Nov-89 2 v Pakistan Karachi 15-Nov-89 3 v Pakistan Faisalabad 23-Nov-89 8a. Determining the column types with sapply -R sapply(tendulkar,class)   Runs Minutes BallsFaced Fours Sixes StrikeRate "character" "integer" "integer" "integer" "integer" "numeric" Position Dismissal Innings Opposition Ground StartDate "integer" "character" "integer" "character" "character" "character" 8b. Determining the column types with printSchema – SparkR printSchema(tendulkar1)  root |-- Runs: string (nullable = true) |-- Minutes: string (nullable = true) |-- BallsFaced: string (nullable = true) |-- Fours: string (nullable = true) |-- Sixes: string (nullable = true) |-- StrikeRate: string (nullable = true) |-- Position: string (nullable = true) |-- Dismissal: string (nullable = true) |-- Innings: string (nullable = true) |-- Opposition: string (nullable = true) |-- Ground: string (nullable = true) |-- StartDate: string (nullable = true) 9a. Selecting columns – R library(dplyr) df=select(tendulkar,Runs,BallsFaced,Minutes) head(df,5)   Runs BallsFaced Minutes 1 15 24 28 2 DNB NA NA 3 59 172 254 4 8 16 24 5 41 90 124 9b. Selecting columns – SparkR library(SparkR) Sys.setenv(SPARK_HOME="/usr/hdp/2.6.0.3-8/spark") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) # Initiate a SparkR session sparkR.session() tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") df=SparkR::select(tendulkar1, "Runs", "BF","Mins") head(SparkR::collect(df))   Runs BF Mins 1 15 24 28 2 DNB - - 3 59 172 254 4 8 16 24 5 41 90 124 6 35 51 74 10a. Filter rows by criteria – R library(dplyr) df=tendulkar %>% filter(Runs > 50) head(df,5)   Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 1 DNB NA NA NA NA NA NA 4 2 59 254 172 4 0 34.30 6 lbw 1 3 8 24 16 1 0 50.00 6 run out 3 4 57 193 134 6 0 42.53 6 caught 3 5 88 324 266 5 0 33.08 6 caught 1 Opposition Ground StartDate 1 v Pakistan Karachi 15-Nov-89 2 v Pakistan Faisalabad 23-Nov-89 3 v Pakistan Faisalabad 23-Nov-89 4 v Pakistan Sialkot 9-Dec-89 5 v New Zealand Napier 9-Feb-90 10b. Filter rows by criteria – SparkR df=SparkR::filter(tendulkar1, tendulkar1$Runs > 50)

  Runs Mins  BF 4s 6s    SR Pos Dismissal Inns     Opposition       Ground
1   59  254 172  4  0  34.3   6       lbw    1     v Pakistan   Faisalabad
2   57  193 134  6  0 42.53   6    caught    3     v Pakistan      Sialkot
3   88  324 266  5  0 33.08   6    caught    1  v New Zealand       Napier
4   68  216 136  8  0    50   6    caught    2      v England   Manchester
5  114  228 161 16  0  70.8   4    caught    2    v Australia        Perth
6  111  373 270 19  0 41.11   4    caught    2 v South Africa Johannesburg
Start Date
1  23-Nov-89
2   9-Dec-89
3   9-Feb-90
4   9-Aug-90
5   1-Feb-92
6  26-Nov-92
11a. Unique values -R
unique(tendulkar$Runs)   [1] "15" "DNB" "59" "8" "41" "35" "57" "0" "24" "88" [11] "5" "10" "27" "68" "119*" "21" "11" "16" "7" "40" [21] "148*" "6" "17" "114" "111" "1" "73" "50" "9*" "165" [31] "78" "62" "TDNB" "28" "104*" "71" "142" "96" "43" "11*" [41] "34" "85" "179" "54" "4" "0*" "52*" "2" "122" "31" [51] "177" "74" "42" "18" "61" "36" "169" "9" "15*" "92" [61] "83" "143" "139" "23" "148" "13" "155*" "79" "47" "113" [71] "67" "136" "29" "53" "124*" "126*" "44*" "217" "116" "52" [81] "45" "97" "20" "39" "201*" "76" "65" "126" "36*" "69" [91] "155" "22*" "103" "26" "90" "176" "117" "86" "12" "193" [101] "16*" "51" "32" "55" "37" "44" "241*" "60*" "194*" "3" [111] "32*" "248*" "94" "22" "109" "19" "14" "28*" "63" "64" [121] "101" "122*" "91" "82" "56*" "154*" "153" "49" "10*" "103*" [131] "160" "100*" "105*" "100" "106" "84" "203" "98" "38" "214" [141] "53*" "111*" "146" "14*" "56" "80" "25" "81" "13*" 11b. Unique values – SparkR head(SparkR::distinct(tendulkar1[,"Runs"]),5)   Runs 1 119* 2 7 3 51 4 169 5 32* 12a. Aggregate – Mean, min and max – R library(dplyr) library(magrittr) a <- tendulkar$Runs != "DNB"
tendulkar <- tendulkar[a,]
dim(tendulkar)

# Remove rows with 'TDNB'
c <- tendulkar$Runs != "TDNB" tendulkar <- tendulkar[c,] # Remove rows with absent d <- tendulkar$Runs != "absent"
tendulkar <- tendulkar[d,]
dim(tendulkar)

# Remove the "* indicating not out
tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs))
c <- complete.cases(tendulkar)

#Subset the rows which are complete
tendulkar <- tendulkar[c,]
print(dim(tendulkar))
df <-tendulkar %>%  group_by(Ground) %>% summarise(meanRuns= mean(Runs), minRuns=min(Runs), maxRuns=max(Runs))
#names(tendulkar)

[1] 327  12
# A tibble: 6 x 4
Ground       meanRuns minRuns maxRuns

3 Auckland         5.00      5.      5.
4 Bangalore       57.9       4.    214.
5 Birmingham      46.8       1.    122.
6 Bloemfontein    85.0      15.    155.
12b. Aggregate- Mean, Min, Max – SparkR
sparkR.session()

delimiter = ",",
source = "csv",
inferSchema = "true",
na.strings = "")

print(dim(tendulkar1))
tendulkar1 <-SparkR::filter(tendulkar1,tendulkar1$Runs != "DNB") print(dim(tendulkar1)) tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "TDNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "absent") print(dim(tendulkar1)) # Cast the string type Runs to double withColumn(tendulkar1, "Runs", cast(tendulkar1$Runs, "double"))
# Remove the "* indicating not out
tendulkar1$Runs=SparkR::regexp_replace(tendulkar1$Runs, "\\*", "")
df=SparkR::summarize(SparkR::groupBy(tendulkar1, tendulkar1$Ground), mean = mean(tendulkar1$Runs), minRuns=min(tendulkar1$Runs),maxRuns=max(tendulkar1$Runs))

[1] 347  12
[1] 330  12
[1] 329  12
[1] 329  12
Ground       mean minRuns maxRuns
1      Bangalore  54.312500       0      96
3  Colombo (PSS)  37.200000      14      71
4   Christchurch  12.000000       0      24
5       Auckland   5.000000       5       5
6        Chennai  60.625000       0      81
7      Centurion  73.500000     111      36
8       Brisbane   7.666667       0       7
9     Birmingham  46.750000       1      40
11 Colombo (RPS) 143.000000     143     143
12    Chittagong  57.800000     101      36
13     Cape Town  69.857143      14       9
14    Bridgetown  26.000000       0      92
15      Bulawayo  55.000000      36      74
16         Delhi  39.947368       0      76
17    Chandigarh  11.000000      11      11
18  Bloemfontein  85.000000      15     155
19 Colombo (SSC)  77.555556     104       8
20       Cuttack   2.000000       2       2
13a Using SQL with SparkR
sparkR.session()
delimiter = ",",
source = "csv",
inferSchema = "true",
na.strings = "")

# Register this SparkDataFrame as a temporary view.
createOrReplaceTempView(tendulkar1, "tendulkar2")

# SQL statements can be run by using the sql method
df=SparkR::sql("SELECT * FROM tendulkar2 WHERE Ground='Karachi'")


  Runs Mins BF 4s 6s    SR Pos Dismissal Inns Opposition  Ground Start Date
1   15   28 24  2  0  62.5   6    bowled    2 v Pakistan Karachi  15-Nov-89
2  DNB    -  -  -  -     -   -         -    4 v Pakistan Karachi  15-Nov-89
3   23   49 29  5  0 79.31   4    bowled    2 v Pakistan Karachi  29-Jan-06
4   26   74 47  5  0 55.31   4    bowled    4 v Pakistan Karachi  29-Jan-06
Conclusion

This post discusses some of the key constructs in R and SparkR and how one can transition from R to SparkR fairly easily. I will be adding more constructs later. Do check back!

To see all posts click Index of posts

# My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($9.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle-$9.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

# Deep Learning from first principles in Python, R and Octave – Part 1

“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

                          David Eagleman - The Brain: The Story of You

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

Check out my video presentation which discusses the derivations in detail
1. Elements of Neural Networks and Deep Le- Part 1
2. Elements of Neural Networks and Deep Learning – Part 2

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

## 1. Logistic Regression

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

## 2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
$Z= w_{1}x_{1} +w_{2}x_{2} +..+ w_{30}x_{30} + b$
where ‘b’ is the bias term

The Activation function is the sigmoid function which is $a= 1/(1+e^{-z})$
The Loss, when the sigmoid function is used in the output layer, is given by
$L=-(ylog(a) + (1-y)log(1-a))$ (1)

### Forward propagation

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

### Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

Chain rule of differentiation
Let y=f(u)
and u=g(x) then
$\partial y/\partial x = \partial y/\partial u * \partial u/\partial x$

Derivative of sigmoid
$\sigma=1/(1+e^{-z})$
Let $x= 1 + e^{-z}$  then
$\sigma = 1/x$
$\partial \sigma/\partial x = -1/x^{2}$
$\partial x/\partial z = -e^{-z}$
Using the chain rule of differentiation we get
$\partial \sigma/\partial z = \partial \sigma/\partial x * \partial x/\partial z$
$=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}$
Therefore $\partial \sigma/\partial z = \sigma(1-\sigma)$        -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
$L=-(y*log(a) + (1-y)*log(1-a))$      -(a)
$a=1/(1+e^{-Z})$      -(b)
$Z= w_{1}x_{1} +w_{2}x_{2} +...+ w_{30}x_{30} +b = Z = \sum_{i} w_{i}*x_{i} + b$ -(c)

The back propagation step requires the computation of $dL/dw_{i}$ and $dL/db_{i}$. In the case of regression it would be $dE/dw_{i}$ and $dE/db_{i}$ where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
$dL/da = -(y/a + (1-y)/(1-a))$          -(d)
because $d/dx(logx) = 1/x$
Also from equation (2) we get
$da/dZ = a (1-a)$                                  – (e)
By chain rule
$\partial L/\partial Z = \partial L/\partial a * \partial a/\partial Z$
therefore substituting the results of (d) & (e) we get
$\partial L/\partial Z = -(y/a + (1-y)/(1-a)) * a(1-a) = a-y$         (f)
Finally
$\partial L/\partial w_{i}= \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial w_{i}$                                                           -(g)
$\partial Z/\partial w_{i} = x_{i}$            – (h)
and from (f) we have  $\partial L/\partial Z =a-y$
Therefore  (g) reduces to
$\partial L/\partial w_{i} = x_{i}* (a-y)$ -(i)
Also
$\partial L/\partial b = \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial b$ -(j)
Since
$\partial Z/\partial b = 1$ and using (f) in (j)
$\partial L/\partial b = a-y$

The gradient computes the weights at the input layer and the corresponding bias by using the values
of $dw_{i}$ and $db$
$w_{i} := w_{i} -\alpha * dw_{i}$
$b := b -\alpha * db$
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

### 3. Neural Network for Logistic Regression -Python code (vectorized)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Define the sigmoid function
def sigmoid(z):
a=1/(1+np.exp(-z))
return a

# Initialize
def initialize(dim):
w = np.zeros(dim).reshape(dim,1)
b = 0
return w

# Compute the loss
def computeLoss(numTraining,Y,A):
loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
return(loss)

# Execute the forward propagation
def forwardPropagation(w,b,X,Y):
# Compute Z
Z=np.dot(w.T,X)+b
# Determine the number of training samples
numTraining=float(len(X))
# Compute the output of the sigmoid activation function
A=sigmoid(Z)
#Compute the loss
loss = computeLoss(numTraining,Y,A)
# Compute the gradients dZ, dw and db
dZ=A-Y
dw=1/numTraining*np.dot(X,dZ.T)
db=1/numTraining*np.sum(dZ)

# Return the results as a dictionary
"db": db}
loss = np.squeeze(loss)

def gradientDescent(w, b, X, Y, numIerations, learningRate):
losses=[]
idx =[]
# Iterate
for i in range(numIerations):
#Get the derivates
w = w-learningRate*dw
b = b-learningRate*db

# Store the loss
if i % 100 == 0:
idx.append(i)
losses.append(loss)
params = {"w": w,
"b": b}
"db": db}

# Predict the output for a training set
def predict(w,b,X):
size=X.shape[1]
yPredicted=np.zeros((1,size))
Z=np.dot(w.T,X)
# Compute the sigmoid
A=sigmoid(Z)
for i in range(A.shape[1]):
#If the value is > 0.5 then set as 1
if(A[0][i] > 0.5):
yPredicted[0][i]=1
else:
# Else set as 0
yPredicted[0][i]=0

return yPredicted

#Normalize the data
def normalize(x):
x_norm = None
x_norm = np.linalg.norm(x,axis=1,keepdims=True)
x= x/x_norm
return x

# Run the 2 layer Neural Network on the cancer data set

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Normalize the data for better performance
X_train1=normalize(X_train)

# Create weight vectors of zeros. The size is the number of features in the data set=30
w=np.zeros((X_train.shape[1],1))
#w=np.zeros((30,1))
b=0

#Normalize the training data so that gradient descent performs better
X_train1=normalize(X_train)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_train2=X_train1.T

# Reshape to remove the rank 1 array and then transpose
y_train1=y_train.reshape(len(y_train),1)
y_train2=y_train1.T

# Run gradient descent for 4000 times and compute the weights
w = parameters["w"]
b = parameters["b"]

# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1.T

#Reshape y_test
y_test1=y_test.reshape(len(y_test),1)
y_test2=y_test1.T

# Predict the values for
yPredictionTest = predict(w, b, X_test2)
yPredictionTrain = predict(w, b, X_train2)

# Print the accuracy
print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100))

# Plot the Costs vs the number of iterations
fig1=plt.plot(idx,costs)
fig1=plt.title("Gradient descent-Cost vs No of iterations")
fig1=plt.xlabel("No of iterations")
fig1=plt.ylabel("Cost")
fig1.figure.savefig("fig1", bbox_inches='tight')
## train accuracy: 90.3755868545 %
## test accuracy: 89.5104895105 %

Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

### 4. Neural Network for Logistic Regression -R code (vectorized)

source("RFunctions-1.R")
# Define the sigmoid function
sigmoid <- function(z){
a <- 1/(1+ exp(-z))
a
}

# Compute the loss
computeLoss <- function(numTraining,Y,A){
loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A))
return(loss)
}

# Compute forward propagation
forwardPropagation <- function(w,b,X,Y){
# Compute Z
Z <- t(w) %*% X +b
#Set the number of samples
numTraining <- ncol(X)
# Compute the activation function
A=sigmoid(Z)

#Compute the loss
loss <- computeLoss(numTraining,Y,A)

# Compute the gradients dZ, dw and db
dZ<-A-Y
dw<-1/numTraining * X %*% t(dZ)
db<-1/numTraining*sum(dZ)

fwdProp <- list("loss" = loss, "dw" = dw, "db" = db)
return(fwdProp)
}

# Perform one cycle of Gradient descent
gradientDescent <- function(w, b, X, Y, numIerations, learningRate){
losses <- NULL
idx <- NULL
# Loop through the number of iterations
for(i in 1:numIerations){
fwdProp <-forwardPropagation(w,b,X,Y)
#Get the derivatives
dw <- fwdProp$dw db <- fwdProp$db
w = w-learningRate*dw
b = b-learningRate*db
l <- fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx <- c(idx,i) losses <- c(losses,l) } } # Return the weights and losses gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict <- function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z <- t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize <- function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<-as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<-sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Set the features X_train <-train[,1:30] y_train <- train[,31] X_test <- test[,1:30] y_test <- test[,31] # Create a matrix of 0's with the number of features w <-matrix(rep(0,dim(X_train)[2])) b <-0 X_train1 <- normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <-data.frame(gradDescent$idx, gradDescent\$losses)
names(df) <- c("iterations","losses")
ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") +
ggtitle("Gradient Descent - Losses vs No of Iterations") +
xlab("No of iterations") + ylab("Losses")

### 4. Neural Network for Logistic Regression -Octave code (vectorized)

 1; # Define sigmoid function function a = sigmoid(z) a = 1 ./ (1+ exp(-z)); end # Compute the loss function loss=computeLoss(numtraining,Y,A) loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A)); end
 # Perform forward propagation function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y) % Compute Z Z = w' * X + b; numtraining = size(X)(1,2); # Compute sigmoid A = sigmoid(Z);
 #Compute loss. Note this is element wise product loss =computeLoss(numtraining,Y,A); # Compute the gradients dZ, dw and db dZ = A-Y; dw = 1/numtraining* X * dZ'; db =1/numtraining*sum(dZ);

end
 # Compute Gradient Descent function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate) #Initialize losses and idx losses=[]; index=[]; # Loop through the number of iterations for i=1:numIerations, [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y); # Perform Gradient descent w = w - learningRate*dw; b = b - learningRate*db; if(mod(i,100) ==0) # Append index and loss index = [index i]; losses = [losses loss]; endif

end
end
 # Determine the predicted value for dataset function yPredicted = predict(w,b,X) m = size(X)(1,2); yPredicted=zeros(1,m); # Compute Z Z = w' * X + b; # Compute sigmoid A = sigmoid(Z); for i=1:size(X)(1,2), # Set predicted as 1 if A > 0,5 if(A(1,i) >= 0.5) yPredicted(1,i)=1; else yPredicted(1,i)=0; endif end end
 # Normalize by dividing each value by the sum of squares function normalized = normalize(x) # Compute Frobenius norm. Square the elements, sum rows and then find square root a = sqrt(sum(x .^ 2,2)); # Perform element wise division normalized = x ./ a; end
 # Split into train and test sets function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent) # Create a random index ix = randperm(length(dataset)); # Split into training trainSize = floor(trainPercent/100 * length(dataset)); train=dataset(ix(1:trainSize),:); # And test test=dataset(ix(trainSize+1:length(dataset)),:); X_train = train(:,1:30); y_train = train(:,31); X_test = test(:,1:30); y_test = test(:,31); end

 cancer=csvread("cancer.csv"); [X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75); w=zeros(size(X_train)(1,2),1); b=0; X_train1=normalize(X_train); X_train2=X_train1'; y_train1=y_train'; [w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75); # Normalize X_test X_test1=normalize(X_test); #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1'; y_test1=y_test'; # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(w1, b1, X_test2); yPredictionTrain = predict(w1, b1, X_train2); 

 trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100 testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100 trainAccuracy = 90.845 testAccuracy = 89.510 graphics_toolkit('gnuplot') plot(idx,losses); title ('Gradient descent- Cost vs No of iterations'); xlabel ("No of iterations"); ylabel ("Cost");`

Conclusion
This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers

The Deep Learning journey has begun… Don’t miss the bus!
Stay tuned for more interesting posts in Deep Learning!!

To see all posts check Index of posts

# My 3 video presentations on “Essential R”

In this post I include my  3 video presentations on the topic “Essential R”. In these 3 presentations I cover the entire landscape of R. I cover the following

• R Language – The essentials
• Key R Packages (dplyr, lubridate, ggplot2, etc.)
• How to create R Markdown and share reports
• A look at Shiny apps
• How to create a simple R package

You can download the relevant slide deck and practice code at Essential R

Essential R – Part 1
This video cover basic R data types – character, numeric, vectors, matrices, lists and data frames. It also touches on how to subset these data types

Essential R – Part 2
This video continues on how to subset dataframes (the most important data type) and some important packages. It also presents one of the most important job of a Data Scientist – that of cleaning and shaping the data. This is done with an example unclean data frame. It also  touches on some  key operations of dplyr like select, filter, arrange, summarise and mutate. Other packages like lubridate, quantmod are also included. This presentation also shows how to use base plot and ggplot2

Essential R – Part 3
This final session covers R Markdown , and  touches on some of the key markdown elements. There is a brief overview of a simple Shiny app. Finally this presentation also shows the key steps to create an R package

These 3 R sessions cover most of the basic R topics that we tend to use in a our day-to-day R way of life. With this you should be able to hit the ground running!

Hope you enjoy these video presentation and also hope you have an even greater time with R!

Check out my 2 books on cricket, a) Cricket analytics with cricketr b) Beaten by sheer pace – Cricket analytics with yorkr, now available in both paperback & kindle versions on Amazon!!! Pick up your copies today!

To see all my posts click – Index of posts