Big Data-4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR

“There’s something so paradoxical about pi. On the one hand, it represents order, as embodied by the shape of a circle, long held to be a symbol of perfection and eternity. On the other hand, pi is unruly, disheveled in appearance, its digits obeying no obvious rule, or at least none that we can perceive. Pi is elusive and mysterious, forever beyond reach. Its mix of order and disorder is what makes it so bewitching. ” 

From  Infinite Powers by Steven Strogatz

Anybody who wants to be “anybody” in Big Data must necessarily be able to work on both large structured and unstructured data.  Log analysis is critical in any enterprise which is usually unstructured. As I mentioned in my previous post Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3 RDDs are typically used to handle unstructured data. Spark has the Dataframe abstraction over RDDs which performs better as it is optimized with the Catalyst optimization engine. Nevertheless, it is important to be able to process with RDDs.  This post is a continuation of my 3 earlier posts on Big Data namely

1. Big Data-1: Move into the big league:Graduate from Python to Pyspark
2. Big Data-2: Move into the big league:Graduate from R to SparkR
3. Big Data: On RDDs, Dataframes,Hive QL with Pyspark and SparkR-Part 3

This post uses publicly available Webserver logs from NASA. The logs are for the months Jul 95 and Aug 95 and are a good place to start unstructured text analysis/log analysis. I highly recommend parsing these publicly available logs with regular expressions. It is only when you do that the truth of Jamie Zawinski’s pearl of wisdom

“Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.” – Jamie Zawinksi

hits home. I spent many hours struggling with regex!!

For this post for the RDD part,  I had to refer to Dr. Fisseha Berhane’s blog post Webserver Log Analysis and for the Pyspark part, to the Univ. of California Specialization which I had done 3 years back Big Data Analysis with Apache Spark. Once I had played around with the regex for RDDs and PySpark I managed to get SparkR and SparklyR versions to work.

The notebooks used in this post have been published and are available at

  1. logsAnalysiswithRDDs
  2. logsAnalysiswithPyspark
  3. logsAnalysiswithSparkRandSparklyR

You can also download all the notebooks from Github at WebServerLogsAnalysis

An essential and unavoidable aspect of Big Data processing is the need to process unstructured text.Web server logs are one such area which requires Big Data techniques to process massive amounts of logs. The Common Log Format also known as the NCSA Common log format, is a standardized text file format used by web servers when generating server log files. Because the format is standardized, the files can be readily analyzed.

A publicly available webserver logs is the NASA-HTTP Web server logs. This is good dataset with which we can play around to get familiar to handling web server logs. The logs can be accessed at NASA-HTTP

Description These two traces contain two month’s worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida.

Format The logs are an ASCII file with one line per request, with the following columns:

-host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.

-timestamp in the format “DAY MON DD HH:MM:SS YYYY”, where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-hour clock, and YYYY is the year. The timezone is -0400.

-request given in quotes.

-HTTP reply code.

-bytes in the reply.

1 Parse Web server logs with RDDs

1.1 Read NASA Web server logs

Read the logs files from NASA for the months Jul 95 and Aug 95

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("Spark-Logs-Handling").setMaster("local[*]")
sc = SparkContext.getOrCreate(conf)

sqlcontext = SQLContext(sc)
rdd = sc.textFile("/FileStore/tables/NASA_access_log_*.gz")
rdd.count()
Out[1]: 3461613

1.2Check content

Check the logs to identify the parsing rules required for the logs

i=0
for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect():
    i=i+1
    print(line)
    if i >5:
      break
ix-stp-fl2-19.ix.netcom.com – – [03/Aug/1995:23:03:09 -0400] “GET /images/faq.gif HTTP/1.0” 200 263
slip183-1.kw.jp.ibm.net – – [04/Aug/1995:18:42:17 -0400] “GET /shuttle/missions/sts-70/images/DSC-95EC-0001.gif HTTP/1.0” 200 107133
piweba4y.prodigy.com – – [05/Aug/1995:19:17:41 -0400] “GET /icons/menu.xbm HTTP/1.0” 200 527
ruperts.bt-sys.bt.co.uk – – [07/Aug/1995:04:44:10 -0400] “GET /shuttle/countdown/video/livevideo2.gif HTTP/1.0” 200 69067
dal06-04.ppp.iadfw.net – – [07/Aug/1995:21:10:19 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 200 786
p15.ppp-1.directnet.com – – [10/Aug/1995:01:22:54 -0400] “GET /images/KSC-logosmall.gif HTTP/1.0” 200 1204

1.3 Write the parsing rule for each of the fields

  • host
  • timestamp
  • path
  • status
  • content_bytes

1.21 Get IP address/host name

This regex is at the start of the log and includes any non-white characted

import re
rslt=(rdd.map(lambda line: re.search('\S+',line)
   .group(0))
   .take(3)) # Get the IP address \host name
rslt
Out[3]: [‘in24.inetnebr.com’, ‘uplherc.upl.com’, ‘uplherc.upl.com’]

1.22 Get timestamp

Get the time stamp

rslt=(rdd.map(lambda line: re.search(‘(\S+ -\d{4})’,line)
    .groups())
    .take(3))  #Get the  date
rslt
[(‘[01/Aug/1995:00:00:01 -0400’,),
(‘[01/Aug/1995:00:00:07 -0400’,),
(‘[01/Aug/1995:00:00:08 -0400’,)]

1.23 HTTP request

Get the HTTP request sent to Web server \w+ {GET}

# Get the REST call with ” “
rslt=(rdd.map(lambda line: re.search('"\w+\s+([^\s]+)\s+HTTP.*"',line)
    .groups())
    .take(3)) # Get the REST call
rslt
[(‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,),
(‘/’,),
(‘/images/ksclogo-medium.gif’,)]

1.23Get HTTP response status

Get the HTTP response to the request

rslt=(rdd.map(lambda line: re.search('"\s(\d{3})',line)
    .groups())
    .take(3)) #Get the status
rslt
Out[6]: [(‘200’,), (‘304’,), (‘304’,)]

1.24 Get content size

Get the HTTP response in bytes

rslt=(rdd.map(lambda line: re.search(‘^.*\s(\d*)$’,line)
    .groups())
    .take(3)) # Get the content size
rslt
Out[7]: [(‘1839’,), (‘0’,), (‘0’,)]

1.24 Putting it all together

Now put all the individual pieces together into 1 big regular expression and assign to the groups

  1. Host 2. Timestamp 3. Path 4. Status 5. Content_size
rslt=(rdd.map(lambda line: re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
    .groups())
    .take(3))
rslt
[(‘in24.inetnebr.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:01 -0400]’,
‘”GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0″‘,
‘/shuttle/missions/sts-68/news/sts-68-mcc-05.txt’,
‘200 1839’,
‘1839’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:07 -0400]’,
‘”GET / HTTP/1.0″‘,
‘/’,
‘304 0’,
‘0’),
(‘uplherc.upl.com’,
‘ -‘,
‘ ‘,
‘-‘,
‘[01/Aug/1995:00:00:08 -0400]’,
‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘,
‘/images/ksclogo-medium.gif’,
‘304 0’,
‘0’)]

1.25 Add a log parsing function

import re
def parse_log1(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s(\d*)$)',line)
    if match is None:    
        return(line,0)
    else:
        return(line,1)

1.26 Check for parsing failure

Check how many lines successfully parsed with the parsing function

n_logs = rdd.count()
failed = rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1] == 0).count()
print('Out of a total of {} logs, {} failed to parse'.format(n_logs,failed))
# Get the failed records line[1] == 0
failed1=rdd.map(lambda line: parse_log1(line)).filter(lambda line: line[1]==0)
failed1.take(3)
Out of a total of 3461613 logs, 38768 failed to parse
Out[10]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
0),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
0),
(‘pipe1.nyc.pipeline.com – – [01/Aug/1995:00:12:37 -0400] “GET /history/apollo/apollo-13/apollo-13-patch-small.gif” 200 12859’,
0)]

1.26 The above rule is not enough to parse the logs

It can be seen that the single rule only parses part of the logs and we cannot group the regex separately. There is an error “AttributeError: ‘NoneType’ object has no attribute ‘group'” which shows up

#rdd.map(lambda line: re.search(‘^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s(“\w+\s+([^\s]+)\s+HTTP.*”)\s(\d{3}\s(\d*)$)’,line[0]).group()).take(4)

File “/databricks/spark/python/pyspark/util.py”, line 99, in wrapper
return f(*args, **kwargs)
File “<command-1348022240961444>”, line 1, in <lambda>
AttributeError: ‘NoneType’ object has no attribute ‘group’

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490)

1.27 Add rule for parsing failed records

One of the issues with the earlier rule is the content_size has “-” for some logs

import re
def parse_failed(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
    if match is None:        
        return(line,0)
    else:
        return(line,1)

1.28 Parse records which fail

Parse the records that fails with the new rule

failed2=rdd.map(lambda line: parse_failed(line)).filter(lambda line: line[1]==1)
failed2.take(5)
Out[13]:
[(‘gw1.att.com – – [01/Aug/1995:00:03:53 -0400] “GET /shuttle/missions/sts-73/news HTTP/1.0” 302 -‘,
1),
(‘js002.cc.utsunomiya-u.ac.jp – – [01/Aug/1995:00:07:33 -0400] “GET /shuttle/resources/orbiters/discovery.gif HTTP/1.0” 404 -‘,
1),
(‘tia1.eskimo.com – – [01/Aug/1995:00:28:41 -0400] “GET /pub/winvn/release.txt HTTP/1.0” 404 -‘,
1),
(‘itws.info.eng.niigata-u.ac.jp – – [01/Aug/1995:00:38:01 -0400] “GET /ksc.html/facts/about_ksc.html HTTP/1.0” 403 -‘,
1),
(‘grimnet23.idirect.com – – [01/Aug/1995:00:50:12 -0400] “GET /www/software/winvn/winvn.html HTTP/1.0” 404 -‘,
1)]

1.28 Add both rules

Add both rules for parsing the log.

Note it can be shown that even with both rules all the logs are not parse.Further rules may need to be added

import re
def parse_log2(line):
    # Parse logs with the rule below
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
    # If match failed then use the rule below
    if match is None:
        match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3}\s-$)',line)
    if match is None:
        return (line, 0) # Return 0 for failure
    else:
        return (line, 1) # Return 1 for success

1.29 Group the different regex to groups for handling

def map2groups(line):
    match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(\d*)$',line)
    if match is None:
        match = re.search('^(\S+)((\s)(-))+\s(\[\S+ -\d{4}\])\s("\w+\s+([^\s]+)\s+HTTP.*")\s(\d{3})\s(-)$',line)    
    return(match.groups())

1.30 Parse the logs and map the groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

2. Parse Web server logs with Pyspark

2.1Read data into a Pyspark dataframe

import os
logs_file_path="/FileStore/tables/" + os.path.join('NASA_access_log_*.gz')
from pyspark.sql.functions import split, regexp_extract
base_df = sqlContext.read.text(logs_file_path)
#base_df.show(truncate=False)
from pyspark.sql.functions import split, regexp_extract
split_df = base_df.select(regexp_extract('value', r'^([^\s]+\s)', 1).alias('host'),
                          regexp_extract('value', r'^.*\[(\d\d\/\w{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{4})]', 1).alias('timestamp'),
                          regexp_extract('value', r'^.*"\w+\s+([^\s]+)\s+HTTP.*"', 1).alias('path'),
                          regexp_extract('value', r'^.*"\s+([^\s]+)', 1).cast('integer').alias('status'),
                          regexp_extract('value', r'^.*\s+(\d+)$', 1).cast('integer').alias('content_size'))
split_df.show(5,truncate=False)
+———————+————————–+———————————————–+——+————+
|host |timestamp |path |status|content_size|
+———————+————————–+———————————————–+——+————+
|199.72.81.55 |01/Jul/1995:00:00:01 -0400|/history/apollo/ |200 |6245 |
|unicomp6.unicomp.net |01/Jul/1995:00:00:06 -0400|/shuttle/countdown/ |200 |3985 |
|199.120.110.21 |01/Jul/1995:00:00:09 -0400|/shuttle/missions/sts-73/mission-sts-73.html |200 |4085 |
|burger.letters.com |01/Jul/1995:00:00:11 -0400|/shuttle/countdown/liftoff.html |304 |0 |
|199.120.110.21 |01/Jul/1995:00:00:11 -0400|/shuttle/missions/sts-73/sts-73-patch-small.gif|200 |4179 |
+———————+————————–+———————————————–+——+————+
only showing top 5 rows

2.2 Check data

bad_rows_df = split_df.filter(split_df[‘host’].isNull() |
                              split_df['timestamp'].isNull() |
                              split_df['path'].isNull() |
                              split_df['status'].isNull() |
                             split_df['content_size'].isNull())
bad_rows_df.count()
Out[20]: 33905

2.3Check no of rows which do not have digits

We have already seen that the content_type field has ‘-‘ instead of digits in RDDs

#bad_content_size_df = base_df.filter(~ base_df[‘value’].rlike(r’\d+$’))
bad_content_size_df.count()
Out[21]: 33905

2.4 Add ‘*’ to identify bad rows

To identify the rows that are bad, concatenate ‘*’ to the content_size field where the field does not have digits. It can be seen that the content_size has ‘-‘ instead of a valid number

from pyspark.sql.functions import lit, concat
bad_content_size_df.select(concat(bad_content_size_df['value'], lit('*'))).show(4,truncate=False)
+—————————————————————————————————————————————————+
|concat(value, *) |
+—————————————————————————————————————————————————+
|dd15-062.compuserve.com – – [01/Jul/1995:00:01:12 -0400] “GET /news/sci.space.shuttle/archive/sci-space-shuttle-22-apr-1995-40.txt HTTP/1.0” 404 -*|
|dynip42.efn.org – – [01/Jul/1995:00:02:14 -0400] “GET /software HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:02:40 -0400] “GET /software/winvn HTTP/1.0” 302 -* |
|ix-or10-06.ix.netcom.com – – [01/Jul/1995:00:03:24 -0400] “GET /software HTTP/1.0” 302 -* |
+—————————————————————————————————————————————————+

2.5 Fill NAs with 0s

# Replace all null content_size values with 0.

cleaned_df = split_df.na.fill({‘content_size’: 0})

3. Webserver  logs parsing with SparkR

library(SparkR)
library(stringr)
file_location = "/FileStore/tables/NASA_access_log_Jul95.gz"
file_location = "/FileStore/tables/NASA_access_log_Aug95.gz"
# Load the SparkR library


# Initiate a SparkR session
sparkR.session()
sc <- sparkR.session()
sqlContext <- sparkRSQL.init(sc)
df <- read.text(sqlContext,"/FileStore/tables/NASA_access_log_Jul95.gz")

#df=SparkR::select(df, "value")
#head(SparkR::collect(df))
#m=regexp_extract(df$value,'\\\\S+',1)

a=df %>% 
  withColumn('host', regexp_extract(df$value, '^(\\S+)', 1)) %>%
  withColumn('timestamp', regexp_extract(df$value, "((\\S+ -\\d{4}))", 2)) %>%
  withColumn('path', regexp_extract(df$value, '(\\"\\w+\\s+([^\\s]+)\\s+HTTP.*")', 2))  %>%
  withColumn('status', regexp_extract(df$value, '(^.*"\\s+([^\\s]+))', 2)) %>%
  withColumn('content_size', regexp_extract(df$value, '(^.*\\s+(\\d+)$)', 2))
#b=a%>% select(host,timestamp,path,status,content_type)
head(SparkR::collect(a),10)

1 199.72.81.55 – – [01/Jul/1995:00:00:01 -0400] “GET /history/apollo/ HTTP/1.0” 200 6245
2 unicomp6.unicomp.net – – [01/Jul/1995:00:00:06 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
3 199.120.110.21 – – [01/Jul/1995:00:00:09 -0400] “GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0” 200 4085
4 burger.letters.com – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/countdown/liftoff.html HTTP/1.0” 304 0
5 199.120.110.21 – – [01/Jul/1995:00:00:11 -0400] “GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0” 200 4179
6 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /images/NASA-logosmall.gif HTTP/1.0” 304 0
7 burger.letters.com – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/video/livevideo.gif HTTP/1.0” 200 0
8 205.212.115.106 – – [01/Jul/1995:00:00:12 -0400] “GET /shuttle/countdown/countdown.html HTTP/1.0” 200 3985
9 d104.aa.net – – [01/Jul/1995:00:00:13 -0400] “GET /shuttle/countdown/ HTTP/1.0” 200 3985
10 129.94.144.152 – – [01/Jul/1995:00:00:13 -0400] “GET / HTTP/1.0” 200 7074
host timestamp
1 199.72.81.55 [01/Jul/1995:00:00:01 -0400
2 unicomp6.unicomp.net [01/Jul/1995:00:00:06 -0400
3 199.120.110.21 [01/Jul/1995:00:00:09 -0400
4 burger.letters.com [01/Jul/1995:00:00:11 -0400
5 199.120.110.21 [01/Jul/1995:00:00:11 -0400
6 burger.letters.com [01/Jul/1995:00:00:12 -0400
7 burger.letters.com [01/Jul/1995:00:00:12 -0400
8 205.212.115.106 [01/Jul/1995:00:00:12 -0400
9 d104.aa.net [01/Jul/1995:00:00:13 -0400
10 129.94.144.152 [01/Jul/1995:00:00:13 -0400
path status content_size
1 /history/apollo/ 200 6245
2 /shuttle/countdown/ 200 3985
3 /shuttle/missions/sts-73/mission-sts-73.html 200 4085
4 /shuttle/countdown/liftoff.html 304 0
5 /shuttle/missions/sts-73/sts-73-patch-small.gif 200 4179
6 /images/NASA-logosmall.gif 304 0
7 /shuttle/countdown/video/livevideo.gif 200 0
8 /shuttle/countdown/countdown.html 200 3985
9 /shuttle/countdown/ 200 3985
10 / 200 7074

4 Webserver logs parsing with SparklyR

install.packages("sparklyr")
library(sparklyr)
library(dplyr)
library(stringr)
#sc <- spark_connect(master = "local", version = "2.1.0")
sc <- spark_connect(method = "databricks")
sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz")
sdf
Installing package into ‘/databricks/spark/R/lib’
# Source: spark [?? x 1]
   line                                                                         
                                                                           
 1 "199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] \"GET /history/apollo/ HTTP/1…
 2 "unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] \"GET /shuttle/countd…
 3 "199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] \"GET /shuttle/missions/sts…
 4 "burger.letters.com - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/countdow…
 5 "199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] \"GET /shuttle/missions/sts…
 6 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /images/NASA-logo…
 7 "burger.letters.com - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdow…
 8 "205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] \"GET /shuttle/countdown/c…
 9 "d104.aa.net - - [01/Jul/1995:00:00:13 -0400] \"GET /shuttle/countdown/ HTTP…
10 "129.94.144.152 - - [01/Jul/1995:00:00:13 -0400] \"GET / HTTP/1.0\" 200 7074"
# … with more rows
#install.packages(“sparklyr”)
library(sparklyr)
library(dplyr)
library(stringr)
#sc <- spark_connect(master = "local", version = "2.1.0")
sc <- spark_connect(method = "databricks")
sdf <-spark_read_text(sc, name="df", path = "/FileStore/tables/NASA_access_log*.gz")
sdf <- sdf %>% mutate(host = regexp_extract(line, '^(\\\\S+)',1)) %>% 
               mutate(timestamp = regexp_extract(line, '((\\\\S+ -\\\\d{4}))',2)) %>%
               mutate(path = regexp_extract(line, '(\\\\"\\\\w+\\\\s+([^\\\\s]+)\\\\s+HTTP.*")',2)) %>%
               mutate(status = regexp_extract(line, '(^.*"\\\\s+([^\\\\s]+))',2)) %>%
               mutate(content_size = regexp_extract(line, '(^.*\\\\s+(\\\\d+)$)',2))

5 Hosts

5.1  RDD

5.11 Parse and map to hosts to groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))

# Create tuples of (host,1) and apply reduceByKey() and order by descending
rslt=(parsed_rdd2.map(lambda x😦x[0],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
Out[18]:
[(‘piweba3y.prodigy.com’, 21988),
(‘piweba4y.prodigy.com’, 16437),
(‘piweba1y.prodigy.com’, 12825),
(‘edams.ksc.nasa.gov’, 11962),
(‘163.206.89.4’, 9697),
(‘news.ti.com’, 8161),
(‘www-d1.proxy.aol.com’, 8047),
(‘alyssa.prodigy.com’, 8037),
(‘siltb10.orl.mmc.com’, 7573),
(‘www-a2.proxy.aol.com’, 7516)]

5.12Plot counts of hosts

import seaborn as sns

import pandas as pd import matplotlib.pyplot as plt df=pd.DataFrame(rslt,columns=[‘host’,‘count’]) sns.barplot(x=‘host’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.6, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

5.2 PySpark

5.21 Compute counts of hosts

df= (cleaned_df
     .groupBy('host')
     .count()
     .orderBy('count',ascending=False))
df.show(5)
+——————–+—–+
| host|count|
+——————–+—–+
|piweba3y.prodigy….|21988|
|piweba4y.prodigy….|16437|
|piweba1y.prodigy….|12825|
| edams.ksc.nasa.gov |11964|
| 163.206.89.4 | 9697|
+——————–+—–+
only showing top 5 rows

5.22 Plot count of hosts

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df1=df.toPandas()
df2 = df1.head(10)
df2.count()
sns.barplot(x='host',y='count',data=df2)
plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9)
plt.xlabel("Hosts")
plt.ylabel('Count')
plt.xticks(rotation="vertical",fontsize=10)
display()

5.3 SparkR

5.31 Compute count of hosts

c <- SparkR::select(a,a$host)
df=SparkR::summarize(SparkR::groupBy(c, a$host), noHosts = count(a$host))
df1 =head(arrange(df,desc(df$noHosts)),10)
head(df1)
                  host noHosts
1 piweba3y.prodigy.com   17572
2 piweba4y.prodigy.com   11591
3 piweba1y.prodigy.com    9868
4   alyssa.prodigy.com    7852
5  siltb10.orl.mmc.com    7573
6 piweba2y.prodigy.com    5922

5.32 Plot count of hosts

library(ggplot2)
p <-ggplot(data=df1, aes(x=host, y=noHosts,fill=host)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Host') + ylab('Count')
p

5.4 SparklyR

5.41 Compute count of Hosts

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(host) %>% group_by(host) %>% summarise(noHosts=n()) %>% arrange(desc(noHosts))
df2 <-head(df1,10)

5.42 Plot count of hosts

library(ggplot2)

p <-ggplot(data=df2, aes(x=host, y=noHosts,fill=host)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Host’) + ylab(Count’)

p

6 Paths

6.1 RDD

6.11 Parse and map to hosts to groups

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[5],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
[(‘”GET /images/NASA-logosmall.gif HTTP/1.0″‘, 207520),
(‘”GET /images/KSC-logosmall.gif HTTP/1.0″‘, 164487),
(‘”GET /images/MOSAIC-logosmall.gif HTTP/1.0″‘, 126933),
(‘”GET /images/USA-logosmall.gif HTTP/1.0″‘, 126108),
(‘”GET /images/WORLD-logosmall.gif HTTP/1.0″‘, 124972),
(‘”GET /images/ksclogo-medium.gif HTTP/1.0″‘, 120704),
(‘”GET /ksc.html HTTP/1.0″‘, 83209),
(‘”GET /images/launch-logo.gif HTTP/1.0″‘, 75839),
(‘”GET /history/apollo/images/apollo-logo1.gif HTTP/1.0″‘, 68759),
(‘”GET /shuttle/countdown/ HTTP/1.0″‘, 64467)]

6.12 Plot counts of HTTP Requests

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘path’,‘count’]) sns.barplot(x=‘path’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

6.2 Pyspark

6.21 Compute count of HTTP Requests

df= (cleaned_df
     .groupBy('path')
     .count()
     .orderBy('count',ascending=False))
df.show(5)
Out[20]:
+——————–+——+
| path| count|
+——————–+——+
|/images/NASA-logo…|208362|
|/images/KSC-logos…|164813|
|/images/MOSAIC-lo…|127656|
|/images/USA-logos…|126820|
|/images/WORLD-log…|125676|
+——————–+——+
only showing top 5 rows

6.22 Plot count of HTTP Requests

import matplotlib.pyplot as plt

import pandas as pd import seaborn as sns df1=df.toPandas() df2 = df1.head(10) df2.count() sns.barplot(x=‘path’,y=‘count’,data=df2)

plt.subplots_adjust(bottom=0.7, right=0.8, top=0.9) plt.xlabel(“HTTP Requests”) plt.ylabel(‘Count’) plt.xticks(rotation=90,fontsize=8)

display()

 

6.3 SparkR

6.31Compute count of HTTP requests

library(SparkR)
c <- SparkR::select(a,a$path)
df=SparkR::summarize(SparkR::groupBy(c, a$path), numRequest = count(a$path))
df1=head(df)

3.14 Plot count of HTTP Requests

library(ggplot2)
p <-ggplot(data=df1, aes(x=path, y=numRequest,fill=path)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ xlab('Path') + ylab('Count')
p

6.4 SparklyR

6.41 Compute count of paths

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(path) %>% group_by(path) %>% summarise(noPaths=n()) %>% arrange(desc(noPaths))
df2 <-head(df1,10)
df2
# Source: spark [?? x 2]
# Ordered by: desc(noPaths)
   path                                    noPaths
                                        
 1 /images/NASA-logosmall.gif               208362
 2 /images/KSC-logosmall.gif                164813
 3 /images/MOSAIC-logosmall.gif             127656
 4 /images/USA-logosmall.gif                126820
 5 /images/WORLD-logosmall.gif              125676
 6 /images/ksclogo-medium.gif               121286
 7 /ksc.html                                 83685
 8 /images/launch-logo.gif                   75960
 9 /history/apollo/images/apollo-logo1.gif   68858
10 /shuttle/countdown/                       64695

6.42 Plot count of Paths

library(ggplot2)
p <-ggplot(data=df2, aes(x=path, y=noPaths,fill=path)) +   geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Path') + ylab('Count')
p

7.1 RDD

7.11 Compute count of HTTP Status

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[7],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
Out[22]:
[(‘200’, 3095682),
(‘304’, 266764),
(‘302’, 72970),
(‘404’, 20625),
(‘403’, 225),
(‘500’, 65),
(‘501’, 41)]

1.37 Plot counts of HTTP response status’

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘status’,‘count’]) sns.barplot(x=‘status’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8)

display()

7.2 Pyspark

7.21 Compute count of HTTP status

status_count=(cleaned_df
                .groupBy('status')
                .count()
                .orderBy('count',ascending=False))
status_count.show()
+——+——-+
|status| count|
+——+——-+
| 200|3100522|
| 304| 266773|
| 302| 73070|
| 404| 20901|
| 403| 225|
| 500| 65|
| 501| 41|
| 400| 15|
| null| 1|

7.22 Plot count of HTTP status

Plot the HTTP return status vs the counts

df1=status_count.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘status’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“HTTP Status”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

7.3 SparkR

7.31 Compute count of HTTP Response status

library(SparkR)
c <- SparkR::select(a,a$status)
df=SparkR::summarize(SparkR::groupBy(c, a$status), numStatus = count(a$status))
df1=head(df)

3.16 Plot count of HTTP Response status

library(ggplot2)
p <-ggplot(data=df1, aes(x=status, y=numStatus,fill=status)) +   geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Status') + ylab('Count')
p

7.4 SparklyR

7.41 Compute count of status

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(status) %>% group_by(status) %>% summarise(noStatus=n()) %>% arrange(desc(noStatus))
df2 <-head(df1,10)
df2
# Source: spark [?? x 2]
# Ordered by: desc(noStatus)
  status noStatus
       
1 200     3100522
2 304      266773
3 302       73070
4 404       20901
5 403         225
6 500          65
7 501          41
8 400          15
9 ""            1

7.42 Plot count of status

library(ggplot2)

p <-ggplot(data=df2, aes(x=status, y=noStatus,fill=status)) + geom_bar(stat=identity”)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Status’) + ylab(Count’) p

8.1 RDD

8.12 Compute count of content size

parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])
parsed_rdd2 = parsed_rdd.map(lambda line: map2groups(line))
rslt=(parsed_rdd2.map(lambda x😦x[8],1))
                 .reduceByKey(lambda a,b:a+b)
                 .takeOrdered(10, lambda x: -x[1]))
rslt
Out[24]:
[(‘0’, 280017),
(‘786’, 167281),
(‘1204’, 140505),
(‘363’, 111575),
(‘234’, 110824),
(‘669’, 110056),
(‘5866’, 107079),
(‘1713’, 66904),
(‘1173’, 63336),
(‘3635’, 55528)]

8.21 Plot content size

import seaborn as sns

df=pd.DataFrame(rslt,columns=[‘content_size’,‘count’]) sns.barplot(x=‘content_size’,y=‘count’,data=df) plt.subplots_adjust(bottom=0.4, right=0.8, top=0.9) plt.xticks(rotation=“vertical”,fontsize=8) display()

8.2 Pyspark

8.21 Compute count of content_size

size_counts=(cleaned_df
                .groupBy('content_size')
                .count()
                .orderBy('count',ascending=False))
size_counts.show(10)
+------------+------+
|content_size| count|
+------------+------+
|           0|313932|
|         786|167709|
|        1204|140668|
|         363|111835|
|         234|111086|
|         669|110313|
|        5866|107373|
|        1713| 66953|
|        1173| 63378|
|        3635| 55579|
+------------+------+
only showing top 10 rows

8.22 Plot counts of content size

Plot the path access versus the counts

df1=size_counts.toPandas()

df2 = df1.head(10) df2.count() sns.barplot(x=‘content_size’,y=‘count’,data=df2) plt.subplots_adjust(bottom=0.5, right=0.8, top=0.9) plt.xlabel(“content_size”) plt.ylabel(‘Count’) plt.xticks(rotation=“vertical”,fontsize=10) display()

8.3 SparkR

8.31 Compute count of content size

library(SparkR)
c <- SparkR::select(a,a$content_size)
df=SparkR::summarize(SparkR::groupBy(c, a$content_size), numContentSize = count(a$content_size))
df1=head(df)
df1
     content_size numContentSize
1        28426           1414
2        78382            293
3        60053              4
4        36067              2
5        13282            236
6        41785            174
8.32 Plot count of content sizes
library(ggplot2)

p <-ggplot(data=df1, aes(x=content_size, y=numContentSize,fill=content_size)) + geom_bar(stat=identity”) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab(Content Size’) + ylab(Count’)

p

8.4 SparklyR

8.41Compute count of content_size

df <- sdf %>% select(host,timestamp,path,status,content_size)
df1 <- df %>% select(content_size) %>% group_by(content_size) %>% summarise(noContentSize=n()) %>% arrange(desc(noContentSize))
df2 <-head(df1,10)
df2
# Source: spark [?? x 2]
# Ordered by: desc(noContentSize)
   content_size noContentSize
                   
 1 0                   280027
 2 786                 167709
 3 1204                140668
 4 363                 111835
 5 234                 111086
 6 669                 110313
 7 5866                107373
 8 1713                 66953
 9 1173                 63378
10 3635                 55579

8.42 Plot count of content_size

library(ggplot2)
p <-ggplot(data=df2, aes(x=content_size, y=noContentSize,fill=content_size)) +   geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab('Content size') + ylab('Count')
p

Conclusion: I spent many,many hours struggling with Regex and getting RDDs,Pyspark to work. Also had to spend a lot of time trying to work out the syntax for SparkR and SparklyR for parsing. After you parse the logs plotting and analysis is a piece of cake! This is definitely worth a try!

Watch this space!!

Also see
1. Practical Machine Learning with R and Python – Part 3
2. Deep Learning from first principles in Python, R and Octave – Part 5
3. My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon
4. Latency, throughput implications for the Cloud
5. Modeling a Car in Android
6. Architecting a cloud based IP Multimedia System (IMS)
7. Dabbling with Wiener filter using OpenCV

To see all posts click Index of posts

Revisiting World Bank data analysis with WDI and gVisMotionChart

Note: I had written a post about 3 years back on World Bank Data Analysis using World Development Indicators (WDI) & gVisMotionCharts. But the motion charts stopped working  some time ago. I have always been wanting to fix this and I now got to actually doing it. The issue was 2 of the WDI indicators had changed. After I fixed this I was able to host the generated motion chart using github.io pages. Please make sure that you enable flash player if you open the motion charts with Google Chrome. You may also have to enable flash if using Firefox, IE etc

Please check out the 2 motions charts with World Bank data

1. World Bank Chart 1
2. World Bank Chart 2

If you are using Chrome please enable (Allow)  ‘flash player’ by clicking on the lock sign in the URL as shown

 

 

 

 

 

 

Introduction

Recently I was surfing the web, when I came across a real cool post New R package to access World Bank data, by Markus Gesmann on using googleVis and motion charts with World Bank Data. The post also introduced me to Hans Rosling, Professor of Sweden’s Karolinska Institute. Hans Rosling, the creator of the famous Gapminder chart, the “Heath and Wealth of Nations” displays global trends through animated charts (A must see!!!). As they say, in Hans Rosling’s hands, data dances and sings. Take a look at  his Ted talks for e.g. Hans Rosling:New insights on poverty. Prof Rosling developed the breakthrough software behind the visualizations, in the Gapminder. The free software, which can be loaded with any data – was purchased by Google in March 2007.

In this post, I recreate some of the Gapminder charts with the help of R packages WDI and googleVis. The WDI  package of  Vincent Arel-Bundock, provides a set of really useful functions to get to data based on the World Bank Data indicators.  googleVis provides motion charts with which you can animate the data.

You can clone/download the code from Github at worldBankAnalysis which is in the form of an Rmd file.

library(WDI)
library(ggplot2)
library(googleVis)
library(plyr)

1.Get the data from 1960 to 2019 for the following

  1. Population – SP.POP.TOTL
  2. GDP in US $ – NY.GDP.MKTP.CD
  3. Life Expectancy at birth (Years) – SP.DYN.LE00.IN
  4. GDP Per capita income – NY.GDP.PCAP.PP.CD
  5. Fertility rate (Births per woman) – SP.DYN.TFRT.IN
  6. Poverty headcount ratio – SI.POV.NAHC
# World population total
population = WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2019)
# GDP in US $
gdp= WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2019)
# Life expectancy at birth (Years)
lifeExpectancy= WDI(indicator='SP.DYN.LE00.IN', country="all",start=1960, end=2019)
# GDP Per capita
income = WDI(indicator='NY.GDP.PCAP.PP.CD', country="all",start=1960, end=2019)
# Fertility rate (births per woman)
fertility = WDI(indicator='SP.DYN.TFRT.IN', country="all",start=1960, end=2019)
# Poverty head count
poverty= WDI(indicator='SI.POV.NAHC', country="all",start=1960, end=2019)

2.Rename the columns

names(population)[3]="Total population"
names(lifeExpectancy)[3]="Life Expectancy (Years)"
names(gdp)[3]="GDP (US$)"
names(income)[3]="GDP per capita income"
names(fertility)[3]="Fertility (Births per woman)"
names(poverty)[3]="Poverty headcount ratio"

3.Join the data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries
j1 <- join(population, gdp)

j2 <- join(j1,lifeExpectancy)

j3 <- join(j2,income)

j4 <- join(j3,poverty)

wbData <- join(j4,fertility)

4.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixes
wdi_data =WDI_data
# The 1st matrix is the list is the set of all World Bank Indicators
indicators=wdi_data[[1]]
# The 2nd  matrix gives the set of countries and regions
countries=wdi_data[[2]]
df = as.data.frame(countries)
aa <- df$region != "Aggregates"
# Remove the aggregates
countries_df <- df[aa,]
# Subset from the development data only those corresponding to the countries
bb = subset(wbData, country %in% countries_df$country)
cc = join(bb,countries_df)
dd = complete.cases(cc)
developmentDF = cc[dd,]

5.Create and display the motion chart

gg<- gvisMotionChart(cc,
                                idvar = "country",
                                timevar = "year",
                                xvar = "GDP",
                                yvar = "Life Expectancy",
                                sizevar ="Population",
                                colorvar = "region")
plot(gg)
cat(gg$html$chart, file="chart1.html")

Note: Unfortunately it is not possible to embed the motion chart in WordPress. It is has to hosted on a server as a Webpage. After exploring several possibilities I came up with the following process to display the animation graph. The plot is saved as a html file using ‘cat’ as shown above. The WorldBank_chart1.html page is then hosted as a Github page (gh-page) on Github.

Here is the ggvisMotionChart

Do give  World Bank Motion Chart1  a spin.  Here is how the Motion Chart has to be used

untitled

You can select Life Expectancy, Population, Fertility etc by clicking the black arrows. The blue arrow shows the ‘play’ button to set animate the motion chart. You can also select the countries and change the size of the circles. Do give it a try. Here are some quick analysis by playing around with the motion charts with different parameters chosen

The set of charts below are screenshots captured by running the motion chart World Bank Motion Chart1

a. Life Expectancy vs Fertility chart

This chart is used by Hans Rosling in his Ted talk. The left chart shows low life expectancy and high fertility rate for several sub Saharan and East Asia Pacific countries in the early 1960’s. Today the fertility has dropped and the life expectancy has increased overall. However the sub Saharan countries still have a high fertility rate

pic1

b. Population vs GDP

The chart below shows that GDP of India and China have the same GDP from 1973-1994 with US and Japan well ahead.

pic2

From 1998- 2014 China really pulls away from India and Japan as seen below

pic3

c. Per capita income vs Life Expectancy

In the 1990’s the per capita income and life expectancy of the sub -saharan countries are low (42-50). Japan and US have a good life expectancy in 1990’s. In 2014 the per capita income of the sub-saharan countries are still low though the life expectancy has marginally improved.

pic4

d. Population vs Poverty headcount

pic5

In the early 1990’s China had a higher poverty head count ratio than India. By 2004 China had this all figured out and the poverty head count ratio drops significantly. This can also be seen in the chart below.

pop_pov3

In the chart above China shows a drastic reduction in poverty headcount ratio vs India. Strangely Zambia shows an increase in the poverty head count ratio.

6.Get the data for the 2nd set of indicators

  1. Total population  – SP.POP.TOTL
  2. GDP in US$ – NY.GDP.MKTP.CD
  3. Access to electricity (% population) – EG.ELC.ACCS.ZS
  4. Electricity consumption KWh per capita -EG.USE.ELEC.KH.PC
  5. CO2 emissions -EN.ATM.CO2E.KT
  6. Basic Sanitation Access – SH.STA.BASS.ZS
# World population
population = WDI(indicator='SP.POP.TOTL', country="all",start=1960, end=2016)
# GDP in US $
gdp= WDI(indicator='NY.GDP.MKTP.CD', country="all",start=1960, end=2016)
# Access to electricity (% population)
elecAccess= WDI(indicator='EG.ELC.ACCS.ZS', country="all",start=1960, end=2016)
# Electric power consumption Kwh per capita
elecConsumption= WDI(indicator='EG.USE.ELEC.KH.PC', country="all",start=1960, end=2016)
#CO2 emissions
co2Emissions= WDI(indicator='EN.ATM.CO2E.KT', country="all",start=1960, end=2016)
# Access to sanitation (% population)
sanitationAccess= WDI(indicator='SH.STA.ACSN', country="all",start=1960, end=2016)

7.Rename the columns

names(population)[3]="Total population"
names(gdp)[3]="GDP US($)"
names(elecAccess)[3]="Access to Electricity (% popn)"
names(elecConsumption)[3]="Electric power consumption (KWH per capita)"
names(co2Emissions)[3]="CO2 emisions"
names(sanitationAccess)[3]="Access to sanitation(% popn)"

8.Join the individual data frames

Join the individual data frames to one large wide data frame with all the indicators for the countries


j1 <- join(population, gdp)
j2 <- join(j1,elecAccess)
j3 <- join(j2,elecConsumption)
j4 <- join(j3,co2Emissions)
wbData1 <- join(j3,sanitationAccess)

 

9.Use WDI_data

Use WDI_data to get the list of indicators and the countries. Join the countries and region

#This returns  list of 2 matrixes
wdi_data =WDI_data
# The 1st matrix is the list is the set of all World Bank Indicators
indicators=wdi_data[[1]]
# The 2nd  matrix gives the set of countries and regions
countries=wdi_data[[2]]
df = as.data.frame(countries)
aa <- df$region != "Aggregates"
# Remove the aggregates
countries_df <- df[aa,]
# Subset from the development data only those corresponding to the countries
ee = subset(wbData1, country %in% countries_df$country)
ff = join(ee,countries_df)
## Joining by: iso2c, country

10.Create and display the motion chart

gg1<- gvisMotionChart(ff,
                                idvar = "country",
                                timevar = "year",
                                xvar = "GDP",
                                yvar = "Access to Electricity",
                                sizevar ="Population",
                                colorvar = "region")
plot(gg1)
cat(gg1$html$chart, file="chart2.html")

This is World Bank Motion Chart2  which has a different set of parameters like Access to Energy, CO2 emissions etc

The set of charts below are screenshots of the motion chart World Bank Motion Chart 2

a. Access to Electricity vs Population
pic6The above chart shows that in China 100% population have access to electricity. India has made decent progress from 50% in 1990 to 79% in 2012. However Pakistan seems to have been much better in providing access to electricity. Pakistan moved from 59% to close 98% access to electricity

b. Power consumption vs population

powercon

The above chart shows the Power consumption vs Population. China and India have proportionally much lower consumption that Norway, US, Canada

c. CO2 emissions vs Population

pic7

In 1963 the CO2 emissions were fairly low and about comparable for all countries. US, India have shown a steady increase while China shows a steep increase. Interestingly UK shows a drop in CO2 emissions

d.  Access to sanitation
san

India shows an improvement but it has a long way to go with only 40% of population with access to sanitation. China has made much better strides with 80% having access to sanitation in 2015. Strangely Nigeria shows a drop in sanitation by almost about 20% of population.

The code is available at Github at worldBankAnalysis

Conclusion: So there you have it. I have shown some screenshots of some sample parameters of the World indicators. Please try to play around with World Bank Motion Chart1 & World Bank Motion Chart 2  with your own set of parameters and countries.  You can also create your own motion chart from the 100s of WDI indicators avaialable at  World Bank Data indicator.

Also see
1. My book ‘Deep Learning from first principles:Second Edition’ now on Amazon
2.  Dabbling with Wiener filter using OpenCV
3. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
4. Design Principles of Scalable, Distributed Systems
5. Re-introducing cricketr! : An R package to analyze performances of cricketers
6. Natural language processing: What would Shakespeare say?
7. Brewing a potion with Bluemix, PostgreSQL, Node.js in the cloud
8. Simulating an Edge Shape in Android

To see all posts Index of posts

Analyzing performances of cricketers using cricketr template

This post includes a template which you can use for analyzing the performances of cricketers, both batsmen and bowlers in Test, ODI and Twenty 20 cricket using my R package cricketr. To see actual usage of functions in the R package cricketr see Introducing cricketr! : An R package to analyze performances of cricketers.

This template can be downloaded from Github at cricketer-template

The ‘cricketr’ package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions.

You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

The cricketr package

The cricketr package has several functions that perform several different analyses on both batsman and bowlers. The package has function that plot percentage frequency runs or wickets, runs likelihood for a batsman, relative run/strike rates of batsman and relative performance/economy rate for bowlers are available.

Other interesting functions include batting performance moving average, forecast and a function to check whether the batsmans in in-form or out-of-form.

The data for a particular player can be obtained with the getPlayerData() function. To do you will need to go to ESPN CricInfo Player and type in the name of the player for e.g Ricky Ponting, Sachin Tendulkar etc. This will bring up a page which have the profile number for the player e.g. for Sachin Tendulkar this would be http://www.espncricinfo.com/india/content/player/35320.html. Hence, Sachin’s profile is 35320. This can be used to get the data for Tendulkar as shown below

The cricketr package is now available from CRAN!!! You should be able to install directly with

1. Install the cricketr package

if (!require("cricketr")){
    install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

The cricketr package includes some pre-packaged sample (.csv) files. You can use these sample to test functions as shown below

# Retrieve the file path of a data file installed with cricketr
#pathToFile <- system.file("data", "tendulkar.csv", package = "cricketr")
#batsman4s(pathToFile, "Sachin Tendulkar")

# The general format is pkg-function(pathToFile,par1,...)
#batsman4s(<path-To-File>,"Sachin Tendulkar")

“` The pre-packaged files can be accessed as shown above. To get the data of any player use the function in Test, ODI and Twenty20 use the following

2. For Test cricket

#tendulkar <- getPlayerData(35320,dir="..",file="tendulkar.csv",type="batting",homeOrAway=c(1,2), result=c(1,2,4))

2a. For ODI cricket

#tendulkarOD <- getPlayerDataOD(35320,dir="..",file="tendulkarOD.csv",type="batting")

2b For Twenty 20 cricket

#tendulkarT20 <- getPlayerDataTT(35320,dir="..",file="tendulkarT20.csv",type="batting")

Analysis of batsmen

Important Note This needs to be done only once for a player. This function stores the player’s data in a CSV file (for e.g. tendulkar.csv as above) which can then be reused for all other functions. Once we have the data for the players many analyses can be done. This post will use the stored CSV file obtained with a prior getPlayerData for all subsequent analyses

Sachin Tendulkar’s performance – Basic Analyses

The 3 plots below provide the following for Tendulkar

  1. Frequency percentage of runs in each run range over the whole career
  2. Mean Strike Rate for runs scored in the given range
  3. A histogram of runs frequency percentages in runs ranges For example

3. Basic analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar")
#batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar")
#batsmanRunsRanges("./tendulkar.csv","Tendulkar")
dev.off()
## null device 
##           1
  1. Player 1
  2. Player 2
  3. Player 3
  4. Player 4

4. More analyses

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player1.csv","Player1")
#batsman6s("./player1.csv","Player1")
#batsmanMeanStrikeRate("./player1.csv","Player1")

# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player2.csv","Player2")
#batsman6s("./player2.csv","Player2")
#batsmanMeanStrikeRate("./player2.csv","Player2")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player3.csv","Player3")
#batsman6s("./player3.csv","Player3")
#batsmanMeanStrikeRate("./player3.csv","Player3")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")

dev.off()
## null device 
##           1
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#batsman4s("./player4.csv","Player4")
#batsman6s("./player4.csv","Player4")
#batsmanMeanStrikeRate("./player4.csv","Player4")
# For ODI and T20
#batsmanScoringRateODTT("./player1.csv","Player1")
dev.off()
## null device 
##           1

Note: For mean strike rate in ODI and Twenty20 use the function batsmanScoringRateODTT()

5.Boxplot histogram plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

#batsmanPerfBoxHist("./player1.csv","Player1")
#batsmanPerfBoxHist("./player2.csv","Player2")
#batsmanPerfBoxHist("./player3.csv","Player3")
#batsmanPerfBoxHist("./player4.csv","Player4")

6. Contribution to won and lost matches

For the 2 functions below you will have to use the getPlayerDataSp() function. I have commented this as I already have these files. This function can only be used for Test matches

#player1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player1sp.csv",ttype="batting")
#player2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player2sp.csv",ttype="batting")
#player3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player3sp.csv",ttype="batting")
#player4sp <- getPlayerDataSp(xxxx,tdir=".",tfile="player4sp.csv",ttype="batting")
par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanContributionWonLost("player1sp.csv","Player1")
#batsmanContributionWonLost("player2sp.csv","Player2")
#batsmanContributionWonLost("player3sp.csv","Player3")
#batsmanContributionWonLost("player4sp.csv","Player4")
dev.off()
## null device 
##           1

7, Performance at home and overseas

This function also requires the use of getPlayerDataSp() as shown above. This can only be used for Test matches

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanPerfHomeAway("player1sp.csv","Player1")
#batsmanPerfHomeAway("player2sp.csv","Player2")
#batsmanPerfHomeAway("player3sp.csv","Player3")
#batsmanPerfHomeAway("player4sp.csv","Player4")
dev.off()
## null device 
##           1

8. Batsman average at different venues

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanAvgRunsGround("./player1.csv","Player1")
#batsmanAvgRunsGround("./player2.csv","Player2")
#batsmanAvgRunsGround("./player3.csv","Ponting")
#batsmanAvgRunsGround("./player4.csv","Player4")
dev.off()
## null device 
##           1

9. Batsman average against different opposition

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanAvgRunsOpposition("./player1.csv","Player1")
#batsmanAvgRunsOpposition("./player2.csv","Player2")
#batsmanAvgRunsOpposition("./player3.csv","Ponting")
#batsmanAvgRunsOpposition("./player4.csv","Player4")
dev.off()
## null device 
##           1

10. Runs Likelihood of batsman

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanRunsLikelihood("./player1.csv","Player1")
#batsmanRunsLikelihood("./player2.csv","Player2")
#batsmanRunsLikelihood("./player3.csv","Ponting")
#batsmanRunsLikelihood("./player4.csv","Player4")
dev.off()
## null device 
##           1

11. Moving Average of runs in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanMovingAverage("./player1.csv","Player1")
#batsmanMovingAverage("./player2.csv","Player2")
#batsmanMovingAverage("./player3.csv","Ponting")
#batsmanMovingAverage("./player4.csv","Player4")
dev.off()
## null device 
##           1

12. Cumulative Average runs of batsman in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanCumulativeAverageRuns("./player1.csv","Player1")
#batsmanCumulativeAverageRuns("./player2.csv","Player2")
#batsmanCumulativeAverageRuns("./player3.csv","Ponting")
#batsmanCumulativeAverageRuns("./player4.csv","Player4")
dev.off()
## null device 
##           1

13. Cumulative Average strike rate of batsman in career

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanCumulativeStrikeRate("./player1.csv","Player1")
#batsmanCumulativeStrikeRate("./player2.csv","Player2")
#batsmanCumulativeStrikeRate("./player3.csv","Ponting")
#batsmanCumulativeStrikeRate("./player4.csv","Player4")
dev.off()
## null device 
##           1

14. Future Runs forecast

Here are plots that forecast how the batsman will perform in future. In this case 90% of the career runs trend is uses as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated runs trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

Take a look at the runs forecasted for the batsman below.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
#batsmanPerfForecast("./player1.csv","Player1")
#batsmanPerfForecast("./player2.csv","Player2")
#batsmanPerfForecast("./player3.csv","Player3")
#batsmanPerfForecast("./player4.csv","Player4")
dev.off()
## null device 
##           1

15. Relative Mean Strike Rate plot

The plot below compares the Mean Strike Rate of the batsman for each of the runs ranges of 10 and plots them. The plot indicate the following

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeBatsmanSR(frames,names)

16. Relative Runs Frequency plot

The plot below gives the relative Runs Frequency Percetages for each 10 run bucket. The plot below show

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeRunsFreqPerf(frames,names)

17. Relative cumulative average runs in career

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","Player4")
#relativeBatsmanCumulativeAvgRuns(frames,names)

18. Relative cumulative average strike rate in career

frames <- list("./player1.csv","./player2.csv","player3.csv","player4.csv")
names <- list("Player1","Player2","Player3","player4")
#relativeBatsmanCumulativeStrikeRate(frames,names)

19. Check Batsman In-Form or Out-of-Form

The below computation uses Null Hypothesis testing and p-value to determine if the batsman is in-form or out-of-form. For this 90% of the career runs is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the batsman continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the batsman is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

This is done for the Top 4 batsman

#checkBatsmanInForm("./player1.csv","Player1")
#checkBatsmanInForm("./player2.csv","Player2")
#checkBatsmanInForm("./player3.csv","Player3")
#checkBatsmanInForm("./player4.csv","Player4")

20. 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
#battingPerf3d("./player1.csv","Player1")
#battingPerf3d("./player2.csv","Player2")
par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
#battingPerf3d("./player3.csv","Player3")
#battingPerf3d("./player4.csv","player4")
dev.off()
## null device 
##           1

21. Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
#Player1 <- batsmanRunsPredict("./player1.csv","Player1",newdataframe=newDF)
#Player2 <- batsmanRunsPredict("./player2.csv","Player2",newdataframe=newDF)
#ponting <- batsmanRunsPredict("./player3.csv","Player3",newdataframe=newDF)
#sangakkara <- batsmanRunsPredict("./player4.csv","Player4",newdataframe=newDF)
#batsmen <-cbind(round(Player1$Runs),round(Player2$Runs),round(Player3$Runs),round(Player4$Runs))
#colnames(batsmen) <- c("Player1","Player2","Player3","Player4")
#newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
#colnames(newDF) <- c("BallsFaced","MinsAtCrease")
#predictedRuns <- cbind(newDF,batsmen)
#predictedRuns

Analysis of bowlers

  1. Bowler1
  2. Bowler2
  3. Bowler3
  4. Bowler4

player1 <- getPlayerData(xxxx,dir=“..”,file=“player1.csv”,type=“bowling”) Note For One day you will have to use getPlayerDataOD() and for Twenty20 it is getPlayerDataTT()

21. Wicket Frequency Plot

This plot below computes the percentage frequency of number of wickets taken for e.g 1 wicket x%, 2 wickets y% etc and plots them as a continuous line

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerWktsFreqPercent("./bowler1.csv","Bowler1")
#bowlerWktsFreqPercent("./bowler2.csv","Bowler2")
#bowlerWktsFreqPercent("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

22. Wickets Runs plot

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerWktsRunsPlot("./bowler1.csv","Bowler1")
#bowlerWktsRunsPlot("./bowler2.csv","Bowler2")
#bowlerWktsRunsPlot("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

23. Average wickets at different venues

#bowlerAvgWktsGround("./bowler3.csv","Bowler3")

24. Average wickets against different opposition

#bowlerAvgWktsOpposition("./bowler3.csv","Bowler3")

25. Wickets taken moving average

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerMovingAverage("./bowler1.csv","Bowler1")
#bowlerMovingAverage("./bowler2.csv","Bowler2")
#bowlerMovingAverage("./bowler3.csv","Bowler3")

dev.off()
## null device 
##           1

26. Cumulative Wickets taken

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerCumulativeAvgWickets("./bowler1.csv","Bowler1")
#bowlerCumulativeAvgWickets("./bowler2.csv","Bowler2")
#bowlerCumulativeAvgWickets("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

27. Cumulative Economy rate

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerCumulativeAvgEconRate("./bowler1.csv","Bowler1")
#bowlerCumulativeAvgEconRate("./bowler2.csv","Bowler2")
#bowlerCumulativeAvgEconRate("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

28. Future Wickets forecast

Here are plots that forecast how the bowler will perform in future. In this case 90% of the career wickets trend is used as the training set. the remaining 10% is the test set.

A Holt-Winters forecating model is used to forecast future performance based on the 90% training set. The forecated wickets trend is plotted. The test set is also plotted to see how close the forecast and the actual matches

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerPerfForecast("./bowler1.csv","Bowler1")
#bowlerPerfForecast("./bowler2.csv","Bowler2")
#bowlerPerfForecast("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

29. Contribution to matches won and lost

As discussed above the next 2 charts require the use of getPlayerDataSp(). This can only be done for Test matches

#bowler1sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler1sp.csv",ttype="bowling")
#bowler2sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler2sp.csv",ttype="bowling")
#bowler3sp <- getPlayerDataSp(xxxx,tdir=".",tfile="bowler3sp.csv",ttype="bowling")
par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerContributionWonLost("bowler1sp","Bowler1")
#bowlerContributionWonLost("bowler2sp","Bowler2")
#bowlerContributionWonLost("bowler3sp","Bowler3")
dev.off()
## null device 
##           1

30. Performance home and overseas.

This can only be done for Test matches

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
#bowlerPerfHomeAway("bowler1sp","Bowler1")
#bowlerPerfHomeAway("bowler2sp","Bowler2")
#bowlerPerfHomeAway("bowler3sp","Bowler3")
dev.off()
## null device 
##           1

31 Relative Wickets Frequency Percentage

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlingPerf(frames,names)

32 Relative Economy Rate against wickets taken

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlingER(frames,names)

33 Relative cumulative average wickets of bowlers in career

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlerCumulativeAvgWickets(frames,names)

34 Relative cumulative average economy rate of bowlers

frames <- list("./bowler1.csv","./bowler3.csv","bowler2.csv")
names <- list("Bowler1","Bowler3","Bowler2")
#relativeBowlerCumulativeAvgEconRate(frames,names)

35 Check for bowler in-form/out-of-form

The below computation uses Null Hypothesis testing and p-value to determine if the bowler is in-form or out-of-form. For this 90% of the career wickets is chosen as the population and the mean computed. The last 10% is chosen to be the sample set and the sample Mean and the sample Standard Deviation are caculated.

The Null Hypothesis (H0) assumes that the bowler continues to stay in-form where the sample mean is within 95% confidence interval of population mean The Alternative (Ha) assumes that the bowler is out of form the sample mean is beyond the 95% confidence interval of the population mean.

A significance value of 0.05 is chosen and p-value us computed If p-value >= .05 – Batsman In-Form If p-value < 0.05 – Batsman Out-of-Form

Note Ideally the p-value should be done for a population that follows the Normal Distribution. But the runs population is usually left skewed. So some correction may be needed. I will revisit this later

Note: The check for the form status of the bowlers indicate

#checkBowlerInForm("./bowler1.csv","Bowler1")
#checkBowlerInForm("./bowler2.csv","Bowler2")
#checkBowlerInForm("./bowler3.csv","Bowler3")
dev.off()
## null device 
##           1

Big Data-2: Move into the big league:Graduate from R to SparkR

This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR

Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkR

Note 2: You can download this RMarkdown file from Github at Big Data- Python to Pyspark and R to SparkR
1a. Read CSV- R

Note: To upload the CSV to databricks see the video Upload Flat File to Databricks Table

# Read CSV file
tendulkar= read.csv("/dbfs/FileStore/tables/tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-"))
#Check the dimensions of the dataframe
dim(tendulkar)
[1] 347  12
1b. Read CSV – SparkR
# Load the SparkR library
library(SparkR)
# Initiate a SparkR session
sparkR.session()
tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

# Check the dimensions of the dataframe
dim(tendulkar1)
[1] 347  12
2a. Data frame shape – R
# Get the shape of the dataframe in R
dim(tendulkar)
[1] 347  12
2b. Dataframe shape – SparkR

The same ‘dim’ command works in SparkR too!

dim(tendulkar1)
[1] 347  12
3a . Dataframe columns – R
# Get the names
names(tendulkar) # Also colnames(tendulkar)
 [1] "Runs"       "Mins"       "BF"         "X4s"        "X6s"       
 [6] "SR"         "Pos"        "Dismissal"  "Inns"       "Opposition"
[11] "Ground"     "Start.Date"
3b. Dataframe columns – SparkR
names(tendulkar1)
 [1] "Runs"       "Mins"       "BF"         "4s"         "6s"        
 [6] "SR"         "Pos"        "Dismissal"  "Inns"       "Opposition"
[11] "Ground"     "Start Date"
4a. Rename columns – R
names(tendulkar)=c('Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate')
names(tendulkar)
 [1] "Runs"       "Minutes"    "BallsFaced" "Fours"      "Sixes"     
 [6] "StrikeRate" "Position"   "Dismissal"  "Innings"    "Opposition"
[11] "Ground"     "StartDate"
4b. Rename columns – SparkR
names(tendulkar1)=c('Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate')
names(tendulkar1)
 [1] "Runs"       "Minutes"    "BallsFaced" "Fours"      "Sixes"     
 [6] "StrikeRate" "Position"   "Dismissal"  "Innings"    "Opposition"
[11] "Ground"     "StartDate"
5a. Summary – R
summary(tendulkar)
     Runs              Minutes        BallsFaced         Fours       
 Length:347         Min.   :  1.0   Min.   :  0.00   Min.   : 0.000  
 Class :character   1st Qu.: 33.0   1st Qu.: 22.00   1st Qu.: 1.000  
 Mode  :character   Median : 82.0   Median : 58.50   Median : 4.000  
                    Mean   :125.5   Mean   : 89.75   Mean   : 6.274  
                    3rd Qu.:181.0   3rd Qu.:133.25   3rd Qu.: 9.000  
                    Max.   :613.0   Max.   :436.00   Max.   :35.000  
                    NA's   :18      NA's   :19       NA's   :19      
     Sixes          StrikeRate        Position     Dismissal        
 Min.   :0.0000   Min.   :  0.00   Min.   :2.00   Length:347        
 1st Qu.:0.0000   1st Qu.: 38.09   1st Qu.:4.00   Class :character  
 Median :0.0000   Median : 52.25   Median :4.00   Mode  :character  
 Mean   :0.2097   Mean   : 51.79   Mean   :4.24                     
 3rd Qu.:0.0000   3rd Qu.: 65.09   3rd Qu.:4.00                     
 Max.   :4.0000   Max.   :166.66   Max.   :7.00                     
 NA's   :18       NA's   :20       NA's   :18                       
    Innings       Opposition           Ground           StartDate        
 Min.   :1.000   Length:347         Length:347         Length:347        
 1st Qu.:1.000   Class :character   Class :character   Class :character  
 Median :2.000   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2.376                                                           
 3rd Qu.:3.000                                                           
 Max.   :4.000                                                           
 NA's   :1
5b. Summary – SparkR
summary(tendulkar1)
SparkDataFrame[summary:string, Runs:string, Minutes:string, BallsFaced:string, Fours:string, Sixes:string, StrikeRate:string, Position:string, Dismissal:string, Innings:string, Opposition:string, Ground:string, StartDate:string]
6a. Displaying details of dataframe with str() – R
str(tendulkar)
'data.frame':	347 obs. of  12 variables:
 $ Runs      : chr  "15" "DNB" "59" "8" ...
 $ Minutes   : int  28 NA 254 24 124 74 193 1 50 324 ...
 $ BallsFaced: int  24 NA 172 16 90 51 134 1 44 266 ...
 $ Fours     : int  2 NA 4 1 5 5 6 0 3 5 ...
 $ Sixes     : int  0 NA 0 0 0 0 0 0 0 0 ...
 $ StrikeRate: num  62.5 NA 34.3 50 45.5 ...
 $ Position  : int  6 NA 6 6 7 6 6 6 6 6 ...
 $ Dismissal : chr  "bowled" NA "lbw" "run out" ...
 $ Innings   : int  2 4 1 3 1 1 3 2 3 1 ...
 $ Opposition: chr  "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" ...
 $ Ground    : chr  "Karachi" "Karachi" "Faisalabad" "Faisalabad" ...
 $ StartDate : chr  "15-Nov-89" "15-Nov-89" "23-Nov-89" "23-Nov-89" ...
6b. Displaying details of dataframe with str() – SparkR
str(tendulkar1)
'SparkDataFrame': 12 variables:
 $ Runs      : chr "15" "DNB" "59" "8" "41" "35"
 $ Minutes   : chr "28" "-" "254" "24" "124" "74"
 $ BallsFaced: chr "24" "-" "172" "16" "90" "51"
 $ Fours     : chr "2" "-" "4" "1" "5" "5"
 $ Sixes     : chr "0" "-" "0" "0" "0" "0"
 $ StrikeRate: chr "62.5" "-" "34.3" "50" "45.55" "68.62"
 $ Position  : chr "6" "-" "6" "6" "7" "6"
 $ Dismissal : chr "bowled" "-" "lbw" "run out" "bowled" "lbw"
 $ Innings   : chr "2" "4" "1" "3" "1" "1"
 $ Opposition: chr "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan"
 $ Ground    : chr "Karachi" "Karachi" "Faisalabad" "Faisalabad" "Lahore" "Sialkot"
 $ StartDate : chr "15-Nov-89" "15-Nov-89" "23-Nov-89" "23-Nov-89" "1-Dec-89" "9-Dec-89"
7a. Head & tail -R
print(head(tendulkar),3)
print(tail(tendulkar),3)
 Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings
1   15      28         24     2     0      62.50        6    bowled       2
2  DNB      NA         NA    NA    NA         NA       NA             4
3   59     254        172     4     0      34.30        6       lbw       1
4    8      24         16     1     0      50.00        6   run out       3
5   41     124         90     5     0      45.55        7    bowled       1
6   35      74         51     5     0      68.62        6       lbw       1
  Opposition     Ground StartDate
1 v Pakistan    Karachi 15-Nov-89
2 v Pakistan    Karachi 15-Nov-89
3 v Pakistan Faisalabad 23-Nov-89
4 v Pakistan Faisalabad 23-Nov-89
5 v Pakistan     Lahore  1-Dec-89
6 v Pakistan    Sialkot  9-Dec-89
    Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings
342   37     125         81     5     0      45.67        4    caught       2
343   21      71         23     2     0      91.30        4   run out       4
344   32      99         53     5     0      60.37        4       lbw       2
345    1       8          5     0     0      20.00        4       lbw       4
346   10      41         24     2     0      41.66        4       lbw       2
347   74     150        118    12     0      62.71        4    caught       2
       Opposition  Ground StartDate
342   v Australia  Mohali 14-Mar-13
343   v Australia  Mohali 14-Mar-13
344   v Australia   Delhi 22-Mar-13
345   v Australia   Delhi 22-Mar-13
346 v West Indies Kolkata  6-Nov-13
347 v West Indies  Mumbai 14-Nov-13
7b. Head – SparkR
head(tendulkar1,3)
  Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings
1   15      28         24     2     0       62.5        6    bowled       2
2  DNB       -          -     -     -          -        -         -       4
3   59     254        172     4     0       34.3        6       lbw       1
  Opposition     Ground StartDate
1 v Pakistan    Karachi 15-Nov-89
2 v Pakistan    Karachi 15-Nov-89
3 v Pakistan Faisalabad 23-Nov-89
8a. Determining the column types with sapply -R
sapply(tendulkar,class)
       Runs     Minutes  BallsFaced       Fours       Sixes  StrikeRate 
"character"   "integer"   "integer"   "integer"   "integer"   "numeric" 
   Position   Dismissal     Innings  Opposition      Ground   StartDate 
  "integer" "character"   "integer" "character" "character" "character"
8b. Determining the column types with printSchema – SparkR
printSchema(tendulkar1)
root
 |-- Runs: string (nullable = true)
 |-- Minutes: string (nullable = true)
 |-- BallsFaced: string (nullable = true)
 |-- Fours: string (nullable = true)
 |-- Sixes: string (nullable = true)
 |-- StrikeRate: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Dismissal: string (nullable = true)
 |-- Innings: string (nullable = true)
 |-- Opposition: string (nullable = true)
 |-- Ground: string (nullable = true)
 |-- StartDate: string (nullable = true)
9a. Selecting columns – R
library(dplyr)
df=select(tendulkar,Runs,BallsFaced,Minutes)
head(df,5)
  Runs BallsFaced Minutes
1   15         24      28
2  DNB         NA      NA
3   59        172     254
4    8         16      24
5   41         90     124
9b. Selecting columns – SparkR
library(SparkR)
Sys.setenv(SPARK_HOME="/usr/hdp/2.6.0.3-8/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
# Initiate a SparkR session
sparkR.session()
tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")
df=SparkR::select(tendulkar1, "Runs", "BF","Mins")
head(SparkR::collect(df))
  Runs  BF Mins
1   15  24   28
2  DNB   -    -
3   59 172  254
4    8  16   24
5   41  90  124
6   35  51   74
10a. Filter rows by criteria – R
library(dplyr)
df=tendulkar %>% filter(Runs > 50)
head(df,5)
  Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings
1  DNB      NA         NA    NA    NA         NA       NA             4
2   59     254        172     4     0      34.30        6       lbw       1
3    8      24         16     1     0      50.00        6   run out       3
4   57     193        134     6     0      42.53        6    caught       3
5   88     324        266     5     0      33.08        6    caught       1
     Opposition     Ground StartDate
1    v Pakistan    Karachi 15-Nov-89
2    v Pakistan Faisalabad 23-Nov-89
3    v Pakistan Faisalabad 23-Nov-89
4    v Pakistan    Sialkot  9-Dec-89
5 v New Zealand     Napier  9-Feb-90
10b. Filter rows by criteria – SparkR
df=SparkR::filter(tendulkar1, tendulkar1$Runs > 50)
head(SparkR::collect(df))
  Runs Mins  BF 4s 6s    SR Pos Dismissal Inns     Opposition       Ground
1   59  254 172  4  0  34.3   6       lbw    1     v Pakistan   Faisalabad
2   57  193 134  6  0 42.53   6    caught    3     v Pakistan      Sialkot
3   88  324 266  5  0 33.08   6    caught    1  v New Zealand       Napier
4   68  216 136  8  0    50   6    caught    2      v England   Manchester
5  114  228 161 16  0  70.8   4    caught    2    v Australia        Perth
6  111  373 270 19  0 41.11   4    caught    2 v South Africa Johannesburg
  Start Date
1  23-Nov-89
2   9-Dec-89
3   9-Feb-90
4   9-Aug-90
5   1-Feb-92
6  26-Nov-92
11a. Unique values -R
unique(tendulkar$Runs)
  [1] "15"   "DNB"  "59"   "8"    "41"   "35"   "57"   "0"    "24"   "88"  
 [11] "5"    "10"   "27"   "68"   "119*" "21"   "11"   "16"   "7"    "40"  
 [21] "148*" "6"    "17"   "114"  "111"  "1"    "73"   "50"   "9*"   "165" 
 [31] "78"   "62"   "TDNB" "28"   "104*" "71"   "142"  "96"   "43"   "11*" 
 [41] "34"   "85"   "179"  "54"   "4"    "0*"   "52*"  "2"    "122"  "31"  
 [51] "177"  "74"   "42"   "18"   "61"   "36"   "169"  "9"    "15*"  "92"  
 [61] "83"   "143"  "139"  "23"   "148"  "13"   "155*" "79"   "47"   "113" 
 [71] "67"   "136"  "29"   "53"   "124*" "126*" "44*"  "217"  "116"  "52"  
 [81] "45"   "97"   "20"   "39"   "201*" "76"   "65"   "126"  "36*"  "69"  
 [91] "155"  "22*"  "103"  "26"   "90"   "176"  "117"  "86"   "12"   "193" 
[101] "16*"  "51"   "32"   "55"   "37"   "44"   "241*" "60*"  "194*" "3"   
[111] "32*"  "248*" "94"   "22"   "109"  "19"   "14"   "28*"  "63"   "64"  
[121] "101"  "122*" "91"   "82"   "56*"  "154*" "153"  "49"   "10*"  "103*"
[131] "160"  "100*" "105*" "100"  "106"  "84"   "203"  "98"   "38"   "214" 
[141] "53*"  "111*" "146"  "14*"  "56"   "80"   "25"   "81"   "13*"
11b. Unique values – SparkR
head(SparkR::distinct(tendulkar1[,"Runs"]),5)
  Runs
1 119*
2    7
3   51
4  169
5  32*
12a. Aggregate – Mean, min and max – R
library(dplyr)
library(magrittr)
a <- tendulkar$Runs != "DNB"
tendulkar <- tendulkar[a,]
dim(tendulkar)

# Remove rows with 'TDNB'
c <- tendulkar$Runs != "TDNB"
tendulkar <- tendulkar[c,]

# Remove rows with absent
d <- tendulkar$Runs != "absent"
tendulkar <- tendulkar[d,]
dim(tendulkar)

# Remove the "* indicating not out
tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs))
c <- complete.cases(tendulkar)

#Subset the rows which are complete
tendulkar <- tendulkar[c,]
print(dim(tendulkar))
df <-tendulkar %>%  group_by(Ground) %>% summarise(meanRuns= mean(Runs), minRuns=min(Runs), maxRuns=max(Runs)) 
#names(tendulkar)
head(df)
[1] 327  12
# A tibble: 6 x 4
  Ground       meanRuns minRuns maxRuns
                   
1 Adelaide        32.6       0.    153.
2 Ahmedabad       40.1       4.    217.
3 Auckland         5.00      5.      5.
4 Bangalore       57.9       4.    214.
5 Birmingham      46.8       1.    122.
6 Bloemfontein    85.0      15.    155.
12b. Aggregate- Mean, Min, Max – SparkR
sparkR.session()

tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

print(dim(tendulkar1))
tendulkar1 <-SparkR::filter(tendulkar1,tendulkar1$Runs != "DNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "TDNB")
print(dim(tendulkar1))
tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "absent")
print(dim(tendulkar1))

# Cast the string type Runs to double
withColumn(tendulkar1, "Runs", cast(tendulkar1$Runs, "double"))
head(SparkR::distinct(tendulkar1[,"Runs"]),20)
# Remove the "* indicating not out
tendulkar1$Runs=SparkR::regexp_replace(tendulkar1$Runs, "\\*", "")
head(SparkR::distinct(tendulkar1[,"Runs"]),20)
df=SparkR::summarize(SparkR::groupBy(tendulkar1, tendulkar1$Ground), mean = mean(tendulkar1$Runs), minRuns=min(tendulkar1$Runs),maxRuns=max(tendulkar1$Runs))
head(df,20)
[1] 347  12
[1] 330  12
[1] 329  12
[1] 329  12
          Ground       mean minRuns maxRuns
1      Bangalore  54.312500       0      96
2       Adelaide  32.600000       0      61
3  Colombo (PSS)  37.200000      14      71
4   Christchurch  12.000000       0      24
5       Auckland   5.000000       5       5
6        Chennai  60.625000       0      81
7      Centurion  73.500000     111      36
8       Brisbane   7.666667       0       7
9     Birmingham  46.750000       1      40
10     Ahmedabad  40.125000     100       8
11 Colombo (RPS) 143.000000     143     143
12    Chittagong  57.800000     101      36
13     Cape Town  69.857143      14       9
14    Bridgetown  26.000000       0      92
15      Bulawayo  55.000000      36      74
16         Delhi  39.947368       0      76
17    Chandigarh  11.000000      11      11
18  Bloemfontein  85.000000      15     155
19 Colombo (SSC)  77.555556     104       8
20       Cuttack   2.000000       2       2
13a Using SQL with SparkR
sparkR.session()
tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", 
                header = "true", 
                delimiter = ",", 
                source = "csv", 
                inferSchema = "true", 
                na.strings = "")

# Register this SparkDataFrame as a temporary view.
createOrReplaceTempView(tendulkar1, "tendulkar2")

# SQL statements can be run by using the sql method
df=SparkR::sql("SELECT * FROM tendulkar2 WHERE Ground='Karachi'")

head(df)

  Runs Mins BF 4s 6s    SR Pos Dismissal Inns Opposition  Ground Start Date
1   15   28 24  2  0  62.5   6    bowled    2 v Pakistan Karachi  15-Nov-89
2  DNB    -  -  -  -     -   -         -    4 v Pakistan Karachi  15-Nov-89
3   23   49 29  5  0 79.31   4    bowled    2 v Pakistan Karachi  29-Jan-06
4   26   74 47  5  0 55.31   4    bowled    4 v Pakistan Karachi  29-Jan-06
Conclusion

This post discusses some of the key constructs in R and SparkR and how one can transition from R to SparkR fairly easily. I will be adding more constructs later. Do check back!

You may also like
1. Exploring Quantum Gate operations with QCSimulator
2. Deep Learning from first principles in Python, R and Octave – Part 4
3. A Bluemix recipe with MongoDB and Node.js
4. Practical Machine Learning with R and Python – Part 5
5. Introducing cricketr! : An R package to analyze performances of cricketers

To see all posts click Index of posts

My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($9.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle- $9.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

Deep Learning from first principles in Python, R and Octave – Part 1

“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

                          David Eagleman - The Brain: The Story of You

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

Check out my video presentation which discusses the derivations in detail
1. Elements of Neural Networks and Deep Le- Part 1
2. Elements of Neural Networks and Deep Learning – Part 2

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

1. Logistic Regression

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

 

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
Z= w_{1}x_{1} +w_{2}x_{2} +..+ w_{30}x_{30} + b
where ‘b’ is the bias term

The Activation function is the sigmoid function which is a= 1/(1+e^{-z})
The Loss, when the sigmoid function is used in the output layer, is given by
L=-(ylog(a) + (1-y)log(1-a)) (1)

Gradient Descent

Forward propagation

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

Chain rule of differentiation
Let y=f(u)
and u=g(x) then
\partial y/\partial x = \partial y/\partial u * \partial u/\partial x

Derivative of sigmoid
\sigma=1/(1+e^{-z})
Let x= 1 + e^{-z}  then
\sigma = 1/x
\partial \sigma/\partial x = -1/x^{2}
\partial x/\partial z = -e^{-z}
Using the chain rule of differentiation we get
\partial \sigma/\partial z = \partial \sigma/\partial x * \partial x/\partial z
=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}
Therefore \partial \sigma/\partial z = \sigma(1-\sigma)        -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
L=-(y*log(a) + (1-y)*log(1-a))      -(a)
a=1/(1+e^{-Z})      -(b)
Z= w_{1}x_{1} +w_{2}x_{2} +...+ w_{30}x_{30} +b = Z = \sum_{i} w_{i}*x_{i} + b -(c)

The back propagation step requires the computation of dL/dw_{i} and dL/db_{i}. In the case of regression it would be dE/dw_{i} and dE/db_{i} where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
dL/da = -(y/a + (1-y)/(1-a))          -(d)
because d/dx(logx) = 1/x
Also from equation (2) we get
da/dZ = a (1-a)                                  – (e)
By chain rule
\partial L/\partial Z = \partial L/\partial a * \partial a/\partial Z
therefore substituting the results of (d) & (e) we get
\partial L/\partial Z = -(y/a + (1-y)/(1-a)) * a(1-a) = a-y         (f)
Finally
\partial L/\partial w_{i}= \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial w_{i}                                                           -(g)
\partial Z/\partial w_{i} = x_{i}            – (h)
and from (f) we have  \partial L/\partial Z =a-y
Therefore  (g) reduces to
\partial L/\partial w_{i} = x_{i}* (a-y) -(i)
Also
\partial L/\partial b = \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial b -(j)
Since
\partial Z/\partial b = 1 and using (f) in (j)
\partial L/\partial b = a-y

The gradient computes the weights at the input layer and the corresponding bias by using the values
of dw_{i} and db
w_{i} := w_{i} -\alpha * dw_{i}
b := b -\alpha * db
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

3. Neural Network for Logistic Regression -Python code (vectorized)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Define the sigmoid function
def sigmoid(z):  
    a=1/(1+np.exp(-z))    
    return a

# Initialize
def initialize(dim):
    w = np.zeros(dim).reshape(dim,1)
    b = 0   
    return w

# Compute the loss
def computeLoss(numTraining,Y,A):
    loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
    return(loss)

# Execute the forward propagation
def forwardPropagation(w,b,X,Y):
    # Compute Z
    Z=np.dot(w.T,X)+b
    # Determine the number of training samples
    numTraining=float(len(X))
    # Compute the output of the sigmoid activation function 
    A=sigmoid(Z)
    #Compute the loss
    loss = computeLoss(numTraining,Y,A)
    # Compute the gradients dZ, dw and db
    dZ=A-Y
    dw=1/numTraining*np.dot(X,dZ.T)
    db=1/numTraining*np.sum(dZ)
    
    # Return the results as a dictionary
    gradients = {"dw": dw,
             "db": db}
    loss = np.squeeze(loss)
    return gradients,loss

# Compute Gradient Descent    
def gradientDescent(w, b, X, Y, numIerations, learningRate):
    losses=[]
    idx =[]
    # Iterate 
    for i in range(numIerations):
        gradients,loss=forwardPropagation(w,b,X,Y)
        #Get the derivates
        dw = gradients["dw"]
        db = gradients["db"]
        w = w-learningRate*dw
        b = b-learningRate*db

        # Store the loss
        if i % 100 == 0:
            idx.append(i)
            losses.append(loss)      
        # Set params and grads
        params = {"w": w,
                  "b": b}  
        grads = {"dw": dw,
                 "db": db}
    
    return params, grads, losses,idx

# Predict the output for a training set 
def predict(w,b,X):
    size=X.shape[1]
    yPredicted=np.zeros((1,size))
    Z=np.dot(w.T,X)
    # Compute the sigmoid
    A=sigmoid(Z)
    for i in range(A.shape[1]):
        #If the value is > 0.5 then set as 1
        if(A[0][i] > 0.5):
            yPredicted[0][i]=1
        else:
        # Else set as 0
            yPredicted[0][i]=0

    return yPredicted

#Normalize the data   
def normalize(x):
    x_norm = None
    x_norm = np.linalg.norm(x,axis=1,keepdims=True)
    x= x/x_norm
    return x

   
# Run the 2 layer Neural Network on the cancer data set

from sklearn.datasets import load_breast_cancer
# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)
# Normalize the data for better performance
X_train1=normalize(X_train)


# Create weight vectors of zeros. The size is the number of features in the data set=30
w=np.zeros((X_train.shape[1],1))
#w=np.zeros((30,1))
b=0

#Normalize the training data so that gradient descent performs better
X_train1=normalize(X_train)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_train2=X_train1.T

# Reshape to remove the rank 1 array and then transpose
y_train1=y_train.reshape(len(y_train),1)
y_train2=y_train1.T

# Run gradient descent for 4000 times and compute the weights
parameters, grads, costs,idx = gradientDescent(w, b, X_train2, y_train2, numIerations=4000, learningRate=0.75)
w = parameters["w"]
b = parameters["b"]
   

# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1.T

#Reshape y_test
y_test1=y_test.reshape(len(y_test),1)
y_test2=y_test1.T

# Predict the values for 
yPredictionTest = predict(w, b, X_test2)
yPredictionTrain = predict(w, b, X_train2)

# Print the accuracy
print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100))

# Plot the Costs vs the number of iterations
fig1=plt.plot(idx,costs)
fig1=plt.title("Gradient descent-Cost vs No of iterations")
fig1=plt.xlabel("No of iterations")
fig1=plt.ylabel("Cost")
fig1.figure.savefig("fig1", bbox_inches='tight')
## train accuracy: 90.3755868545 %
## test accuracy: 89.5104895105 %

Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

4. Neural Network for Logistic Regression -R code (vectorized)

source("RFunctions-1.R")
# Define the sigmoid function
sigmoid <- function(z){
    a <- 1/(1+ exp(-z))
    a
}

# Compute the loss
computeLoss <- function(numTraining,Y,A){
    loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A))
    return(loss)
}

# Compute forward propagation
forwardPropagation <- function(w,b,X,Y){
    # Compute Z
    Z <- t(w) %*% X +b
    #Set the number of samples
    numTraining <- ncol(X)
    # Compute the activation function
    A=sigmoid(Z) 
    
    #Compute the loss
    loss <- computeLoss(numTraining,Y,A)
    
    # Compute the gradients dZ, dw and db
    dZ<-A-Y
    dw<-1/numTraining * X %*% t(dZ)
    db<-1/numTraining*sum(dZ)
    
    fwdProp <- list("loss" = loss, "dw" = dw, "db" = db)
    return(fwdProp)
}

# Perform one cycle of Gradient descent
gradientDescent <- function(w, b, X, Y, numIerations, learningRate){
    losses <- NULL
    idx <- NULL
    # Loop through the number of iterations
    for(i in 1:numIerations){
        fwdProp <-forwardPropagation(w,b,X,Y)
        #Get the derivatives
        dw <- fwdProp$dw
        db <- fwdProp$db
        #Perform gradient descent
        w = w-learningRate*dw
        b = b-learningRate*db
        l <- fwdProp$loss
        # Stoe the loss
        if(i %% 100 == 0){
            idx <- c(idx,i)
            losses <- c(losses,l)  
        }
    }
    
    # Return the weights and losses
    gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx)
   
    return(gradDescnt)
}

# Compute the predicted value for input
predict <- function(w,b,X){
    m=dim(X)[2]
    # Create a ector of 0's
    yPredicted=matrix(rep(0,m),nrow=1,ncol=m)
    Z <- t(w) %*% X +b
    # Compute sigmoid
    A=sigmoid(Z)
    for(i in 1:dim(A)[2]){
        # If A > 0.5 set value as 1
        if(A[1,i] > 0.5)
        yPredicted[1,i]=1
       else
        # Else set as 0
        yPredicted[1,i]=0
    }

    return(yPredicted)
}

# Normalize the matrix
normalize <- function(x){
    #Create the norm of the matrix.Perform the Frobenius norm of the matrix 
    n<-as.matrix(sqrt(rowSums(x^2)))
    #Sweep by rows by norm. Note '1' in the function which performing on every row
    normalized<-sweep(x, 1, n, FUN="/")
    return(normalized)
}

# Run the 2 layer Neural Network on the cancer data set
# Read the data (from sklearn)
cancer <- read.csv("cancer.csv")
# Rename the target variable
names(cancer) <- c(seq(1,30),"output")
# Split as training and test sets
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]

# Set the features
X_train <-train[,1:30]
y_train <- train[,31]
X_test <- test[,1:30]
y_test <- test[,31]
# Create a matrix of 0's with the number of features
w <-matrix(rep(0,dim(X_train)[2]))
b <-0
X_train1 <- normalize(X_train)
X_train2=t(X_train1)

# Reshape  then transpose
y_train1=as.matrix(y_train)
y_train2=t(y_train1)

# Perform gradient descent
gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77)


# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=t(X_test1)

#Reshape y_test and take transpose
y_test1=as.matrix(y_test)
y_test2=t(y_test1)

# Use the values of the weights generated from Gradient Descent
yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2)
yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2)

sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100))
## [1] "Train accuracy: 90.845070"
sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100))
## [1] "test accuracy: 87.323944"
df <-data.frame(gradDescent$idx, gradDescent$losses)
names(df) <- c("iterations","losses")
ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") +
    ggtitle("Gradient Descent - Losses vs No of Iterations") +
    xlab("No of iterations") + ylab("Losses")

4. Neural Network for Logistic Regression -Octave code (vectorized)


1;
# Define sigmoid function
function a = sigmoid(z)
a = 1 ./ (1+ exp(-z));
end
# Compute the loss
function loss=computeLoss(numtraining,Y,A)
loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A));
end


# Perform forward propagation
function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y)
% Compute Z
Z = w' * X + b;
numtraining = size(X)(1,2);
# Compute sigmoid
A = sigmoid(Z);


#Compute loss. Note this is element wise product
loss =computeLoss(numtraining,Y,A);
# Compute the gradients dZ, dw and db
dZ = A-Y;
dw = 1/numtraining* X * dZ';
db =1/numtraining*sum(dZ);

end

# Compute Gradient Descent
function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate)
#Initialize losses and idx
losses=[];
index=[];
# Loop through the number of iterations
for i=1:numIerations,
[loss,dw,db,dZ] = forwardPropagation(w,b,X,Y);
# Perform Gradient descent
w = w - learningRate*dw;
b = b - learningRate*db;
if(mod(i,100) ==0)
# Append index and loss
index = [index i];
losses = [losses loss];
endif

end
end

# Determine the predicted value for dataset
function yPredicted = predict(w,b,X)
m = size(X)(1,2);
yPredicted=zeros(1,m);
# Compute Z
Z = w' * X + b;
# Compute sigmoid
A = sigmoid(Z);
for i=1:size(X)(1,2),
# Set predicted as 1 if A > 0,5
if(A(1,i) >= 0.5)
yPredicted(1,i)=1;
else
yPredicted(1,i)=0;
endif
end
end


# Normalize by dividing each value by the sum of squares
function normalized = normalize(x)
# Compute Frobenius norm. Square the elements, sum rows and then find square root
a = sqrt(sum(x .^ 2,2));
# Perform element wise division
normalized = x ./ a;
end


# Split into train and test sets
function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent)
# Create a random index
ix = randperm(length(dataset));
# Split into training
trainSize = floor(trainPercent/100 * length(dataset));
train=dataset(ix(1:trainSize),:);
# And test
test=dataset(ix(trainSize+1:length(dataset)),:);
X_train = train(:,1:30);
y_train = train(:,31);
X_test = test(:,1:30);
y_test = test(:,31);
end


cancer=csvread("cancer.csv");
[X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75);
w=zeros(size(X_train)(1,2),1);
b=0;
X_train1=normalize(X_train);
X_train2=X_train1';
y_train1=y_train';
[w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75);
# Normalize X_test
X_test1=normalize(X_test);
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1';
y_test1=y_test';
# Use the values of the weights generated from Gradient Descent
yPredictionTest = predict(w1, b1, X_test2);
yPredictionTrain = predict(w1, b1, X_train2);


trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100
testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100
trainAccuracy = 90.845
testAccuracy = 89.510
graphics_toolkit('gnuplot')
plot(idx,losses);
title ('Gradient descent- Cost vs No of iterations');
xlabel ("No of iterations");
ylabel ("Cost");

Conclusion
This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers

The Deep Learning journey has begun… Don’t miss the bus!
Stay tuned for more interesting posts in Deep Learning!!

References
1. Deep Learning Specialization
2. Neural Networks for Machine Learning
3. Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville
4. Neural Networks: The mechanics of backpropagation
5. Machine Learning

Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4
3. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
4. Practical Machine Learning with R and Python – Part 4
5. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
6. A Bluemix recipe with MongoDB and Node.js
7. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

To see all posts check Index of posts

My 3 video presentations on “Essential R”

In this post I include my  3 video presentations on the topic “Essential R”. In these 3 presentations I cover the entire landscape of R. I cover the following

  • R Language – The essentials
  • Key R Packages (dplyr, lubridate, ggplot2, etc.)
  • How to create R Markdown and share reports
  • A look at Shiny apps
  • How to create a simple R package

You can download the relevant slide deck and practice code at Essential R

Essential R – Part 1
This video cover basic R data types – character, numeric, vectors, matrices, lists and data frames. It also touches on how to subset these data types

Essential R – Part 2
This video continues on how to subset dataframes (the most important data type) and some important packages. It also presents one of the most important job of a Data Scientist – that of cleaning and shaping the data. This is done with an example unclean data frame. It also  touches on some  key operations of dplyr like select, filter, arrange, summarise and mutate. Other packages like lubridate, quantmod are also included. This presentation also shows how to use base plot and ggplot2

Essential R – Part 3
This final session covers R Markdown , and  touches on some of the key markdown elements. There is a brief overview of a simple Shiny app. Finally this presentation also shows the key steps to create an R package

These 3 R sessions cover most of the basic R topics that we tend to use in a our day-to-day R way of life. With this you should be able to hit the ground running!

Hope you enjoy these video presentation and also hope you have an even greater time with R!

Check out my 2 books on cricket, a) Cricket analytics with cricketr b) Beaten by sheer pace – Cricket analytics with yorkr, now available in both paperback & kindle versions on Amazon!!! Pick up your copies today!

Also see
1. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
2. Computer Vision: Ramblings on derivatives, histograms and contours
3. Designing a Social Web Portal
4. Revisiting Whats up, Watson – Using Watson’s Question and Answer with Bluemix – Part 2
5. Re-introducing cricketr! : An R package to analyze performances of cricketers

To see all my posts click – Index of posts