# The Clash of the Titans in Test and ODI cricket

Who looks outside, dreams; who looks inside, awakes.
Show me a sane man and I will cure him for you.

            Carl Jung 

We’re made of star stuff. We are a way for the cosmos to know itself.
If you want to make an apple pie from scratch, you must first create the universe.

            Carl Sagan

## Introduction

The biggest nag in the collective psyche of cricketing fraternity these days, is whether Virat Kohli has surpassed Sachin Tendulkar. This question has been troubling cricket lovers the world over and particularly in India, for quite a while. This nagging question has only grown stronger with Kohli’s 41st ODI century and with Michael Vaughan bestowing the GOAT title to Virat Kohli for ODI cricket. Hence, I decided to do my bit in addressing this, by doing analysis of Kohli’s and Tendulkar’s performance in ODI cricket. I also wanted to address the the best among the cricketing idols of India in Test cricket, namely Sunil Gavaskar, Sachin Tendulkar and Virat Kohli. Hence this post has 2 parts

1. Analysis of Tendulkar, Gavaskar and Kohli in Test cricket
2. Analysis of Tendulkar and Kohli in ODIs

In this post, I analyze the performances of these titans in Test and ODI cricket using my R package cricketr. While some may feel that comparisons are not possible as these batsmen are from different eras. To some extent this is true. I would give some leeway to Gavaskar as he had to bat in a pre-helmet era. But with Tendulkar and Kohli a fair and objective comparison is possible. There were pre-eminient bowlers in the times of Tendulkar as there are now.

From the analysis below, it can be seen that Tendulkar is ahead  of everybody else in Test cricket. However it must be noted that Tendulkar’s performance deteriorated towards the end of his career. Such was not the case with Gavaskar. Kohli has some catching up to do and he still has a lot of Test cricket in him.

In ODI Kohli can be seen to pulling ahead of Tendulkar in several aspects.

My R package cricketr can be installed directly from CRAN and you can use it analyze cricketers.

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports all formats of the game including Test, ODI and Twenty20 versions.

You should be able to install the package from GitHub and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Important note 1: The latest release of ‘cricketr’ now includes the ability to analyze performances of teams now!!  See Cricketr adds team analytics to its repertoire!!!

Important note 2 : Cricketr can now do a more fine-grained analysis of players, see Cricketr learns new tricks : Performs fine-grained analysis of players

Important note 3: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

Take a look at my short video tutorial on my R package cricketr on Youtube – R package cricketr – A short tutorial

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Note 1: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr templatefrom Github (which is the R Markdown file I have used for the analysis below).

Note 2: I sprinkle the charts with my observations. Feel free to look at them more closely and come to your conclusions.

If you are passionate about cricket, and love analyzing cricket performances, then check out my racy book on cricket ‘Cricket analytics with cricketr and cricpy – Analytics harmony with R & Python’! This book discusses and shows how to use my R package ‘cricketr’ and my Python package ‘cricpy’ to analyze batsmen and bowlers in all formats of the game (Test, ODI and T20). The paperback is available on Amazon at $21.99 and the kindle version at$9.99/Rs 449/-. A must read for any cricket lover! Check it out!!

Important note: Do check out the python avatar of cricketr, ‘cricpy’ in my post Introducing cricpy:A python package to analyze performances of cricketers

### 1 Load the cricketr package

if (!require("cricketr")){
install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

## A Test cricket  – Analysis of Gavaskar, Tendulkar and Kohli

### 2. Get player data

tendulkar <- getPlayerData(35320,dir=".",file="tendulkar.csv",type="batting")
kohli <- getPlayerData(253802,dir=".",file="kohli.csv",type="batting")
gavaskar <- getPlayerData(28794,dir=".",file="gavaskar.csv",type="batting")

### 3a. Basic analyses for Tendulkar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./tendulkar.csv","Tendulkar")
batsmanMeanStrikeRate("./tendulkar.csv","Tendulkar")
batsmanRunsRanges("./tendulkar.csv","Tendulkar")
dev.off()

### 3b Basic analyses for Kohli

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsFreqPerf("./kohli.csv","Kohli")
batsmanMeanStrikeRate("./kohli.csv","Kohli")
batsmanRunsRanges("./kohli.csv","Kohli")
dev.off()

### 3c Basic analyses for Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanRunsRanges("./gavaskar.csv","Gavaskar")
dev.off()

### 4a.More analyses for Tendulkar

It can be seen that Tendulkar and Gavaskar has been bowled more often than Kohli. Also Kohli does not have as many sixes in Test cricket as Tendulkar and Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./tendulkar.csv","Tendulkar")
batsman6s("./tendulkar.csv","Tendulkar")
batsmanDismissals("./tendulkar.csv","Tendulkar")
dev.off()

### 4b. More analyses for Kohli

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./kohli.csv","Kohli")
batsman6s("./kohli.csv","Kohli")
batsmanDismissals("./kohli.csv","Kohli")
dev.off()

### 4c More analyses for Gavaskar

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsmanDismissals("./gavaskar.csv","Gavaskar")
dev.off()

### 5 Performance of batsmen on different grounds

par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./tendulkar.csv","Tendulkar")
batsmanAvgRunsGround("./kohli.csv","Kohli")
batsmanAvgRunsGround("./gavaskar.csv","Gavaskar")


a

#dev.off()

### 6. Performance if batsmen against different Opposition

1. Tendulkar averages 50 against the following countries – Australia, Bangladesh, England, Sri Lanka, West Indies and Zimbabwe
2. Kohli average almost 50 against all the nations he has played – Australia, Bangladesh, England, New Zealand, Sri Lanka and West Indies
3. Gavaskar averages 50 against Australia, Pakistan, West Indies, Sri Lanka
par(mar=c(4,4,2,2))
batsmanAvgRunsOpposition("./tendulkar.csv","Tendulkar")
batsmanAvgRunsOpposition("./kohli.csv","Kohli")
batsmanAvgRunsOpposition("./gavaskar.csv","Gavaskar")

### 7. Get player data special

This is required for the next 2 function calls

tendulkarsp <- getPlayerDataSp(35320,tdir=".",tfile="tendulkarsp.csv",ttype="batting")
kohlisp <- getPlayerDataSp(253802,tdir=".",tfile="kohlisp.csv",ttype="batting")

#dev.off()

### 8 Get contribution of batsmen in matches won and lost

Kohli contribution has had an equal contribution in won and lost matches. Tendulkar’s runs seem to have not helped in winning as much as only 50% of matches he has played have been won

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanContributionWonLost("tendulkarsp.csv","Tendulkar")
batsmanContributionWonLost("./kohlisp.csv","Kohli")


a

### 9 Performance of batsmen at home and overseas

The boxplots show that Kohli performs better overseas than at home. The 3rd quartile is higher, though the median seems to lower overseas. For Tendulkar the performance is similar both ways. Gavaskar’s median runs scored overseas is higher.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanPerfHomeAway("tendulkarsp.csv","Tendulkar")
batsmanPerfHomeAway("./kohlisp.csv","Kohli")



### 10. Moving average of runs

Gavaskar’s moving average was very good at the time of his retirement. Kohli seems to be going very strong. Tendulkar’s performance shows signs of deterioration around the time of his retirement.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))

batsmanMovingAverage("./tendulkar.csv","Tendulkar")
batsmanMovingAverage("./kohli.csv","Kohli")

#dev.off()

### 11 Boxplot and histogram of runs

Kohli has a marginally higher average (50.69) than Tendulkar (48.65) while Gavaskar 46. The median runs are same for Tendulkar and Kohli at 32

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfBoxHist("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfBoxHist("./kohli.csv","Kohli")
batsmanPerfBoxHist("./gavaskar.csv","Gavaskar")

### 12 Cumulative average Runs for batsmen

Looking at the cumulative average runs we can see a gradual drop in the cumulative average for Tendulkar while Kohli and Gavaskar’s performance seems to be getting better

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeAverageRuns("./tendulkar.csv","Tendulkar")
batsmanCumulativeAverageRuns("./kohli.csv","Kohli")
batsmanCumulativeAverageRuns("./gavaskar.csv","Gavaskar")

### 13. Cumulative average strike rate of batsmen

Tendulkar’s strike rate is better than Kohli and Gavaskar

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanCumulativeStrikeRate("./tendulkar.csv","Tendulkar")
batsmanCumulativeStrikeRate("./kohli.csv","Kohli")
batsmanCumulativeStrikeRate("./gavaskar.csv","Gavaskar")

### 14 Performance forecast of batsmen

The forecasted performance for Kohli and Gavaskar is higher than that of Tendulkar

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./tendulkar.csv","Sachin Tendulkar")
batsmanPerfForecast("./kohli.csv","Kohli")

#dev.off()

### 15. Relative strike rate of batsmen

par(mar=c(4,4,2,2))

relativeBatsmanSR(frames,names)
#dev.off()



### 16. Relative Runs frequency of batsmen

par(mar=c(4,4,2,2))
relativeRunsFreqPerf(frames,names)
#dev.off()


### 17. Relative cumulative average runs of batsmen

Tendulkar leads the way here, but it can be seem Kohli catching up.

par(mar=c(4,4,2,2))
relativeBatsmanCumulativeAvgRuns(frames,names)
#dev.off()


### 18. Relative cumulative average strike rate

Tendulkar has better strike rate than the other two.

par(mar=c(4,4,2,2))
relativeBatsmanCumulativeStrikeRate(frames,names)
#dev.off()


### 19. Check batsman in form

As in the moving average and performance forecast and cumulative average runs, Kohli and Gavaskar are in-form while Tendulkar was out-of-form towards the end.

checkBatsmanInForm("./tendulkar.csv","Sachin Tendulkar")
## [1] "**************************** Form status of Sachin Tendulkar ****************************
\n\n Population size: 294  Mean of population: 50.48 \n Sample size: 33  Mean of sample: 32.42 SD of
sample: 29.8 \n\n Null hypothesis H0 : Sachin Tendulkar 's sample average is within 95% confidence interval
of population average\n Alternative hypothesis Ha : Sachin Tendulkar 's sample average is below
the 95% confidence interval of population average\n\n
Sachin Tendulkar 's Form Status: Out-of-Form because the p value: 0.000713  is less than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./kohli.csv","Kohli")
## [1] "**************************** Form status of Kohli ****************************\n\n Population size: 117
Mean of population: 50.35 \n Sample size: 13  Mean of sample: 53.77 SD of sample: 46.15 \n\n Null
hypothesis H0 : Kohli 's sample average is within 95% confidence interval of population average\n
Alternative hypothesis Ha : Kohli 's sample average is below the 95% confidence interval of population
average\n\n Kohli 's Form Status: In-Form because the p value: 0.603244  is greater than alpha=  0.05 \n *******************************************************************************************\n\n"
checkBatsmanInForm("./gavaskar.csv","Gavaskar")
## [1] "**************************** Form status of Gavaskar ****************************\n\n
Population size: 125  Mean of population: 44.67 \n Sample size: 14  Mean of sample: 57.86 SD of sample:
58.55 \n\n Null hypothesis H0 : Gavaskar 's sample average is within 95% confidence interval of population
average\n Alternative hypothesis Ha : Gavaskar 's sample average is below the 95% confidence interval of
population average\n\n Gavaskar 's Form Status: In-Form because the p value: 0.793276  is greater
than alpha=  0.05 \n *******************************************************************************************\n\n"
#dev.off()

### 20. Performance 3D

A 3D regression plane is fitted between the the Balls faced, Minutes at crease and Runs scored

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
battingPerf3d("./tendulkar.csv","Sachin Tendulkar")
battingPerf3d("./kohli.csv","Kohli")
#dev.off()

### 20. Runs likelihood

This functions computes the K-Means and determines the runs the batsmen are likely to score.

par(mar=c(4,4,2,2))
batsmanRunsLikelihood("./tendulkar.csv","Tendulkar")
## Summary of  Tendulkar 's runs scoring likelihood
## **************************************************
##
## There is a 16.51 % likelihood that Tendulkar  will make  139 Runs in  251 balls over 353  Minutes
## There is a 25.08 % likelihood that Tendulkar  will make  66 Runs in  122 balls over  167  Minutes
## There is a 58.41 % likelihood that Tendulkar  will make  16 Runs in  31 balls over 44  Minutes
batsmanRunsLikelihood("./kohli.csv","Kohli")
## Summary of  Kohli 's runs scoring likelihood
## **************************************************
##
## There is a 20 % likelihood that Kohli  will make  143 Runs in  232 balls over 330  Minutes
## There is a 33.85 % likelihood that Kohli  will make  51 Runs in  92 balls over  127  Minutes
## There is a 46.15 % likelihood that Kohli  will make  11 Runs in  24 balls over 31  Minutes
batsmanRunsLikelihood("./gavaskar.csv","Gavaskar")
## Summary of  Gavaskar 's runs scoring likelihood
## **************************************************
##
## There is a 33.81 % likelihood that Gavaskar  will make  69 Runs in  159 balls over 214  Minutes
## There is a 8.63 % likelihood that Gavaskar  will make  172 Runs in  364 balls over  506  Minutes
## There is a 57.55 % likelihood that Gavaskar  will make  13 Runs in  35 balls over 48  Minutes

### 21. Predict runs for a random combination of Balls faced and runs scored

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
tendulkar <- batsmanRunsPredict("./tendulkar.csv","Tendulkar",newdataframe=newDF)
kohli <- batsmanRunsPredict("./kohli.csv","Kohli",newdataframe=newDF)
2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle- $9.99/Rs449) This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better. Here is a look at the topics covered Table of Contents Preface …………………………………………………………………………….4 Introduction ………………………………………………………………………6 1. Essential R ………………………………………………………………… 8 2. Essential Python for Datascience ……………………………………………57 3. R vs Python …………………………………………………………………81 4. Regression of a continuous variable ……………………………………….101 5. Classification and Cross Validation ………………………………………..121 6. Regression techniques and regularization ………………………………….146 7. SVMs, Decision Trees and Validation curves ………………………………191 8. Splines, GAMs, Random Forests and Boosting ……………………………222 9. PCA, K-Means and Hierarchical Clustering ………………………………258 References ……………………………………………………………………..269 Pick up your copy today!! Hope you have a great time learning as I did while implementing these algorithms! # My book “Deep Learning from first principles” now on Amazon Note: The 2nd edition of this book is now available on Amazon My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($18.99) and kindle ($9.99/Rs449). The book starts with the most primitive 2-layer Neural Network and works its way to a generic L-layer Deep Learning Network, with all the bells and whistles. The book includes detailed derivations and vectorized implementations in Python, R and Octave. The code has been extensively commented and has been included in the Appendix section. Pick up your copy today!!! # Deep Learning from first principles in Python, R and Octave – Part 8 ## 1. Introduction You don’t understand anything until you learn it more than one way. Marvin Minsky No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. Marvin Minsky A wealth of information creates a poverty of attention. Herbert Simon This post, Deep Learning from first Principles in Python, R and Octave-Part8, is my final post in my Deep Learning from first principles series. In this post, I discuss and implement a key functionality needed while building Deep Learning networks viz. ‘Gradient Checking’. Gradient Checking is an important method to check the correctness of your implementation, specifically the forward propagation and the backward propagation cycles of an implementation. In addition I also discuss some tips for tuning hyper-parameters of a Deep Learning network based on my experience. My post in this ‘Deep Learning Series’ so far were 1. Deep Learning from first principles in Python, R and Octave – Part 1 In part 1, I implement logistic regression as a neural network in vectorized Python, R and Octave 2. Deep Learning from first principles in Python, R and Octave – Part 2 In the second part I implement a simple Neural network with just 1 hidden layer and a sigmoid activation output function 3. Deep Learning from first principles in Python, R and Octave – Part 3 The 3rd part implemented a multi-layer Deep Learning Network with sigmoid activation output in vectorized Python, R and Octave 4. Deep Learning from first principles in Python, R and Octave – Part 4 The 4th part deals with multi-class classification. Specifically, I derive the Jacobian of the Softmax function and enhance my L-Layer DL network to include Softmax output function in addition to Sigmoid activation 5. Deep Learning from first principles in Python, R and Octave – Part 5 This post uses the Softmax classifier implemented to classify MNIST digits using a L-layer Deep Learning network 6. Deep Learning from first principles in Python, R and Octave – Part 6 The 6th part adds more bells and whistles to my L-Layer DL network, by including different initialization types namely He and Xavier. Besides L2 Regularization and random dropout is added. 7. Deep Learning from first principles in Python, R and Octave – Part 7 The 7th part deals with Stochastic Gradient Descent Optimization methods including momentum, RMSProp and Adam 8. Deep Learning from first principles in Python, R and Octave – Part 8 – This post implements a critical function for ensuring the correctness of a L-Layer Deep Learning network implementation using Gradient Checking Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. Gradient Checking is based on the following approach. One iteration of Gradient Descent computes and updates the parameters $\theta$ by doing $\theta := \theta - \frac{d}{d\theta}J(\theta)$. To minimize the cost we will need to minimize $J(\theta)$ Let $g(\theta)$ be a function that computes the derivative $\frac {d}{d\theta}J(\theta)$. Gradient Checking allows us to numerically evaluate the implementation of the function $g(\theta)$ and verify its correctness. We know the derivative of a function is given by $\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}$ Note: The above derivative is based on the 2 sided derivative. The 1-sided derivative is given by $\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}$ Gradient Checking is based on the 2-sided derivative because the error is of the order $O(\epsilon^{2})$ as opposed $O(\epsilon)$ for the 1-sided derivative. Hence Gradient Check uses the 2 sided derivative as follows. $g(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}$ In Gradient Check the following is done A) Run one normal cycle of your implementation by doing the following a) Compute the output activation by running 1 cycle of forward propagation b) Compute the cost using the output activation c) Compute the gradients using backpropation (grad) B) Perform gradient check steps as below a) Set $\theta$ . Flatten all ‘weights’ and ‘bias’ matrices and vectors to a column vector. b) Initialize $\theta+$ by bumping up $\theta$ by adding $\epsilon$ ($\theta + \epsilon$) c) Perform forward propagation with $\theta+$ d) Compute cost with $\theta+$ i.e. $J(\theta+)$ e) Initialize $\theta-$ by bumping down $\theta$ by subtracting $\epsilon$ $(\theta - \epsilon)$ f) Perform forward propagation with $\theta-$ g) Compute cost with $\theta-$ i.e. $J(\theta-)$ h) Compute $\frac {d} {d\theta} J(\theta)$ or ‘gradapprox’ as$\frac {J(\theta+) - J(\theta-) } {2\epsilon}$using the 2 sided derivative. i) Compute L2norm or the Euclidean distance between ‘grad’ and ‘gradapprox’. If the diference is of the order of $10^{-5}$ or $10^{-7}$ the implementation is correct. In the Deep Learning Specialization Prof Andrew Ng mentions that if the difference is of the order of $10^{-7}$ then the implementation is correct. A difference of $10^{-5}$ is also ok. Anything more than that is a cause of worry and you should look at your code more closely. To see more details click Gradient checking and advanced optimization You can clone/download the code from Github at DeepLearning-Part8 After spending a better part of 3 days, I now realize how critical Gradient Check is for ensuring the correctness of you implementation. Initially I was getting very high difference and did not know how to understand the results or debug my implementation. After many hours of staring at the results, I was able to finally arrive at a way, to localize issues in the implementation. In fact, I did catch a small bug in my Python code, which did not exist in the R and Octave implementations. I will demonstrate this below ## 1.1a Gradient Check – Sigmoid Activation – Python import numpy as np import matplotlib exec(open("DLfunctions8.py").read()) exec(open("testcases.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() #Set layer dimensions layersDimensions = [2,4,1] parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="sigmoid") print("cost=",cost) #Perform backprop and get gradients gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") epsilon = 1e-7 outputActivationFunc="sigmoid" # Set-up variables # Flatten parameters to a vector parameters_values, _ = dictionary_to_vector(parameters) #Flatten gradients to a vector grad = gradients_to_vector(parameters,gradients) num_parameters = parameters_values.shape[0] #Initialize J_plus = np.zeros((num_parameters, 1)) J_minus = np.zeros((num_parameters, 1)) gradapprox = np.zeros((num_parameters, 1)) # Compute gradapprox using 2 sided derivative for i in range(num_parameters): # Compute J_plus[i]. thetaplus = np.copy(parameters_values) thetaplus[i][0] = thetaplus[i][0] + epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_plus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = np.copy(parameters_values) thetaminus[i][0] = thetaminus[i][0] - epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_minus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) # Compare gradapprox to backward propagation gradients by computing difference. numerator = np.linalg.norm(grad-gradapprox) denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) difference = numerator/denominator #Check the difference if difference > 1e-5: print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m") else: print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m") print(difference) print("\n") # The technique below can be used to identify # which of the parameters are in error # Covert grad to dictionary m=vector_to_dictionary2(parameters,grad) print("Gradients from backprop") print(m) print("\n") # Convert gradapprox to dictionary n=vector_to_dictionary2(parameters,gradapprox) print("Gradapprox from gradient check") print(n)  ## (300, 2) ## (300,) ## cost= 0.6931455556341791 ## [92mYour backward propagation works perfectly fine! difference = 1.1604150683743381e-06[0m ## 1.1604150683743381e-06 ## ## ## Gradients from backprop ## {'dW1': array([[-6.19439955e-06, -2.06438046e-06], ## [-1.50165447e-05, 7.50401672e-05], ## [ 1.33435433e-04, 1.74112143e-04], ## [-3.40909024e-05, -1.38363681e-04]]), 'db1': array([[ 7.31333221e-07], ## [ 7.98425950e-06], ## [ 8.15002817e-08], ## [-5.69821155e-08]]), 'dW2': array([[2.73416304e-04, 2.96061451e-04, 7.51837363e-05, 1.01257729e-04]]), 'db2': array([[-7.22232235e-06]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[-6.19448937e-06, -2.06501483e-06], ## [-1.50168766e-05, 7.50399742e-05], ## [ 1.33435485e-04, 1.74112391e-04], ## [-3.40910633e-05, -1.38363765e-04]]), 'db1': array([[ 7.31081862e-07], ## [ 7.98472399e-06], ## [ 8.16013923e-08], ## [-5.71764858e-08]]), 'dW2': array([[2.73416290e-04, 2.96061509e-04, 7.51831930e-05, 1.01257891e-04]]), 'db2': array([[-7.22255589e-06]])} ## 1.1b Gradient Check – Softmax Activation – Python (Error!!) In the code below I show, how I managed to spot a bug in your implementation import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 parameters = initializeDeepModel(layersDimensions) #Compute forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T # Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax")  cost= 1.0986187818144022 2 There is a mistake in the backward propagation! difference = 0.7100295155692544 0.7100295155692544 Gradients from backprop {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40953794e-05, -9.52657769e-04, -1.10269379e-04], [-7.45469382e-04, 9.49795606e-04, 2.29045434e-04], [ 8.29564761e-04, 2.86216305e-06, -1.18776055e-04]]), 'db2': array([[-0.00253808], [-0.00505508], [ 0.00759315]])} Gradapprox from gradient check {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40960634e-05, -9.52657953e-04, -1.10268461e-04], [-7.45469242e-04, 9.49796908e-04, 2.29045671e-04], [ 8.29565305e-04, 2.86104473e-06, -1.18776100e-04]]), 'db2': array([[-8.46211989e-06], [-1.68487446e-05], [ 2.53108645e-05]])} Gradient Check gives a high value of the difference of 0.7100295. Inspecting the Gradients and Gradapprox we can see there is a very big discrepancy in db2. After I went over my code I discovered that I my computation in the function layerActivationBackward for Softmax was  # Erroneous code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W) instead of # Fixed code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = 1/numtraining * np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W)  After fixing this error when I ran Gradient Check I get ## 1.1c Gradient Check – Softmax Activation – Python (Corrected!!) import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 #Set layer dimensions parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T #Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax") ## cost= 1.0986193170234435 ## 2 ## [92mYour backward propagation works perfectly fine! difference = 5.268804859613151e-07[0m ## 5.268804859613151e-07 ## ## ## Gradients from backprop ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83441270e-05, -9.70179498e-04, -1.08715815e-04], ## [-7.70175008e-04, 9.54478237e-04, 2.27690198e-04], ## [ 8.48519135e-04, 1.57012608e-05, -1.18974383e-04]]), 'db2': array([[-8.52190476e-06], ## [-1.69954294e-05], ## [ 2.55173342e-05]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83439980e-05, -9.70180603e-04, -1.08716369e-04], ## [-7.70173925e-04, 9.54478718e-04, 2.27690089e-04], ## [ 8.48520143e-04, 1.57018842e-05, -1.18973720e-04]]), 'db2': array([[-8.52096171e-06], ## [-1.69964043e-05], ## [ 2.55162558e-05]])} ## 1.2a Gradient Check – Sigmoid Activation – R source("DLfunctions8.R") z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) #Set layer dimensions layersDimensions = c(2,5,1) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="sigmoid", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 0.6931447 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid",numClasses=layersDimensions[length(layersDimensions)]) epsilon = 1e-07 outputActivationFunc="sigmoid" #Convert parameter list to vector parameters_values = list_to_vector(parameters) #Convert gradient list to vector grad = gradients_to_vector(parameters,gradients) num_parameters = dim(parameters_values)[1] #Initialize J_plus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) J_minus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) gradapprox = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) # Compute gradapprox for(i in 1:num_parameters){ # Compute J_plus[i]. thetaplus = parameters_values thetaplus[i][1] = thetaplus[i][1] + epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_plus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = parameters_values thetaminus[i][1] = thetaminus[i][1] - epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_minus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) } # Compare gradapprox to backward propagation gradients by computing difference. #Compute L2Norm numerator = L2NormVec(grad-gradapprox) denominator = L2NormVec(grad) + L2NormVec(gradapprox) difference = numerator/denominator if(difference > 1e-5){ cat("There is a mistake, the difference is too high",difference) } else{ cat("The implementations works perfectly", difference) } ## The implementations works perfectly 1.279911e-06 # This can be used to check print("Gradients from backprop") ## [1] "Gradients from backprop" vector_to_list2(parameters,grad) ##$dW1
##               [,1]          [,2]
## [1,] -7.641588e-05 -3.427989e-07
## [2,] -9.049683e-06  6.906304e-05
## [3,]  3.401039e-06 -1.503914e-04
## [4,]  1.535226e-04 -1.686402e-04
## [5,] -6.029292e-05 -2.715648e-04
##
## $db1 ## [,1] ## [1,] 6.930318e-06 ## [2,] -3.283117e-05 ## [3,] 1.310647e-05 ## [4,] -3.454308e-05 ## [5,] -2.331729e-08 ## ##$dW2
##              [,1]         [,2]         [,3]        [,4]         [,5]
## [1,] 0.0001612356 0.0001113475 0.0002435824 0.000362149 2.874116e-05
##
## $db2 ## [,1] ## [1,] -1.16364e-05 print("Grad approx from gradient check") ## [1] "Grad approx from gradient check" vector_to_list2(parameters,gradapprox) ##$dW1
##               [,1]          [,2]
## [1,] -7.641554e-05 -3.430589e-07
## [2,] -9.049428e-06  6.906253e-05
## [3,]  3.401168e-06 -1.503919e-04
## [4,]  1.535228e-04 -1.686401e-04
## [5,] -6.029288e-05 -2.715650e-04
##
## $db1 ## [,1] ## [1,] 6.930012e-06 ## [2,] -3.283096e-05 ## [3,] 1.310618e-05 ## [4,] -3.454237e-05 ## [5,] -2.275957e-08 ## ##$dW2
##              [,1]         [,2]         [,3]         [,4]        [,5]
## [1,] 0.0001612355 0.0001113476 0.0002435829 0.0003621486 2.87409e-05
##
## $db2 ## [,1] ## [1,] -1.16368e-05 ## 1.2b Gradient Check – Softmax Activation – R source("DLfunctions8.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) layersDimensions = c(2, 3, 3) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="softmax", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 1.098618 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax",numClasses=layersDimensions[length(layersDimensions)]) # Need to take transpose of the last layer for Softmax L=length(parameters)/2 gradients[[paste('dW',L,sep="")]]=t(gradients[[paste('dW',L,sep="")]]) gradients[[paste('db',L,sep="")]]=t(gradients[[paste('db',L,sep="")]]) #Perform gradient check gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7,outputActivationFunc="softmax") ## The implementations works perfectly 3.903011e-07[1] "Gradients from backprop" ##$dW1
##              [,1]          [,2]
## [1,] 0.0007962367 -0.0001907606
## [2,] 0.0004444254  0.0010354412
## [3,] 0.0003078611  0.0007591255
##
## $db1 ## [,1] ## [1,] -0.0017305136 ## [2,] 0.0005393734 ## [3,] 0.0012484550 ## ##$dW2
##               [,1]          [,2]          [,3]
## [1,] -3.515627e-04  7.487283e-04 -3.971656e-04
## [2,] -6.381521e-05 -1.257328e-06  6.507254e-05
## [3,] -1.719479e-04 -4.857264e-04  6.576743e-04
##
## $db2 ## [,1] ## [1,] -5.536383e-06 ## [2,] -1.824656e-05 ## [3,] 2.378295e-05 ## ## [1] "Grad approx from gradient check" ##$dW1
##              [,1]          [,2]
## [1,] 0.0007962364 -0.0001907607
## [2,] 0.0004444256  0.0010354406
## [3,] 0.0003078615  0.0007591250
##
## $db1 ## [,1] ## [1,] -0.0017305135 ## [2,] 0.0005393741 ## [3,] 0.0012484547 ## ##$dW2
##               [,1]          [,2]          [,3]
## [1,] -3.515632e-04  7.487277e-04 -3.971656e-04
## [2,] -6.381451e-05 -1.257883e-06  6.507239e-05
## [3,] -1.719469e-04 -4.857270e-04  6.576739e-04
##
## $db2 ## [,1] ## [1,] -5.536682e-06 ## [2,] -1.824652e-05 ## [3,] 2.378209e-05 ## 1.3a Gradient Check – Sigmoid Activation – Octave source("DL8functions.m") ################## Circles data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); #Set layer dimensions layersDimensions = [2 5 1]; #tanh=-0.5(ok), #relu=0.1 best! [weights biases] = initializeDeepModel(layersDimensions); #Perform forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid"); #Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); #Compute gradients from cost [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid", numClasses=layersDimensions(size(layersDimensions)(2))); epsilon = 1e-07; outputActivationFunc="sigmoid"; # Convert paramter cell array to vector parameters_values = cellArray_to_vector(weights, biases); #Convert gradient cell array to vector grad = gradients_to_vector(gradsDW,gradsDB); num_parameters = size(parameters_values)(1); #Initialize J_plus = zeros(num_parameters, 1); J_minus = zeros(num_parameters, 1); gradapprox = zeros(num_parameters, 1); # Compute gradapprox for i = 1:num_parameters # Compute J_plus[i]. thetaplus = parameters_values; thetaplus(i,1) = thetaplus(i,1) + epsilon; [weights1 biases1] =vector_to_cellArray(weights, biases,thetaplus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_plus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute J_minus[i]. thetaminus = parameters_values; thetaminus(i,1) = thetaminus(i,1) - epsilon ; [weights1 biases1] = vector_to_cellArray(weights, biases,thetaminus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X',weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_minus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute gradapprox[i] gradapprox(i) = (J_plus(i) - J_minus(i))/(2*epsilon); endfor #Compute L2Norm numerator = L2NormVec(grad-gradapprox); denominator = L2NormVec(grad) + L2NormVec(gradapprox); difference = numerator/denominator; disp(difference); #Check difference if difference > 1e-04 printf("There is a mistake in the implementation "); disp(difference); else printf("The implementation works perfectly"); disp(difference); endif [weights1 biases1] = vector_to_cellArray(weights, biases,grad); printf("Gradients from back propagation"); disp(weights1); disp(biases1); [weights2 biases2] = vector_to_cellArray(weights, biases,gradapprox); printf("Gradients from gradient check"); disp(weights2); disp(biases2);  0.69315 1.4893e-005 The implementation works perfectly 1.4893e-005 Gradients from back propagation { [1,1] = 5.0349e-005 2.1323e-005 8.8632e-007 1.8231e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9529e-007 5.4502e-005 3.2721e-005 [1,2] = 1.0567e-005 6.0615e-005 4.6004e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8716e-005 1.1309e-009 4.7686e-005 1.2051e-005 -1.4612e-005 [1,2] = 9.5808e-006 } Gradients from gradient check { [1,1] = 5.0348e-005 2.1320e-005 8.8485e-007 1.8219e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9762e-007 5.4502e-005 3.2723e-005 [1,2] = [1,2] = 1.0565e-005 6.0614e-005 4.6007e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8713e-005 1.1102e-009 4.7687e-005 1.2048e-005 -1.4609e-005 [1,2] = 9.5790e-006 }  ## 1.3b Gradient Check – Softmax Activation – Octave source("DL8functions.m") data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); # Set the layer dimensions layersDimensions = [2 3 3]; [weights biases] = initializeDeepModel(layersDimensions); # Run forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax"); # Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); # Perform backward prop [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax", numClasses=layersDimensions(size(layersDimensions)(2))); #Take transpose of last layer for Softmax L=size(weights)(2); gradsDW{L}= gradsDW{L}'; gradsDB{L}= gradsDB{L}'; #Perform gradient check difference= gradient_check_n(weights, biases, gradsDW,gradsDB, X, Y, epsilon = 1e-7, outputActivationFunc="softmax",numClasses=layersDimensions(size(layersDimensions)(2)));   1.0986 The implementation works perfectly 2.0021e-005 Gradients from back propagation { [1,1] = -7.1590e-005 4.1375e-005 -1.9494e-004 -5.2014e-005 -1.4554e-004 5.1699e-005 [1,2] = 3.3129e-004 1.9806e-004 -1.5662e-005 -4.9692e-004 -3.7756e-004 -8.2318e-005 1.6562e-004 1.7950e-004 9.7980e-005 } { [1,1] = -3.0856e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.2046e-006 2.9259e-007 -1.4972e-006 } Gradients from gradient check { [1,1] = -7.1586e-005 4.1377e-005 -1.9494e-004 -5.2013e-005 -1.4554e-004 5.1695e-005 3.3129e-004 1.9806e-004 -1.5664e-005 -4.9692e-004 -3.7756e-004 -8.2316e-005 1.6562e-004 1.7950e-004 9.7979e-005 } { [1,1] = -3.0852e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.1902e-006 2.8200e-007 -1.4644e-006 }  ## 2.1 Tip for tuning hyperparameters Deep Learning Networks come with a large number of hyper parameters which require tuning. The hyper parameters are 1. $\alpha$ -learning rate 2. Number of layers 3. Number of hidden units 4. Number of iterations 5. Momentum – $\beta$ – 0.9 6. RMSProp – $\beta_{1}$ – 0.9 7. Adam – $\beta_{1}$,$\beta_{2}$ and $\epsilon$ 8. learning rate decay 9. mini batch size 10. Initialization method – He, Xavier 11. Regularization – Among the above the most critical is learning rate $\alpha$ . Rather than just trying out random values, it may help to try out values on a logarithmic scale. So we could try out values -0.01,0.1,1.0,10 etc. If we find that the cost is between 0.01 and 0.1 we could use a technique similar to binary search or bisection, so we can try 0.01, 0.05. If we need to be bigger than 0.01 and 0.05 we could try 0.25 and then keep halving the distance etc. – The performance of Momentum and RMSProp are very good and work well with values 0.9. Even with this, it is better to try out values of 1-$\beta$ in the logarithmic range. So 1-$\beta$ could 0.001,0.01,0.1 and hence $\beta$ would be 0.999,0.99 or 0.9 – Increasing the number of hidden units or number of hidden layers need to be done gradually. I have noticed that increasing number of hidden layers heavily does not improve performance and sometimes degrades it. – Sometimes, I tend to increase the number of iterations if I think I see a steady decrease in the cost for a certain learning rate – It may also help to add learning rate decay if you see there is an oscillation while it decreases. – Xavier and He initializations also help in a fast convergence and are worth trying out. ## 3.1 Final thoughts As I come to a close in this Deep Learning Series from first principles in Python, R and Octave, I must admit that I learnt a lot in the process. * Building a L-layer, vectorized Deep Learning Network in Python, R and Octave was extremely challenging but very rewarding * One benefit of building vectorized versions in Python, R and Octave was that I was looking at each function that I was implementing thrice, and hence I was able to fix any bugs in any of the languages * In addition since I built the generic L-Layer DL network with all the bells and whistles, layer by layer I further had an opportunity to look at all the functions in each successive post. * Each language has its advantages and disadvantages. From the performance perspective I think Python is the best, followed by Octave and then R * Interesting, I noticed that even if small bugs creep into your implementation, the DL network does learn and does generate a valid set of weights and biases, however this may not be an optimum solution. In one case of an inadvertent bug, I was not updating the weights in the final layer of the DL network. Yet, using all the other layers, the DL network was able to come with a reasonable solution (maybe like random dropout, remaining units can still learn the data!) * Having said that, the Gradient Check method discussed and implemented in this post can be very useful in ironing out bugs. Feel free to clone/download the code from Github at DeepLearning-Part8 ## Conclusion These last couple of months when I was writing the posts and the also churning up the code in Python, R and Octave were very hectic. There have been times when I found that implementations of some function to be extremely demanding and I almost felt like giving up. Other times, I have spent quite some time on an intractable DL network which would not respond to changes in hyper-parameters. All in all, it was a great learning experience. I would suggest that you start from my first post Deep Learning from first principles in Python, R and Octave-Part 1 and work your way up. Feel free to take the code apart and try out things. That is the only way you will learn. Hope you had as much fun as I had. Stay tuned. I will be back!!! To see all post click Index of Posts # Deep Learning from first principles in Python, R and Octave – Part 7 Artificial Intelligence is the new electricity. – Prof Andrew Ng Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI. – Yann LeCun, March 14, 2016 (Facebook) # Introduction In this post ‘Deep Learning from first principles with Python, R and Octave-Part 7’, I implement optimization methods used in Stochastic Gradient Descent (SGD) to speed up the convergence. Specifically I discuss and implement the following gradient descent optimization techniques a.Vanilla Stochastic Gradient Descent b.Learning rate decay c. Momentum method d. RMSProp e. Adaptive Moment Estimation (Adam) This post, further enhances my generic L-Layer Deep Learning Network implementations in vectorized Python, R and Octave to also include the Stochastic Gradient Descent optimization techniques. You can clone/download the code from Github at DeepLearning-Part7 You can view my video presentation on Gradient Descent Optimization in Neural Networks 7 Incidentally, a good discussion of the various optimizations methods used in Stochastic Gradient Optimization techniques can be seen at Sebastian Ruder’s blog Note: In the vectorized Python, R and Octave implementations below only a 1024 random training samples were used. This was to reduce the computation time. You are free to use the entire data set (60000 training data) for the computation. This post is largely based of on Prof Andrew Ng’s Deep Learning Specialization. All the above optimization techniques for Stochastic Gradient Descent are based on the technique of exponentially weighted average method. So for example if we had some time series data $\theta_{1},\theta_{2},\theta_{3}... \theta_{t}$ then we we can represent the exponentially average value at time ‘t’ as a sequence of the the previous value $v_{t-1}$ and $\theta_{t}$ as shown below $v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$ Here $v_{t}$ represent the average of the data set over $\frac {1}{1-\beta}$ By choosing different values of $\beta$, we can average over a larger or smaller number of the data points. We can write the equations as follows $v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$ $v_{t-1} = \beta v_{t-2} + (1-\beta)\theta_{t-1}$ $v_{t-2} = \beta v_{t-3} + (1-\beta)\theta_{t-2}$ and $v_{t-k} = \beta v_{t-(k+1))} + (1-\beta)\theta_{t-k}$ By substitution we have $v_{t} = (1-\beta)\theta_{t} + \beta v_{t-1}$ $v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta v_{t-2}$ $v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta ((1-\beta)\theta_{t-2}+ \beta v_{t-3} )$ Hence it can be seen that the $v_{t}$ is the weighted sum over the previous values $\theta_{k}$, which is an exponentially decaying function. Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. ## 1.1a. Stochastic Gradient Descent (Vanilla) – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) # Read the training data training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of 1024 random numbers. permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) # Set the layer dimensions layersDimensions=[784, 15,9,10] # Perform SGD with regular gradient descent parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="gd", mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig1.png")  ## 1.1b. Stochastic Gradient Descent (Vanilla) – R source("mnist.R") source("DLfunctions7.R") #Load and read MNIST data load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) # Set layer dimensions layersDimensions=c(784, 15,9, 10) # Perform SGD with regular gradient descent retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.05, optimizer="gd", mini_batch_size = 512, num_epochs = 5000, print_cost = True) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs no of epochs") + xlab("No of epochss") + ylab("Cost")

## 1.1c. Stochastic Gradient Descent (Vanilla) – Octave

source("DL7functions.m")
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with regular gradient descent
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.005,
lrDecay=true,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="gd",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);

plotCostVsEpochs(5000,costs);


## 2.1. Stochastic Gradient Descent with Learning rate decay

Since in Stochastic Gradient Descent,with  each epoch, we use slight different samples, the gradient descent algorithm, oscillates across the ravines and wanders around the minima, when a fixed learning rate is used. In this technique of ‘learning rate decay’ the learning rate is slowly decreased with the number of epochs and becomes smaller and smaller, so that gradient descent can take smaller steps towards the minima.

There are several techniques employed in learning rate decay

a) Exponential decay: $\alpha = decayRate^{epochNum} *\alpha_{0}$
b) 1/t decay : $\alpha = \frac{\alpha_{0}}{1 + decayRate*epochNum}$
c) $\alpha = \frac {decayRate}{\sqrt(epochNum)}*\alpha_{0}$

In my implementation I have used the ‘exponential decay’. The code snippet for Python is shown below

if lrDecay == True:
learningRate = np.power(decayRate,(num_epochs/1000)) * learningRate


## 2.1a. Stochastic Gradient Descent with Learning rate decay – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

# Create  a list of random numbers of 1024
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
# Set layer dimensions
layersDimensions=[784, 15,9,10]
# Perform SGD with learning rate decay
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01 , lrDecay=True, decayRate=0.9999,
optimizer="gd",
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig2.png")

## 2.1b. Stochastic Gradient Descent with Learning rate decay – R

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
# Set layer dimensions
layersDimensions=c(784, 15,9, 10)
# Perform SGD with Learning rate decay
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
lrDecay=TRUE,
decayRate=0.9999,
optimizer="gd",
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost") ## 2.1c. Stochastic Gradient Descent with Learning rate decay – Octave source("DL7functions.m") #Load and read MNIST load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024 permutation = randperm(1024); disp(length(permutation)); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; # Perform SGD with regular Learning rate decay [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.01, lrDecay=true, decayRate=0.999, lambd=0, keep_prob=1, optimizer="gd", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs)  ## 3.1. Stochastic Gradient Descent with Momentum Stochastic Gradient Descent with Momentum uses the exponentially weighted average method discusses above and more generally moves faster into the ravine than across it. The equations are $v_{dW}^l = \beta v_{dW}^l + (1-\beta)dW^{l}$ $v_{db}^l = \beta v_{db}^l + (1-\beta)db^{l}$ $W^{l} = W^{l} - \alpha v_{dW}^l$ $b^{l} = b^{l} - \alpha v_{db}^l$ where $v_{dW}$ and $v_{db}$ are the momentum terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’ The code snippet for Stochastic Gradient Descent with momentum in R is shown below # Perform Gradient Descent with momentum # Input : Weights and biases # : beta # : gradients # : learning rate # : outputActivationFunc - Activation function at hidden layer sigmoid/softmax #output : Updated weights after 1 iteration gradientDescentWithMomentum <- function(parameters, gradients,v, beta, learningRate,outputActivationFunc="sigmoid"){ L = length(parameters)/2 # number of layers in the neural network # Update rule for each parameter. Use a for loop. for(l in 1:(L-1)){ # Compute velocities # v['dWk'] = beta *v['dWk'] + (1-beta)*dWk v[[paste("dW",l, sep="")]] = beta*v[[paste("dW",l, sep="")]] + (1-beta) * gradients[[paste('dW',l,sep="")]] v[[paste("db",l, sep="")]] = beta*v[[paste("db",l, sep="")]] + (1-beta) * gradients[[paste('db',l,sep="")]] parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] - learningRate* v[[paste("dW",l, sep="")]] parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] - learningRate* v[[paste("db",l, sep="")]] } # Compute for the Lth layer if(outputActivationFunc=="sigmoid"){ v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] + (1-beta) * gradients[[paste('dW',L,sep="")]] v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] + (1-beta) * gradients[[paste('db',L,sep="")]] parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate* v[[paste("dW",l, sep="")]] parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate* v[[paste("db",l, sep="")]] }else if (outputActivationFunc=="softmax"){ v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] + (1-beta) * t(gradients[[paste('dW',L,sep="")]]) v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] + (1-beta) * t(gradients[[paste('db',L,sep="")]]) parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate* t(gradients[[paste("dW",L,sep="")]]) parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate* t(gradients[[paste("db",L,sep="")]]) } return(parameters) } ## 3.1a. Stochastic Gradient Descent with Momentum- Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets # Read and load data exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of random numbers of 1024 permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) layersDimensions=[784, 15,9,10] # Perform SGD with momentum parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="momentum", beta=0.9, mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig3.png") ## 3.1b. Stochastic Gradient Descent with Momentum- R source("mnist.R") source("DLfunctions7.R") load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) layersDimensions=c(784, 15,9, 10) # Perform SGD with momentum retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.05, optimizer="momentum", beta=0.9, mini_batch_size = 512, num_epochs = 5000, print_cost = True)  #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

## 3.1c. Stochastic Gradient Descent with Momentum- Octave

source("DL7functions.m")
#Create a random permutatation from 60K
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with Momentum
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="momentum",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);

plotCostVsEpochs(5000,costs)


## 4.1. Stochastic Gradient Descent with RMSProp

Stochastic Gradient Descent with RMSProp tries to move faster towards the minima while dampening the oscillations across the ravine.
The equations are

$s_{dW}^l = \beta_{1} s_{dW}^l + (1-\beta_{1})(dW^{l})^{2}$
$s_{db}^l = \beta_{1} s_{db}^l + (1-\beta_{1})(db^{l})^2$
$W^{l} = W^{l} - \frac {\alpha s_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$
$b^{l} = b^{l} - \frac {\alpha s_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$
where $s_{dW}$ and $s_{db}$ are the RMSProp terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’

The code snippet in Octave is shown below

# Update parameters with RMSProp
# Input : parameters
#       : s
#       : beta
#       : learningRate
#       :
#output : Updated parameters RMSProp
L = size(weights)(2); # number of layers in the neural network
# Update rule for each parameter.
for l=1:(L-1)
weights{l} = weights{l} - learningRate* gradsDW{l} ./ sqrt(sdW{l} + epsilon);
biases{l} = biases{l} -  learningRate* gradsDB{l} ./ sqrt(sdB{l} + epsilon);
endfor

if (strcmp(outputActivationFunc,"sigmoid"))
weights{L} = weights{L} -learningRate* gradsDW{L} ./ sqrt(sdW{L} +epsilon);
biases{L} = biases{L} -learningRate* gradsDB{L} ./ sqrt(sdB{L} + epsilon);
elseif (strcmp(outputActivationFunc,"softmax"))
weights{L} = weights{L} -learningRate* gradsDW{L}' ./ sqrt(sdW{L} +epsilon);
biases{L} = biases{L} -learningRate* gradsDB{L}' ./ sqrt(sdB{L} + epsilon);
endif
end


## 4.1a. Stochastic Gradient Descent with RMSProp – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

print("X1=",X1.shape)
print("y1=",Y1.shape)

# Create  a list of random numbers of 1024
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))

layersDimensions=[784, 15,9,10]
# Use SGD with RMSProp
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="rmsprop", beta1=0.7, epsilon=1e-8,
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig4.png")

## 4.1b. Stochastic Gradient Descent with RMSProp – R

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
#Perform SGD with RMSProp
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.001,
optimizer="rmsprop",
beta1=0.9,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000 ,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")  ## 4.1c. Stochastic Gradient Descent with RMSProp – Octave source("DL7functions.m") load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024 permutation = randperm(1024); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; #Perform SGD with RMSProp [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.005, lrDecay=false, decayRate=1, lambd=0, keep_prob=1, optimizer="rmsprop", beta=0.9, beta1=0.9, beta2=0.999, epsilon=1, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs)  ## 5.1. Stochastic Gradient Descent with Adam Adaptive Moment Estimate is a combination of the momentum (1st moment) and RMSProp(2nd moment). The equations for Adam are below $v_{dW}^l = \beta_{1} v_{dW}^l + (1-\beta_{1})dW^{l}$ $v_{db}^l = \beta_{1} v_{db}^l + (1-\beta_{1})db^{l}$ The bias corrections for the 1st moment $vCorrected_{dW}^l= \frac {v_{dW}^l}{1 - \beta_{1}^{t}}$ $vCorrected_{db}^l= \frac {v_{db}^l}{1 - \beta_{1}^{t}}$ Similarly the moving average for the 2nd moment- RMSProp $s_{dW}^l = \beta_{2} s_{dW}^l + (1-\beta_{2})(dW^{l})^2$ $s_{db}^l = \beta_{2} s_{db}^l + (1-\beta_{2})(db^{l})^2$ The bias corrections for the 2nd moment $sCorrected_{dW}^l= \frac {s_{dW}^l}{1 - \beta_{2}^{t}}$ $sCorrected_{db}^l= \frac {s_{db}^l}{1 - \beta_{2}^{t}}$ The Adam Gradient Descent is given by $W^{l} = W^{l} - \frac {\alpha vCorrected_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$ $b^{l} = b^{l} - \frac {\alpha vCorrected_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$ The code snippet of Adam in R is included below # Perform Gradient Descent with Adam # Input : Weights and biases # : beta1 # : epsilon # : gradients # : learning rate # : outputActivationFunc - Activation function at hidden layer sigmoid/softmax #output : Updated weights after 1 iteration gradientDescentWithAdam <- function(parameters, gradients,v, s, t, beta1=0.9, beta2=0.999, epsilon=10^-8, learningRate=0.1,outputActivationFunc="sigmoid"){ L = length(parameters)/2 # number of layers in the neural network v_corrected <- list() s_corrected <- list() # Update rule for each parameter. Use a for loop. for(l in 1:(L-1)){ # v['dWk'] = beta *v['dWk'] + (1-beta)*dWk v[[paste("dW",l, sep="")]] = beta1*v[[paste("dW",l, sep="")]] + (1-beta1) * gradients[[paste('dW',l,sep="")]] v[[paste("db",l, sep="")]] = beta1*v[[paste("db",l, sep="")]] + (1-beta1) * gradients[[paste('db',l,sep="")]] # Compute bias-corrected first moment estimate. v_corrected[[paste("dW",l, sep="")]] = v[[paste("dW",l, sep="")]]/(1-beta1^t) v_corrected[[paste("db",l, sep="")]] = v[[paste("db",l, sep="")]]/(1-beta1^t) # Element wise multiply of gradients s[[paste("dW",l, sep="")]] = beta2*s[[paste("dW",l, sep="")]] + (1-beta2) * gradients[[paste('dW',l,sep="")]] * gradients[[paste('dW',l,sep="")]] s[[paste("db",l, sep="")]] = beta2*s[[paste("db",l, sep="")]] + (1-beta2) * gradients[[paste('db',l,sep="")]] * gradients[[paste('db',l,sep="")]] # Compute bias-corrected second moment estimate. s_corrected[[paste("dW",l, sep="")]] = s[[paste("dW",l, sep="")]]/(1-beta2^t) s_corrected[[paste("db",l, sep="")]] = s[[paste("db",l, sep="")]]/(1-beta2^t) # Update parameters. d1=sqrt(s_corrected[[paste("dW",l, sep="")]]+epsilon) d2=sqrt(s_corrected[[paste("db",l, sep="")]]+epsilon) parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] - learningRate * v_corrected[[paste("dW",l, sep="")]]/d1 parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] - learningRate*v_corrected[[paste("db",l, sep="")]]/d2 } # Compute for the Lth layer if(outputActivationFunc=="sigmoid"){ v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] + (1-beta1) * gradients[[paste('dW',L,sep="")]] v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] + (1-beta1) * gradients[[paste('db',L,sep="")]] # Compute bias-corrected first moment estimate. v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t) v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t) # Element wise multiply of gradients s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] + (1-beta2) * gradients[[paste('dW',L,sep="")]] * gradients[[paste('dW',L,sep="")]] s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] + (1-beta2) * gradients[[paste('db',L,sep="")]] * gradients[[paste('db',L,sep="")]] # Compute bias-corrected second moment estimate. s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t) s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t) # Update parameters. d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon) d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon) parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate * v_corrected[[paste("dW",L, sep="")]]/d1 parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate*v_corrected[[paste("db",L, sep="")]]/d2 }else if (outputActivationFunc=="softmax"){ v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] + (1-beta1) * t(gradients[[paste('dW',L,sep="")]]) v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] + (1-beta1) * t(gradients[[paste('db',L,sep="")]]) # Compute bias-corrected first moment estimate. v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t) v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t) # Element wise multiply of gradients s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] + (1-beta2) * t(gradients[[paste('dW',L,sep="")]]) * t(gradients[[paste('dW',L,sep="")]]) s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] + (1-beta2) * t(gradients[[paste('db',L,sep="")]]) * t(gradients[[paste('db',L,sep="")]]) # Compute bias-corrected second moment estimate. s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t) s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t) # Update parameters. d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon) d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon) parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate * v_corrected[[paste("dW",L, sep="")]]/d1 parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate*v_corrected[[paste("db",L, sep="")]]/d2 } return(parameters) }  ## 5.1a. Stochastic Gradient Descent with Adam – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] print(len(training)) #for i in range(len(training)): for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of random numbers of 1024 permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) layersDimensions=[784, 15,9,10] #Perform SGD with Adam optimization parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="adam", beta1=0.9, beta2=0.9, epsilon = 1e-8, mini_batch_size =512, num_epochs = 1000, print_cost = True, figure="fig5.png") ## 5.1b. Stochastic Gradient Descent with Adam – R source("mnist.R") source("DLfunctions7.R") load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) layersDimensions=c(784, 15,9, 10) #Perform SGD with Adam retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.005, optimizer="adam", beta1=0.7, beta2=0.9, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000 , print_cost = True) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

source("DL7functions.m")
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];

# Note the high value for epsilon.
#Otherwise GD with Adam does not seem to converge
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
beta=0.9,
beta1=0.9,
beta2=0.9,
epsilon=100,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)


Conclusion: In this post I discuss and implement several Stochastic Gradient Descent optimization methods. The implementation of these methods enhance my already existing generic L-Layer Deep Learning Network implementation in vectorized Python, R and Octave, which I had discussed in the previous post in this series on Deep Learning from first principles in Python, R and Octave. Check it out, if you haven’t already. As already mentioned the code for this post can be cloned/forked from Github at DeepLearning-Part7

Watch this space! I’ll be back!

To see all post click Index of posts