My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 6,7,8

This is the final set of presentations in my series ‘Elements of Neural Networks and Deep Learning’. This set follows the earlier 2 sets of presentations namely
1. My presentations on ‘Elements of Neural Networks & Deep Learning’ -Part1,2,3
2. My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

In this final set of presentations I discuss initialization methods, regularization techniques including dropout. Next I also discuss gradient descent optimization methods like momentum, rmsprop, adam etc. Lastly, I briefly also touch on hyper-parameter tuning approaches. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 6
This part discusses initialization methods specifically like He and Xavier. The presentation also focuses on how to prevent over-fitting using regularization. Lastly the dropout method of regularization is also discusses

The corresponding implementations in vectorized R, Python and Octave of the above discussed methods are available in my post Deep Learning from first principles in Python, R and Octave – Part 6

2. Elements of Neural Networks and Deep Learning – Part 7
This presentation introduces exponentially weighted moving average and shows how this is used in different approaches to gradient descent optimization. The key techniques discussed are learning rate decay, momentum method, rmsprop and adam.

The equivalent implementations of the gradient descent optimization techniques in R, Python and Octave can be seen in my post Deep Learning from first principles in Python, R and Octave – Part 7

3. Elements of Neural Networks and Deep Learning – Part 8
This last part touches upon hyper-parameter tuning in Deep Learning networks

This concludes this series of presentations on “Elements of Neural Networks and Deep Learning’

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and and in kindle version($9.99/Rs449).

To see all posts click Index of posts

My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

This is the next set of presentations on “Elements of Neural Networks and Deep Learning”.  In the 4th presentation I discuss and derive the generalized equations for a multi-unit, multi-layer Deep Learning network.  The 5th presentation derives the equations for a Deep Learning network when performing multi-class classification along with the derivations for cross-entropy loss. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Network and Deep Learning – Part 4
This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)

The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

2. Elements of Neural Network and Deep Learning – Part 5
This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.

The corresponding implementations in vectorized R, Python and Octave are available in the following posts
a. Deep Learning from first principles in Python, R and Octave – Part 4
b. Deep Learning from first principles in Python, R and Octave – Part 5

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

To see all posts click Index of Posts

My presentations on ‘Elements of Neural Networks & Deep Learning’ -Part1,2,3

I will be uploading a series of presentations on ‘Elements of Neural Networks and Deep Learning’. In these video presentations I discuss the derivations of L -Layer Deep Learning Networks, starting from the basics. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 1
This presentation introduces Neural Networks and Deep Learning. A look at history of Neural Networks, Perceptrons and why Deep Learning networks are required and concluding with a simple toy examples of a Neural Network and how they compute

2. Elements of Neural Networks and Deep Learning – Part 2
This presentation takes logistic regression as an example and creates an equivalent 2 layer Neural network. The presentation also takes a look at forward & backward propagation and how the cost is minimized using gradient descent

The implementation of the discussed 2 layer Neural Network in vectorized R, Python and Octave are available in my post ‘Deep Learning from first principles in Python, R and Octave – Part 1

3. Elements of Neural Networks and Deep Learning – Part 3
This 3rd part, discusses a primitive neural network with an input layer, output layer and a hidden layer. The neural network uses tanh activation in the hidden layer and a sigmoid activation in the output layer. The equations for forward and backward propagation are derived.

To see the implementations for the above discussed video see my post ‘Deep Learning from first principles in Python, R and Octave – Part 2

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

To see all posts click Index of posts

My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions.  In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable.  There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle-$8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

Introducing cricpy:A python package to analyze performances of cricketers

Full many a gem of purest ray serene,
The dark unfathomed caves of ocean bear;
Full many a flower is born to blush unseen,
And waste its sweetness on the desert air.

            Thomas Gray, An Elegy Written In A Country Churchyard


Introduction

It is finally here! cricpy, the python avatar , of my R package cricketr is now ready to rock-n-roll! My R package cricketr had its genesis about 3 and some years ago and went through a couple of enhancements. During this time I have always thought about creating an equivalent python package like cricketr. Now I have finally done it.

So here it is. My python package ‘cricpy!!!’

This package uses the statistics info available in ESPN Cricinfo Statsguru. The current version of this package supports only Test cricket

You should be able to install the package using pip install cricpy and use the many functions available in the package. Please mindful of the ESPN Cricinfo Terms of Use

Do check out my post on R package cricketr at Re-introducing cricketr! : An R package to analyze performances of cricketers

2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle- $9.99/Rs449) This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better. Here is a look at the topics covered Table of Contents Preface …………………………………………………………………………….4 Introduction ………………………………………………………………………6 1. Essential R ………………………………………………………………… 8 2. Essential Python for Datascience ……………………………………………57 3. R vs Python …………………………………………………………………81 4. Regression of a continuous variable ……………………………………….101 5. Classification and Cross Validation ………………………………………..121 6. Regression techniques and regularization ………………………………….146 7. SVMs, Decision Trees and Validation curves ………………………………191 8. Splines, GAMs, Random Forests and Boosting ……………………………222 9. PCA, K-Means and Hierarchical Clustering ………………………………258 References ……………………………………………………………………..269 Pick up your copy today!! Hope you have a great time learning as I did while implementing these algorithms! My book “Deep Learning from first principles” now on Amazon Note: The 2nd edition of this book is now available on Amazon My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($18.99) and kindle ($9.99/Rs449). The book starts with the most primitive 2-layer Neural Network and works its way to a generic L-layer Deep Learning Network, with all the bells and whistles. The book includes detailed derivations and vectorized implementations in Python, R and Octave. The code has been extensively commented and has been included in the Appendix section. Pick up your copy today!!! Deep Learning from first principles in Python, R and Octave – Part 8 1. Introduction You don’t understand anything until you learn it more than one way. Marvin Minsky No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. Marvin Minsky A wealth of information creates a poverty of attention. Herbert Simon This post, Deep Learning from first Principles in Python, R and Octave-Part8, is my final post in my Deep Learning from first principles series. In this post, I discuss and implement a key functionality needed while building Deep Learning networks viz. ‘Gradient Checking’. Gradient Checking is an important method to check the correctness of your implementation, specifically the forward propagation and the backward propagation cycles of an implementation. In addition I also discuss some tips for tuning hyper-parameters of a Deep Learning network based on my experience. My post in this ‘Deep Learning Series’ so far were 1. Deep Learning from first principles in Python, R and Octave – Part 1 In part 1, I implement logistic regression as a neural network in vectorized Python, R and Octave 2. Deep Learning from first principles in Python, R and Octave – Part 2 In the second part I implement a simple Neural network with just 1 hidden layer and a sigmoid activation output function 3. Deep Learning from first principles in Python, R and Octave – Part 3 The 3rd part implemented a multi-layer Deep Learning Network with sigmoid activation output in vectorized Python, R and Octave 4. Deep Learning from first principles in Python, R and Octave – Part 4 The 4th part deals with multi-class classification. Specifically, I derive the Jacobian of the Softmax function and enhance my L-Layer DL network to include Softmax output function in addition to Sigmoid activation 5. Deep Learning from first principles in Python, R and Octave – Part 5 This post uses the Softmax classifier implemented to classify MNIST digits using a L-layer Deep Learning network 6. Deep Learning from first principles in Python, R and Octave – Part 6 The 6th part adds more bells and whistles to my L-Layer DL network, by including different initialization types namely He and Xavier. Besides L2 Regularization and random dropout is added. 7. Deep Learning from first principles in Python, R and Octave – Part 7 The 7th part deals with Stochastic Gradient Descent Optimization methods including momentum, RMSProp and Adam 8. Deep Learning from first principles in Python, R and Octave – Part 8 – This post implements a critical function for ensuring the correctness of a L-Layer Deep Learning network implementation using Gradient Checking Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. Gradient Checking is based on the following approach. One iteration of Gradient Descent computes and updates the parameters $\theta$ by doing $\theta := \theta - \frac{d}{d\theta}J(\theta)$. To minimize the cost we will need to minimize $J(\theta)$ Let $g(\theta)$ be a function that computes the derivative $\frac {d}{d\theta}J(\theta)$. Gradient Checking allows us to numerically evaluate the implementation of the function $g(\theta)$ and verify its correctness. We know the derivative of a function is given by $\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}$ Note: The above derivative is based on the 2 sided derivative. The 1-sided derivative is given by $\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}$ Gradient Checking is based on the 2-sided derivative because the error is of the order $O(\epsilon^{2})$ as opposed $O(\epsilon)$ for the 1-sided derivative. Hence Gradient Check uses the 2 sided derivative as follows. $g(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}$ In Gradient Check the following is done A) Run one normal cycle of your implementation by doing the following a) Compute the output activation by running 1 cycle of forward propagation b) Compute the cost using the output activation c) Compute the gradients using backpropation (grad) B) Perform gradient check steps as below a) Set $\theta$ . Flatten all ‘weights’ and ‘bias’ matrices and vectors to a column vector. b) Initialize $\theta+$ by bumping up $\theta$ by adding $\epsilon$ ($\theta + \epsilon$) c) Perform forward propagation with $\theta+$ d) Compute cost with $\theta+$ i.e. $J(\theta+)$ e) Initialize $\theta-$ by bumping down $\theta$ by subtracting $\epsilon$ $(\theta - \epsilon)$ f) Perform forward propagation with $\theta-$ g) Compute cost with $\theta-$ i.e. $J(\theta-)$ h) Compute $\frac {d} {d\theta} J(\theta)$ or ‘gradapprox’ as$\frac {J(\theta+) - J(\theta-) } {2\epsilon}$using the 2 sided derivative. i) Compute L2norm or the Euclidean distance between ‘grad’ and ‘gradapprox’. If the diference is of the order of $10^{-5}$ or $10^{-7}$ the implementation is correct. In the Deep Learning Specialization Prof Andrew Ng mentions that if the difference is of the order of $10^{-7}$ then the implementation is correct. A difference of $10^{-5}$ is also ok. Anything more than that is a cause of worry and you should look at your code more closely. To see more details click Gradient checking and advanced optimization You can clone/download the code from Github at DeepLearning-Part8 After spending a better part of 3 days, I now realize how critical Gradient Check is for ensuring the correctness of you implementation. Initially I was getting very high difference and did not know how to understand the results or debug my implementation. After many hours of staring at the results, I was able to finally arrive at a way, to localize issues in the implementation. In fact, I did catch a small bug in my Python code, which did not exist in the R and Octave implementations. I will demonstrate this below 1.1a Gradient Check – Sigmoid Activation – Python import numpy as np import matplotlib exec(open("DLfunctions8.py").read()) exec(open("testcases.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() #Set layer dimensions layersDimensions = [2,4,1] parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="sigmoid") print("cost=",cost) #Perform backprop and get gradients gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid") epsilon = 1e-7 outputActivationFunc="sigmoid" # Set-up variables # Flatten parameters to a vector parameters_values, _ = dictionary_to_vector(parameters) #Flatten gradients to a vector grad = gradients_to_vector(parameters,gradients) num_parameters = parameters_values.shape[0] #Initialize J_plus = np.zeros((num_parameters, 1)) J_minus = np.zeros((num_parameters, 1)) gradapprox = np.zeros((num_parameters, 1)) # Compute gradapprox using 2 sided derivative for i in range(num_parameters): # Compute J_plus[i]. thetaplus = np.copy(parameters_values) thetaplus[i][0] = thetaplus[i][0] + epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_plus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = np.copy(parameters_values) thetaminus[i][0] = thetaminus[i][0] - epsilon AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) J_minus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) # Compare gradapprox to backward propagation gradients by computing difference. numerator = np.linalg.norm(grad-gradapprox) denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox) difference = numerator/denominator #Check the difference if difference > 1e-5: print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m") else: print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m") print(difference) print("\n") # The technique below can be used to identify # which of the parameters are in error # Covert grad to dictionary m=vector_to_dictionary2(parameters,grad) print("Gradients from backprop") print(m) print("\n") # Convert gradapprox to dictionary n=vector_to_dictionary2(parameters,gradapprox) print("Gradapprox from gradient check") print(n)  ## (300, 2) ## (300,) ## cost= 0.6931455556341791 ## [92mYour backward propagation works perfectly fine! difference = 1.1604150683743381e-06[0m ## 1.1604150683743381e-06 ## ## ## Gradients from backprop ## {'dW1': array([[-6.19439955e-06, -2.06438046e-06], ## [-1.50165447e-05, 7.50401672e-05], ## [ 1.33435433e-04, 1.74112143e-04], ## [-3.40909024e-05, -1.38363681e-04]]), 'db1': array([[ 7.31333221e-07], ## [ 7.98425950e-06], ## [ 8.15002817e-08], ## [-5.69821155e-08]]), 'dW2': array([[2.73416304e-04, 2.96061451e-04, 7.51837363e-05, 1.01257729e-04]]), 'db2': array([[-7.22232235e-06]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[-6.19448937e-06, -2.06501483e-06], ## [-1.50168766e-05, 7.50399742e-05], ## [ 1.33435485e-04, 1.74112391e-04], ## [-3.40910633e-05, -1.38363765e-04]]), 'db1': array([[ 7.31081862e-07], ## [ 7.98472399e-06], ## [ 8.16013923e-08], ## [-5.71764858e-08]]), 'dW2': array([[2.73416290e-04, 2.96061509e-04, 7.51831930e-05, 1.01257891e-04]]), 'db2': array([[-7.22255589e-06]])} 1.1b Gradient Check – Softmax Activation – Python (Error!!) In the code below I show, how I managed to spot a bug in your implementation import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 parameters = initializeDeepModel(layersDimensions) #Compute forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T # Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax")  cost= 1.0986187818144022 2 There is a mistake in the backward propagation! difference = 0.7100295155692544 0.7100295155692544 Gradients from backprop {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40953794e-05, -9.52657769e-04, -1.10269379e-04], [-7.45469382e-04, 9.49795606e-04, 2.29045434e-04], [ 8.29564761e-04, 2.86216305e-06, -1.18776055e-04]]), 'db2': array([[-0.00253808], [-0.00505508], [ 0.00759315]])} Gradapprox from gradient check {'dW1': array([[ 0.00050125, 0.00045194], [ 0.00096392, 0.00039641], [-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082], [-0.00224399], [ 0.00052305]]), 'dW2': array([[-8.40960634e-05, -9.52657953e-04, -1.10268461e-04], [-7.45469242e-04, 9.49796908e-04, 2.29045671e-04], [ 8.29565305e-04, 2.86104473e-06, -1.18776100e-04]]), 'db2': array([[-8.46211989e-06], [-1.68487446e-05], [ 2.53108645e-05]])} Gradient Check gives a high value of the difference of 0.7100295. Inspecting the Gradients and Gradapprox we can see there is a very big discrepancy in db2. After I went over my code I discovered that I my computation in the function layerActivationBackward for Softmax was  # Erroneous code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W) instead of # Fixed code if activationFunc == 'softmax': dW = 1/numtraining * np.dot(A_prev,dZ) db = 1/numtraining * np.sum(dZ, axis=0, keepdims=True) dA_prev = np.dot(dZ,W)  After fixing this error when I ran Gradient Check I get 1.1c Gradient Check – Softmax Activation – Python (Corrected!!) import numpy as np exec(open("DLfunctions8.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data #plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) layersDimensions = [2,3,3] y1=y.reshape(-1,1).T train_X=X.T train_Y=y1 #Set layer dimensions parameters = initializeDeepModel(layersDimensions) #Perform forward prop AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") #Compute cost cost = computeCost(AL, train_Y, outputActivationFunc="softmax") print("cost=",cost) #Compute gradients from backprop gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="softmax") # Note the transpose of the gradients for Softmax has to be taken L= len(parameters)//2 print(L) gradients['dW'+str(L)]=gradients['dW'+str(L)].T gradients['db'+str(L)]=gradients['db'+str(L)].T #Perform gradient check gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax") ## cost= 1.0986193170234435 ## 2 ## [92mYour backward propagation works perfectly fine! difference = 5.268804859613151e-07[0m ## 5.268804859613151e-07 ## ## ## Gradients from backprop ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83441270e-05, -9.70179498e-04, -1.08715815e-04], ## [-7.70175008e-04, 9.54478237e-04, 2.27690198e-04], ## [ 8.48519135e-04, 1.57012608e-05, -1.18974383e-04]]), 'db2': array([[-8.52190476e-06], ## [-1.69954294e-05], ## [ 2.55173342e-05]])} ## ## ## Gradapprox from gradient check ## {'dW1': array([[ 0.00053206, 0.00038987], ## [ 0.00093941, 0.00038077], ## [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662], ## [-0.00210198], ## [ 0.00046741]]), 'dW2': array([[-7.83439980e-05, -9.70180603e-04, -1.08716369e-04], ## [-7.70173925e-04, 9.54478718e-04, 2.27690089e-04], ## [ 8.48520143e-04, 1.57018842e-05, -1.18973720e-04]]), 'db2': array([[-8.52096171e-06], ## [-1.69964043e-05], ## [ 2.55162558e-05]])} 1.2a Gradient Check – Sigmoid Activation – R source("DLfunctions8.R") z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) #Set layer dimensions layersDimensions = c(2,5,1) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="sigmoid", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 0.6931447 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid",numClasses=layersDimensions[length(layersDimensions)]) epsilon = 1e-07 outputActivationFunc="sigmoid" #Convert parameter list to vector parameters_values = list_to_vector(parameters) #Convert gradient list to vector grad = gradients_to_vector(parameters,gradients) num_parameters = dim(parameters_values)[1] #Initialize J_plus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) J_minus = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) gradapprox = matrix(rep(0,num_parameters), nrow=num_parameters,ncol=1) # Compute gradapprox for(i in 1:num_parameters){ # Compute J_plus[i]. thetaplus = parameters_values thetaplus[i][1] = thetaplus[i][1] + epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaplus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_plus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute J_minus[i]. thetaminus = parameters_values thetaminus[i][1] = thetaminus[i][1] - epsilon retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaminus), keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc) AL <- retvals[['AL']] J_minus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc) # Compute gradapprox[i] gradapprox[i] = (J_plus[i] - J_minus[i])/(2*epsilon) } # Compare gradapprox to backward propagation gradients by computing difference. #Compute L2Norm numerator = L2NormVec(grad-gradapprox) denominator = L2NormVec(grad) + L2NormVec(gradapprox) difference = numerator/denominator if(difference > 1e-5){ cat("There is a mistake, the difference is too high",difference) } else{ cat("The implementations works perfectly", difference) } ## The implementations works perfectly 1.279911e-06 # This can be used to check print("Gradients from backprop") ## [1] "Gradients from backprop" vector_to_list2(parameters,grad) ##$dW1
##               [,1]          [,2]
## [1,] -7.641588e-05 -3.427989e-07
## [2,] -9.049683e-06  6.906304e-05
## [3,]  3.401039e-06 -1.503914e-04
## [4,]  1.535226e-04 -1.686402e-04
## [5,] -6.029292e-05 -2.715648e-04
##
## $db1 ## [,1] ## [1,] 6.930318e-06 ## [2,] -3.283117e-05 ## [3,] 1.310647e-05 ## [4,] -3.454308e-05 ## [5,] -2.331729e-08 ## ##$dW2
##              [,1]         [,2]         [,3]        [,4]         [,5]
## [1,] 0.0001612356 0.0001113475 0.0002435824 0.000362149 2.874116e-05
##
## $db2 ## [,1] ## [1,] -1.16364e-05 print("Grad approx from gradient check") ## [1] "Grad approx from gradient check" vector_to_list2(parameters,gradapprox) ##$dW1
##               [,1]          [,2]
## [1,] -7.641554e-05 -3.430589e-07
## [2,] -9.049428e-06  6.906253e-05
## [3,]  3.401168e-06 -1.503919e-04
## [4,]  1.535228e-04 -1.686401e-04
## [5,] -6.029288e-05 -2.715650e-04
##
## $db1 ## [,1] ## [1,] 6.930012e-06 ## [2,] -3.283096e-05 ## [3,] 1.310618e-05 ## [4,] -3.454237e-05 ## [5,] -2.275957e-08 ## ##$dW2
##              [,1]         [,2]         [,3]         [,4]        [,5]
## [1,] 0.0001612355 0.0001113476 0.0002435829 0.0003621486 2.87409e-05
##
## $db2 ## [,1] ## [1,] -1.16368e-05 1.2b Gradient Check – Softmax Activation – R source("DLfunctions8.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) layersDimensions = c(2, 3, 3) parameters = initializeDeepModel(layersDimensions) #Perform forward prop retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax") AL <- retvals[['AL']] caches <- retvals[['caches']] dropoutMat <- retvals[['dropoutMat']] #Compute cost cost <- computeCost(AL, Y,outputActivationFunc="softmax", numClasses=layersDimensions[length(layersDimensions)]) print(cost) ## [1] 1.098618 # Backward propagation. gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax",numClasses=layersDimensions[length(layersDimensions)]) # Need to take transpose of the last layer for Softmax L=length(parameters)/2 gradients[[paste('dW',L,sep="")]]=t(gradients[[paste('dW',L,sep="")]]) gradients[[paste('db',L,sep="")]]=t(gradients[[paste('db',L,sep="")]]) #Perform gradient check gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7,outputActivationFunc="softmax") ## The implementations works perfectly 3.903011e-07[1] "Gradients from backprop" ##$dW1
##              [,1]          [,2]
## [1,] 0.0007962367 -0.0001907606
## [2,] 0.0004444254  0.0010354412
## [3,] 0.0003078611  0.0007591255
##
## $db1 ## [,1] ## [1,] -0.0017305136 ## [2,] 0.0005393734 ## [3,] 0.0012484550 ## ##$dW2
##               [,1]          [,2]          [,3]
## [1,] -3.515627e-04  7.487283e-04 -3.971656e-04
## [2,] -6.381521e-05 -1.257328e-06  6.507254e-05
## [3,] -1.719479e-04 -4.857264e-04  6.576743e-04
##
## $db2 ## [,1] ## [1,] -5.536383e-06 ## [2,] -1.824656e-05 ## [3,] 2.378295e-05 ## ## [1] "Grad approx from gradient check" ##$dW1
##              [,1]          [,2]
## [1,] 0.0007962364 -0.0001907607
## [2,] 0.0004444256  0.0010354406
## [3,] 0.0003078615  0.0007591250
##
## $db1 ## [,1] ## [1,] -0.0017305135 ## [2,] 0.0005393741 ## [3,] 0.0012484547 ## ##$dW2
##               [,1]          [,2]          [,3]
## [1,] -3.515632e-04  7.487277e-04 -3.971656e-04
## [2,] -6.381451e-05 -1.257883e-06  6.507239e-05
## [3,] -1.719469e-04 -4.857270e-04  6.576739e-04
##
## $db2 ## [,1] ## [1,] -5.536682e-06 ## [2,] -1.824652e-05 ## [3,] 2.378209e-05 1.3a Gradient Check – Sigmoid Activation – Octave source("DL8functions.m") ################## Circles data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); #Set layer dimensions layersDimensions = [2 5 1]; #tanh=-0.5(ok), #relu=0.1 best! [weights biases] = initializeDeepModel(layersDimensions); #Perform forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid"); #Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); #Compute gradients from cost [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="sigmoid", numClasses=layersDimensions(size(layersDimensions)(2))); epsilon = 1e-07; outputActivationFunc="sigmoid"; # Convert paramter cell array to vector parameters_values = cellArray_to_vector(weights, biases); #Convert gradient cell array to vector grad = gradients_to_vector(gradsDW,gradsDB); num_parameters = size(parameters_values)(1); #Initialize J_plus = zeros(num_parameters, 1); J_minus = zeros(num_parameters, 1); gradapprox = zeros(num_parameters, 1); # Compute gradapprox for i = 1:num_parameters # Compute J_plus[i]. thetaplus = parameters_values; thetaplus(i,1) = thetaplus(i,1) + epsilon; [weights1 biases1] =vector_to_cellArray(weights, biases,thetaplus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_plus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute J_minus[i]. thetaminus = parameters_values; thetaminus(i,1) = thetaminus(i,1) - epsilon ; [weights1 biases1] = vector_to_cellArray(weights, biases,thetaminus); [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X',weights1, biases1, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc); J_minus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc); # Compute gradapprox[i] gradapprox(i) = (J_plus(i) - J_minus(i))/(2*epsilon); endfor #Compute L2Norm numerator = L2NormVec(grad-gradapprox); denominator = L2NormVec(grad) + L2NormVec(gradapprox); difference = numerator/denominator; disp(difference); #Check difference if difference > 1e-04 printf("There is a mistake in the implementation "); disp(difference); else printf("The implementation works perfectly"); disp(difference); endif [weights1 biases1] = vector_to_cellArray(weights, biases,grad); printf("Gradients from back propagation"); disp(weights1); disp(biases1); [weights2 biases2] = vector_to_cellArray(weights, biases,gradapprox); printf("Gradients from gradient check"); disp(weights2); disp(biases2);  0.69315 1.4893e-005 The implementation works perfectly 1.4893e-005 Gradients from back propagation { [1,1] = 5.0349e-005 2.1323e-005 8.8632e-007 1.8231e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9529e-007 5.4502e-005 3.2721e-005 [1,2] = 1.0567e-005 6.0615e-005 4.6004e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8716e-005 1.1309e-009 4.7686e-005 1.2051e-005 -1.4612e-005 [1,2] = 9.5808e-006 } Gradients from gradient check { [1,1] = 5.0348e-005 2.1320e-005 8.8485e-007 1.8219e-006 9.3784e-005 1.0057e-004 1.0875e-004 -1.9762e-007 5.4502e-005 3.2723e-005 [1,2] = [1,2] = 1.0565e-005 6.0614e-005 4.6007e-005 1.3977e-004 1.0405e-004 } { [1,1] = -1.8713e-005 1.1102e-009 4.7687e-005 1.2048e-005 -1.4609e-005 [1,2] = 9.5790e-006 }  1.3b Gradient Check – Softmax Activation – Octave source("DL8functions.m") data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); # Set the layer dimensions layersDimensions = [2 3 3]; [weights biases] = initializeDeepModel(layersDimensions); # Run forward prop [AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax"); # Compute cost cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2))); disp(cost); # Perform backward prop [gradsDA gradsDW gradsDB] = backwardPropagationDeep(AL, Y', activation_caches,forward_caches, droputMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu", outputActivationFunc="softmax", numClasses=layersDimensions(size(layersDimensions)(2))); #Take transpose of last layer for Softmax L=size(weights)(2); gradsDW{L}= gradsDW{L}'; gradsDB{L}= gradsDB{L}'; #Perform gradient check difference= gradient_check_n(weights, biases, gradsDW,gradsDB, X, Y, epsilon = 1e-7, outputActivationFunc="softmax",numClasses=layersDimensions(size(layersDimensions)(2)));   1.0986 The implementation works perfectly 2.0021e-005 Gradients from back propagation { [1,1] = -7.1590e-005 4.1375e-005 -1.9494e-004 -5.2014e-005 -1.4554e-004 5.1699e-005 [1,2] = 3.3129e-004 1.9806e-004 -1.5662e-005 -4.9692e-004 -3.7756e-004 -8.2318e-005 1.6562e-004 1.7950e-004 9.7980e-005 } { [1,1] = -3.0856e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.2046e-006 2.9259e-007 -1.4972e-006 } Gradients from gradient check { [1,1] = -7.1586e-005 4.1377e-005 -1.9494e-004 -5.2013e-005 -1.4554e-004 5.1695e-005 3.3129e-004 1.9806e-004 -1.5664e-005 -4.9692e-004 -3.7756e-004 -8.2316e-005 1.6562e-004 1.7950e-004 9.7979e-005 } { [1,1] = -3.0852e-005 -3.3321e-004 -3.8197e-004 [1,2] = 1.1902e-006 2.8200e-007 -1.4644e-006 }  2.1 Tip for tuning hyperparameters Deep Learning Networks come with a large number of hyper parameters which require tuning. The hyper parameters are 1. $\alpha$ -learning rate 2. Number of layers 3. Number of hidden units 4. Number of iterations 5. Momentum – $\beta$ – 0.9 6. RMSProp – $\beta_{1}$ – 0.9 7. Adam – $\beta_{1}$,$\beta_{2}$ and $\epsilon$ 8. learning rate decay 9. mini batch size 10. Initialization method – He, Xavier 11. Regularization – Among the above the most critical is learning rate $\alpha$ . Rather than just trying out random values, it may help to try out values on a logarithmic scale. So we could try out values -0.01,0.1,1.0,10 etc. If we find that the cost is between 0.01 and 0.1 we could use a technique similar to binary search or bisection, so we can try 0.01, 0.05. If we need to be bigger than 0.01 and 0.05 we could try 0.25 and then keep halving the distance etc. – The performance of Momentum and RMSProp are very good and work well with values 0.9. Even with this, it is better to try out values of 1-$\beta$ in the logarithmic range. So 1-$\beta$ could 0.001,0.01,0.1 and hence $\beta$ would be 0.999,0.99 or 0.9 – Increasing the number of hidden units or number of hidden layers need to be done gradually. I have noticed that increasing number of hidden layers heavily does not improve performance and sometimes degrades it. – Sometimes, I tend to increase the number of iterations if I think I see a steady decrease in the cost for a certain learning rate – It may also help to add learning rate decay if you see there is an oscillation while it decreases. – Xavier and He initializations also help in a fast convergence and are worth trying out. 3.1 Final thoughts As I come to a close in this Deep Learning Series from first principles in Python, R and Octave, I must admit that I learnt a lot in the process. * Building a L-layer, vectorized Deep Learning Network in Python, R and Octave was extremely challenging but very rewarding * One benefit of building vectorized versions in Python, R and Octave was that I was looking at each function that I was implementing thrice, and hence I was able to fix any bugs in any of the languages * In addition since I built the generic L-Layer DL network with all the bells and whistles, layer by layer I further had an opportunity to look at all the functions in each successive post. * Each language has its advantages and disadvantages. From the performance perspective I think Python is the best, followed by Octave and then R * Interesting, I noticed that even if small bugs creep into your implementation, the DL network does learn and does generate a valid set of weights and biases, however this may not be an optimum solution. In one case of an inadvertent bug, I was not updating the weights in the final layer of the DL network. Yet, using all the other layers, the DL network was able to come with a reasonable solution (maybe like random dropout, remaining units can still learn the data!) * Having said that, the Gradient Check method discussed and implemented in this post can be very useful in ironing out bugs. Feel free to clone/download the code from Github at DeepLearning-Part8 Conclusion These last couple of months when I was writing the posts and the also churning up the code in Python, R and Octave were very hectic. There have been times when I found that implementations of some function to be extremely demanding and I almost felt like giving up. Other times, I have spent quite some time on an intractable DL network which would not respond to changes in hyper-parameters. All in all, it was a great learning experience. I would suggest that you start from my first post Deep Learning from first principles in Python, R and Octave-Part 1 and work your way up. Feel free to take the code apart and try out things. That is the only way you will learn. Hope you had as much fun as I had. Stay tuned. I will be back!!! To see all post click Index of Posts Deep Learning from first principles in Python, R and Octave – Part 7 Artificial Intelligence is the new electricity. – Prof Andrew Ng Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI. – Yann LeCun, March 14, 2016 (Facebook) Introduction In this post ‘Deep Learning from first principles with Python, R and Octave-Part 7’, I implement optimization methods used in Stochastic Gradient Descent (SGD) to speed up the convergence. Specifically I discuss and implement the following gradient descent optimization techniques a.Vanilla Stochastic Gradient Descent b.Learning rate decay c. Momentum method d. RMSProp e. Adaptive Moment Estimation (Adam) This post, further enhances my generic L-Layer Deep Learning Network implementations in vectorized Python, R and Octave to also include the Stochastic Gradient Descent optimization techniques. You can clone/download the code from Github at DeepLearning-Part7 Incidentally, a good discussion of the various optimizations methods used in Stochastic Gradient Optimization techniques can be seen at Sebastian Ruder’s blog Note: In the vectorized Python, R and Octave implementations below only a 1024 random training samples were used. This was to reduce the computation time. You are free to use the entire data set (60000 training data) for the computation. This post is largely based of on Prof Andrew Ng’s Deep Learning Specialization. All the above optimization techniques for Stochastic Gradient Descent are based on the technique of exponentially weighted average method. So for example if we had some time series data $\theta_{1},\theta_{2},\theta_{3}... \theta_{t}$ then we we can represent the exponentially average value at time ‘t’ as a sequence of the the previous value $v_{t-1}$ and $\theta_{t}$ as shown below $v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$ Here $v_{t}$ represent the average of the data set over $\frac {1}{1-\beta}$ By choosing different values of $\beta$, we can average over a larger or smaller number of the data points. We can write the equations as follows $v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$ $v_{t-1} = \beta v_{t-2} + (1-\beta)\theta_{t-1}$ $v_{t-2} = \beta v_{t-3} + (1-\beta)\theta_{t-2}$ and $v_{t-k} = \beta v_{t-(k+1))} + (1-\beta)\theta_{t-k}$ By substitution we have $v_{t} = (1-\beta)\theta_{t} + \beta v_{t-1}$ $v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta v_{t-2}$ $v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta ((1-\beta)\theta_{t-2}+ \beta v_{t-3} )$ Hence it can be seen that the $v_{t}$ is the weighted sum over the previous values $\theta_{k}$, which is an exponentially decaying function. Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. 1.1a. Stochastic Gradient Descent (Vanilla) – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) # Read the training data training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of 1024 random numbers. permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) # Set the layer dimensions layersDimensions=[784, 15,9,10] # Perform SGD with regular gradient descent parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="gd", mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig1.png")  1.1b. Stochastic Gradient Descent (Vanilla) – R source("mnist.R") source("DLfunctions7.R") #Load and read MNIST data load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) # Set layer dimensions layersDimensions=c(784, 15,9, 10) # Perform SGD with regular gradient descent retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.05, optimizer="gd", mini_batch_size = 512, num_epochs = 5000, print_cost = True) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs no of epochs") + xlab("No of epochss") + ylab("Cost")

1.1c. Stochastic Gradient Descent (Vanilla) – Octave

source("DL7functions.m")
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with regular gradient descent
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.005,
lrDecay=true,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="gd",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);

plotCostVsEpochs(5000,costs);


2.1. Stochastic Gradient Descent with Learning rate decay

Since in Stochastic Gradient Descent,with  each epoch, we use slight different samples, the gradient descent algorithm, oscillates across the ravines and wanders around the minima, when a fixed learning rate is used. In this technique of ‘learning rate decay’ the learning rate is slowly decreased with the number of epochs and becomes smaller and smaller, so that gradient descent can take smaller steps towards the minima.

There are several techniques employed in learning rate decay

a) Exponential decay: $\alpha = decayRate^{epochNum} *\alpha_{0}$
b) 1/t decay : $\alpha = \frac{\alpha_{0}}{1 + decayRate*epochNum}$
c) $\alpha = \frac {decayRate}{\sqrt(epochNum)}*\alpha_{0}$

In my implementation I have used the ‘exponential decay’. The code snippet for Python is shown below

if lrDecay == True:
learningRate = np.power(decayRate,(num_epochs/1000)) * learningRate


2.1a. Stochastic Gradient Descent with Learning rate decay – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

# Create  a list of random numbers of 1024
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
# Set layer dimensions
layersDimensions=[784, 15,9,10]
# Perform SGD with learning rate decay
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01 , lrDecay=True, decayRate=0.9999,
optimizer="gd",
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig2.png")

2.1b. Stochastic Gradient Descent with Learning rate decay – R

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
# Set layer dimensions
layersDimensions=c(784, 15,9, 10)
# Perform SGD with Learning rate decay
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
lrDecay=TRUE,
decayRate=0.9999,
optimizer="gd",
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost") 2.1c. Stochastic Gradient Descent with Learning rate decay – Octave source("DL7functions.m") #Load and read MNIST load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024 permutation = randperm(1024); disp(length(permutation)); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; # Perform SGD with regular Learning rate decay [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.01, lrDecay=true, decayRate=0.999, lambd=0, keep_prob=1, optimizer="gd", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs)  3.1. Stochastic Gradient Descent with Momentum Stochastic Gradient Descent with Momentum uses the exponentially weighted average method discusses above and more generally moves faster into the ravine than across it. The equations are $v_{dW}^l = \beta v_{dW}^l + (1-\beta)dW^{l}$ $v_{db}^l = \beta v_{db}^l + (1-\beta)db^{l}$ $W^{l} = W^{l} - \alpha v_{dW}^l$ $b^{l} = b^{l} - \alpha v_{db}^l$ where $v_{dW}$ and $v_{db}$ are the momentum terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’ The code snippet for Stochastic Gradient Descent with momentum in R is shown below # Perform Gradient Descent with momentum # Input : Weights and biases # : beta # : gradients # : learning rate # : outputActivationFunc - Activation function at hidden layer sigmoid/softmax #output : Updated weights after 1 iteration gradientDescentWithMomentum <- function(parameters, gradients,v, beta, learningRate,outputActivationFunc="sigmoid"){ L = length(parameters)/2 # number of layers in the neural network # Update rule for each parameter. Use a for loop. for(l in 1:(L-1)){ # Compute velocities # v['dWk'] = beta *v['dWk'] + (1-beta)*dWk v[[paste("dW",l, sep="")]] = beta*v[[paste("dW",l, sep="")]] + (1-beta) * gradients[[paste('dW',l,sep="")]] v[[paste("db",l, sep="")]] = beta*v[[paste("db",l, sep="")]] + (1-beta) * gradients[[paste('db',l,sep="")]] parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] - learningRate* v[[paste("dW",l, sep="")]] parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] - learningRate* v[[paste("db",l, sep="")]] } # Compute for the Lth layer if(outputActivationFunc=="sigmoid"){ v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] + (1-beta) * gradients[[paste('dW',L,sep="")]] v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] + (1-beta) * gradients[[paste('db',L,sep="")]] parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate* v[[paste("dW",l, sep="")]] parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate* v[[paste("db",l, sep="")]] }else if (outputActivationFunc=="softmax"){ v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] + (1-beta) * t(gradients[[paste('dW',L,sep="")]]) v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] + (1-beta) * t(gradients[[paste('db',L,sep="")]]) parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate* t(gradients[[paste("dW",L,sep="")]]) parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate* t(gradients[[paste("db",L,sep="")]]) } return(parameters) } 3.1a. Stochastic Gradient Descent with Momentum- Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets # Read and load data exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of random numbers of 1024 permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) layersDimensions=[784, 15,9,10] # Perform SGD with momentum parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="momentum", beta=0.9, mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig3.png") 3.1b. Stochastic Gradient Descent with Momentum- R source("mnist.R") source("DLfunctions7.R") load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) layersDimensions=c(784, 15,9, 10) # Perform SGD with momentum retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.05, optimizer="momentum", beta=0.9, mini_batch_size = 512, num_epochs = 5000, print_cost = True)  #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

3.1c. Stochastic Gradient Descent with Momentum- Octave

source("DL7functions.m")
#Create a random permutatation from 60K
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with Momentum
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="momentum",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);

plotCostVsEpochs(5000,costs)


4.1. Stochastic Gradient Descent with RMSProp

Stochastic Gradient Descent with RMSProp tries to move faster towards the minima while dampening the oscillations across the ravine.
The equations are

$s_{dW}^l = \beta_{1} s_{dW}^l + (1-\beta_{1})(dW^{l})^{2}$
$s_{db}^l = \beta_{1} s_{db}^l + (1-\beta_{1})(db^{l})^2$
$W^{l} = W^{l} - \frac {\alpha s_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$
$b^{l} = b^{l} - \frac {\alpha s_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$
where $s_{dW}$ and $s_{db}$ are the RMSProp terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’

The code snippet in Octave is shown below

# Update parameters with RMSProp
# Input : parameters
#       : s
#       : beta
#       : learningRate
#       :
#output : Updated parameters RMSProp
L = size(weights)(2); # number of layers in the neural network
# Update rule for each parameter.
for l=1:(L-1)
weights{l} = weights{l} - learningRate* gradsDW{l} ./ sqrt(sdW{l} + epsilon);
biases{l} = biases{l} -  learningRate* gradsDB{l} ./ sqrt(sdB{l} + epsilon);
endfor

if (strcmp(outputActivationFunc,"sigmoid"))
weights{L} = weights{L} -learningRate* gradsDW{L} ./ sqrt(sdW{L} +epsilon);
biases{L} = biases{L} -learningRate* gradsDB{L} ./ sqrt(sdB{L} + epsilon);
elseif (strcmp(outputActivationFunc,"softmax"))
weights{L} = weights{L} -learningRate* gradsDW{L}' ./ sqrt(sdW{L} +epsilon);
biases{L} = biases{L} -learningRate* gradsDB{L}' ./ sqrt(sdB{L} + epsilon);
endif
end


4.1a. Stochastic Gradient Descent with RMSProp – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

print("X1=",X1.shape)
print("y1=",Y1.shape)

# Create  a list of random numbers of 1024
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))

layersDimensions=[784, 15,9,10]
# Use SGD with RMSProp
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="rmsprop", beta1=0.7, epsilon=1e-8,
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig4.png")

4.1b. Stochastic Gradient Descent with RMSProp – R

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
#Perform SGD with RMSProp
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.001,
optimizer="rmsprop",
beta1=0.9,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000 ,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")  4.1c. Stochastic Gradient Descent with RMSProp – Octave source("DL7functions.m") load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024 permutation = randperm(1024); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; #Perform SGD with RMSProp [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.005, lrDecay=false, decayRate=1, lambd=0, keep_prob=1, optimizer="rmsprop", beta=0.9, beta1=0.9, beta2=0.999, epsilon=1, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs)  5.1. Stochastic Gradient Descent with Adam Adaptive Moment Estimate is a combination of the momentum (1st moment) and RMSProp(2nd moment). The equations for Adam are below $v_{dW}^l = \beta_{1} v_{dW}^l + (1-\beta_{1})dW^{l}$ $v_{db}^l = \beta_{1} v_{db}^l + (1-\beta_{1})db^{l}$ The bias corrections for the 1st moment $vCorrected_{dW}^l= \frac {v_{dW}^l}{1 - \beta_{1}^{t}}$ $vCorrected_{db}^l= \frac {v_{db}^l}{1 - \beta_{1}^{t}}$ Similarly the moving average for the 2nd moment- RMSProp $s_{dW}^l = \beta_{2} s_{dW}^l + (1-\beta_{2})(dW^{l})^2$ $s_{db}^l = \beta_{2} s_{db}^l + (1-\beta_{2})(db^{l})^2$ The bias corrections for the 2nd moment $sCorrected_{dW}^l= \frac {s_{dW}^l}{1 - \beta_{2}^{t}}$ $sCorrected_{db}^l= \frac {s_{db}^l}{1 - \beta_{2}^{t}}$ The Adam Gradient Descent is given by $W^{l} = W^{l} - \frac {\alpha vCorrected_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$ $b^{l} = b^{l} - \frac {\alpha vCorrected_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$ The code snippet of Adam in R is included below # Perform Gradient Descent with Adam # Input : Weights and biases # : beta1 # : epsilon # : gradients # : learning rate # : outputActivationFunc - Activation function at hidden layer sigmoid/softmax #output : Updated weights after 1 iteration gradientDescentWithAdam <- function(parameters, gradients,v, s, t, beta1=0.9, beta2=0.999, epsilon=10^-8, learningRate=0.1,outputActivationFunc="sigmoid"){ L = length(parameters)/2 # number of layers in the neural network v_corrected <- list() s_corrected <- list() # Update rule for each parameter. Use a for loop. for(l in 1:(L-1)){ # v['dWk'] = beta *v['dWk'] + (1-beta)*dWk v[[paste("dW",l, sep="")]] = beta1*v[[paste("dW",l, sep="")]] + (1-beta1) * gradients[[paste('dW',l,sep="")]] v[[paste("db",l, sep="")]] = beta1*v[[paste("db",l, sep="")]] + (1-beta1) * gradients[[paste('db',l,sep="")]] # Compute bias-corrected first moment estimate. v_corrected[[paste("dW",l, sep="")]] = v[[paste("dW",l, sep="")]]/(1-beta1^t) v_corrected[[paste("db",l, sep="")]] = v[[paste("db",l, sep="")]]/(1-beta1^t) # Element wise multiply of gradients s[[paste("dW",l, sep="")]] = beta2*s[[paste("dW",l, sep="")]] + (1-beta2) * gradients[[paste('dW',l,sep="")]] * gradients[[paste('dW',l,sep="")]] s[[paste("db",l, sep="")]] = beta2*s[[paste("db",l, sep="")]] + (1-beta2) * gradients[[paste('db',l,sep="")]] * gradients[[paste('db',l,sep="")]] # Compute bias-corrected second moment estimate. s_corrected[[paste("dW",l, sep="")]] = s[[paste("dW",l, sep="")]]/(1-beta2^t) s_corrected[[paste("db",l, sep="")]] = s[[paste("db",l, sep="")]]/(1-beta2^t) # Update parameters. d1=sqrt(s_corrected[[paste("dW",l, sep="")]]+epsilon) d2=sqrt(s_corrected[[paste("db",l, sep="")]]+epsilon) parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] - learningRate * v_corrected[[paste("dW",l, sep="")]]/d1 parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] - learningRate*v_corrected[[paste("db",l, sep="")]]/d2 } # Compute for the Lth layer if(outputActivationFunc=="sigmoid"){ v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] + (1-beta1) * gradients[[paste('dW',L,sep="")]] v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] + (1-beta1) * gradients[[paste('db',L,sep="")]] # Compute bias-corrected first moment estimate. v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t) v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t) # Element wise multiply of gradients s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] + (1-beta2) * gradients[[paste('dW',L,sep="")]] * gradients[[paste('dW',L,sep="")]] s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] + (1-beta2) * gradients[[paste('db',L,sep="")]] * gradients[[paste('db',L,sep="")]] # Compute bias-corrected second moment estimate. s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t) s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t) # Update parameters. d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon) d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon) parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate * v_corrected[[paste("dW",L, sep="")]]/d1 parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate*v_corrected[[paste("db",L, sep="")]]/d2 }else if (outputActivationFunc=="softmax"){ v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] + (1-beta1) * t(gradients[[paste('dW',L,sep="")]]) v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] + (1-beta1) * t(gradients[[paste('db',L,sep="")]]) # Compute bias-corrected first moment estimate. v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t) v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t) # Element wise multiply of gradients s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] + (1-beta2) * t(gradients[[paste('dW',L,sep="")]]) * t(gradients[[paste('dW',L,sep="")]]) s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] + (1-beta2) * t(gradients[[paste('db',L,sep="")]]) * t(gradients[[paste('db',L,sep="")]]) # Compute bias-corrected second moment estimate. s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t) s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t) # Update parameters. d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon) d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon) parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] - learningRate * v_corrected[[paste("dW",L, sep="")]]/d1 parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] - learningRate*v_corrected[[paste("db",L, sep="")]]/d2 } return(parameters) }  5.1a. Stochastic Gradient Descent with Adam – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] print(len(training)) #for i in range(len(training)): for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of random numbers of 1024 permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) layersDimensions=[784, 15,9,10] #Perform SGD with Adam optimization parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="adam", beta1=0.9, beta2=0.9, epsilon = 1e-8, mini_batch_size =512, num_epochs = 1000, print_cost = True, figure="fig5.png") 5.1b. Stochastic Gradient Descent with Adam – R source("mnist.R") source("DLfunctions7.R") load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) layersDimensions=c(784, 15,9, 10) #Perform SGD with Adam retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.005, optimizer="adam", beta1=0.7, beta2=0.9, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000 , print_cost = True) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

source("DL7functions.m")
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];

# Note the high value for epsilon.
#Otherwise GD with Adam does not seem to converge
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
beta=0.9,
beta1=0.9,
beta2=0.9,
epsilon=100,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)


Conclusion: In this post I discuss and implement several Stochastic Gradient Descent optimization methods. The implementation of these methods enhance my already existing generic L-Layer Deep Learning Network implementation in vectorized Python, R and Octave, which I had discussed in the previous post in this series on Deep Learning from first principles in Python, R and Octave. Check it out, if you haven’t already. As already mentioned the code for this post can be cloned/forked from Github at DeepLearning-Part7

Watch this space! I’ll be back!

To see all post click Index of posts

Deep Learning from first principles in Python, R and Octave – Part 6

“Today you are You, that is truer than true. There is no one alive who is Youer than You.”
Dr. Seuss

“Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.”
H L Mencken

Introduction

In this 6th instalment of ‘Deep Learning from first principles in Python, R and Octave-Part6’, I look at a couple of different initialization techniques used in Deep Learning, L2 regularization and the ‘dropout’ method. Specifically, I implement “He initialization” & “Xavier Initialization”. My earlier posts in this series of Deep Learning included

1. Part 1 – In the 1st part, I implemented logistic regression as a simple 2 layer Neural Network
2. Part 2 – In part 2, implemented the most basic of Neural Networks, with just 1 hidden layer, and any number of activation units in that hidden layer. The implementation was in vectorized Python, R and Octave
3. Part 3 -In part 3, I derive the equations and also implement a L-Layer Deep Learning network with either the relu, tanh or sigmoid activation function in Python, R and Octave. The output activation unit was a sigmoid function for logistic classification
4. Part 4 – This part looks at multi-class classification, and I derive the Jacobian of a Softmax function and implement a simple problem to perform multi-class classification.
5. Part 5 – In the 5th part, I extend the L-Layer Deep Learning network implemented in Part 3, to include the Softmax classification. I also use this L-layer implementation to classify MNIST handwritten digits with Python, R and Octave.

The code in Python, R and Octave are identical, and just take into account some of the minor idiosyncrasies of the individual language. In this post, I implement different initialization techniques (random, He, Xavier), L2 regularization and finally dropout. Hence my generic L-Layer Deep Learning network includes these additional enhancements for enabling/disabling initialization methods, regularization or dropout in the algorithm. It already included sigmoid & softmax output activation for binary and multi-class classification, besides allowing relu, tanh and sigmoid activation for hidden units.

This R Markdown file and the code for Python, R and Octave can be cloned/downloaded from Github at DeepLearning-Part6

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

1. Initialization techniques

The usual initialization technique is to generate Gaussian or uniform random numbers and multiply it by a small value like 0.01. Two techniques which are used to speed up convergence is the He initialization or Xavier. These initialization techniques enable gradient descent to converge faster.

1.1 a Default initialization – Python

This technique just initializes the weights to small random values based on Gaussian or uniform distribution

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
train_X, train_Y, test_X, test_Y = load_dataset()
# Set the layers dimensions
layersDimensions = [2,7,1]

# Train a deep learning network with random initialization
parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",learningRate = 0.6, num_iterations = 9000, initType="default", print_cost = True,figure="fig1.png")

# Clear the plot
plt.clf()
plt.close()

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig2.png")

1.1 b He initialization – Python

‘He’ initialization attributed to He et al, multiplies the random weights by
$\sqrt{\frac{2}{dimension\ of\ previous\ layer}}$

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

train_X, train_Y, test_X, test_Y = load_dataset()
# Set the layers dimensions
layersDimensions = [2,7,1]

# Train a deep learning network with He  initialization
parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate =0.6,    num_iterations = 10000,initType="He",print_cost = True,                           figure="fig3.png")

plt.clf()
plt.close()
# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig4.png")

1.1 c Xavier initialization – Python

Xavier  initialization multiply the random weights by
$\sqrt{\frac{1}{dimension\ of\ previous\ layer}}$

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

train_X, train_Y, test_X, test_Y = load_dataset()
# Set the layers dimensions
layersDimensions = [2,7,1]

# Train a L layer Deep Learning network
parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",
learningRate = 0.6,num_iterations = 10000, initType="Xavier",print_cost = True,
figure="fig5.png")

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig6.png")

1.2a Default initialization – R

source("DLfunctions61.R")
x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
#Set the layer dimensions
layersDimensions = c(2,11,1)
# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
numIterations = 8000,
initType="default",
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,8000,1000)
costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost") # Plot the decision boundary plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",lr=0.5) 1.2b He initialization – R The code for ‘He’ initilaization in R is included below # He Initialization model for L layers # Input : List of units in each layer # Returns: Initial weights and biases matrices for all layers # He initilization multiplies the random numbers with sqrt(2/layerDimensions[previouslayer]) HeInitializeDeepModel <- function(layerDimensions){ set.seed(2) # Initialize empty list layerParams <- list() # Note the Weight matrix at layer 'l' is a matrix of size (l,l-1) # The Bias is a vectors of size (l,1) # Loop through the layer dimension from 1.. L # Indices in R start from 1 for(l in 2:length(layersDimensions)){ # Initialize a matrix of small random numbers of size l x l-1 # Create random numbers of size l x l-1 w=rnorm(layersDimensions[l]*layersDimensions[l-1]) # Create a weight matrix of size l x l-1 with this initial weights and # Add to list W1,W2... WL # He initialization - Divide by sqrt(2/layerDimensions[previous layer]) layerParams[[paste('W',l-1,sep="")]] = matrix(w,nrow=layersDimensions[l], ncol=layersDimensions[l-1])*sqrt(2/layersDimensions[l-1]) layerParams[[paste('b',l-1,sep="")]] = matrix(rep(0,layersDimensions[l]), nrow=layersDimensions[l],ncol=1) } return(layerParams) }  The code in R below uses He initialization to learn the data source("DLfunctions61.R") # Load the data z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) # Set the layer dimensions layersDimensions = c(2,11,1) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, numIterations = 9000, initType="He", print_cost = True) #Plot the cost vs iterations iterations <- seq(0,9000,1000) costs=retvals$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost")

# Plot the decision boundary
plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.5,lr=0.5)

1.2c Xavier initialization – R

## Xav initialization
# Set the layer dimensions
layersDimensions = c(2,11,1)
# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
numIterations = 9000,
initType="Xav",
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,9000,1000)
costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost") # Plot the decision boundary plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.5) 1.3a Default initialization – Octave source("DL61functions.m") # Read the data data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); # Set the layer dimensions layersDimensions = [2 11 1]; #tanh=-0.5(ok), #relu=0.1 best! # Train a deep learning network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, lambd=0, keep_prob=1, numIterations = 10000, initType="default"); # Plot cost vs iterations plotCostVsIterations(10000,costs) #Plot decision boundary plotDecisionBoundary(data,weights, biases,keep_prob=1, hiddenActivationFunc="relu")  1.3b He initialization – Octave source("DL61functions.m") #Load data data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); # Set the layer dimensions layersDimensions = [2 11 1]; #tanh=-0.5(ok), #relu=0.1 best! # Train a deep learning network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, lambd=0, keep_prob=1, numIterations = 8000, initType="He"); plotCostVsIterations(8000,costs) #Plot decision boundary plotDecisionBoundary(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")  1.3c Xavier initialization – Octave The code snippet for Xavier initialization in Octave is shown below source("DL61functions.m") # Xavier Initialization for L layers # Input : List of units in each layer # Returns: Initial weights and biases matrices for all layers function [W b] = XavInitializeDeepModel(layerDimensions) rand ("seed", 3); # note the Weight matrix at layer 'l' is a matrix of size (l,l-1) # The Bias is a vectors of size (l,1) # Loop through the layer dimension from 1.. L # Create cell arrays for Weights and biases for l =2:size(layerDimensions)(2) W{l-1} = rand(layerDimensions(l),layerDimensions(l-1))* sqrt(1/layerDimensions(l-1)); # Multiply by .01 b{l-1} = zeros(layerDimensions(l),1); endfor end  The Octave code below uses Xavier initialization source("DL61functions.m") #Load data data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); #Set layer dimensions layersDimensions = [2 11 1]; #tanh=-0.5(ok), #relu=0.1 best! # Train a deep learning network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, lambd=0, keep_prob=1, numIterations = 8000, initType="Xav"); plotCostVsIterations(8000,costs) plotDecisionBoundary(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")  2.1a Regularization : Circles data – Python The cross entropy cost for Logistic classification is given as $J = \frac{1}{m}\sum_{i=1}^{m}y^{i}log((a^{L})^{(i)}) - (1-y^{i})log((a^{L})^{(i)})$ The regularized L2 cost is given by $J = \frac{1}{m}\sum_{i=1}^{m}y^{i}log((a^{L})^{(i)}) - (1-y^{i})log((a^{L})^{(i)}) + \frac{\lambda}{2m}\sum \sum \sum W_{kj}^{l}$ import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() # Set the layers dimensions layersDimensions = [2,7,1] # Train a deep learning network parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",learningRate = 0.6, lambd=0.1, num_iterations = 9000, initType="default", print_cost = True,figure="fig7.png") # Clear the plot plt.clf() plt.close() # Plot the decision boundary plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig8.png") plt.clf() plt.close() #Plot the decision boundary plot_decision_boundary(lambda x: predict(parameters, x.T,keep_prob=0.9), train_X, train_Y,str(2.2),"fig8.png",) 2.1 b Regularization: Spiral data – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) plt.clf() plt.close() #Set layer dimensions layersDimensions = [2,100,3] y1=y.reshape(-1,1).T # Train a deep learning network parameters = L_Layer_DeepModel(X.T, y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 1,lambd=1e-3, num_iterations = 5000, print_cost = True,figure="fig9.png") plt.clf() plt.close() W1=parameters['W1'] b1=parameters['b1'] W2=parameters['W2'] b2=parameters['b2'] plot_decision_boundary1(X, y1,W1,b1,W2,b2,figure2="fig10.png") 2.2a Regularization: Circles data – R source("DLfunctions61.R") #Load data df=read.csv("circles.csv",header=FALSE) z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) #Set layer dimensions layersDimensions = c(2,11,1) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, lambd=0.1, numIterations = 9000, initType="default", print_cost = True)  #Plot the cost vs iterations iterations <- seq(0,9000,1000) costs=retvals$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost")

# Plot the decision boundary
plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.5)

2.2b Regularization:Spiral data – R

# Read the spiral dataset
source("DLfunctions61.R")

# Setup the data
X <- Z[,1:2]
y <- Z[,3]
X <- t(X)
Y <- t(y)
layersDimensions = c(2, 100, 3)
# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.5,
lambd=0.01,
numIterations = 9000,
print_cost = True)
print_cost = True)
parameters<-retvals$parameters plotDecisionBoundary1(Z,parameters) 2.3a Regularization: Circles data – Octave source("DL61functions.m") #Load data data=csvread("circles.csv"); X=data(:,1:2); Y=data(:,3); layersDimensions = [2 11 1]; #tanh=-0.5(ok), #relu=0.1 best! # Train a deep learning network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, lambd=0.2, keep_prob=1, numIterations = 8000, initType="default"); plotCostVsIterations(8000,costs) #Plot decision boundary plotDecisionBoundary(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")  2.3b Regularization:Spiral data 2 – Octave source("DL61functions.m") data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); layersDimensions = [2 100 3] # Train a deep learning network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.6, lambd=0.2, keep_prob=1, numIterations = 10000); plotCostVsIterations(10000,costs) #Plot decision boundary plotDecisionBoundary1(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")  3.1 a Dropout: Circles data – Python The ‘dropout’ regularization technique was used with great effectiveness, to prevent overfitting by Alex Krizhevsky, Ilya Sutskever and Prof Geoffrey E. Hinton in the Imagenet classification with Deep Convolutional Neural Networks The technique of dropout works by dropping a random set of activation units in each hidden layer, based on a ‘keep_prob’ criteria in the forward propagation cycle. Here is the code for Octave. A ‘dropoutMat’ is created for each layer which specifies which units to drop Note: The same ‘dropoutMat has to be used which computing the gradients in the backward propagation cycle. Hence the dropout matrices are stored in a cell array.  for l =1:L-1 ... D=rand(size(A)(1),size(A)(2)); D = (D < keep_prob) ; # Zero out some hidden units A= A .* D; # Divide by keep_prob to keep the expected value of A the same A = A ./ keep_prob; # Store D in a dropoutMat cell array dropoutMat{l}=D; ... endfor In the backward propagation cycle we have  for l =(L-1):-1:1 ... D = dropoutMat{l}; # Zero out the dAl based on same dropout matrix dAl= dAl .* D; # Divide by keep_prob to maintain the expected value dAl = dAl ./ keep_prob; ... endfor  import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() # Set the layers dimensions layersDimensions = [2,7,1] # Train a deep learning network parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",learningRate = 0.6, keep_prob=0.7, num_iterations = 9000, initType="default", print_cost = True,figure="fig11.png") # Clear the plot plt.clf() plt.close() # Plot the decision boundary plot_decision_boundary(lambda x: predict(parameters, x.T,keep_prob=0.7), train_X, train_Y,str(0.6),figure1="fig12.png")  3.1b Dropout: Spiral data – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) # Create an input data set - Taken from CS231n Convolutional Neural networks, # http://cs231n.github.io/neural-networks-case-study/ N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes X = np.zeros((N*K,D)) # data matrix (each row = single example) y = np.zeros(N*K, dtype='uint8') # class labels for j in range(K): ix = range(N*j,N*(j+1)) r = np.linspace(0.0,1,N) # radius t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta X[ix] = np.c_[r*np.sin(t), r*np.cos(t)] y[ix] = j # Plot the data plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) plt.clf() plt.close() layersDimensions = [2,100,3] y1=y.reshape(-1,1).T # Train a deep learning network parameters = L_Layer_DeepModel(X.T, y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 1,keep_prob=0.9, num_iterations = 5000, print_cost = True,figure="fig13.png") plt.clf() plt.close() W1=parameters['W1'] b1=parameters['b1'] W2=parameters['W2'] b2=parameters['b2'] #Plot decision boundary plot_decision_boundary1(X, y1,W1,b1,W2,b2,figure2="fig14.png") 3.2a Dropout: Circles data – R source("DLfunctions61.R") #Load data df=read.csv("circles.csv",header=FALSE) z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) layersDimensions = c(2,11,1) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, keep_prob=0.8, numIterations = 9000, initType="default", print_cost = True) # Plot the decision boundary plotDecisionBoundary(z,retvals,keep_prob=0.6, hiddenActivationFunc="relu",0.5) 3.2b Dropout: Spiral data – R # Read the spiral dataset source("DLfunctions61.R") # Load data Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.1, keep_prob=0.90, numIterations = 9000, print_cost = True)  parameters<-retvals$parameters
#Plot decision boundary
plotDecisionBoundary1(Z,parameters)

3.3a Dropout: Circles data – Octave

data=csvread("circles.csv");

X=data(:,1:2);
Y=data(:,3);
layersDimensions = [2 11  1]; #tanh=-0.5(ok), #relu=0.1 best!

# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0,
keep_prob=0.8,
numIterations = 10000,
initType="default");
plotCostVsIterations(10000,costs)
#Plot decision boundary
plotDecisionBoundary1(data,weights, biases,keep_prob=1, hiddenActivationFunc="relu")


3.3b Dropout  Spiral data – Octave

source("DL61functions.m")

# Setup the data
X=data(:,1:2);
Y=data(:,3);

layersDimensions = [numFeats numHidden  numOutput];
# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
lambd=0,
keep_prob=0.8,
numIterations = 10000);

plotCostVsIterations(10000,costs)
#Plot decision boundary
plotDecisionBoundary1(data,weights, biases,keep_prob=1, hiddenActivationFunc="relu")


Note: The Python, R and Octave code can be cloned/downloaded from Github at DeepLearning-Part6
Conclusion
This post further enhances my earlier L-Layer generic implementation of a Deep Learning network to include options for initialization techniques, L2 regularization or dropout regularization

To see all posts click Index of posts