# Understanding Neural Style Transfer with Tensorflow and Keras

Neural Style Transfer (NST)  is a fascinating area of Deep Learning and Convolutional Neural Networks. NST is an interesting technique, in which the style from an image, known as the ‘style image’ is transferred to another image ‘content image’ and we get a third a image which is a generated image which has the content of the original image and the style of another image.

NST can be used to reimagine how famous painters like Van Gogh, Claude Monet or a Picasso would have visualised a scenery or architecture. NST uses Convolutional Neural Networks (CNNs) to achieve this artistic style transfer from one image to another. NST was originally implemented by Gati et al., in their paper Neural Algorithm of Artistic Style. Convolutional Neural Networks have been very successful in image classification image recognition et cetera. CNN networks have also been have also generated very interesting pictures using Neural Style Transfer which will be shown in this post. An interesting aspect of CNN’s is that the first couple of layers in the CNN capture basic features of the image like edges and  pixel values. But as we go deeper into the CNN, the network captures higher level features of the input image.

To get started with Neural Style transfer  we will be using the VGG19 pre-trained network. The VGG19 CNN is a compact pre-trained your network which can be used for performing the NST. However, we could have also used Resnet or InceptionV3 networks for this purpose but these are very large networks. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

What needs to be done to transfer the style from one of the image to another image. This brings us to the question – What is ‘style’? What is it that distinguishes Van Gogh’s painting or Picasso’s cubist art. Convolutional Neural Networks capture basic features in the lower layers and much more complex features in the deeper layers.  Style can be computed by taking the correlation of the feature maps in a layer L. This is my interpretation of how style is captured.  Since style  is intrinsic to  the image, it  implies that the style feature would exist across all the filters in a layer. Hence, to pick up this style we would need to get the correlation of the filters across channels of a lawyer. This is computed mathematically, using the Gram matrix which calculates the correlation of the activation of a the filter by the style image and generated image

To transfer the style from one image to the content image we need to do two parallel operations while doing forward propagation
– Compute the content loss between the source image and the generated image
– Compute the style loss between the style image and the generated image
– Finally we need to compute the total loss

In order to get transfer the style from the ‘style’ image to the ‘content ‘image resulting in a  ‘generated’  image  the total loss has to be minimised. Therefore backward propagation with gradient descent  is done to minimise the total loss comprising of the content and style loss.

Initially we make the Generated Image ‘G’ the same as the source image ‘S’

The content loss at layer ‘l’

$L_{content} = 1/2 \sum_{i}^{j} ( F^{l}_{i,j} - P^{l}_{i,j})^{2}$

where $F^{l}_{i,j}$ and $P^{l}_{i,j}$ represent the activations at layer ‘l’ in a filter i, at position ‘j’. The intuition is that the activations will be same for similar source and generated image. We need to minimise the content loss so that the generated stylized image is as close to the original image as possible. An intermediate layer of VGG19 block5_conv2 is used

The Style layers that are are used are

style_layers = [‘block1_conv1’,
‘block2_conv1’,
‘block3_conv1’,
‘block4_conv1’,
‘block5_conv1’]
To compute the Style Loss the Gram matrix needs to be computed. The Gram Matrix is computed by unrolling the filters as shown below (source: Convolutional Neural Networks by Prof Andrew Ng, Coursera). The result is a matrix of size $n_{c}$ x $n_{c}$ where $n_{c}$ is the number of channels
The above diagram shows the filters of height $n_{H}$ and width $n_{W}$ with $n_{C}$ channels
The contribution of layer ‘l’ to style loss is given by
$L^{'}_{style} = \frac{\sum_{i}^{j} (G^{2}_{i,j} - A^l{i,j})^2}{4N^{2}_{l}M^{2}_{l}}$
where $G_{i,j}$  and $A_{i,j}$ are the Gram matrices of the style and generated images respectively. By minimising the distance in the gram matrices of the style and generated image we can ensure that generated image is a stylized version of the original image similar to the style image
The total loss is given by
$L_{total} = \alpha L_{content} + \beta L_{style}$
Back propagation with gradient descent works to minimise the content loss between the source and generated image, while the style loss tries to minimise the discrepancies in the style of the style image and generated image. Running through forward and backpropagation through several epochs successfully transfers the style from the style image to the source image.
You can check the Notebook at Neural Style Transfer

Note: The code in this notebook is largely based on the Neural Style Transfer tutorial from Tensorflow, though I may have taken some changes from other blogs. I also made a few changes to the code in this tutorial, like removing the scaling factor, or the class definition (Personally, I belong to the old school (C language) and am not much in love with the ‘self.”..All references are included below

Note: Here is a interesting thought. Could we do a Neural Style Transfer in music? Imagine Carlos Santana playing ‘Hotel California’ or Brian May style in ‘Another brick in the wall’. While our first reaction would be that it may not sound good as we are used to style of these songs, we may be surprised by a possible style transfer. This is definitely music to the ears!

Here are few runs from this

## A) Run 1

1. Neural Style Transfer – a) Content Image – My portrait.  b) Style Image – Wassily Kadinsky Oil on canvas, 1913, Vassily Kadinsky’s composition

2. Result of Neural Style Transfer

2) Run 2

a) Content Image – Portrait of my parents b) Style Image –  Vincent Van Gogh’s ,Starry Night Oil on canvas 1889

2. Result of Neural Style Transfer

## Run 3

1.  Content Image – Caesar 2 (Masai Mara- 20 Jun 2018).  Style Image – The Great Wave at Kanagawa – Katsushika Hokosai, 1826-1833

2. Result of Neural Style Transfer

## Run 4

1.   Content Image – Junagarh Fort , Rajasthan   Sep 2016              b) Style Image – Le Pont Japonais by Claude Monet, Oil on canvas, 1920

2. Result of Neural Style Transfer

Neural Style Transfer is a very ingenious idea which shows that we can segregate the style of a painting and transfer to another image.

### References

1. A Neural Algorithm of Artistic Style, Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
2. Neural style transfer
3. Neural Style Transfer: Creating Art with Deep Learning using tf.keras and eager execution
4. Convolutional Neural Network, DeepLearning.AI Specialization, Prof Andrew Ng
5. Intuitive Guide to Neural Style Transfer

To see all posts click Index of posts

# Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8

“Lights, camera and … action – Take 4+!”

This post includes  a rework of all presentation of ‘Elements of Neural Networks and Deep  Learning Parts 1-8 ‘ since my earlier presentations had some missing parts, omissions and some occasional errors. So I have re-recorded all the presentations.
This series of presentation will do a deep-dive  into Deep Learning networks starting from the fundamentals. The equations required for performing learning in a L-layer Deep Learning network  are derived in detail, starting from the basics. Further, the presentations also discuss multi-class classification, regularization techniques, and gradient descent optimization methods in deep networks methods. Finally the presentations also touch on how  Deep Learning Networks can be tuned.

The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 1
This presentation introduces Neural Networks and Deep Learning. A look at history of Neural Networks, Perceptrons and why Deep Learning networks are required and concluding with a simple toy examples of a Neural Network and how they compute. This part also includes a small digression on the basics of Machine Learning and how the algorithm learns from a data set

2. Elements of Neural Networks and Deep Learning – Part 2
This presentation takes logistic regression as an example and creates an equivalent 2 layer Neural network. The presentation also takes a look at forward & backward propagation and how the cost is minimized using gradient descent

The implementation of the discussed 2 layer Neural Network in vectorized R, Python and Octave are available in my post ‘Deep Learning from first principles in Python, R and Octave – Part 1‘

3. Elements of Neural Networks and Deep Learning – Part 3
This 3rd part, discusses a primitive neural network with an input layer, output layer and a hidden layer. The neural network uses tanh activation in the hidden layer and a sigmoid activation in the output layer. The equations for forward and backward propagation are derived.

To see the implementations for the above discussed video see my post ‘Deep Learning from first principles in Python, R and Octave – Part 2

4. Elements of Neural Network and Deep Learning – Part 4
This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)

The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

5. Elements of Neural Network and Deep Learning – Part 5
This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.

The corresponding implementations in vectorized R, Python and Octave are available in the following posts
a. Deep Learning from first principles in Python, R and Octave – Part 4
b. Deep Learning from first principles in Python, R and Octave – Part 5

6. Elements of Neural Networks and Deep Learning – Part 6
This part discusses initialization methods specifically like He and Xavier. The presentation also focuses on how to prevent over-fitting using regularization. Lastly the dropout method of regularization is also discussed

The corresponding implementations in vectorized R, Python and Octave of the above discussed methods are available in my post Deep Learning from first principles in Python, R and Octave – Part 6

7. Elements of Neural Networks and Deep Learning – Part 7
This presentation introduces exponentially weighted moving average and shows how this is used in different approaches to gradient descent optimization. The key techniques discussed are learning rate decay, momentum method, rmsprop and adam.

The equivalent implementations of the gradient descent optimization techniques in R, Python and Octave can be seen in my post Deep Learning from first principles in Python, R and Octave – Part 7

8. Elements of Neural Networks and Deep Learning – Part 8
This last part touches on the method to adopt while tuning hyper-parameters in Deep Learning networks

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

This concludes this series of presentations on “Elements of Neural Networks and Deep Learning’

To see all posts click Index of posts

# My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 6,7,8

This is the final set of presentations in my series ‘Elements of Neural Networks and Deep Learning’. This set follows the earlier 2 sets of presentations namely
1. My presentations on ‘Elements of Neural Networks & Deep Learning’ -Part1,2,3
2. My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

In this final set of presentations I discuss initialization methods, regularization techniques including dropout. Next I also discuss gradient descent optimization methods like momentum, rmsprop, adam etc. Lastly, I briefly also touch on hyper-parameter tuning approaches. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 6
This part discusses initialization methods specifically like He and Xavier. The presentation also focuses on how to prevent over-fitting using regularization. Lastly the dropout method of regularization is also discusses

The corresponding implementations in vectorized R, Python and Octave of the above discussed methods are available in my post Deep Learning from first principles in Python, R and Octave – Part 6

2. Elements of Neural Networks and Deep Learning – Part 7
This presentation introduces exponentially weighted moving average and shows how this is used in different approaches to gradient descent optimization. The key techniques discussed are learning rate decay, momentum method, rmsprop and adam.

The equivalent implementations of the gradient descent optimization techniques in R, Python and Octave can be seen in my post Deep Learning from first principles in Python, R and Octave – Part 7

3. Elements of Neural Networks and Deep Learning – Part 8
This last part touches upon hyper-parameter tuning in Deep Learning networks

This concludes this series of presentations on “Elements of Neural Networks and Deep Learning’

Important note: Do check out my later version of these videos at Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8 . These have more content and also include some corrections. Check it out!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and and in kindle version($9.99/Rs449).

To see all posts click Index of posts

# My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

This is the next set of presentations on “Elements of Neural Networks and Deep Learning”.  In the 4th presentation I discuss and derive the generalized equations for a multi-unit, multi-layer Deep Learning network.  The 5th presentation derives the equations for a Deep Learning network when performing multi-class classification along with the derivations for cross-entropy loss. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

Important note: Do check out my later version of these videos at Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8 . These have more content and also include some corrections. Check it out!

1. Elements of Neural Network and Deep Learning – Part 4
This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)

The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

2. Elements of Neural Network and Deep Learning – Part 5
This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.

The corresponding implementations in vectorized R, Python and Octave are available in the following posts
a. Deep Learning from first principles in Python, R and Octave – Part 4
b. Deep Learning from first principles in Python, R and Octave – Part 5

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

To see all posts click Index of Posts

# My presentations on ‘Elements of Neural Networks & Deep Learning’ -Part1,2,3

I will be uploading a series of presentations on ‘Elements of Neural Networks and Deep Learning’. In these video presentations I discuss the derivations of L -Layer Deep Learning Networks, starting from the basics. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 1
This presentation introduces Neural Networks and Deep Learning. A look at history of Neural Networks, Perceptrons and why Deep Learning networks are required and concluding with a simple toy examples of a Neural Network and how they compute

2. Elements of Neural Networks and Deep Learning – Part 2
This presentation takes logistic regression as an example and creates an equivalent 2 layer Neural network. The presentation also takes a look at forward & backward propagation and how the cost is minimized using gradient descent

The implementation of the discussed 2 layer Neural Network in vectorized R, Python and Octave are available in my post ‘Deep Learning from first principles in Python, R and Octave – Part 1

3. Elements of Neural Networks and Deep Learning – Part 3
This 3rd part, discusses a primitive neural network with an input layer, output layer and a hidden layer. The neural network uses tanh activation in the hidden layer and a sigmoid activation in the output layer. The equations for forward and backward propagation are derived.

To see the implementations for the above discussed video see my post ‘Deep Learning from first principles in Python, R and Octave – Part 2

Important note: Do check out my later version of these videos at Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8 . These have more content and also include some corrections. Check it out!

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

To see all posts click Index of posts

# My book ‘Deep Learning from first principles:Second Edition’ now on Amazon

The second edition of my book ‘Deep Learning from first principles:Second Edition- In vectorized Python, R and Octave’, is now available on Amazon, in both paperback ($18.99) and kindle ($9.99/Rs449/-)  versions. Since this book is almost 70% code, all functions, and code snippets have been formatted to use the fixed-width font ‘Lucida Console’. In addition line numbers have been added to all code snippets. This makes the code more organized and much more readable. I have also fixed typos in the book

The book includes the following chapters

Table of Contents
Preface 4
Introduction 6
1. Logistic Regression as a Neural Network 8
2. Implementing a simple Neural Network 23
3. Building a L- Layer Deep Learning Network 48
4. Deep Learning network with the Softmax 85
5. MNIST classification with Softmax 103
6. Initialization, regularization in Deep Learning 121
7. Gradient Descent Optimization techniques 167
8. Gradient Check in Deep Learning 197
1. Appendix A 214
2. Appendix 1 – Logistic Regression as a Neural Network 220
3. Appendix 2 - Implementing a simple Neural Network 227
4. Appendix 3 - Building a L- Layer Deep Learning Network 240
5. Appendix 4 - Deep Learning network with the Softmax 259
6. Appendix 5 - MNIST classification with Softmax 269
7. Appendix 6 - Initialization, regularization in Deep Learning 302
8. Appendix 7 - Gradient Descent Optimization techniques 344
9. Appendix 8 – Gradient Check 405
References 475

To see posts click Index of Posts

# My book “Deep Learning from first principles” now on Amazon

Note: The 2nd edition of this book is now available on Amazon

My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($18.99) and kindle ($9.99/Rs449). The book starts with the most primitive 2-layer Neural Network and works  its way to a generic L-layer Deep Learning Network, with all the bells and whistles.  The book includes detailed derivations and vectorized implementations in Python, R and Octave.  The code has been extensively  commented and has been included in the Appendix section.

# Deep Learning from first principles in Python, R and Octave – Part 8

## 1. Introduction

You don’t understand anything until you learn it more than one way. Marvin Minsky
No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. Marvin Minsky
A wealth of information creates a poverty of attention. Herbert Simon

This post, Deep Learning from first Principles in Python, R and Octave-Part8, is my final post in my Deep Learning from first principles series. In this post, I discuss and implement a key functionality needed while building Deep Learning networks viz. ‘Gradient Checking’. Gradient Checking is an important method to check the correctness of your implementation, specifically the forward propagation and the backward propagation cycles of an implementation. In addition I also discuss some tips for tuning hyper-parameters of a Deep Learning network based on my experience.

My post in this  ‘Deep Learning Series’ so far were
1. Deep Learning from first principles in Python, R and Octave – Part 1 In part 1, I implement logistic regression as a neural network in vectorized Python, R and Octave
2. Deep Learning from first principles in Python, R and Octave – Part 2 In the second part I implement a simple Neural network with just 1 hidden layer and a sigmoid activation output function
3. Deep Learning from first principles in Python, R and Octave – Part 3 The 3rd part implemented a multi-layer Deep Learning Network with sigmoid activation output in vectorized Python, R and Octave
4. Deep Learning from first principles in Python, R and Octave – Part 4 The 4th part deals with multi-class classification. Specifically, I derive the Jacobian of the Softmax function and enhance my L-Layer DL network to include Softmax output function in addition to Sigmoid activation
5. Deep Learning from first principles in Python, R and Octave – Part 5 This post uses the Softmax classifier implemented to classify MNIST digits using a L-layer Deep Learning network
6. Deep Learning from first principles in Python, R and Octave – Part 6 The 6th part adds more bells and whistles to my L-Layer DL network, by including different initialization types namely He and Xavier. Besides L2 Regularization and random dropout is added.
7. Deep Learning from first principles in Python, R and Octave – Part 7 The 7th part deals with Stochastic Gradient Descent Optimization methods including momentum, RMSProp and Adam
8. Deep Learning from first principles in Python, R and Octave – Part 8 – This post implements a critical function for ensuring the correctness of a L-Layer Deep Learning network implementation using Gradient Checking

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

Gradient Checking is based on the following approach. One iteration of Gradient Descent computes and updates the parameters $\theta$ by doing
$\theta := \theta - \frac{d}{d\theta}J(\theta)$.
To minimize the cost we will need to minimize $J(\theta)$
Let $g(\theta)$ be a function that computes the derivative $\frac {d}{d\theta}J(\theta)$. Gradient Checking allows us to numerically evaluate the implementation of the function $g(\theta)$ and verify its correctness.
We know the derivative of a function is given by
$\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}$
Note: The above derivative is based on the 2 sided derivative. The 1-sided derivative  is given by $\frac {d}{d\theta}J(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta)} {\epsilon}$
Gradient Checking is based on the 2-sided derivative because the error is of the order $O(\epsilon^{2})$ as opposed $O(\epsilon)$ for the 1-sided derivative.
Hence Gradient Check uses the 2 sided derivative as follows.
$g(\theta) = lim->0 \frac {J(\theta +\epsilon) - J(\theta -\epsilon)} {2*\epsilon}$

In Gradient Check the following is done
A) Run one normal cycle of your implementation by doing the following
a) Compute the output activation by running 1 cycle of forward propagation
b) Compute the cost using the output activation

B) Perform gradient check steps as below
a) Set $\theta$ . Flatten all ‘weights’ and ‘bias’ matrices and vectors to a column vector.
b) Initialize $\theta+$ by bumping up $\theta$ by adding $\epsilon$ ($\theta + \epsilon$)
c) Perform forward propagation with $\theta+$
d) Compute cost with $\theta+$ i.e. $J(\theta+)$
e) Initialize  $\theta-$ by bumping down $\theta$ by subtracting $\epsilon$ $(\theta - \epsilon)$
f) Perform forward propagation with $\theta-$
g) Compute cost with $\theta-$ i.e.  $J(\theta-)$
h) Compute $\frac {d} {d\theta} J(\theta)$ or ‘gradapprox’ as$\frac {J(\theta+) - J(\theta-) } {2\epsilon}$using the 2 sided derivative.
i) Compute L2norm or the Euclidean distance between ‘grad’ and ‘gradapprox’. If the
diference is of the order of $10^{-5}$ or $10^{-7}$ the implementation is correct. In the Deep Learning Specialization Prof Andrew Ng mentions that if the difference is of the order of $10^{-7}$ then the implementation is correct. A difference of $10^{-5}$ is also ok. Anything more than that is a cause of worry and you should look at your code more closely. To see more details click Gradient checking and advanced optimization

After spending a better part of 3 days, I now realize how critical Gradient Check is for ensuring the correctness of you implementation. Initially I was getting very high difference and did not know how to understand the results or debug my implementation. After many hours of staring at the results, I  was able to finally arrive at a way, to localize issues in the implementation. In fact, I did catch a small bug in my Python code, which did not exist in the R and Octave implementations. I will demonstrate this below

## 1.1a Gradient Check – Sigmoid Activation – Python

import numpy as np
import matplotlib

train_X, train_Y, test_X, test_Y = load_dataset()
#Set layer dimensions
layersDimensions = [2,4,1]
parameters = initializeDeepModel(layersDimensions)
#Perform forward prop
AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1, hiddenActivationFunc="relu",outputActivationFunc="sigmoid")
#Compute cost
cost = computeCost(AL, train_Y, outputActivationFunc="sigmoid")
print("cost=",cost)
gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1,                                   hiddenActivationFunc="relu",outputActivationFunc="sigmoid")

epsilon = 1e-7
outputActivationFunc="sigmoid"

# Set-up variables
# Flatten parameters to a vector
parameters_values, _ = dictionary_to_vector(parameters)
num_parameters = parameters_values.shape[0]
#Initialize
J_plus = np.zeros((num_parameters, 1))
J_minus = np.zeros((num_parameters, 1))

# Compute gradapprox using 2 sided derivative
for i in range(num_parameters):
# Compute J_plus[i].
thetaplus = np.copy(parameters_values)
thetaplus[i][0] = thetaplus[i][0] + epsilon
AL, caches, dropoutMat = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaplus), keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc)
J_plus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc)

# Compute J_minus[i].
thetaminus = np.copy(parameters_values)
thetaminus[i][0] = thetaminus[i][0] - epsilon
AL, caches, dropoutMat  = forwardPropagationDeep(train_X, vector_to_dictionary(parameters,thetaminus), keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc)
J_minus[i] = computeCost(AL, train_Y, outputActivationFunc=outputActivationFunc)

difference =  numerator/denominator

#Check the difference
if difference > 1e-5:
print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
print(difference)
print("\n")
# The technique below can be used to identify
# which of the parameters are in error
print(m)
print("\n")
print(n)

## (300, 2)
## (300,)
## cost= 0.6931455556341791
## [92mYour backward propagation works perfectly fine! difference = 1.1604150683743381e-06[0m
## 1.1604150683743381e-06
##
##
## {'dW1': array([[-6.19439955e-06, -2.06438046e-06],
##        [-1.50165447e-05,  7.50401672e-05],
##        [ 1.33435433e-04,  1.74112143e-04],
##        [-3.40909024e-05, -1.38363681e-04]]), 'db1': array([[ 7.31333221e-07],
##        [ 7.98425950e-06],
##        [ 8.15002817e-08],
##        [-5.69821155e-08]]), 'dW2': array([[2.73416304e-04, 2.96061451e-04, 7.51837363e-05, 1.01257729e-04]]), 'db2': array([[-7.22232235e-06]])}
##
##
## {'dW1': array([[-6.19448937e-06, -2.06501483e-06],
##        [-1.50168766e-05,  7.50399742e-05],
##        [ 1.33435485e-04,  1.74112391e-04],
##        [-3.40910633e-05, -1.38363765e-04]]), 'db1': array([[ 7.31081862e-07],
##        [ 7.98472399e-06],
##        [ 8.16013923e-08],
##        [-5.71764858e-08]]), 'dW2': array([[2.73416290e-04, 2.96061509e-04, 7.51831930e-05, 1.01257891e-04]]), 'db2': array([[-7.22255589e-06]])}

## 1.1b Gradient Check – Softmax Activation – Python (Error!!)

In the code below I show, how I managed to spot a bug in your implementation

import numpy as np
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

# Plot the data
#plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
layersDimensions = [2,3,3]
y1=y.reshape(-1,1).T
train_X=X.T
train_Y=y1

parameters = initializeDeepModel(layersDimensions)
#Compute forward prop
AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc="softmax")
#Compute cost
cost = computeCost(AL, train_Y, outputActivationFunc="softmax")
print("cost=",cost)
gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc="softmax")
# Note the transpose of the gradients for Softmax has to be taken
L= len(parameters)//2
print(L)
gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax")

cost= 1.0986187818144022
2
There is a mistake in the backward propagation! difference = 0.7100295155692544
0.7100295155692544

{'dW1': array([[ 0.00050125,  0.00045194],
[ 0.00096392,  0.00039641],
[-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082],
[-0.00224399],
[ 0.00052305]]), 'dW2': array([[-8.40953794e-05, -9.52657769e-04, -1.10269379e-04],
[-7.45469382e-04,  9.49795606e-04,  2.29045434e-04],
[ 8.29564761e-04,  2.86216305e-06, -1.18776055e-04]]),
'db2': array([[-0.00253808],
[-0.00505508],
[ 0.00759315]])}

{'dW1': array([[ 0.00050125,  0.00045194],
[ 0.00096392,  0.00039641],
[-0.00014276, -0.00045639]]), 'db1': array([[ 0.00070082],
[-0.00224399],
[ 0.00052305]]), 'dW2': array([[-8.40960634e-05, -9.52657953e-04, -1.10268461e-04],
[-7.45469242e-04,  9.49796908e-04,  2.29045671e-04],
[ 8.29565305e-04,  2.86104473e-06, -1.18776100e-04]]),
'db2': array([[-8.46211989e-06],
[-1.68487446e-05],
[ 2.53108645e-05]])}

Gradient Check gives a high value of the difference of 0.7100295. Inspecting the Gradients and Gradapprox we can see there is a very big discrepancy in db2. After I went over my code I discovered that I my computation in the function layerActivationBackward for Softmax was


# Erroneous code
if activationFunc == 'softmax':
dW = 1/numtraining * np.dot(A_prev,dZ)
db = np.sum(dZ, axis=0, keepdims=True)
dA_prev = np.dot(dZ,W)
# Fixed code
if activationFunc == 'softmax':
dW = 1/numtraining * np.dot(A_prev,dZ)
db = 1/numtraining *  np.sum(dZ, axis=0, keepdims=True)
dA_prev = np.dot(dZ,W)


After fixing this error when I ran Gradient Check I get

## 1.1c Gradient Check – Softmax Activation – Python (Corrected!!)

import numpy as np
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

# Plot the data
#plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
layersDimensions = [2,3,3]
y1=y.reshape(-1,1).T
train_X=X.T
train_Y=y1
#Set layer dimensions
parameters = initializeDeepModel(layersDimensions)
#Perform forward prop
AL, caches, dropoutMat = forwardPropagationDeep(train_X, parameters, keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc="softmax")
#Compute cost
cost = computeCost(AL, train_Y, outputActivationFunc="softmax")
print("cost=",cost)
gradients = backwardPropagationDeep(AL, train_Y, caches, dropoutMat, lambd=0, keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc="softmax")
# Note the transpose of the gradients for Softmax has to be taken
L= len(parameters)//2
print(L)
gradient_check_n(parameters, gradients, train_X, train_Y, epsilon = 1e-7,outputActivationFunc="softmax")
## cost= 1.0986193170234435
## 2
## [92mYour backward propagation works perfectly fine! difference = 5.268804859613151e-07[0m
## 5.268804859613151e-07
##
##
## {'dW1': array([[ 0.00053206,  0.00038987],
##        [ 0.00093941,  0.00038077],
##        [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662],
##        [-0.00210198],
##        [ 0.00046741]]), 'dW2': array([[-7.83441270e-05, -9.70179498e-04, -1.08715815e-04],
##        [-7.70175008e-04,  9.54478237e-04,  2.27690198e-04],
##        [ 8.48519135e-04,  1.57012608e-05, -1.18974383e-04]]), 'db2': array([[-8.52190476e-06],
##        [-1.69954294e-05],
##        [ 2.55173342e-05]])}
##
##
## {'dW1': array([[ 0.00053206,  0.00038987],
##        [ 0.00093941,  0.00038077],
##        [-0.00012177, -0.0004692 ]]), 'db1': array([[ 0.00072662],
##        [-0.00210198],
##        [ 0.00046741]]), 'dW2': array([[-7.83439980e-05, -9.70180603e-04, -1.08716369e-04],
##        [-7.70173925e-04,  9.54478718e-04,  2.27690089e-04],
##        [ 8.48520143e-04,  1.57018842e-05, -1.18973720e-04]]), 'db2': array([[-8.52096171e-06],
##        [-1.69964043e-05],
##        [ 2.55162558e-05]])}

## 1.2a Gradient Check – Sigmoid Activation – R

source("DLfunctions8.R")

x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
#Set layer dimensions
layersDimensions = c(2,5,1)
parameters = initializeDeepModel(layersDimensions)
#Perform forward prop
retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu",
outputActivationFunc="sigmoid")
AL <- retvals[['AL']]
caches <- retvals[['caches']]
dropoutMat <- retvals[['dropoutMat']]
#Compute cost
cost <- computeCost(AL, Y,outputActivationFunc="sigmoid",
numClasses=layersDimensions[length(layersDimensions)])
print(cost)
## [1] 0.6931447
# Backward propagation.
gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",
outputActivationFunc="sigmoid",numClasses=layersDimensions[length(layersDimensions)])
epsilon = 1e-07
outputActivationFunc="sigmoid"
#Convert parameter list to vector
parameters_values = list_to_vector(parameters)
num_parameters = dim(parameters_values)[1]
#Initialize
J_plus = matrix(rep(0,num_parameters),
nrow=num_parameters,ncol=1)
J_minus = matrix(rep(0,num_parameters),
nrow=num_parameters,ncol=1)
nrow=num_parameters,ncol=1)

for(i in 1:num_parameters){
# Compute J_plus[i].
thetaplus = parameters_values
thetaplus[i][1] = thetaplus[i][1] + epsilon
retvals = forwardPropagationDeep(X, vector_to_list(parameters,thetaplus), keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc)

AL <- retvals[['AL']]
J_plus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc)

# Compute J_minus[i].
thetaminus = parameters_values
thetaminus[i][1] = thetaminus[i][1] - epsilon
retvals  = forwardPropagationDeep(X, vector_to_list(parameters,thetaminus), keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc)
AL <- retvals[['AL']]
J_minus[i] = computeCost(AL, Y, outputActivationFunc=outputActivationFunc)

}
#Compute L2Norm
difference =  numerator/denominator
if(difference > 1e-5){
cat("There is a mistake, the difference is too high",difference)
} else{
cat("The implementations works perfectly", difference)
}
## The implementations works perfectly 1.279911e-06
# This can be used to check
print("Gradients from backprop")
## [1] "Gradients from backprop"
vector_to_list2(parameters,grad)
## $dW1 ## [,1] [,2] ## [1,] -7.641588e-05 -3.427989e-07 ## [2,] -9.049683e-06 6.906304e-05 ## [3,] 3.401039e-06 -1.503914e-04 ## [4,] 1.535226e-04 -1.686402e-04 ## [5,] -6.029292e-05 -2.715648e-04 ## ##$db1
##               [,1]
## [1,]  6.930318e-06
## [2,] -3.283117e-05
## [3,]  1.310647e-05
## [4,] -3.454308e-05
## [5,] -2.331729e-08
##
## $dW2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0001612356 0.0001113475 0.0002435824 0.000362149 2.874116e-05 ## ##$db2
##              [,1]
## [1,] -1.16364e-05
print("Grad approx from gradient check")
## [1] "Grad approx from gradient check"
vector_to_list2(parameters,gradapprox)
## $dW1 ## [,1] [,2] ## [1,] -7.641554e-05 -3.430589e-07 ## [2,] -9.049428e-06 6.906253e-05 ## [3,] 3.401168e-06 -1.503919e-04 ## [4,] 1.535228e-04 -1.686401e-04 ## [5,] -6.029288e-05 -2.715650e-04 ## ##$db1
##               [,1]
## [1,]  6.930012e-06
## [2,] -3.283096e-05
## [3,]  1.310618e-05
## [4,] -3.454237e-05
## [5,] -2.275957e-08
##
## $dW2 ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0001612355 0.0001113476 0.0002435829 0.0003621486 2.87409e-05 ## ##$db2
##              [,1]
## [1,] -1.16368e-05

## 1.2b Gradient Check – Softmax Activation – R

source("DLfunctions8.R")

# Setup the data
X <- Z[,1:2]
y <- Z[,3]
X <- t(X)
Y <- t(y)
layersDimensions = c(2, 3, 3)
parameters = initializeDeepModel(layersDimensions)
#Perform forward prop
retvals = forwardPropagationDeep(X, parameters,keep_prob=1, hiddenActivationFunc="relu",
outputActivationFunc="softmax")
AL <- retvals[['AL']]
caches <- retvals[['caches']]
dropoutMat <- retvals[['dropoutMat']]
#Compute cost
cost <- computeCost(AL, Y,outputActivationFunc="softmax",
numClasses=layersDimensions[length(layersDimensions)])
print(cost)
## [1] 1.098618
# Backward propagation.
gradients = backwardPropagationDeep(AL, Y, caches, dropoutMat, lambd=0, keep_prob=1, hiddenActivationFunc="relu",
outputActivationFunc="softmax",numClasses=layersDimensions[length(layersDimensions)])
# Need to take transpose of the last layer for Softmax
L=length(parameters)/2
epsilon = 1e-7,outputActivationFunc="softmax")
## The implementations works perfectly 3.903011e-07[1] "Gradients from backprop"
## $dW1 ## [,1] [,2] ## [1,] 0.0007962367 -0.0001907606 ## [2,] 0.0004444254 0.0010354412 ## [3,] 0.0003078611 0.0007591255 ## ##$db1
##               [,1]
## [1,] -0.0017305136
## [2,]  0.0005393734
## [3,]  0.0012484550
##
## $dW2 ## [,1] [,2] [,3] ## [1,] -3.515627e-04 7.487283e-04 -3.971656e-04 ## [2,] -6.381521e-05 -1.257328e-06 6.507254e-05 ## [3,] -1.719479e-04 -4.857264e-04 6.576743e-04 ## ##$db2
##               [,1]
## [1,] -5.536383e-06
## [2,] -1.824656e-05
## [3,]  2.378295e-05
##
## $dW1 ## [,1] [,2] ## [1,] 0.0007962364 -0.0001907607 ## [2,] 0.0004444256 0.0010354406 ## [3,] 0.0003078615 0.0007591250 ## ##$db1
##               [,1]
## [1,] -0.0017305135
## [2,]  0.0005393741
## [3,]  0.0012484547
##
## $dW2 ## [,1] [,2] [,3] ## [1,] -3.515632e-04 7.487277e-04 -3.971656e-04 ## [2,] -6.381451e-05 -1.257883e-06 6.507239e-05 ## [3,] -1.719469e-04 -4.857270e-04 6.576739e-04 ## ##$db2
##               [,1]
## [1,] -5.536682e-06
## [2,] -1.824652e-05
## [3,]  2.378209e-05

## 1.3a Gradient Check – Sigmoid Activation – Octave

source("DL8functions.m")
################## Circles

X=data(:,1:2);
Y=data(:,3);
#Set layer dimensions
layersDimensions = [2 5  1]; #tanh=-0.5(ok), #relu=0.1 best!
[weights biases] = initializeDeepModel(layersDimensions);
#Perform forward prop
[AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1,
hiddenActivationFunc="relu", outputActivationFunc="sigmoid");
#Compute cost
cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2)));
disp(cost);
hiddenActivationFunc="relu", outputActivationFunc="sigmoid",
numClasses=layersDimensions(size(layersDimensions)(2)));
epsilon = 1e-07;
outputActivationFunc="sigmoid";
# Convert paramter cell array to vector
parameters_values = cellArray_to_vector(weights, biases);
#Convert gradient cell array to vector
num_parameters = size(parameters_values)(1);
#Initialize
J_plus = zeros(num_parameters, 1);
J_minus = zeros(num_parameters, 1);
for i = 1:num_parameters
# Compute J_plus[i].
thetaplus = parameters_values;
thetaplus(i,1) = thetaplus(i,1) + epsilon;
[weights1 biases1] =vector_to_cellArray(weights, biases,thetaplus);
[AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights1, biases1, keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc);
J_plus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc);

# Compute J_minus[i].
thetaminus = parameters_values;
thetaminus(i,1) = thetaminus(i,1) - epsilon ;
[weights1 biases1] = vector_to_cellArray(weights, biases,thetaminus);
[AL forward_caches activation_caches droputMat]  = forwardPropagationDeep(X',weights1, biases1, keep_prob=1,
hiddenActivationFunc="relu",outputActivationFunc=outputActivationFunc);
J_minus(i) = computeCost(AL, Y', outputActivationFunc=outputActivationFunc);

endfor

#Compute L2Norm
difference =  numerator/denominator;
disp(difference);
#Check difference
if difference > 1e-04
printf("There is a mistake in the implementation ");
disp(difference);
else
printf("The implementation works perfectly");
disp(difference);
endif
disp(weights1);
disp(biases1);
disp(weights2);
disp(biases2);

0.69315
1.4893e-005
The implementation works perfectly 1.4893e-005
{
[1,1] =
5.0349e-005 2.1323e-005
8.8632e-007 1.8231e-006
9.3784e-005 1.0057e-004
1.0875e-004 -1.9529e-007
5.4502e-005 3.2721e-005
[1,2] =
1.0567e-005 6.0615e-005 4.6004e-005 1.3977e-004 1.0405e-004
}
{
[1,1] =
-1.8716e-005
1.1309e-009
4.7686e-005
1.2051e-005
-1.4612e-005
[1,2] = 9.5808e-006
}
{
[1,1] =
5.0348e-005 2.1320e-005
8.8485e-007 1.8219e-006
9.3784e-005 1.0057e-004
1.0875e-004 -1.9762e-007
5.4502e-005 3.2723e-005
[1,2] =
[1,2] =
1.0565e-005 6.0614e-005 4.6007e-005 1.3977e-004 1.0405e-004
}
{
[1,1] =
-1.8713e-005
1.1102e-009
4.7687e-005
1.2048e-005
-1.4609e-005
[1,2] = 9.5790e-006
}


## 1.3b Gradient Check – Softmax Activation – Octave

source("DL8functions.m")

# Setup the data
X=data(:,1:2);
Y=data(:,3);
# Set the layer dimensions
layersDimensions = [2 3  3];
[weights biases] = initializeDeepModel(layersDimensions);
# Run forward prop
[AL forward_caches activation_caches droputMat] = forwardPropagationDeep(X', weights, biases,keep_prob=1,
hiddenActivationFunc="relu", outputActivationFunc="softmax");
# Compute cost
cost = computeCost(AL, Y',outputActivationFunc=outputActivationFunc,numClasses=layersDimensions(size(layersDimensions)(2)));
disp(cost);
# Perform backward prop
hiddenActivationFunc="relu", outputActivationFunc="softmax",
numClasses=layersDimensions(size(layersDimensions)(2)));

#Take transpose of last layer for Softmax
L=size(weights)(2);
outputActivationFunc="softmax",numClasses=layersDimensions(size(layersDimensions)(2)));

 1.0986
The implementation works perfectly  2.0021e-005
{
[1,1] =
-7.1590e-005  4.1375e-005
-1.9494e-004  -5.2014e-005
-1.4554e-004  5.1699e-005
[1,2] =
3.3129e-004  1.9806e-004  -1.5662e-005
-4.9692e-004  -3.7756e-004  -8.2318e-005
1.6562e-004  1.7950e-004  9.7980e-005
}
{
[1,1] =
-3.0856e-005
-3.3321e-004
-3.8197e-004
[1,2] =
1.2046e-006
2.9259e-007
-1.4972e-006
}
{
[1,1] =
-7.1586e-005  4.1377e-005
-1.9494e-004  -5.2013e-005
-1.4554e-004  5.1695e-005
3.3129e-004  1.9806e-004  -1.5664e-005
-4.9692e-004  -3.7756e-004  -8.2316e-005
1.6562e-004  1.7950e-004  9.7979e-005
}
{
[1,1] =
-3.0852e-005
-3.3321e-004
-3.8197e-004
[1,2] =
1.1902e-006
2.8200e-007
-1.4644e-006
}


## 2.1 Tip for tuning hyperparameters

Deep Learning Networks come with a large number of hyper parameters which require tuning. The hyper parameters are

1. $\alpha$ -learning rate
2. Number of layers
3. Number of hidden units
4. Number of iterations
5. Momentum – $\beta$ – 0.9
6. RMSProp – $\beta_{1}$ – 0.9
7. Adam – $\beta_{1}$,$\beta_{2}$ and $\epsilon$
8. learning rate decay
9. mini batch size
10. Initialization method – He, Xavier
11. Regularization

– Among the above the most critical is learning rate $\alpha$ . Rather than just trying out random values, it may help to try out values on a logarithmic scale. So we could try out values -0.01,0.1,1.0,10 etc. If we find that the cost is between 0.01 and 0.1 we could use a technique similar to binary search or bisection, so we can try 0.01, 0.05. If we need to be bigger than 0.01 and 0.05 we could try 0.25  and then keep halving the distance etc.
– The performance of Momentum and RMSProp are very good and work well with values 0.9. Even with this, it is better to try out values of 1-$\beta$ in the logarithmic range. So 1-$\beta$ could 0.001,0.01,0.1 and hence $\beta$ would be 0.999,0.99 or 0.9
– Increasing the number of hidden units or number of hidden layers need to be done gradually. I have noticed that increasing number of hidden layers heavily does not improve performance and sometimes degrades it.
– Sometimes, I tend to increase the number of iterations if I think I see a steady decrease in the cost for a certain learning rate
– It may also help to add learning rate decay if you see there is an oscillation while it decreases.
– Xavier and He initializations also help in a fast convergence and are worth trying out.

## 3.1 Final thoughts

As I come to a close in this Deep Learning Series from first principles in Python, R and Octave, I must admit that I learnt a lot in the process.

* Building a L-layer, vectorized Deep Learning Network in Python, R and Octave was extremely challenging but very rewarding
* One benefit of building vectorized versions in Python, R and Octave was that I was looking at each function that I was implementing thrice, and hence I was able to fix any bugs in any of the languages
* In addition since I built the generic L-Layer DL network with all the bells and whistles, layer by layer I further had an opportunity to look at all the functions in each successive post.
* Each language has its advantages and disadvantages. From the performance perspective I think Python is the best, followed by Octave and then R
* Interesting, I noticed that even if small bugs creep into your implementation, the DL network does learn and does generate a valid set of weights and biases, however this may not be an optimum solution. In one case of an inadvertent bug, I was not updating the weights in the final layer of the DL network. Yet, using all the other layers, the DL network was able to come with a reasonable solution (maybe like random dropout, remaining units can still learn the data!)
* Having said that, the Gradient Check method discussed and implemented in this post can be very useful in ironing out bugs.

## Conclusion

These last couple of months when I was writing the posts and the also churning up the code in Python, R and Octave were  very hectic. There have been times when I found that implementations of some function to be extremely demanding and I almost felt like giving up. Other times, I have spent quite some time on an intractable DL network which would not respond to changes in hyper-parameters. All in all, it was a great learning experience. I would suggest that you start from my first post Deep Learning from first principles in Python, R and Octave-Part 1 and work your way up. Feel free to take the code apart and try out things. That is the only way you will learn.

Hope you had as much fun as I had. Stay tuned. I will be back!!!

To see all post click Index of Posts

# Deep Learning from first principles in Python, R and Octave – Part 7

Artificial Intelligence is the new electricity. – Prof Andrew Ng

Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI.  – Yann LeCun, March 14, 2016 (Facebook)

# Introduction

In this post ‘Deep Learning from first principles with Python, R and Octave-Part 7’, I implement optimization methods used in Stochastic Gradient Descent (SGD) to speed up the convergence. Specifically I discuss and implement the following gradient descent optimization techniques

b.Learning rate decay
c. Momentum method
d. RMSProp

This post, further enhances my generic  L-Layer Deep Learning Network implementations in  vectorized Python, R and Octave to also include the Stochastic Gradient Descent optimization techniques. You can clone/download the code from Github at DeepLearning-Part7

You can view my video  presentation on Gradient Descent Optimization in Neural Networks 7

Incidentally, a good discussion of the various optimizations methods used in Stochastic Gradient Optimization techniques can be seen at Sebastian Ruder’s blog

Note: In the vectorized Python, R and Octave implementations below only a  1024 random training samples were used. This was to reduce the computation time. You are free to use the entire data set (60000 training data) for the computation.

This post is largely based of on Prof Andrew Ng’s Deep Learning Specialization.  All the above optimization techniques for Stochastic Gradient Descent are based on the technique of exponentially weighted average method. So for example if we had some time series data $\theta_{1},\theta_{2},\theta_{3}... \theta_{t}$ then we we can represent the exponentially average value at time ‘t’ as a sequence of the the previous value $v_{t-1}$ and $\theta_{t}$ as shown below
$v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$

Here $v_{t}$ represent the average of the data set over $\frac {1}{1-\beta}$  By choosing different values of $\beta$, we can average over a larger or smaller number of the data points.
We can write the equations as follows
$v_{t} = \beta v_{t-1} + (1-\beta)\theta_{t}$
$v_{t-1} = \beta v_{t-2} + (1-\beta)\theta_{t-1}$
$v_{t-2} = \beta v_{t-3} + (1-\beta)\theta_{t-2}$
and
$v_{t-k} = \beta v_{t-(k+1))} + (1-\beta)\theta_{t-k}$
By substitution we have
$v_{t} = (1-\beta)\theta_{t} + \beta v_{t-1}$
$v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta v_{t-2}$
$v_{t} = (1-\beta)\theta_{t} + \beta ((1-\beta)\theta_{t-1}) + \beta ((1-\beta)\theta_{t-2}+ \beta v_{t-3} )$

Hence it can be seen that the $v_{t}$ is the weighted sum over the previous values $\theta_{k}$, which is an exponentially decaying function.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

## 1.1a. Stochastic Gradient Descent (Vanilla) – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

# Create  a list of 1024 random numbers.
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
# Set the layer dimensions
layersDimensions=[784, 15,9,10]
# Perform SGD with regular gradient descent
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="gd",
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig1.png")


## 1.1b. Stochastic Gradient Descent (Vanilla) – R

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
# Set layer dimensions
layersDimensions=c(784, 15,9, 10)
# Perform SGD with regular gradient descent
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
optimizer="gd",
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs no of epochs") + xlab("No of epochss") + ylab("Cost") ## 1.1c. Stochastic Gradient Descent (Vanilla) – Octave source("DL7functions.m") #Load and read MNIST load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024 permutation = randperm(1024); disp(length(permutation)); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; # Perform SGD with regular gradient descent [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.005, lrDecay=true, decayRate=1, lambd=0, keep_prob=1, optimizer="gd", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs);  ## 2.1. Stochastic Gradient Descent with Learning rate decay Since in Stochastic Gradient Descent,with each epoch, we use slight different samples, the gradient descent algorithm, oscillates across the ravines and wanders around the minima, when a fixed learning rate is used. In this technique of ‘learning rate decay’ the learning rate is slowly decreased with the number of epochs and becomes smaller and smaller, so that gradient descent can take smaller steps towards the minima. There are several techniques employed in learning rate decay a) Exponential decay: $\alpha = decayRate^{epochNum} *\alpha_{0}$ b) 1/t decay : $\alpha = \frac{\alpha_{0}}{1 + decayRate*epochNum}$ c) $\alpha = \frac {decayRate}{\sqrt(epochNum)}*\alpha_{0}$ In my implementation I have used the ‘exponential decay’. The code snippet for Python is shown below if lrDecay == True: learningRate = np.power(decayRate,(num_epochs/1000)) * learningRate  ## 2.1a. Stochastic Gradient Descent with Learning rate decay – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) # Read the MNIST data training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T # Create a list of random numbers of 1024 permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) # Set layer dimensions layersDimensions=[784, 15,9,10] # Perform SGD with learning rate decay parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.01 , lrDecay=True, decayRate=0.9999, optimizer="gd", mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig2.png") ## 2.1b. Stochastic Gradient Descent with Learning rate decay – R source("mnist.R") source("DLfunctions7.R") # Read and load MNIST load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) # Set layer dimensions layersDimensions=c(784, 15,9, 10) # Perform SGD with Learning rate decay retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.05, lrDecay=TRUE, decayRate=0.9999, optimizer="gd", mini_batch_size = 512, num_epochs = 5000, print_cost = True) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")

## 2.1c. Stochastic Gradient Descent with Learning rate decay – Octave

source("DL7functions.m")
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with regular Learning rate decay
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01,
lrDecay=true,
decayRate=0.999,
lambd=0,
keep_prob=1,
optimizer="gd",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)


## 3.1. Stochastic Gradient Descent with Momentum

Stochastic Gradient Descent with Momentum uses the exponentially weighted average method discusses above and more generally moves faster into the ravine than across it. The equations are
$v_{dW}^l = \beta v_{dW}^l + (1-\beta)dW^{l}$
$v_{db}^l = \beta v_{db}^l + (1-\beta)db^{l}$
$W^{l} = W^{l} - \alpha v_{dW}^l$
$b^{l} = b^{l} - \alpha v_{db}^l$ where
$v_{dW}$ and $v_{db}$ are the momentum terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’ The code snippet for Stochastic Gradient Descent with momentum in R is shown below

# Perform Gradient Descent with momentum
# Input : Weights and biases
#       : beta
#       : learning rate
#       : outputActivationFunc - Activation function at hidden layer sigmoid/softmax
#output : Updated weights after 1 iteration

L = length(parameters)/2 # number of layers in the neural network
# Update rule for each parameter. Use a for loop.
for(l in 1:(L-1)){
# Compute velocities
# v['dWk'] = beta *v['dWk'] + (1-beta)*dWk
v[[paste("dW",l, sep="")]] = beta*v[[paste("dW",l, sep="")]] +
v[[paste("db",l, sep="")]] = beta*v[[paste("db",l, sep="")]] +

parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] -
learningRate* v[[paste("dW",l, sep="")]]
parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] -
learningRate* v[[paste("db",l, sep="")]]
}
# Compute for the Lth layer
if(outputActivationFunc=="sigmoid"){
v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] +
v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] +

parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate* v[[paste("dW",l, sep="")]]
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate* v[[paste("db",l, sep="")]]

}else if (outputActivationFunc=="softmax"){
v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] +
v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] +
parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
}
return(parameters)
}

## 3.1a. Stochastic Gradient Descent with Momentum- Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

# Create  a list of random numbers of 1024
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
layersDimensions=[784, 15,9,10]
# Perform SGD with momentum
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="momentum", beta=0.9,
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig3.png")

## 3.1b. Stochastic Gradient Descent with Momentum- R

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
# Perform SGD with momentum
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
optimizer="momentum",
beta=0.9,
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)


#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost") ## 3.1c. Stochastic Gradient Descent with Momentum- Octave source("DL7functions.m") #Load and read MNIST load('./mnist/mnist.txt.gz'); #Create a random permutatation from 60K permutation = randperm(1024); disp(length(permutation)); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; # Perform SGD with Momentum [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.01, lrDecay=false, decayRate=1, lambd=0, keep_prob=1, optimizer="momentum", beta=0.9, beta1=0.9, beta2=0.999, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs)  ## 4.1. Stochastic Gradient Descent with RMSProp Stochastic Gradient Descent with RMSProp tries to move faster towards the minima while dampening the oscillations across the ravine. The equations are $s_{dW}^l = \beta_{1} s_{dW}^l + (1-\beta_{1})(dW^{l})^{2}$ $s_{db}^l = \beta_{1} s_{db}^l + (1-\beta_{1})(db^{l})^2$ $W^{l} = W^{l} - \frac {\alpha s_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$ $b^{l} = b^{l} - \frac {\alpha s_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$ where $s_{dW}$ and $s_{db}$ are the RMSProp terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’ The code snippet in Octave is shown below # Update parameters with RMSProp # Input : parameters # : gradients # : s # : beta # : learningRate # : #output : Updated parameters RMSProp function [weights biases] = gradientDescentWithRMSProp(weights, biases,gradsDW,gradsDB, sdW, sdB, beta1, epsilon, learningRate,outputActivationFunc="sigmoid") L = size(weights)(2); # number of layers in the neural network # Update rule for each parameter. for l=1:(L-1) sdW{l} = beta1*sdW{l} + (1 -beta1) * gradsDW{l} .* gradsDW{l}; sdB{l} = beta1*sdB{l} + (1 -beta1) * gradsDB{l} .* gradsDB{l}; weights{l} = weights{l} - learningRate* gradsDW{l} ./ sqrt(sdW{l} + epsilon); biases{l} = biases{l} - learningRate* gradsDB{l} ./ sqrt(sdB{l} + epsilon); endfor if (strcmp(outputActivationFunc,"sigmoid")) sdW{L} = beta1*sdW{L} + (1 -beta1) * gradsDW{L} .* gradsDW{L}; sdB{L} = beta1*sdB{L} + (1 -beta1) * gradsDB{L} .* gradsDB{L}; weights{L} = weights{L} -learningRate* gradsDW{L} ./ sqrt(sdW{L} +epsilon); biases{L} = biases{L} -learningRate* gradsDB{L} ./ sqrt(sdB{L} + epsilon); elseif (strcmp(outputActivationFunc,"softmax")) sdW{L} = beta1*sdW{L} + (1 -beta1) * gradsDW{L}' .* gradsDW{L}'; sdB{L} = beta1*sdB{L} + (1 -beta1) * gradsDB{L}' .* gradsDB{L}'; weights{L} = weights{L} -learningRate* gradsDW{L}' ./ sqrt(sdW{L} +epsilon); biases{L} = biases{L} -learningRate* gradsDB{L}' ./ sqrt(sdB{L} + epsilon); endif end  ## 4.1a. Stochastic Gradient Descent with RMSProp – Python import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions7.py").read()) exec(open("load_mnist.py").read()) # Read and load MNIST training=list(read(dataset='training',path=".\\mnist")) test=list(read(dataset='testing',path=".\\mnist")) lbls=[] pxls=[] for i in range(60000): l,p=training[i] lbls.append(l) pxls.append(p) labels= np.array(lbls) pixels=np.array(pxls) y=labels.reshape(-1,1) X=pixels.reshape(pixels.shape[0],-1) X1=X.T Y1=y.T print("X1=",X1.shape) print("y1=",Y1.shape) # Create a list of random numbers of 1024 permutation = list(np.random.permutation(2**10)) # Subset 16384 from the data X2 = X1[:, permutation] Y2 = Y1[:, permutation].reshape((1,2**10)) layersDimensions=[784, 15,9,10] # Use SGD with RMSProp parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 , optimizer="rmsprop", beta1=0.7, epsilon=1e-8, mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig4.png") ## 4.1b. Stochastic Gradient Descent with RMSProp – R source("mnist.R") source("DLfunctions7.R") load_mnist() x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 1024 random samples from MNIST permutation = c(sample(2^10)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) layersDimensions=c(784, 15,9, 10) #Perform SGD with RMSProp retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='tanh', outputActivationFunc="softmax", learningRate = 0.001, optimizer="rmsprop", beta1=0.9, epsilon=10^-8, mini_batch_size = 512, num_epochs = 5000 , print_cost = True) #Plot the cost vs iterations iterations <- seq(0,5000,1000) costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")


## 4.1c. Stochastic Gradient Descent with RMSProp – Octave

source("DL7functions.m")
#Create a random permutatation from 1024
permutation = randperm(1024);

# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);

# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
#Perform SGD with RMSProp
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.005,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="rmsprop",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=1,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)


Adaptive Moment Estimate is a combination of the momentum (1st moment) and RMSProp(2nd moment). The equations for Adam are below
$v_{dW}^l = \beta_{1} v_{dW}^l + (1-\beta_{1})dW^{l}$
$v_{db}^l = \beta_{1} v_{db}^l + (1-\beta_{1})db^{l}$
The bias corrections for the 1st moment
$vCorrected_{dW}^l= \frac {v_{dW}^l}{1 - \beta_{1}^{t}}$
$vCorrected_{db}^l= \frac {v_{db}^l}{1 - \beta_{1}^{t}}$

Similarly the moving average for the 2nd moment- RMSProp
$s_{dW}^l = \beta_{2} s_{dW}^l + (1-\beta_{2})(dW^{l})^2$
$s_{db}^l = \beta_{2} s_{db}^l + (1-\beta_{2})(db^{l})^2$
The bias corrections for the 2nd moment
$sCorrected_{dW}^l= \frac {s_{dW}^l}{1 - \beta_{2}^{t}}$
$sCorrected_{db}^l= \frac {s_{db}^l}{1 - \beta_{2}^{t}}$

$W^{l} = W^{l} - \frac {\alpha vCorrected_{dW}^l}{\sqrt (s_{dW}^l + \epsilon) }$
$b^{l} = b^{l} - \frac {\alpha vCorrected_{db}^l}{\sqrt (s_{db}^l + \epsilon) }$
The code snippet of Adam in R is included below

# Perform Gradient Descent with Adam
# Input : Weights and biases
#       : beta1
#       : epsilon
#       : learning rate
#       : outputActivationFunc - Activation function at hidden layer sigmoid/softmax
#output : Updated weights after 1 iteration
beta1=0.9, beta2=0.999, epsilon=10^-8, learningRate=0.1,outputActivationFunc="sigmoid"){

L = length(parameters)/2 # number of layers in the neural network
v_corrected <- list()
s_corrected <- list()
# Update rule for each parameter. Use a for loop.
for(l in 1:(L-1)){
# v['dWk'] = beta *v['dWk'] + (1-beta)*dWk
v[[paste("dW",l, sep="")]] = beta1*v[[paste("dW",l, sep="")]] +
v[[paste("db",l, sep="")]] = beta1*v[[paste("db",l, sep="")]] +

# Compute bias-corrected first moment estimate.
v_corrected[[paste("dW",l, sep="")]] = v[[paste("dW",l, sep="")]]/(1-beta1^t)
v_corrected[[paste("db",l, sep="")]] = v[[paste("db",l, sep="")]]/(1-beta1^t)

# Element wise multiply of gradients
s[[paste("dW",l, sep="")]] = beta2*s[[paste("dW",l, sep="")]] +
s[[paste("db",l, sep="")]] = beta2*s[[paste("db",l, sep="")]] +

# Compute bias-corrected second moment estimate.
s_corrected[[paste("dW",l, sep="")]] = s[[paste("dW",l, sep="")]]/(1-beta2^t)
s_corrected[[paste("db",l, sep="")]] = s[[paste("db",l, sep="")]]/(1-beta2^t)

# Update parameters.
d1=sqrt(s_corrected[[paste("dW",l, sep="")]]+epsilon)
d2=sqrt(s_corrected[[paste("db",l, sep="")]]+epsilon)

parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] -
learningRate * v_corrected[[paste("dW",l, sep="")]]/d1
parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] -
learningRate*v_corrected[[paste("db",l, sep="")]]/d2
}
# Compute for the Lth layer
if(outputActivationFunc=="sigmoid"){
v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] +
v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] +

# Compute bias-corrected first moment estimate.
v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t)
v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t)

# Element wise multiply of gradients
s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] +
s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] +

# Compute bias-corrected second moment estimate.
s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t)
s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t)

# Update parameters.
d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon)
d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon)

parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate * v_corrected[[paste("dW",L, sep="")]]/d1
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate*v_corrected[[paste("db",L, sep="")]]/d2

}else if (outputActivationFunc=="softmax"){
v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] +
v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] +

# Compute bias-corrected first moment estimate.
v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t)
v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t)

# Element wise multiply of gradients
s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] +
s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] +

# Compute bias-corrected second moment estimate.
s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t)
s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t)

# Update parameters.
d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon)
d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon)

parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate * v_corrected[[paste("dW",L, sep="")]]/d1
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate*v_corrected[[paste("db",L, sep="")]]/d2
}
return(parameters)
}


import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
lbls=[]
pxls=[]
print(len(training))
#for i in range(len(training)):
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T

# Create  a list of random numbers of 1024
permutation = list(np.random.permutation(2**10))
# Subset 16384 from the data
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
layersDimensions=[784, 15,9,10]
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="adam", beta1=0.9, beta2=0.9, epsilon = 1e-8,
mini_batch_size =512, num_epochs = 1000, print_cost = True, figure="fig5.png")

source("mnist.R")
source("DLfunctions7.R")
x <- t(train$x) X <- x[,1:60000] y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)

# Subset 1024 random samples from MNIST
permutation = c(sample(2^10))
# Randomly shuffle the training data
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.005,
beta1=0.7,
beta2=0.9,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000 ,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost") ## 5.1c. Stochastic Gradient Descent with Adam – Octave source("DL7functions.m") load('./mnist/mnist.txt.gz'); #Create a random permutatation from 1024 permutation = randperm(1024); disp(length(permutation)); # Use this 1024 as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; # Note the high value for epsilon. #Otherwise GD with Adam does not seem to converge # Perform SGD with Adam [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.1, lrDecay=false, decayRate=1, lambd=0, keep_prob=1, optimizer="adam", beta=0.9, beta1=0.9, beta2=0.9, epsilon=100, mini_batch_size = 512, num_epochs = 5000); plotCostVsEpochs(5000,costs)  Conclusion: In this post I discuss and implement several Stochastic Gradient Descent optimization methods. The implementation of these methods enhance my already existing generic L-Layer Deep Learning Network implementation in vectorized Python, R and Octave, which I had discussed in the previous post in this series on Deep Learning from first principles in Python, R and Octave. Check it out, if you haven’t already. As already mentioned the code for this post can be cloned/forked from Github at DeepLearning-Part7 Watch this space! I’ll be back! To see all post click Index of posts # Deep Learning from first principles in Python, R and Octave – Part 6 “Today you are You, that is truer than true. There is no one alive who is Youer than You.” Dr. Seuss “Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.” H L Mencken # Introduction In this 6th instalment of ‘Deep Learning from first principles in Python, R and Octave-Part6’, I look at a couple of different initialization techniques used in Deep Learning, L2 regularization and the ‘dropout’ method. Specifically, I implement “He initialization” & “Xavier Initialization”. My earlier posts in this series of Deep Learning included 1. Part 1 – In the 1st part, I implemented logistic regression as a simple 2 layer Neural Network 2. Part 2 – In part 2, implemented the most basic of Neural Networks, with just 1 hidden layer, and any number of activation units in that hidden layer. The implementation was in vectorized Python, R and Octave 3. Part 3 -In part 3, I derive the equations and also implement a L-Layer Deep Learning network with either the relu, tanh or sigmoid activation function in Python, R and Octave. The output activation unit was a sigmoid function for logistic classification 4. Part 4 – This part looks at multi-class classification, and I derive the Jacobian of a Softmax function and implement a simple problem to perform multi-class classification. 5. Part 5 – In the 5th part, I extend the L-Layer Deep Learning network implemented in Part 3, to include the Softmax classification. I also use this L-layer implementation to classify MNIST handwritten digits with Python, R and Octave. The code in Python, R and Octave are identical, and just take into account some of the minor idiosyncrasies of the individual language. In this post, I implement different initialization techniques (random, He, Xavier), L2 regularization and finally dropout. Hence my generic L-Layer Deep Learning network includes these additional enhancements for enabling/disabling initialization methods, regularization or dropout in the algorithm. It already included sigmoid & softmax output activation for binary and multi-class classification, besides allowing relu, tanh and sigmoid activation for hidden units. A video presentation of regularization and initialization techniques can be also be viewed in Neural Networks 6 This R Markdown file and the code for Python, R and Octave can be cloned/downloaded from Github at DeepLearning-Part6 Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. ## 1. Initialization techniques The usual initialization technique is to generate Gaussian or uniform random numbers and multiply it by a small value like 0.01. Two techniques which are used to speed up convergence is the He initialization or Xavier. These initialization techniques enable gradient descent to converge faster. ## 1.1 a Default initialization – Python This technique just initializes the weights to small random values based on Gaussian or uniform distribution import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() # Set the layers dimensions layersDimensions = [2,7,1] # Train a deep learning network with random initialization parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",learningRate = 0.6, num_iterations = 9000, initType="default", print_cost = True,figure="fig1.png") # Clear the plot plt.clf() plt.close() # Plot the decision boundary plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig2.png") ## 1.1 b He initialization – Python ‘He’ initialization attributed to He et al, multiplies the random weights by $\sqrt{\frac{2}{dimension\ of\ previous\ layer}}$ import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() # Set the layers dimensions layersDimensions = [2,7,1] # Train a deep learning network with He initialization parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate =0.6, num_iterations = 10000,initType="He",print_cost = True, figure="fig3.png") plt.clf() plt.close() # Plot the decision boundary plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig4.png") ## 1.1 c Xavier initialization – Python Xavier initialization multiply the random weights by $\sqrt{\frac{1}{dimension\ of\ previous\ layer}}$ import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd import sklearn import sklearn.datasets exec(open("DLfunctions61.py").read()) #Load the data train_X, train_Y, test_X, test_Y = load_dataset() # Set the layers dimensions layersDimensions = [2,7,1] # Train a L layer Deep Learning network parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.6,num_iterations = 10000, initType="Xavier",print_cost = True, figure="fig5.png") # Plot the decision boundary plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig6.png") ## 1.2a Default initialization – R source("DLfunctions61.R") #Load the data z <- as.matrix(read.csv("circles.csv",header=FALSE)) x <- z[,1:2] y <- z[,3] X <- t(x) Y <- t(y) #Set the layer dimensions layersDimensions = c(2,11,1) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, numIterations = 8000, initType="default", print_cost = True) #Plot the cost vs iterations iterations <- seq(0,8000,1000) costs=retvals$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost")

# Plot the decision boundary
plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",lr=0.5)

## 1.2b He initialization – R

The code for ‘He’ initilaization in R is included below

# He Initialization model for L layers
# Input : List of units in each layer
# Returns: Initial weights and biases matrices for all layers
# He initilization multiplies the random numbers with sqrt(2/layerDimensions[previouslayer])
HeInitializeDeepModel <- function(layerDimensions){
set.seed(2)

# Initialize empty list
layerParams <- list()

# Note the Weight matrix at layer 'l' is a matrix of size (l,l-1)
# The Bias is a vectors of size (l,1)

# Loop through the layer dimension from 1.. L
# Indices in R start from 1
for(l in 2:length(layersDimensions)){
# Initialize a matrix of small random numbers of size l x l-1
# Create random numbers of size  l x l-1
w=rnorm(layersDimensions[l]*layersDimensions[l-1])

# Create a weight matrix of size l x l-1 with this initial weights and
# Add to list W1,W2... WL
# He initialization - Divide by sqrt(2/layerDimensions[previous layer])
layerParams[[paste('W',l-1,sep="")]] = matrix(w,nrow=layersDimensions[l],
ncol=layersDimensions[l-1])*sqrt(2/layersDimensions[l-1])
layerParams[[paste('b',l-1,sep="")]] = matrix(rep(0,layersDimensions[l]),
nrow=layersDimensions[l],ncol=1)
}
return(layerParams)
}


The code in R below uses He initialization to learn the data

source("DLfunctions61.R")
x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
# Set the layer dimensions
layersDimensions = c(2,11,1)
# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
numIterations = 9000,
initType="He",
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,9000,1000)
costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost") # Plot the decision boundary plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.5,lr=0.5) ## 1.2c Xavier initialization – R ## Xav initialization # Set the layer dimensions layersDimensions = c(2,11,1) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.5, numIterations = 9000, initType="Xav", print_cost = True) #Plot the cost vs iterations iterations <- seq(0,9000,1000) costs=retvals$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost")

# Plot the decision boundary
plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.5)

## 1.3a Default initialization – Octave

source("DL61functions.m")

X=data(:,1:2);
Y=data(:,3);
# Set the layer dimensions
layersDimensions = [2 11  1]; #tanh=-0.5(ok), #relu=0.1 best!

# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0,
keep_prob=1,
numIterations = 10000,
initType="default");
# Plot cost vs iterations
plotCostVsIterations(10000,costs)
#Plot decision boundary
plotDecisionBoundary(data,weights, biases,keep_prob=1, hiddenActivationFunc="relu")


## 1.3b He initialization – Octave

source("DL61functions.m")
X=data(:,1:2);
Y=data(:,3);
# Set the layer dimensions
layersDimensions = [2 11  1]; #tanh=-0.5(ok), #relu=0.1 best!

# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0,
keep_prob=1,
numIterations = 8000,
initType="He");
plotCostVsIterations(8000,costs)
#Plot decision boundary
plotDecisionBoundary(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")


## 1.3c Xavier initialization – Octave

The code snippet for Xavier initialization in Octave is shown below

source("DL61functions.m")
# Xavier Initialization for L layers
# Input : List of units in each layer
# Returns: Initial weights and biases matrices for all layers
function [W b] = XavInitializeDeepModel(layerDimensions)
rand ("seed", 3);
# note the Weight matrix at layer 'l' is a matrix of size (l,l-1)
# The Bias is a vectors of size (l,1)

# Loop through the layer dimension from 1.. L
# Create cell arrays for Weights and biases

for l =2:size(layerDimensions)(2)
W{l-1} = rand(layerDimensions(l),layerDimensions(l-1))* sqrt(1/layerDimensions(l-1)); #  Multiply by .01
b{l-1} = zeros(layerDimensions(l),1);

endfor
end


The Octave code below uses Xavier initialization

source("DL61functions.m")
X=data(:,1:2);
Y=data(:,3);
#Set layer dimensions
layersDimensions = [2 11 1]; #tanh=-0.5(ok), #relu=0.1 best!

# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0,
keep_prob=1,
numIterations = 8000,
initType="Xav");

plotCostVsIterations(8000,costs)
plotDecisionBoundary(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")



## 2.1a Regularization : Circles data – Python

The cross entropy cost for Logistic classification is given as $J = \frac{1}{m}\sum_{i=1}^{m}y^{i}log((a^{L})^{(i)}) - (1-y^{i})log((a^{L})^{(i)})$ The regularized L2 cost is given by $J = \frac{1}{m}\sum_{i=1}^{m}y^{i}log((a^{L})^{(i)}) - (1-y^{i})log((a^{L})^{(i)}) + \frac{\lambda}{2m}\sum \sum \sum W_{kj}^{l}$

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets

train_X, train_Y, test_X, test_Y = load_dataset()
# Set the layers dimensions
layersDimensions = [2,7,1]

# Train a deep learning network
parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",learningRate = 0.6, lambd=0.1, num_iterations = 9000,
initType="default", print_cost = True,figure="fig7.png")

# Clear the plot
plt.clf()
plt.close()

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), train_X, train_Y,str(0.6),figure1="fig8.png")

plt.clf()
plt.close()
#Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T,keep_prob=0.9), train_X, train_Y,str(2.2),"fig8.png",)

## 2.1 b Regularization: Spiral data  – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.clf()
plt.close()
#Set layer dimensions
layersDimensions = [2,100,3]
y1=y.reshape(-1,1).T
# Train a deep learning network
parameters = L_Layer_DeepModel(X.T, y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",
learningRate = 1,lambd=1e-3, num_iterations = 5000, print_cost = True,figure="fig9.png")

plt.clf()
plt.close()
W1=parameters['W1']
b1=parameters['b1']
W2=parameters['W2']
b2=parameters['b2']
plot_decision_boundary1(X, y1,W1,b1,W2,b2,figure2="fig10.png")

## 2.2a Regularization: Circles data  – R

source("DLfunctions61.R")
x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
#Set layer dimensions
layersDimensions = c(2,11,1)
# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0.1,
numIterations = 9000,
initType="default",
print_cost = True)

#Plot the cost vs iterations
iterations <- seq(0,9000,1000)
costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("No of iterations") + ylab("Cost") # Plot the decision boundary plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.5) ## 2.2b Regularization:Spiral data – R # Read the spiral dataset #Load the data source("DLfunctions61.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) layersDimensions = c(2, 100, 3) # Train a deep learning network retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.5, lambd=0.01, numIterations = 9000, print_cost = True) print_cost = True) parameters<-retvals$parameters
plotDecisionBoundary1(Z,parameters)

2.3a Regularization: Circles data – Octave

source("DL61functions.m")
X=data(:,1:2);
Y=data(:,3);
layersDimensions = [2 11  1]; #tanh=-0.5(ok), #relu=0.1 best!

# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0.2,
keep_prob=1,
numIterations = 8000,
initType="default");

plotCostVsIterations(8000,costs)
#Plot decision boundary
plotDecisionBoundary(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")


## 2.3b Regularization:Spiral data  2 – Octave

source("DL61functions.m")

# Setup the data
X=data(:,1:2);
Y=data(:,3);
layersDimensions = [2 100 3]
# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.6,
lambd=0.2,
keep_prob=1,
numIterations = 10000);

plotCostVsIterations(10000,costs)
#Plot decision boundary
plotDecisionBoundary1(data,weights, biases,keep_prob=1,hiddenActivationFunc="relu")


## 3.1 a Dropout: Circles data – Python

The ‘dropout’ regularization technique was used with great effectiveness, to prevent overfitting  by Alex Krizhevsky, Ilya Sutskever and Prof Geoffrey E. Hinton in the Imagenet classification with Deep Convolutional Neural Networks

The technique of dropout works by dropping a random set of activation units in each hidden layer, based on a ‘keep_prob’ criteria in the forward propagation cycle. Here is the code for Octave. A ‘dropoutMat’ is created for each layer which specifies which units to drop Note: The same ‘dropoutMat has to be used which computing the gradients in the backward propagation cycle. Hence the dropout matrices are stored in a cell array.

 for l =1:L-1
...
D=rand(size(A)(1),size(A)(2));
D = (D < keep_prob) ;
# Zero out some hidden units
A= A .* D;
# Divide by keep_prob to keep the expected value of A the same
A = A ./ keep_prob;
# Store D in a dropoutMat cell array
dropoutMat{l}=D;
...
endfor

In the backward propagation cycle we have

    for l =(L-1):-1:1
...
D = dropoutMat{l};
# Zero out the dAl based on same dropout matrix
dAl= dAl .* D;
# Divide by keep_prob to maintain the expected value
dAl = dAl ./ keep_prob;
...
endfor

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
train_X, train_Y, test_X, test_Y = load_dataset()
# Set the layers dimensions
layersDimensions = [2,7,1]

# Train a deep learning network
parameters = L_Layer_DeepModel(train_X, train_Y, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",learningRate = 0.6, keep_prob=0.7, num_iterations = 9000,
initType="default", print_cost = True,figure="fig11.png")

# Clear the plot
plt.clf()
plt.close()

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T,keep_prob=0.7), train_X, train_Y,str(0.6),figure1="fig12.png")


### 3.1b Dropout: Spiral data – Python

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
# Create an input data set - Taken from CS231n Convolutional Neural networks,
# http://cs231n.github.io/neural-networks-case-study/

N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.clf()
plt.close()
layersDimensions = [2,100,3]
y1=y.reshape(-1,1).T
# Train a deep learning network
parameters = L_Layer_DeepModel(X.T, y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",
learningRate = 1,keep_prob=0.9, num_iterations = 5000, print_cost = True,figure="fig13.png")

plt.clf()
plt.close()
W1=parameters['W1']
b1=parameters['b1']
W2=parameters['W2']
b2=parameters['b2']
#Plot decision boundary
plot_decision_boundary1(X, y1,W1,b1,W2,b2,figure2="fig14.png")

## 3.2a Dropout: Circles data – R

source("DLfunctions61.R")

x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
layersDimensions = c(2,11,1)
# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
keep_prob=0.8,
numIterations = 9000,
initType="default",
print_cost = True)
# Plot the decision boundary
plotDecisionBoundary(z,retvals,keep_prob=0.6, hiddenActivationFunc="relu",0.5)

## 3.2b Dropout: Spiral data – R

# Read the spiral dataset
source("DLfunctions61.R")

# Setup the data
X <- Z[,1:2]
y <- Z[,3]
X <- t(X)
Y <- t(y)

# Train a deep learning network
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
keep_prob=0.90,
numIterations = 9000,
print_cost = True)


parameters<-retvals\$parameters
#Plot decision boundary
plotDecisionBoundary1(Z,parameters)

## 3.3a Dropout: Circles data – Octave

data=csvread("circles.csv");

X=data(:,1:2);
Y=data(:,3);
layersDimensions = [2 11  1]; #tanh=-0.5(ok), #relu=0.1 best!

# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.5,
lambd=0,
keep_prob=0.8,
numIterations = 10000,
initType="default");
plotCostVsIterations(10000,costs)
#Plot decision boundary
plotDecisionBoundary1(data,weights, biases,keep_prob=1, hiddenActivationFunc="relu")


## 3.3b Dropout  Spiral data – Octave

source("DL61functions.m")

# Setup the data
X=data(:,1:2);
Y=data(:,3);

layersDimensions = [numFeats numHidden  numOutput];
# Train a deep learning network
[weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
lambd=0,
keep_prob=0.8,
numIterations = 10000);

plotCostVsIterations(10000,costs)
#Plot decision boundary
plotDecisionBoundary1(data,weights, biases,keep_prob=1, hiddenActivationFunc="relu")


Note: The Python, R and Octave code can be cloned/downloaded from Github at DeepLearning-Part6
Conclusion
This post further enhances my earlier L-Layer generic implementation of a Deep Learning network to include options for initialization techniques, L2 regularization or dropout regularization

To see all posts click Index of posts