# Introduction

In this post ‘Deep Learning from first principles with Python, R and Octave-Part 7’, I implement optimization methods used in Stochastic Gradient Descent (SGD) to speed up the convergence. Specifically I discuss and implement the following gradient descent optimization techniques

a.Vanilla Stochastic Gradient Descent

b.Learning rate decay

c. Momentum method

d. RMSProp

e. Adaptive Moment Estimation (Adam)

This post, further enhances my generic L-Layer Deep Learning Network implementations in vectorized Python, R and Octave to also include the Stochastic Gradient Descent optimization techniques. You can clone/download the code from Github at DeepLearning-Part7

Incidentally, a good discussion of the various optimizations methods used in Stochastic Gradient Optimization techniques can be seen at Sebastian Ruder’s blog

**Note**: In the vectorized Python, R and Octave implementations below only a 1024 random training samples were used. This was to reduce the computation time. You are free to use the entire data set (60000 training data) for the computation.

This post is largely based of on Prof Andrew Ng’s Deep Learning Specialization. All the above optimization techniques for Stochastic Gradient Descent are based on the technique of exponentially weighted average method. So for example if we had some time series data then we we can represent the exponentially average value at time ‘t’ as a sequence of the the previous value and as shown below

Here represent the average of the data set over By choosing different values of , we can average over a larger or smaller number of the data points.

We can write the equations as follows

and

By substitution we have

Hence it can be seen that the is the weighted sum over the previous values , which is an exponentially decaying function.

Checkout my book ‘Deep Learning from first principles- In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($16.99) and in kindle version($6.65/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python- Machine Learning in stereo” available in Amazon in paperback($9.99) and Kindle($6.99) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

## 1.1a. Stochastic Gradient Descent (Vanilla) – Python

```
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
exec(open("DLfunctions7.py").read())
exec(open("load_mnist.py").read())
# Read the training data
training=list(read(dataset='training',path=".\\mnist"))
test=list(read(dataset='testing',path=".\\mnist"))
lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
permutation = list(np.random.permutation(2**10))
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
# Set the layer dimensions
layersDimensions=[784, 15,9,10]
# Perform SGD with regular gradient descent
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="gd",
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig1.png")
```

## 1.1b. Stochastic Gradient Descent (Vanilla) – R

```
source("mnist.R")
source("DLfunctions7.R")
#Load and read MNIST data
load_mnist()
x <- t(train$x)
X <- x[,1:60000]
y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)
permutation = c(sample(2^10))
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
# Set layer dimensions
layersDimensions=c(784, 15,9, 10)
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
optimizer="gd",
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)
```

```
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs no of epochs") + xlab("No of epochss") + ylab("Cost")
```

## 1.1c. Stochastic Gradient Descent (Vanilla) – Octave

source("DL7functions.m")
#Load and read MNIST
load('./mnist/mnist.txt.gz');
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));
# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with regular gradient descent
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.005,
lrDecay=true,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="gd",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs);

## 2.1. Stochastic Gradient Descent with Learning rate decay

Since in Stochastic Gradient Descent,with each epoch, we use slight different samples, the gradient descent algorithm, oscillates across the ravines and wanders around the minima, when a fixed learning rate is used. In this technique of ‘learning rate decay’ the learning rate is slowly decreased with the number of epochs and becomes smaller and smaller, so that gradient descent can take smaller steps towards the minima.

There are several techniques employed in learning rate decay

a) Exponential decay:

b) 1/t decay :

c)

In my implementation I have used the ‘exponential decay’. The code snippet for Python is shown below

if lrDecay == True:
learningRate = np.power(decayRate,(num_epochs/1000)) * learningRate

## 2.1a. Stochastic Gradient Descent with Learning rate decay – Python

```
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
exec(open("DLfunctions7.py").read())
exec(open("load_mnist.py").read())
# Read the MNIST data
training=list(read(dataset='training',path=".\\mnist"))
test=list(read(dataset='testing',path=".\\mnist"))
lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
permutation = list(np.random.permutation(2**10))
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
# Set layer dimensions
layersDimensions=[784, 15,9,10]
# Perform SGD with learning rate decay
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01 , lrDecay=True, decayRate=0.9999,
optimizer="gd",
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig2.png")
```

## 2.1b. Stochastic Gradient Descent with Learning rate decay – R

```
source("mnist.R")
source("DLfunctions7.R")
# Read and load MNIST
load_mnist()
x <- t(train$x)
X <- x[,1:60000]
y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)
permutation = c(sample(2^10))
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
# Set layer dimensions
layersDimensions=c(784, 15,9, 10)
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
lrDecay=TRUE,
decayRate=0.9999,
optimizer="gd",
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)
```

```
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")
```

## 2.1c. Stochastic Gradient Descent with Learning rate decay – Octave

source("DL7functions.m")
#Load and read MNIST
load('./mnist/mnist.txt.gz');
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));
# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with regular Learning rate decay
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01,
lrDecay=true,
decayRate=0.999,
lambd=0,
keep_prob=1,
optimizer="gd",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)

## 3.1. Stochastic Gradient Descent with Momentum

Stochastic Gradient Descent with Momentum uses the exponentially weighted average method discusses above and more generally moves faster into the ravine than across it. The equations are

where

and are the momentum terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’ The code snippet for Stochastic Gradient Descent with momentum in R is shown below

# Perform Gradient Descent with momentum
# Input : Weights and biases
# : beta
# : gradients
# : learning rate
# : outputActivationFunc - Activation function at hidden layer sigmoid/softmax
#output : Updated weights after 1 iteration
gradientDescentWithMomentum <- function(parameters, gradients,v, beta, learningRate,outputActivationFunc="sigmoid"){
L = length(parameters)/2 # number of layers in the neural network
# Update rule for each parameter. Use a for loop.
for(l in 1:(L-1)){
# Compute velocities
# v['dWk'] = beta *v['dWk'] + (1-beta)*dWk
v[[paste("dW",l, sep="")]] = beta*v[[paste("dW",l, sep="")]] +
(1-beta) * gradients[[paste('dW',l,sep="")]]
v[[paste("db",l, sep="")]] = beta*v[[paste("db",l, sep="")]] +
(1-beta) * gradients[[paste('db',l,sep="")]]
parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] -
learningRate* v[[paste("dW",l, sep="")]]
parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] -
learningRate* v[[paste("db",l, sep="")]]
}
# Compute for the Lth layer
if(outputActivationFunc=="sigmoid"){
v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] +
(1-beta) * gradients[[paste('dW',L,sep="")]]
v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] +
(1-beta) * gradients[[paste('db',L,sep="")]]
parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate* v[[paste("dW",l, sep="")]]
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate* v[[paste("db",l, sep="")]]
}else if (outputActivationFunc=="softmax"){
v[[paste("dW",L, sep="")]] = beta*v[[paste("dW",L, sep="")]] +
(1-beta) * t(gradients[[paste('dW',L,sep="")]])
v[[paste("db",L, sep="")]] = beta*v[[paste("db",L, sep="")]] +
(1-beta) * t(gradients[[paste('db',L,sep="")]])
parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate* t(gradients[[paste("dW",L,sep="")]])
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate* t(gradients[[paste("db",L,sep="")]])
}
return(parameters)
}

## 3.1a. Stochastic Gradient Descent with Momentum- Python

```
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
# Read and load data
exec(open("DLfunctions7.py").read())
exec(open("load_mnist.py").read())
training=list(read(dataset='training',path=".\\mnist"))
test=list(read(dataset='testing',path=".\\mnist"))
lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
permutation = list(np.random.permutation(2**10))
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
layersDimensions=[784, 15,9,10]
# Perform SGD with momentum
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="momentum", beta=0.9,
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig3.png")
```

## 3.1b. Stochastic Gradient Descent with Momentum- R

```
source("mnist.R")
source("DLfunctions7.R")
load_mnist()
x <- t(train$x)
X <- x[,1:60000]
y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)
permutation = c(sample(2^10))
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
# Perform SGD with momentum
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.05,
optimizer="momentum",
beta=0.9,
mini_batch_size = 512,
num_epochs = 5000,
print_cost = True)
```

```
```

```
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")
```

## 3.1c. Stochastic Gradient Descent with Momentum- Octave

source("DL7functions.m")
#Load and read MNIST
load('./mnist/mnist.txt.gz');
#Create a random permutatation from 60K
permutation = randperm(1024);
disp(length(permutation));
# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Perform SGD with Momentum
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.01,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="momentum",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)

## 4.1. Stochastic Gradient Descent with RMSProp

Stochastic Gradient Descent with RMSProp tries to move faster towards the minima while dampening the oscillations across the ravine.

The equations are

where and are the RMSProp terms which are exponentially weighted with the corresponding gradients ‘dW’ and ‘db’ at the corresponding layer ‘l’

The code snippet in Octave is shown below

# Update parameters with RMSProp
# Input : parameters
# : gradients
# : s
# : beta
# : learningRate
# :
#output : Updated parameters RMSProp
function [weights biases] = gradientDescentWithRMSProp(weights, biases,gradsDW,gradsDB, sdW, sdB, beta1, epsilon, learningRate,outputActivationFunc="sigmoid")
L = size(weights)(2); # number of layers in the neural network
# Update rule for each parameter.
for l=1:(L-1)
sdW{l} = beta1*sdW{l} + (1 -beta1) * gradsDW{l} .* gradsDW{l};
sdB{l} = beta1*sdB{l} + (1 -beta1) * gradsDB{l} .* gradsDB{l};
weights{l} = weights{l} - learningRate* gradsDW{l} ./ sqrt(sdW{l} + epsilon);
biases{l} = biases{l} - learningRate* gradsDB{l} ./ sqrt(sdB{l} + epsilon);
endfor
if (strcmp(outputActivationFunc,"sigmoid"))
sdW{L} = beta1*sdW{L} + (1 -beta1) * gradsDW{L} .* gradsDW{L};
sdB{L} = beta1*sdB{L} + (1 -beta1) * gradsDB{L} .* gradsDB{L};
weights{L} = weights{L} -learningRate* gradsDW{L} ./ sqrt(sdW{L} +epsilon);
biases{L} = biases{L} -learningRate* gradsDB{L} ./ sqrt(sdB{L} + epsilon);
elseif (strcmp(outputActivationFunc,"softmax"))
sdW{L} = beta1*sdW{L} + (1 -beta1) * gradsDW{L}' .* gradsDW{L}';
sdB{L} = beta1*sdB{L} + (1 -beta1) * gradsDB{L}' .* gradsDB{L}';
weights{L} = weights{L} -learningRate* gradsDW{L}' ./ sqrt(sdW{L} +epsilon);
biases{L} = biases{L} -learningRate* gradsDB{L}' ./ sqrt(sdB{L} + epsilon);
endif
end

## 4.1a. Stochastic Gradient Descent with RMSProp – Python

```
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
exec(open("DLfunctions7.py").read())
exec(open("load_mnist.py").read())
# Read and load MNIST
training=list(read(dataset='training',path=".\\mnist"))
test=list(read(dataset='testing',path=".\\mnist"))
lbls=[]
pxls=[]
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
print("X1=",X1.shape)
print("y1=",Y1.shape)
permutation = list(np.random.permutation(2**10))
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
layersDimensions=[784, 15,9,10]
# Use SGD with RMSProp
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="rmsprop", beta1=0.7, epsilon=1e-8,
mini_batch_size =512, num_epochs = 1000, print_cost = True,figure="fig4.png")
```

## 4.1b. Stochastic Gradient Descent with RMSProp – R

```
source("mnist.R")
source("DLfunctions7.R")
load_mnist()
x <- t(train$x)
X <- x[,1:60000]
y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)
permutation = c(sample(2^10))
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
#Perform SGD with RMSProp
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.001,
optimizer="rmsprop",
beta1=0.9,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000 ,
print_cost = True)
```

```
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")
```

## 4.1c. Stochastic Gradient Descent with RMSProp – Octave

source("DL7functions.m")
load('./mnist/mnist.txt.gz');
#Create a random permutatation from 1024
permutation = randperm(1024);
# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
#Perform SGD with RMSProp
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.005,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="rmsprop",
beta=0.9,
beta1=0.9,
beta2=0.999,
epsilon=1,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)

## 5.1. Stochastic Gradient Descent with Adam

Adaptive Moment Estimate is a combination of the momentum (1st moment) and RMSProp(2nd moment). The equations for Adam are below

The bias corrections for the 1st moment

Similarly the moving average for the 2nd moment- RMSProp

The bias corrections for the 2nd moment

The Adam Gradient Descent is given by

The code snippet of Adam in R is included below

# Perform Gradient Descent with Adam
# Input : Weights and biases
# : beta1
# : epsilon
# : gradients
# : learning rate
# : outputActivationFunc - Activation function at hidden layer sigmoid/softmax
#output : Updated weights after 1 iteration
gradientDescentWithAdam <- function(parameters, gradients,v, s, t,
beta1=0.9, beta2=0.999, epsilon=10^-8, learningRate=0.1,outputActivationFunc="sigmoid"){
L = length(parameters)/2 # number of layers in the neural network
v_corrected <- list()
s_corrected <- list()
# Update rule for each parameter. Use a for loop.
for(l in 1:(L-1)){
# v['dWk'] = beta *v['dWk'] + (1-beta)*dWk
v[[paste("dW",l, sep="")]] = beta1*v[[paste("dW",l, sep="")]] +
(1-beta1) * gradients[[paste('dW',l,sep="")]]
v[[paste("db",l, sep="")]] = beta1*v[[paste("db",l, sep="")]] +
(1-beta1) * gradients[[paste('db',l,sep="")]]
# Compute bias-corrected first moment estimate.
v_corrected[[paste("dW",l, sep="")]] = v[[paste("dW",l, sep="")]]/(1-beta1^t)
v_corrected[[paste("db",l, sep="")]] = v[[paste("db",l, sep="")]]/(1-beta1^t)
# Element wise multiply of gradients
s[[paste("dW",l, sep="")]] = beta2*s[[paste("dW",l, sep="")]] +
(1-beta2) * gradients[[paste('dW',l,sep="")]] * gradients[[paste('dW',l,sep="")]]
s[[paste("db",l, sep="")]] = beta2*s[[paste("db",l, sep="")]] +
(1-beta2) * gradients[[paste('db',l,sep="")]] * gradients[[paste('db',l,sep="")]]
# Compute bias-corrected second moment estimate.
s_corrected[[paste("dW",l, sep="")]] = s[[paste("dW",l, sep="")]]/(1-beta2^t)
s_corrected[[paste("db",l, sep="")]] = s[[paste("db",l, sep="")]]/(1-beta2^t)
# Update parameters.
d1=sqrt(s_corrected[[paste("dW",l, sep="")]]+epsilon)
d2=sqrt(s_corrected[[paste("db",l, sep="")]]+epsilon)
parameters[[paste("W",l,sep="")]] = parameters[[paste("W",l,sep="")]] -
learningRate * v_corrected[[paste("dW",l, sep="")]]/d1
parameters[[paste("b",l,sep="")]] = parameters[[paste("b",l,sep="")]] -
learningRate*v_corrected[[paste("db",l, sep="")]]/d2
}
# Compute for the Lth layer
if(outputActivationFunc=="sigmoid"){
v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] +
(1-beta1) * gradients[[paste('dW',L,sep="")]]
v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] +
(1-beta1) * gradients[[paste('db',L,sep="")]]
# Compute bias-corrected first moment estimate.
v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t)
v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t)
# Element wise multiply of gradients
s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] +
(1-beta2) * gradients[[paste('dW',L,sep="")]] * gradients[[paste('dW',L,sep="")]]
s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] +
(1-beta2) * gradients[[paste('db',L,sep="")]] * gradients[[paste('db',L,sep="")]]
# Compute bias-corrected second moment estimate.
s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t)
s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t)
# Update parameters.
d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon)
d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon)
parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate * v_corrected[[paste("dW",L, sep="")]]/d1
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate*v_corrected[[paste("db",L, sep="")]]/d2
}else if (outputActivationFunc=="softmax"){
v[[paste("dW",L, sep="")]] = beta1*v[[paste("dW",L, sep="")]] +
(1-beta1) * t(gradients[[paste('dW',L,sep="")]])
v[[paste("db",L, sep="")]] = beta1*v[[paste("db",L, sep="")]] +
(1-beta1) * t(gradients[[paste('db',L,sep="")]])
# Compute bias-corrected first moment estimate.
v_corrected[[paste("dW",L, sep="")]] = v[[paste("dW",L, sep="")]]/(1-beta1^t)
v_corrected[[paste("db",L, sep="")]] = v[[paste("db",L, sep="")]]/(1-beta1^t)
# Element wise multiply of gradients
s[[paste("dW",L, sep="")]] = beta2*s[[paste("dW",L, sep="")]] +
(1-beta2) * t(gradients[[paste('dW',L,sep="")]]) * t(gradients[[paste('dW',L,sep="")]])
s[[paste("db",L, sep="")]] = beta2*s[[paste("db",L, sep="")]] +
(1-beta2) * t(gradients[[paste('db',L,sep="")]]) * t(gradients[[paste('db',L,sep="")]])
# Compute bias-corrected second moment estimate.
s_corrected[[paste("dW",L, sep="")]] = s[[paste("dW",L, sep="")]]/(1-beta2^t)
s_corrected[[paste("db",L, sep="")]] = s[[paste("db",L, sep="")]]/(1-beta2^t)
# Update parameters.
d1=sqrt(s_corrected[[paste("dW",L, sep="")]]+epsilon)
d2=sqrt(s_corrected[[paste("db",L, sep="")]]+epsilon)
parameters[[paste("W",L,sep="")]] = parameters[[paste("W",L,sep="")]] -
learningRate * v_corrected[[paste("dW",L, sep="")]]/d1
parameters[[paste("b",L,sep="")]] = parameters[[paste("b",L,sep="")]] -
learningRate*v_corrected[[paste("db",L, sep="")]]/d2
}
return(parameters)
}

## 5.1a. Stochastic Gradient Descent with Adam – Python

```
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd
import sklearn
import sklearn.datasets
exec(open("DLfunctions7.py").read())
exec(open("load_mnist.py").read())
training=list(read(dataset='training',path=".\\mnist"))
test=list(read(dataset='testing',path=".\\mnist"))
lbls=[]
pxls=[]
print(len(training))
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
permutation = list(np.random.permutation(2**10))
X2 = X1[:, permutation]
Y2 = Y1[:, permutation].reshape((1,2**10))
layersDimensions=[784, 15,9,10]
#Perform SGD with Adam optimization
parameters = L_Layer_DeepModel_SGD(X2, Y2, layersDimensions, hiddenActivationFunc='relu',
outputActivationFunc="softmax",learningRate = 0.01 ,
optimizer="adam", beta1=0.9, beta2=0.9, epsilon = 1e-8,
mini_batch_size =512, num_epochs = 1000, print_cost = True, figure="fig5.png")
```

## 5.1b. Stochastic Gradient Descent with Adam – R

```
source("mnist.R")
source("DLfunctions7.R")
load_mnist()
x <- t(train$x)
X <- x[,1:60000]
y <-train$y
y1 <- y[1:60000]
y2 <- as.matrix(y1)
Y=t(y2)
permutation = c(sample(2^10))
X1 = X[, permutation]
y1 = Y[1, permutation]
y2 <- as.matrix(y1)
Y1=t(y2)
layersDimensions=c(784, 15,9, 10)
#Perform SGD with Adam
retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
outputActivationFunc="softmax",
learningRate = 0.005,
optimizer="adam",
beta1=0.7,
beta2=0.9,
epsilon=10^-8,
mini_batch_size = 512,
num_epochs = 5000 ,
print_cost = True)
```

```
iterations <- seq(0,5000,1000)
costs=retvalsSGD$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs number of epochs") + xlab("No of epochs") + ylab("Cost")
```

## 5.1c. Stochastic Gradient Descent with Adam – Octave

source("DL7functions.m")
load('./mnist/mnist.txt.gz');
#Create a random permutatation from 1024
permutation = randperm(1024);
disp(length(permutation));
# Use this 1024 as the batch
X=trainX(permutation,:);
Y=trainY(permutation,:);
# Set layer dimensions
layersDimensions=[784, 15, 9, 10];
# Note the high value for epsilon.
#Otherwise GD with Adam does not seem to converge
# Perform SGD with Adam
[weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="softmax",
learningRate = 0.1,
lrDecay=false,
decayRate=1,
lambd=0,
keep_prob=1,
optimizer="adam",
beta=0.9,
beta1=0.9,
beta2=0.9,
epsilon=100,
mini_batch_size = 512,
num_epochs = 5000);
plotCostVsEpochs(5000,costs)

**Conclusion**: In this post I discuss and implement several Stochastic Gradient Descent optimization methods. The implementation of these methods enhance my already existing generic L-Layer Deep Learning Network implementation in vectorized Python, R and Octave, which I had discussed in the previous post in this series on Deep Learning from first principles in Python, R and Octave. Check it out, if you haven’t already. As already mentioned the code for this post can be cloned/forked from Github at DeepLearning-Part7

Watch this space! I’ll be back!

Also see

1.My book ‘Practical Machine Learning with R and Python’ on Amazon

2. Deep Learning from first principles in Python, R and Octave – Part 3

3. Experiments with deblurring using OpenCV

3. Design Principles of Scalable, Distributed Systems

4. Natural language processing: What would Shakespeare say?

5. yorkr crashes the IPL party! – Part 3!

6. cricketr flexes new muscles: The final analysis

To see all post click Index of posts