Deep Learning from first principles in Python, R and Octave – Part 2

“What does the world outside your head really ‘look’ like? Not only is there no color, there’s also no sound: the compression and expansion of air is picked up by the ears, and turned into electrical signals. The brain then presents these signals to us as mellifluous tones and swishes and clatters and jangles. Reality is also odorless: there’s no such thing as smell outside our brains. Molecules floating through the air bind to receptors in our nose and are interpreted as different smells by our brain. The real world is not full of rich sensory events; instead, our brains light up the world with their own sensuality.”
The Brain: The Story of You” by David Eagleman

The world is Maya, illusory. The ultimate reality, the Brahman, is all-pervading and all-permeating, which is colourless, odourless, tasteless, nameless and formless

1. Introduction

This post is a follow-up post to my earlier post Deep Learning from first principles in Python, R and Octave-Part 1. In the first part, I implemented Logistic Regression, in vectorized Python,R and Octave, with a wannabe Neural Network (a Neural Network with no hidden layers). In this second part, I implement a regular, but somewhat primitive Neural Network (a Neural Network with just 1 hidden layer). The 2nd part implements classification of manually created datasets, where the different clusters of the 2 classes are not linearly separable.

Neural Network perform really well in learning all sorts of non-linear boundaries between classes. Initially logistic regression is used perform the classification and the decision boundary is plotted. Vanilla logistic regression performs quite poorly. Using SVMs with a radial basis kernel would have performed much better in creating non-linear boundaries. To see R and Python implementations of SVMs take a look at my post Practical Machine Learning with R and Python – Part 4.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

Take a look at my video presentation which discusses the below derivation step-by- step Elements of Neural Networks and Deep Learning – Part 3

You can clone and fork this R Markdown file along with the vectorized implementations of the 3 layer Neural Network for Python, R and Octave from Github DeepLearning-Part2

2. The 3 layer Neural Network

A simple representation of a 3 layer Neural Network (NN) with 1 hidden layer is shown below.

In the above Neural Network, there are 2 input features at the input layer, 3 hidden units at the hidden layer and 1 output layer as it deals with binary classification. The activation unit at the hidden layer can be a tanh, sigmoid, relu etc. At the output layer the activation is a sigmoid to handle binary classification

# Superscript indicates layer 1
$z_{11} = w_{11}^{1}x_{1} + w_{21}^{1}x_{2} + b_{1}$
$z_{12} = w_{12}^{1}x_{1} + w_{22}^{1}x_{2} + b_{1}$
$z_{13} = w_{13}^{1}x_{1} + w_{23}^{1}x_{2} + b_{1}$

Also $a_{11} = tanh(z_{11})$
$a_{12} = tanh(z_{12})$
$a_{13} = tanh(z_{13})$

# Superscript indicates layer 2
$z_{21} = w_{11}^{2}a_{11} + w_{21}^{2}a_{12} + w_{31}^{2}a_{13} + b_{2}$
$a_{21} = sigmoid(z21)$

Hence
$Z1= \begin{pmatrix} z11\\ z12\\ z13 \end{pmatrix} =\begin{pmatrix} w_{11}^{1} & w_{21}^{1} \\ w_{12}^{1} & w_{22}^{1} \\ w_{13}^{1} & w_{23}^{1} \end{pmatrix} * \begin{pmatrix} x1\\ x2 \end{pmatrix} + b_{1}$
And
$A1= \begin{pmatrix} a11\\ a12\\ a13 \end{pmatrix} = \begin{pmatrix} tanh(z11)\\ tanh(z12)\\ tanh(z13) \end{pmatrix}$

Similarly
$Z2= z_{21} = \begin{pmatrix} w_{11}^{2} & w_{21}^{2} & w_{31}^{2} \end{pmatrix} *\begin{pmatrix} z_{11}\\ z_{12}\\ z_{13} \end{pmatrix} +b_{2}$
and $A2 = a_{21} = sigmoid(z_{21})$

These equations can be written as
$Z1 = W1 * X + b1$
$A1 = tanh(Z1)$
$Z2 = W2 * A1 + b2$
$A2 = sigmoid(Z2)$

I) Some important results (a memory refresher!)
$d/dx(e^{x}) = e^{x}$ and $d/dx(e^{-x}) = -e^{-x}$ -(a) and
$sinhx = (e^{x} - e^{-x})/2$ and $coshx = (e^{x} + e^{-x})/2$
Using (a) we can shown that $d/dx(sinhx) = coshx$ and $d/dx(coshx) = sinhx$ (b)
Now $d/dx(f(x)/g(x)) = (g(x)*d/dx(f(x)) - f(x)*d/dx(g(x)))/g(x)^{2}$ -(c)

Since $tanhx =z= sinhx/coshx$ and using (b) we get
$tanhx = (coshx*d/dx(sinhx) - sinhx*d/dx(coshx))/(cosh^{2})$
Using the values of the derivatives of sinhx and coshx from (b) above we get
$d/dx(tanhx) = (coshx^{2} - sinhx{2})/coshx{2} = 1 - tanhx^{2}$
Since $tanhx =z$
$d/dx(tanhx) = 1 - tanhx^{2}= 1 - z^{2}$ -(d)

II) Derivatives
$L=-(Ylog(A2) + (1-Y)log(1-A2))$
$dL/dA2 = -(Y/A2 + (1-Y)/(1-A2))$
Since $A2 = sigmoid(Z2)$ therefore $dA2/dZ2 = A2(1-A2)$ see Part1
$Z2 = W2A1 +b2$
$dZ2/dW2 = A1$
$dZ2/db2 = 1$
$A1 = tanh(Z1)$ and $dA1/dZ1 = 1 - A1^{2}$
$Z1 = W1X + b1$
$dZ1/dW1 = X$
$dZ1/db1 = 1$

III) Back propagation
Using the derivatives from II) we can derive the following results using Chain Rule
$\partial L/\partial Z2 = \partial L/\partial A2 * \partial A2/\partial Z2$
$= -(Y/A2 + (1-Y)/(1-A2)) * A2(1-A2) = A2 - Y$
$\partial L/\partial W2 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial W2$
$= (A2-Y) *A1$ -(A)
$\partial L/\partial b2 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial b2 = (A2-Y)$ -(B)

$\partial L/\partial Z1 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial A1 *\partial A1/\partial Z1 = (A2-Y) * W2 * (1-A1^{2})$
$\partial L/\partial W1 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial A1 *\partial A1/\partial Z1 *\partial Z1/\partial W1$
$=(A2-Y) * W2 * (1-A1^{2}) * X$ -(C)
$\partial L/\partial b1 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial A1 *dA1/dZ1 *dZ1/db1$
$= (A2-Y) * W2 * (1-A1^{2})$ -(D)

The key computations in the backward cycle are
$W1 = W1-learningRate * \partial L/\partial W1$ – From (C)
$b1 = b1-learningRate * \partial L/\partial b1$ – From (D)
$W2 = W2-learningRate * \partial L/\partial W2$ – From (A)
$b2 = b2-learningRate * \partial L/\partial b2$ – From (B)

The weights and biases (W1,b1,W2,b2) are updated for each iteration thus minimizing the loss/cost.

These derivations can be represented pictorially using the computation graph (from the book Deep Learning by Ian Goodfellow, Joshua Bengio and Aaron Courville)

3. Manually create a data set that is not lineary separable

Initially I create a dataset with 2 classes which has around 9 clusters that cannot be separated by linear boundaries. Note: This data set is saved as data.csv and is used for the R and Octave Neural networks to see how they perform on the same dataset.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
import sklearn
import sklearn.datasets

colors=['black','gold']
cmap = matplotlib.colors.ListedColormap(colors)
X, y = make_blobs(n_samples = 400, n_features = 2, centers = 7,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
y=y.reshape(400,1)
y = y % 2
#Plot the figure
plt.figure()
plt.title('Non-linearly separable classes')
plt.scatter(X[:,0], X[:,1], c=y,
marker= 'o', s=50,cmap=cmap)
plt.savefig('fig1.png', bbox_inches='tight')

4. Logistic Regression

On the above created dataset, classification with logistic regression is performed, and the decision boundary is plotted. It can be seen that logistic regression performs quite poorly

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
import sklearn
import sklearn.datasets

#from DLfunctions import plot_decision_boundary
execfile("./DLfunctions.py") # Since import does not work in Rmd!!!

colors=['black','gold']
cmap = matplotlib.colors.ListedColormap(colors)
X, y = make_blobs(n_samples = 400, n_features = 2, centers = 7,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
y=y.reshape(400,1)
y = y % 2

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X, y);

# Plot the decision boundary for logistic regression
plot_decision_boundary_n(lambda x: clf.predict(x), X.T, y.T,"fig2.png")


5. The 3 layer Neural Network in Python (vectorized)

The vectorized implementation is included below. Note that in the case of Python a learning rate of 0.5 and 3 hidden units performs very well.

## Random data set with 9 clusters
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd

from sklearn.datasets import make_classification, make_blobs
execfile("./DLfunctions.py") # Since import does not work in Rmd!!!

X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T

parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=0.5, numIterations = 10000)
plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(4),str(0.5),"fig3.png")
## Cost after iteration 0: 0.692669
## Cost after iteration 1000: 0.246650
## Cost after iteration 2000: 0.227801
## Cost after iteration 3000: 0.226809
## Cost after iteration 4000: 0.226518
## Cost after iteration 5000: 0.226331
## Cost after iteration 6000: 0.226194
## Cost after iteration 7000: 0.226085
## Cost after iteration 8000: 0.225994
## Cost after iteration 9000: 0.225915

6. The 3 layer Neural Network in R (vectorized)

For this the dataset created by Python is saved  to see how R performs on the same dataset. The vectorized implementation of a Neural Network was just a little more interesting as R does not have a similar package like ‘numpy’. While numpy handles broadcasting implicitly, in R I had to use the ‘sweep’ command to broadcast. The implementaion is included below. Note that since the initialization with random weights is slightly different, R performs best with a learning rate of 0.1 and with 6 hidden units

source("DLfunctions2_1.R")
z <- as.matrix(read.csv("data.csv",header=FALSE)) #
x <- z[,1:2]
y <- z[,3]
x1 <- t(x)
y1 <- t(y)
nn <-computeNN(x1, y1, 6, learningRate=0.1,numIterations=10000) # Good
## [1] 0.7075341
## [1] 0.2606695
## [1] 0.2198039
## [1] 0.2091238
## [1] 0.211146
## [1] 0.2108461
## [1] 0.2105351
## [1] 0.210211
## [1] 0.2099104
## [1] 0.2096437
## [1] 0.209409
plotDecisionBoundary(z,nn,6,0.1)

7.  The 3 layer Neural Network in Octave (vectorized)

This uses the same dataset that was generated using Python code.
source("DL-function2.m") data=csvread("data.csv"); X=data(:,1:2); Y=data(:,3); # Make sure that the model parameters are correct. Take the transpose of X & Y
#Perform gradient descent [W1,b1,W2,b2,costs]= computeNN(X', Y',4, learningRate=0.5, numIterations = 10000);

8a. Performance  for different learning rates (Python)

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd

from sklearn.datasets import make_classification, make_blobs
execfile("./DLfunctions.py") # Since import does not work in Rmd!!!
# Create data
X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T
# Create a list of learning rates
learningRate=[0.5,1.2,3.0]
df=pd.DataFrame()
#Compute costs for each learning rate
for lr in learningRate:
parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=lr, numIterations = 10000)
print(costs)
df1=pd.DataFrame(costs)
df=pd.concat([df,df1],axis=1)
#Set the iterations
iterations=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000]
#Create data frame
#Set index
df1=df.set_index([iterations])
df1.columns=[0.5,1.2,3.0]
fig=df1.plot()
fig=plt.title("Cost vs No of Iterations for different learning rates")
plt.savefig('fig4.png', bbox_inches='tight')

8b. Performance  for different hidden units (Python)

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn.linear_model
import pandas as pd

from sklearn.datasets import make_classification, make_blobs
execfile("./DLfunctions.py") # Since import does not work in Rmd!!!
#Create data set
X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T
# Make a list of hidden unis
numHidden=[3,5,7]
df=pd.DataFrame()
#Compute costs for different hidden units
for numHid in numHidden:
parameters,costs = computeNN(X2, Y2, numHidden = numHid, learningRate=1.2, numIterations = 10000)
print(costs)
df1=pd.DataFrame(costs)
df=pd.concat([df,df1],axis=1)
#Set the iterations
iterations=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000]
#Set index
df1=df.set_index([iterations])
df1.columns=[3,5,7]
#Plot
fig=df1.plot()
fig=plt.title("Cost vs No of Iterations for different no of hidden units")
plt.savefig('fig5.png', bbox_inches='tight')

9a. Performance  for different learning rates (R)

source("DLfunctions2_1.R")
x <- z[,1:2]
y <- z[,3]
x1 <- t(x)
y1 <- t(y)
#Loop through learning rates and compute costs
learningRate <-c(0.1,1.2,3.0)
df <- NULL
for(i in seq_along(learningRate)){
nn <-  computeNN(x1, y1, 6, learningRate=learningRate[i],numIterations=10000)
cost <- nn$costs df <- cbind(df,cost) }   #Create dataframe df <- data.frame(df) iterations=seq(0,10000,by=1000) df <- cbind(iterations,df) names(df) <- c("iterations","0.5","1.2","3.0") library(reshape2) df1 <- melt(df,id="iterations") # Melt the data #Plot ggplot(df1) + geom_line(aes(x=iterations,y=value,colour=variable),size=1) + xlab("Iterations") + ylab('Cost') + ggtitle("Cost vs No iterations for different learning rates") 9b. Performance for different hidden units (R) source("DLfunctions2_1.R") # Loop through Num hidden units numHidden <-c(4,6,9) df <- NULL for(i in seq_along(numHidden)){ nn <- computeNN(x1, y1, numHidden[i], learningRate=0.1,numIterations=10000) cost <- nn$costs
df <- cbind(df,cost)

}      
df <- data.frame(df)
iterations=seq(0,10000,by=1000)
df <- cbind(iterations,df)
names(df) <- c("iterations","4","6","9")
library(reshape2)
# Melt
df1 <- melt(df,id="iterations")
# Plot
ggplot(df1) + geom_line(aes(x=iterations,y=value,colour=variable),size=1)  +
xlab("Iterations") +
ylab('Cost') + ggtitle("Cost vs No iterations for  different number of hidden units")

10a. Performance of the Neural Network for different learning rates (Octave)

source("DL-function2.m") plotLRCostVsIterations() print -djph figa.jpg

10b. Performance of the Neural Network for different number of hidden units (Octave)

source("DL-function2.m") plotHiddenCostVsIterations() print -djph figa.jpg

11. Turning the heat on the Neural Network

In this 2nd part I create a a central region of positives and and the outside region as negatives. The points are generated using the equation of a circle (x – a)^{2} + (y -b) ^{2} = R^{2} . How does the 3 layer Neural Network perform on this?  Here’s a look! Note: The same dataset is also used for R and Octave Neural Network constructions

12. Manually creating a circular central region

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
import sklearn
import sklearn.datasets

colors=['black','gold']
cmap = matplotlib.colors.ListedColormap(colors)
x1=np.random.uniform(0,10,800).reshape(800,1)
x2=np.random.uniform(0,10,800).reshape(800,1)
X=np.append(x1,x2,axis=1)
X.shape
# Create (x-a)^2 + (y-b)^2 = R^2
# Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector
a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel()
Y=a.reshape(800,1)

cmap = matplotlib.colors.ListedColormap(colors)

plt.figure()
plt.title('Non-linearly separable classes')
plt.scatter(X[:,0], X[:,1], c=Y,
marker= 'o', s=15,cmap=cmap)
plt.savefig('fig6.png', bbox_inches='tight')

13a. Decision boundary with hidden units=4 and learning rate = 2.2 (Python)

With the above hyper parameters the decision boundary is triangular

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model
execfile("./DLfunctions.py")
x1=np.random.uniform(0,10,800).reshape(800,1)
x2=np.random.uniform(0,10,800).reshape(800,1)
X=np.append(x1,x2,axis=1)
X.shape

# Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector
a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel()
Y=a.reshape(800,1)

X2=X.T
Y2=Y.T

parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=2.2, numIterations = 10000)
plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(4),str(2.2),"fig7.png")

## Cost after iteration 0: 0.692836
## Cost after iteration 1000: 0.331052
## Cost after iteration 2000: 0.326428
## Cost after iteration 3000: 0.474887
## Cost after iteration 4000: 0.247989
## Cost after iteration 5000: 0.218009
## Cost after iteration 6000: 0.201034
## Cost after iteration 7000: 0.197030
## Cost after iteration 8000: 0.193507
## Cost after iteration 9000: 0.191949

13b. Decision boundary with hidden units=12 and learning rate = 2.2 (Python)

With the above hyper parameters the decision boundary is triangular

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model
execfile("./DLfunctions.py")
x1=np.random.uniform(0,10,800).reshape(800,1)
x2=np.random.uniform(0,10,800).reshape(800,1)
X=np.append(x1,x2,axis=1)
X.shape

# Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector
a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel()
Y=a.reshape(800,1)

X2=X.T
Y2=Y.T

parameters,costs = computeNN(X2, Y2, numHidden = 12, learningRate=2.2, numIterations = 10000)
plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(12),str(2.2),"fig8.png")

## Cost after iteration 0: 0.693291
## Cost after iteration 1000: 0.383318
## Cost after iteration 2000: 0.298807
## Cost after iteration 3000: 0.251735
## Cost after iteration 4000: 0.177843
## Cost after iteration 5000: 0.130414
## Cost after iteration 6000: 0.152400
## Cost after iteration 7000: 0.065359
## Cost after iteration 8000: 0.050921
## Cost after iteration 9000: 0.039719

14a. Decision boundary with hidden units=9 and learning rate = 0.5 (R)

When the number of hidden units is 6 and the learning rate is 0,1, is also a triangular shape in R

source("DLfunctions2_1.R")
x <- z[,1:2]
y <- z[,3]
x1 <- t(x)
y1 <- t(y)
nn <-computeNN(x1, y1, 9, learningRate=0.5,numIterations=10000) # Triangular
## [1] 0.8398838
## [1] 0.3303621
## [1] 0.3127731
## [1] 0.3012791
## [1] 0.3305543
## [1] 0.3303964
## [1] 0.2334615
## [1] 0.1920771
## [1] 0.2341225
## [1] 0.2188118
## [1] 0.2082687
plotDecisionBoundary(z,nn,6,0.1)

14b. Decision boundary with hidden units=8 and learning rate = 0.1 (R)

source("DLfunctions2_1.R")
x <- z[,1:2]
y <- z[,3]
x1 <- t(x)
y1 <- t(y)
nn <-computeNN(x1, y1, 8, learningRate=0.1,numIterations=10000) # Hemisphere
## [1] 0.7273279
## [1] 0.3169335
## [1] 0.2378464
## [1] 0.1688635
## [1] 0.1368466
## [1] 0.120664
## [1] 0.111211
## [1] 0.1043362
## [1] 0.09800573
## [1] 0.09126161
## [1] 0.0840379
plotDecisionBoundary(z,nn,8,0.1)

15a. Decision boundary with hidden units=12 and learning rate = 1.5 (Octave)

source("DL-function2.m") data=csvread("data1.csv"); X=data(:,1:2); Y=data(:,3); # Make sure that the model parameters are correct. Take the transpose of X & Y [W1,b1,W2,b2,costs]= computeNN(X', Y',12, learningRate=1.5, numIterations = 10000); plotDecisionBoundary(data, W1,b1,W2,b2) print -djpg fige.jpg

Conclusion: This post implemented a 3 layer Neural Network to create non-linear boundaries while performing classification. Clearly the Neural Network performs very well when the number of hidden units and learning rate are varied.

To be continued…
Watch this space!!

To see all posts check Index of posts

Deep Learning from first principles in Python, R and Octave – Part 1

“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

                          David Eagleman - The Brain: The Story of You

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

Check out my video presentation which discusses the derivations in detail
1. Elements of Neural Networks and Deep Le- Part 1
2. Elements of Neural Networks and Deep Learning – Part 2

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

1. Logistic Regression

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
$Z= w_{1}x_{1} +w_{2}x_{2} +..+ w_{30}x_{30} + b$
where ‘b’ is the bias term

The Activation function is the sigmoid function which is $a= 1/(1+e^{-z})$
The Loss, when the sigmoid function is used in the output layer, is given by
$L=-(ylog(a) + (1-y)log(1-a))$ (1)

Forward propagation

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

Chain rule of differentiation
Let y=f(u)
and u=g(x) then
$\partial y/\partial x = \partial y/\partial u * \partial u/\partial x$

Derivative of sigmoid
$\sigma=1/(1+e^{-z})$
Let $x= 1 + e^{-z}$  then
$\sigma = 1/x$
$\partial \sigma/\partial x = -1/x^{2}$
$\partial x/\partial z = -e^{-z}$
Using the chain rule of differentiation we get
$\partial \sigma/\partial z = \partial \sigma/\partial x * \partial x/\partial z$
$=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}$
Therefore $\partial \sigma/\partial z = \sigma(1-\sigma)$        -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
$L=-(y*log(a) + (1-y)*log(1-a))$      -(a)
$a=1/(1+e^{-Z})$      -(b)
$Z= w_{1}x_{1} +w_{2}x_{2} +...+ w_{30}x_{30} +b = Z = \sum_{i} w_{i}*x_{i} + b$ -(c)

The back propagation step requires the computation of $dL/dw_{i}$ and $dL/db_{i}$. In the case of regression it would be $dE/dw_{i}$ and $dE/db_{i}$ where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
$dL/da = -(y/a + (1-y)/(1-a))$          -(d)
because $d/dx(logx) = 1/x$
Also from equation (2) we get
$da/dZ = a (1-a)$                                  – (e)
By chain rule
$\partial L/\partial Z = \partial L/\partial a * \partial a/\partial Z$
therefore substituting the results of (d) & (e) we get
$\partial L/\partial Z = -(y/a + (1-y)/(1-a)) * a(1-a) = a-y$         (f)
Finally
$\partial L/\partial w_{i}= \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial w_{i}$                                                           -(g)
$\partial Z/\partial w_{i} = x_{i}$            – (h)
and from (f) we have  $\partial L/\partial Z =a-y$
Therefore  (g) reduces to
$\partial L/\partial w_{i} = x_{i}* (a-y)$ -(i)
Also
$\partial L/\partial b = \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial b$ -(j)
Since
$\partial Z/\partial b = 1$ and using (f) in (j)
$\partial L/\partial b = a-y$

The gradient computes the weights at the input layer and the corresponding bias by using the values
of $dw_{i}$ and $db$
$w_{i} := w_{i} -\alpha * dw_{i}$
$b := b -\alpha * db$
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

3. Neural Network for Logistic Regression -Python code (vectorized)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Define the sigmoid function
def sigmoid(z):
a=1/(1+np.exp(-z))
return a

# Initialize
def initialize(dim):
w = np.zeros(dim).reshape(dim,1)
b = 0
return w

# Compute the loss
def computeLoss(numTraining,Y,A):
loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
return(loss)

# Execute the forward propagation
def forwardPropagation(w,b,X,Y):
# Compute Z
Z=np.dot(w.T,X)+b
# Determine the number of training samples
numTraining=float(len(X))
# Compute the output of the sigmoid activation function
A=sigmoid(Z)
#Compute the loss
loss = computeLoss(numTraining,Y,A)
# Compute the gradients dZ, dw and db
dZ=A-Y
dw=1/numTraining*np.dot(X,dZ.T)
db=1/numTraining*np.sum(dZ)

# Return the results as a dictionary
"db": db}
loss = np.squeeze(loss)

def gradientDescent(w, b, X, Y, numIerations, learningRate):
losses=[]
idx =[]
# Iterate
for i in range(numIerations):
#Get the derivates
w = w-learningRate*dw
b = b-learningRate*db

# Store the loss
if i % 100 == 0:
idx.append(i)
losses.append(loss)
params = {"w": w,
"b": b}
"db": db}

# Predict the output for a training set
def predict(w,b,X):
size=X.shape[1]
yPredicted=np.zeros((1,size))
Z=np.dot(w.T,X)
# Compute the sigmoid
A=sigmoid(Z)
for i in range(A.shape[1]):
#If the value is > 0.5 then set as 1
if(A[0][i] > 0.5):
yPredicted[0][i]=1
else:
# Else set as 0
yPredicted[0][i]=0

return yPredicted

#Normalize the data
def normalize(x):
x_norm = None
x_norm = np.linalg.norm(x,axis=1,keepdims=True)
x= x/x_norm
return x

# Run the 2 layer Neural Network on the cancer data set

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Normalize the data for better performance
X_train1=normalize(X_train)

# Create weight vectors of zeros. The size is the number of features in the data set=30
w=np.zeros((X_train.shape[1],1))
#w=np.zeros((30,1))
b=0

#Normalize the training data so that gradient descent performs better
X_train1=normalize(X_train)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_train2=X_train1.T

# Reshape to remove the rank 1 array and then transpose
y_train1=y_train.reshape(len(y_train),1)
y_train2=y_train1.T

# Run gradient descent for 4000 times and compute the weights
w = parameters["w"]
b = parameters["b"]

# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1.T

#Reshape y_test
y_test1=y_test.reshape(len(y_test),1)
y_test2=y_test1.T

# Predict the values for
yPredictionTest = predict(w, b, X_test2)
yPredictionTrain = predict(w, b, X_train2)

# Print the accuracy
print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100))

# Plot the Costs vs the number of iterations
fig1=plt.plot(idx,costs)
fig1=plt.title("Gradient descent-Cost vs No of iterations")
fig1=plt.xlabel("No of iterations")
fig1=plt.ylabel("Cost")
fig1.figure.savefig("fig1", bbox_inches='tight')
## train accuracy: 90.3755868545 %
## test accuracy: 89.5104895105 %

Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

4. Neural Network for Logistic Regression -R code (vectorized)

source("RFunctions-1.R")
# Define the sigmoid function
sigmoid <- function(z){
a <- 1/(1+ exp(-z))
a
}

# Compute the loss
computeLoss <- function(numTraining,Y,A){
loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A))
return(loss)
}

# Compute forward propagation
forwardPropagation <- function(w,b,X,Y){
# Compute Z
Z <- t(w) %*% X +b
#Set the number of samples
numTraining <- ncol(X)
# Compute the activation function
A=sigmoid(Z)

#Compute the loss
loss <- computeLoss(numTraining,Y,A)

# Compute the gradients dZ, dw and db
dZ<-A-Y
dw<-1/numTraining * X %*% t(dZ)
db<-1/numTraining*sum(dZ)

fwdProp <- list("loss" = loss, "dw" = dw, "db" = db)
return(fwdProp)
}

# Perform one cycle of Gradient descent
gradientDescent <- function(w, b, X, Y, numIerations, learningRate){
losses <- NULL
idx <- NULL
# Loop through the number of iterations
for(i in 1:numIerations){
fwdProp <-forwardPropagation(w,b,X,Y)
#Get the derivatives
dw <- fwdProp$dw db <- fwdProp$db
w = w-learningRate*dw
b = b-learningRate*db
l <- fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx <- c(idx,i) losses <- c(losses,l) } } # Return the weights and losses gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict <- function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z <- t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize <- function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<-as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<-sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Set the features X_train <-train[,1:30] y_train <- train[,31] X_test <- test[,1:30] y_test <- test[,31] # Create a matrix of 0's with the number of features w <-matrix(rep(0,dim(X_train)[2])) b <-0 X_train1 <- normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <-data.frame(gradDescent$idx, gradDescent$losses) names(df) <- c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent - Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses") 4. Neural Network for Logistic Regression -Octave code (vectorized)  1; # Define sigmoid function function a = sigmoid(z) a = 1 ./ (1+ exp(-z)); end # Compute the loss function loss=computeLoss(numtraining,Y,A) loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A)); end  # Perform forward propagation function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y) % Compute Z Z = w' * X + b; numtraining = size(X)(1,2); # Compute sigmoid A = sigmoid(Z);  #Compute loss. Note this is element wise product loss =computeLoss(numtraining,Y,A); # Compute the gradients dZ, dw and db dZ = A-Y; dw = 1/numtraining* X * dZ'; db =1/numtraining*sum(dZ); end  # Compute Gradient Descent function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate) #Initialize losses and idx losses=[]; index=[]; # Loop through the number of iterations for i=1:numIerations, [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y); # Perform Gradient descent w = w - learningRate*dw; b = b - learningRate*db; if(mod(i,100) ==0) # Append index and loss index = [index i]; losses = [losses loss]; endif end end  # Determine the predicted value for dataset function yPredicted = predict(w,b,X) m = size(X)(1,2); yPredicted=zeros(1,m); # Compute Z Z = w' * X + b; # Compute sigmoid A = sigmoid(Z); for i=1:size(X)(1,2), # Set predicted as 1 if A > 0,5 if(A(1,i) >= 0.5) yPredicted(1,i)=1; else yPredicted(1,i)=0; endif end end  # Normalize by dividing each value by the sum of squares function normalized = normalize(x) # Compute Frobenius norm. Square the elements, sum rows and then find square root a = sqrt(sum(x .^ 2,2)); # Perform element wise division normalized = x ./ a; end  # Split into train and test sets function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent) # Create a random index ix = randperm(length(dataset)); # Split into training trainSize = floor(trainPercent/100 * length(dataset)); train=dataset(ix(1:trainSize),:); # And test test=dataset(ix(trainSize+1:length(dataset)),:); X_train = train(:,1:30); y_train = train(:,31); X_test = test(:,1:30); y_test = test(:,31); end  cancer=csvread("cancer.csv"); [X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75); w=zeros(size(X_train)(1,2),1); b=0; X_train1=normalize(X_train); X_train2=X_train1'; y_train1=y_train'; [w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75); # Normalize X_test X_test1=normalize(X_test); #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1'; y_test1=y_test'; # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(w1, b1, X_test2); yPredictionTrain = predict(w1, b1, X_train2);   trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100 testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100 trainAccuracy = 90.845 testAccuracy = 89.510 graphics_toolkit('gnuplot') plot(idx,losses); title ('Gradient descent- Cost vs No of iterations'); xlabel ("No of iterations"); ylabel ("Cost"); Conclusion This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers The Deep Learning journey has begun… Don’t miss the bus! Stay tuned for more interesting posts in Deep Learning!! To see all posts check Index of posts Neural Networks: The mechanics of backpropagation The initial work in the ‘Backpropagation Algorithm’ started in the 1980’s and led to an explosion of interest in Neural Networks and the application of backpropagation The ‘Backpropagation’ algorithm computes the minimum of an error function with respect to the weights in the Neural Network. It uses the method of gradient descent. The combination of weights in a multi-layered neural network, which minimizes the error/cost function is considered to be a solution of the learning problem. In the Neural Network above $out_{o1} =\sum_{i} w_{i}*x_{i}$ $E = 1/2(target - out)^{2}$ $\partial E/\partial out= 1/2*2*(target - out) *-1 = -(target - out)$ $\partial E/\partial w_{i} =\partial E/\partial y* \partial y/\partial w_{i}$ $\partial E/\partial w_{i} = -(target - out) * x_{i}$ Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version(\$9.99/Rs449).

Perceptrons and single layered neural networks can only classify, if the sample space is linearly separable. For non-linear decision boundaries, a multi layered neural network with  backpropagation is required to generate more complex boundaries.The backpropagation algorithm, computes the minimum of the error function in weight space using the method of gradient descent. This computation of the gradient, requires the activation function to be both differentiable and continuous. Hence the sigmoid or logistic function is typically chosen as the activation function at every layer.

This post looks at a 3 layer neural network with 1 input, 1 hidden and 1 output. To a large extent this post is based on Matt Mazur’s detailed “A step by step backpropagation example“, and Prof Hinton’s “Neural Networks for Machine Learning” at Coursera and a few other sources.

While Matt Mazur’s post uses example values, I generate the formulas for the gradient derivatives for each weight in the hidden and input layers. I intend to implement a vector version of backpropagation in Octave, R and Python. So this post is a prequel to that.

The 3 layer neural network is as below

Some basic derivations which are used in backpropagation

Chain rule of differentiation
Let y=f(u)
and u=g(x) then
$\partial y/\partial x = \partial y/\partial u * \partial u/\partial x$

An important result
$y=1/(1+e^{-z})$
Let $x= 1 + e^{-z}$  then
$y = 1/x$
$\partial y/\partial x = -1/x^{2}$
$\partial x/\partial z = -e^{-z}$

Using the chain rule of differentiation we get
$\partial y/\partial z = \partial y/\partial x * \partial x/\partial z$
$=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}$
Therefore $\partial y/\partial z = y(1-y)$                                   -(A)

1) Feed forward network
The net output at the 1st hidden layer
$in_{h1} = w_{1}i_{1} + w_{2}i_{2} + b_{1}$
$in_{h2} = w_{3}i_{1} + w_{4}i_{2} + b_{1}$

The sigmoid/logistic function function is used to generate the activation outputs for each hidden layer. The sigmoid is chosen because it is continuous and also has a continuous derivative

$out_{h1} = 1/1+e^{-in_{h1}}$
$out_{h2} = 1/1+e^{-in_{h2}}$

The net output at the output layer
$in_{o1} = w_{5}out_{h_{1}} + w_{6}out_{h_{2}} + b_{2}$
$in_{o2} = w_{7}out_{h_{1}} + w_{8}out_{h_{2}} + b_{2}$

Total error
$E_{total} = 1/2\sum (target - output)^{2}$
$E_{total} = E_{o1} + E_{o2}$
$E_{total} = 1/2(target_{o_{1}} - out_{o_{1}})^{2} + 1/2(target_{o_{2}} - out_{o_{2}})^{2}$

2)The backwards pass
In the backward pass we need to compute how the squared error changes with changing weight. i.e we compute $\partial E_{total}/\partial w_{i}$ for each weight $w_{i}$. This is shown below

A squared error is assumed

Error gradient  with $w_{5}$

$\partial E_{total}/\partial w_{5} = \partial E_{total}/\partial out_{o_{1}} * \partial out_{o_{1}}/\partial in_{o_{1}} * \partial in_{o_{1}}/ \partial w_{5}$                -(B)

Since
$E_{total} = 1/2\sum (target - output)^{2}$
$E_{total} = 1/2(target_{o_{1}} - out_{o_{1}})^{2} + 1/2(target_{o_{2}} - out_{o_{2}})^{2}$
$\partial E _{total}/\partial out_{o1} = \partial E_{o1}/\partial out_{o1} + \partial E_{o2}/\partial out_{o1}$
$\partial E _{total}/\partial out_{o1} = \partial /\partial _{out_{o1}}[1/2(target_{01}-out_{01})^{2}- 1/2(target_{02}-out_{02})^{2}]$
$\partial E _{total}/\partial out_{o1} = 2 * 1/2*(target_{01} - out_{01}) *-1 + 0$

Now considering the 2nd term in (B)
$\partial out_{o1}/\partial in_{o1} = \partial/\partial in_{o1} [1/(1+e^{-in_{o1}})]$

Using result (A)
$\partial out_{o1}/\partial in_{o1} = \partial/\partial in_{o1} [1/(1+e^{-in_{o1}})] = out_{o1}(1-out_{o1})$

The 3rd term in (B)
$\partial in_{o1}/\partial w_{5} = \partial/\partial w_{5} [w_{5}*out_{h1} + w_{6}*out_{h2}] = out_{h1}$
$\partial E_{total}/\partial w_{5}=-(target_{o1} - out_{o1}) * out_{o1} *(1-out_{o1}) * out_{h1}$

Having computed $\partial E_{total}/\partial w_{5}$, we now perform gradient descent, by computing a new weight, assuming a learning rate $\alpha$
$w_{5}^{+} = w_{5} - \alpha * \partial E_{total}/\partial w_{5}$

If we do this for $\partial E_{total}/\partial w_{6}$ we would get
$\partial E_{total}/\partial w_{6}=-(target_{02} - out_{02}) * out_{02} *(1-out_{02}) * out_{h2}$

3)Hidden layer

We now compute how the total error changes for a change in weight $w_{1}$
$\partial E_{total}/\partial w_{1}= \partial E_{total}/\partial out_{h1}* \partial out_{h1}/\partial in_{h1} * \partial in_{h1}/\partial w_{1}$ – (C)

Using
$E_{total} = E_{o1} + E_{o2}$ we get
$\partial E_{total}/\partial w_{1}= (\partial E_{o1}/\partial out_{h1}+ \partial E_{o2}/\partial out_{h1}) * \partial out_{h1}/\partial in_{h1} * \partial in_{h1}/\partial w_{1}$
$\partial E_{total}/\partial w_{1}=(\partial E_{o1}/\partial out_{h1}+ \partial E_{o2}/\partial out_{h1}) * out_{h1}*(1-out_{h1})*i_{1}$     -(D)

Considering the 1st term in (C)
$\partial E_{total}/\partial out_{h1}= \partial E_{o1}/\partial out_{h1}+ \partial E_{o2}/\partial out_{h1}$

Now
$\partial E_{o1}/\partial out_{h1} = \partial E_{o1}/\partial out_{o1} *\partial out_{o1}/\partial in_{01} * \partial in_{o1}/\partial out_{h1}$
$\partial E_{o2}/\partial out_{h1} = \partial E_{o2}/\partial out_{o2} *\partial out_{o2}/\partial in_{02} * \partial in_{o2}/\partial out_{h1}$

which gives the following
$\partial E_{o1}/\partial out_{o1} *\partial out_{o1}/\partial in_{o1} * \partial in_{o1}/\partial out_{h1} =-(target_{o1}-out_{o1}) *out_{o1}(1-out_{o1})*w_{5}$ – (E)
$\partial E_{o2}/\partial out_{o2} *\partial out_{o2}/\partial in_{02} * \partial in_{o2}/\partial out_{h1} =-(target_{o2}-out_{o2}) *out_{o2}(1-out_{o2})*w_{6}$ – (F)

Combining (D), (E) & (F) we get
$\partial E_{total}/\partial w_{1} = -[(target_{o1}-out_{o1}) *out_{o1}(1-out_{o1})*w_{5} + (target_{o2}-out_{o2}) *out_{o2}(1-out_{o2})*w_{6}]*out_{h1}*(1-out_{h1})*i_{1}$

This can be represented as
$\partial E_{total}/\partial w_{1} = -\sum_{i}[(target_{oi}-out_{oi}) *out_{oi}(1-out_{oi})*w_{j}]*out_{h1}*(1-out_{h1})*i_{1}$

With this derivative a new value of $w_{1}$ is computed
$w_{1}^{+} = w_{1} - \alpha * \partial E_{total}/\partial w_{1}$

Hence there are 2 important results
At the output layer we have
a) $\partial E_{total}/\partial w_{j}=-(target_{oi} - out_{oi}) * out_{oi} *(1-out_{oi}) * out_{hi}$
At each hidden layer we compute
b) $\partial E_{total}/\partial w_{k} = -\sum_{i}[(target_{oi}-out_{oi}) *out_{oi}(1-out_{oi})*w_{j}]*out_{hk}*(1-out_{hk})*i_{k}$

Backpropagation, was very successful in the early years,  but the algorithm does have its problems for e.g the issue of the ‘vanishing’ and ‘exploding’ gradient. Yet it is a very key development in Neural Networks, and  the issues with the backprop gradients have been addressed through techniques such as the  momentum method and adaptive learning rate etc.

In this post. I derive the weights at the output layer and the hidden layer. As I already mentioned above, I intend to implement a vector version of the backpropagation algorithm in Octave, R and Python in the days to come.

Watch this space! I’ll be back

P.S. If you find any typos/errors, do let me know!

References
1. Neural Networks for Machine Learning by Prof Geoffrey Hinton
2. A Step by Step Backpropagation Example by Matt Mazur
3. The Backpropagation algorithm by R Rojas
4. Backpropagation Learning Artificial Neural Networks David S Touretzky
5. Artificial Intelligence, Prof Sudeshna Sarkar, NPTEL

To see all my posts go to ‘Index of Posts

Simplifying ML: Neural networks- Part 3

Neural networks try to overcome the shortcomings of logistic regression in which  we have to choose a non-linear hypothesis. Logistic regression requires that we choose an appropriate combination of polynomial terms and the order of the equation. The problem with this is sometimes we either tend to overfit or underfit. Neural networks allow the ability to learns new model parameters from the basis raw parameters.

The neural network is modeled on the neural networking ability of the human brain. The brain is made of trillions of neurons. Each neuron is a processing unit which has several inputs in the dendrites and an output the axon. The neurons communicate thro a combination of electro chemical signal at the synapses or the spaces between the neuron.

A neural network mimics the working of the neuron.

So in a neural network the features of the problem serve as input. For e.g in the case of being able to determine if a mail is spam or not the features could be the words in the subject line, the from address, the contents etc. Based on a combination of these features we need to classify whether the mail is spam or not.

The above diagram shows a simple neural network with features x1, x2, x3 and a bias unit x0

With a hypothesis function hƟ(x) = 1/(1 + e-x)

The edges from the features xi  are the model parameters Ɵ. In other words the edges represent weights.

A typical neural network is a network of many logistic units organized in layers. The output of each layer forms the input to the next subsequent layer. This is shown below

As can be seen in a multi-layer neural network at the left we have the features x1,x2, .. xn.

This at the layer becomes the activation unit. The key advantage of neural networks over regular logistic regression that learns the models parameters is that learned model parameters are input to the next subsequent layers which learn the model parameters more finely. Hence this gives a better fit for the combination of parameters.

The activation parameters at the next layer are

a12 = g(Ɵ101x0+ Ɵ111x1+ Ɵ121x2 + Ɵ131x3) where g is the logistic function or the sigmoid function discussed in my previous post Simplifying ML: Logistic regression – Part 2

Here a12 is the activation parameter at layer 1

Ɵ10 is the model parameter at layer 1 and is the 0th parameter. Similarly Ɵ11 is the model parameter at layer 1 and is the 1st parameter and so on.

Similarly the other activation parameters can be written as

a22 = g(Ɵ201x0+ Ɵ211x1+ Ɵ221x2 + Ɵ231x3)

a32 = g(Ɵ301x0+ Ɵ311x1+ Ɵ321x2 + Ɵ331x3)

hƟ(x) = a13 = g(Ɵ102a0+ Ɵ112a1+ Ɵ122a2 + Ɵ132a3  – (A)

The crux of neural networks is that instead of creating a hypothesis based on the set of raw features, the neural network with multiple hidden layers can learn its own features. In the equation (A) we can see that the hypothesis is not a function of the input raw features x1,x2,… xbut on a new set of features or the activation units a1,a2, … an . In other words the network has ‘learned’ its own features.

As mentioned above the output of each layer is the logistic function or the sigmoid function

The beauty of neural networks based on logistic functions is that we can easily realize the equivalent of logic gates like AND, OR, NOT, NOR etc.

The hypothesis for the above network would be

hƟ(x) = g(-30 + 20 * x1 + 20 * x2)

So for x1= 0 and x2 = 0 we would have

hƟ(x) = g(-30 + 0 + 0) = g(-30)

Since g(-30) < g(0) < 0.5 = 0

Similarly a NOT gate can be constructed with a neural network as follows

Neural networks can also be used for multi class classification.

Hence there are multiple advantages to neural networks. Neural networks are amenable to a) creating complex logic models of combinations of AND, NOT, OR gates

b) The model parameters are learned from the raw parameters and can be more flexible.

It appears that the interest in neural networks surged in the 1980s and then waned, The neural networks were similar to the above and were based on forward propagation. However it appears that in recent time’s backward propagation has been used successfully in areas of research known as ‘deep learning’

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. A highy enjoyable and classic course!!!