# My presentations on ‘Elements of Neural Networks & Deep Learning’ -Parts 4,5

This is the next set of presentations on “Elements of Neural Networks and Deep Learning”.  In the 4th presentation I discuss and derive the generalized equations for a multi-unit, multi-layer Deep Learning network.  The 5th presentation derives the equations for a Deep Learning network when performing multi-class classification along with the derivations for cross-entropy loss. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

Important note: Do check out my later version of these videos at Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8 . These have more content and also include some corrections. Check it out!

1. Elements of Neural Network and Deep Learning – Part 4
This presentation is a continuation of my 3rd presentation in which I derived the equations for a simple 3 layer Neural Network with 1 hidden layer. In this video presentation, I discuss step-by-step the derivations for a L-Layer, multi-unit Deep Learning Network, with any activation function g(z)

The implementations of L-Layer, multi-unit Deep Learning Network in vectorized R, Python and Octave are available in my post Deep Learning from first principles in Python, R and Octave – Part 3

2. Elements of Neural Network and Deep Learning – Part 5
This presentation discusses multi-class classification using the Softmax function. The detailed derivation for the Jacobian of the Softmax is discussed, and subsequently the derivative of cross-entropy loss is also discussed in detail. Finally the final set of equations for a Neural Network with multi-class classification is derived.

The corresponding implementations in vectorized R, Python and Octave are available in the following posts
a. Deep Learning from first principles in Python, R and Octave – Part 4
b. Deep Learning from first principles in Python, R and Octave – Part 5

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

To see all posts click Index of Posts

# My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions.  In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable.  There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle-$8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

# My book ‘Deep Learning from first principles:Second Edition’ now on Amazon

The second edition of my book ‘Deep Learning from first principles:Second Edition- In vectorized Python, R and Octave’, is now available on Amazon, in both paperback ($18.99) and kindle ($9.99/Rs449/-)  versions. Since this book is almost 70% code, all functions, and code snippets have been formatted to use the fixed-width font ‘Lucida Console’. In addition line numbers have been added to all code snippets. This makes the code more organized and much more readable. I have also fixed typos in the book

The book includes the following chapters

Table of Contents
Preface 4
Introduction 6
1. Logistic Regression as a Neural Network 8
2. Implementing a simple Neural Network 23
3. Building a L- Layer Deep Learning Network 48
4. Deep Learning network with the Softmax 85
5. MNIST classification with Softmax 103
6. Initialization, regularization in Deep Learning 121
7. Gradient Descent Optimization techniques 167
8. Gradient Check in Deep Learning 197
1. Appendix A 214
2. Appendix 1 – Logistic Regression as a Neural Network 220
3. Appendix 2 - Implementing a simple Neural Network 227
4. Appendix 3 - Building a L- Layer Deep Learning Network 240
5. Appendix 4 - Deep Learning network with the Softmax 259
6. Appendix 5 - MNIST classification with Softmax 269
7. Appendix 6 - Initialization, regularization in Deep Learning 302
8. Appendix 7 - Gradient Descent Optimization techniques 344
9. Appendix 8 – Gradient Check 405
References 475

To see posts click Index of Posts

# My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($9.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle-$9.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

# My book “Deep Learning from first principles” now on Amazon

Note: The 2nd edition of this book is now available on Amazon

My 4th book(self-published), “Deep Learning from first principles – In vectorized Python, R and Octave” (557 pages), is now available on Amazon in both paperback ($18.99) and kindle ($9.99/Rs449). The book starts with the most primitive 2-layer Neural Network and works  its way to a generic L-layer Deep Learning Network, with all the bells and whistles.  The book includes detailed derivations and vectorized implementations in Python, R and Octave.  The code has been extensively  commented and has been included in the Appendix section.

# Deep Learning from first principles in Python, R and Octave – Part 5

## Introduction

a. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
b. A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
c. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

      Isaac Asimov's Three Laws of Robotics 

Any sufficiently advanced technology is indistinguishable from magic.

      Arthur C Clarke.   

In this 5th part on Deep Learning from first Principles in Python, R and Octave, I solve the MNIST data set of handwritten digits (shown below), from the basics. To do this, I construct a L-Layer, vectorized Deep Learning implementation in Python, R and Octave from scratch and classify the  MNIST data set. The MNIST training data set  contains 60000 handwritten digits from 0-9, and a test set of 10000 digits. MNIST, is a popular dataset for running Deep Learning tests, and has been rightfully termed as the ‘drosophila’ of Deep Learning, by none other than the venerable Prof Geoffrey Hinton.

The ‘Deep Learning from first principles in Python, R and Octave’ series, so far included  Part 1 , where I had implemented logistic regression as a simple Neural Network. Part 2 implemented the most elementary neural network with 1 hidden layer, but  with any number of activation units in that layer, and a sigmoid activation at the output layer.

This post, ‘Deep Learning from first principles in Python, R and Octave – Part 5’ largely builds upon Part3. in which I implemented a multi-layer Deep Learning network, with an arbitrary number of hidden layers and activation units per hidden layer and with the output layer was based on the sigmoid unit, for binary classification. In Part 4, I derive the Jacobian of a Softmax, the Cross entropy loss and the gradient equations for a multi-class Softmax classifier. I also  implement a simple Neural Network using Softmax classifications in Python, R and Octave.

In this post I combine Part 3 and Part 4 to to build a L-layer Deep Learning network, with arbitrary number of hidden layers and hidden units, which can do both binary (sigmoid) and multi-class (softmax) classification.

Note: A detailed discussion of the derivation for multi-class clasification can be seen in my video presentation Neural Networks 5

The generic, vectorized L-Layer Deep Learning Network implementations in Python, R and Octave can be cloned/downloaded from GitHub at DeepLearning-Part5. This implementation allows for arbitrary number of hidden layers and hidden layer units. The activation function at the hidden layers can be one of sigmoid, relu and tanh (will be adding leaky relu soon). The output activation can be used for binary classification with the ‘sigmoid’, or multi-class classification with ‘softmax’. Feel free to download and play around with the code!

I thought the exercise of combining the two parts(Part 3, & Part 4)  would be a breeze. But it was anything but. Incorporating a Softmax classifier into the generic L-Layer Deep Learning model was a challenge. Moreover I found that I could not use the gradient descent on 60,000 training samples as my laptop ran out of memory. So I had to implement Stochastic Gradient Descent (SGD) for Python, R and Octave. In addition, I had to also implement the numerically stable version of Softmax, as the softmax and its derivative would result in NaNs.

### Numerically stable Softmax

The Softmax function $S_{j} =\frac{e^{Z_{j}}}{\sum_{i}^{k}e^{Z_{i}}}$ can be numerically unstable because of the division of large exponentials.  To handle this problem we have to implement stable Softmax function as below

$S_{j} =\frac{e^{Z_{j}}}{\sum_{i}^{k}e^{Z_{i}}}$
$S_{j} =\frac{e^{Z_{j}}}{\sum_{i}^{k}e^{Z_{i}}} = \frac{Ce^{Z_{j}}}{C\sum_{i}^{k}e^{Z_{i}}} = \frac{e^{Z_{j}+log(C)}}{\sum_{i}^{k}e^{Z_{i}+log(C)}}$
Therefore $S_{j} = \frac{e^{Z_{j}+ D}}{\sum_{i}^{k}e^{Z_{i}+ D}}$
Here ‘D’ can be anything. A common choice is
$D=-max(Z_{1},Z_{2},... Z_{k})$

Here is the stable Softmax implementation in Python

# A numerically stable Softmax implementation
def stableSoftmax(Z):
#Compute the softmax of vector x in a numerically stable way.
shiftZ = Z.T - np.max(Z.T,axis=1).reshape(-1,1)
exp_scores = np.exp(shiftZ)
# normalize them for each example
A = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
cache=Z
return A,cache


While trying to create a L-Layer generic Deep Learning network in the 3 languages, I found it useful to ensure that the model executed correctly on smaller datasets.  You can run into numerous problems while setting up the matrices, which becomes extremely difficult to debug. So in this post, I run the model on 2 smaller data for sets used in my earlier posts(Part 3 & Part4) , in each of the languages, before running the generic model on MNIST.

Here is a fair warning. if you think you can dive directly into Deep Learning, with just some basic knowledge of Machine Learning, you are bound to run into serious issues. Moreover, your knowledge will be incomplete. It is essential that you have a good grasp of Machine and Statistical Learning, the different algorithms, the measures and metrics for selecting the models etc.It would help to be conversant with all the ML models, ML concepts, validation techniques, classification measures  etc. Check out the internet/books for background.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

### 1. Random dataset with Sigmoid activation – Python

This random data with 9 clusters, was used in my post Deep Learning from first principles in Python, R and Octave – Part 3 , and was used to test the complete L-layer Deep Learning network with Sigmoid activation.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_blobs
exec(open("DLfunctions51.py").read()) # Cannot import in Rmd.
# Create a random data set with 9 centeres
X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,cluster_std = 1.3, random_state =4)

#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T
# Set the dimensions of L -layer DL network
layersDimensions = [2, 9, 9,1] #  4-layer model
# Execute DL network with hidden activation=relu and sigmoid output function
parameters = L_Layer_DeepModel(X2, Y2, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid",learningRate = 0.3,num_iterations = 2500, print_cost = True)

### 2. Spiral dataset with Softmax activation – Python

The Spiral data was used in my post Deep Learning from first principles in Python, R and Octave – Part 4 and was used to test the complete L-layer Deep Learning network with multi-class Softmax activation at the output layer

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification, make_blobs

# Create an input data set - Taken from CS231n Convolutional Neural networks
# http://cs231n.github.io/neural-networks-case-study/
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

X1=X.T
Y1=y.reshape(-1,1).T
numHidden=100 # No of hidden units in hidden layer
numFeats= 2 # dimensionality
numOutput = 3 # number of classes
# Set the dimensions of the layers
layersDimensions=[numFeats,numHidden,numOutput]
parameters = L_Layer_DeepModel(X1, Y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.6,num_iterations = 9000, print_cost = True)
## Cost after iteration 0: 1.098759
## Cost after iteration 1000: 0.112666
## Cost after iteration 2000: 0.044351
## Cost after iteration 3000: 0.027491
## Cost after iteration 4000: 0.021898
## Cost after iteration 5000: 0.019181
## Cost after iteration 6000: 0.017832
## Cost after iteration 7000: 0.017452
## Cost after iteration 8000: 0.017161

### 3. MNIST dataset with Softmax activation – Python

In the code below, I execute Stochastic Gradient Descent on the MNIST training data of 60000. I used a mini-batch size of 1000. Python takes about 40 minutes to crunch the data. In addition I also compute the Confusion Matrix and other metrics like Accuracy, Precision and Recall for the MNIST data set. I get an accuracy of 0.93 on the MNIST test set. This accuracy can be improved by choosing more hidden layers or more hidden units and possibly also tweaking the learning rate and the number of epochs.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
from sklearn.datasets import make_classification, make_blobs
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Read the MNIST training and test sets
# Create labels and pixel arrays
lbls=[]
pxls=[]
print(len(training))
#for i in range(len(training)):
for i in range(60000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)
y=labels.reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)
X1=X.T
Y1=y.T
# Set the dimensions of the layers. The MNIST data is 28x28 pixels= 784
# Hence input layer is 784. For the 10 digits the Softmax classifier
# has to handle 10 outputs
layersDimensions=[784, 15,9,10] # Works very well,lr=0.01,mini_batch =1000, total=20000
np.random.seed(1)
costs = []
# Run Stochastic Gradient Descent with Learning Rate=0.01, mini batch size=1000
# number of epochs=3000
parameters = L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax",learningRate = 0.01 ,mini_batch_size =1000, num_epochs = 3000, print_cost = True)

# Compute the Confusion Matrix on Training set
# Compute the training accuracy, precision and recall
proba=predict_proba(parameters, X1,outputActivationFunc="softmax")
#A2, cache = forwardPropagationDeep(X1, parameters)
#proba=np.argmax(A2, axis=0).reshape(-1,1)
a=confusion_matrix(Y1.T,proba)
print(a)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Y1.T, proba)))
print('Precision: {:.2f}'.format(precision_score(Y1.T, proba,average="micro")))
print('Recall: {:.2f}'.format(recall_score(Y1.T, proba,average="micro")))

lbls=[]
pxls=[]
print(len(test))
for i in range(10000):
l,p=test[i]
lbls.append(l)
pxls.append(p)
testLabels= np.array(lbls)
testPixels=np.array(pxls)
ytest=testLabels.reshape(-1,1)
Xtest=testPixels.reshape(testPixels.shape[0],-1)
X1test=Xtest.T
Y1test=ytest.T

# Compute the Confusion Matrix on Test set
# Compute the test accuracy, precision and recall
probaTest=predict_proba(parameters, X1test,outputActivationFunc="softmax")
#A2, cache = forwardPropagationDeep(X1, parameters)
#proba=np.argmax(A2, axis=0).reshape(-1,1)
a=confusion_matrix(Y1test.T,probaTest)
print(a)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Y1test.T, probaTest)))
print('Precision: {:.2f}'.format(precision_score(Y1test.T, probaTest,average="micro")))
print('Recall: {:.2f}'.format(recall_score(Y1test.T, probaTest,average="micro")))

##1.  Confusion Matrix of Training set
0     1    2    3    4    5    6    7    8    9
## [[5854    0   19    2   10    7    0    1   24    6]
##  [   1 6659   30   10    5    3    0   14   20    0]
##  [  20   24 5805   18    6   11    2   32   37    3]
##  [   5    4  175 5783    1   27    1   58   60   17]
##  [   1   21    9    0 5780    0    5    2   12   12]
##  [  29    9   21  224    6 4824   18   17  245   28]
##  [   5    4   22    1   32   12 5799    0   43    0]
##  [   3   13  148  154   18    3    0 5883    4   39]
##  [  11   34   30   21   13   16    4    7 5703   12]
##  [  10    4    1   32  135   14    1   92  134 5526]]

##2. Accuracy, Precision, Recall of  Training set
## Accuracy: 0.96
## Precision: 0.96
## Recall: 0.96

##3. Confusion Matrix of Test set
0     1    2    3    4    5    6    7    8    9
## [[ 954    1    8    0    3    3    2    4    4    1]
##  [   0 1107    6    5    0    0    1    2   14    0]
##  [  11    7  957   10    5    0    5   20   16    1]
##  [   2    3   37  925    3   13    0    8   18    1]
##  [   2    6    1    1  944    0    7    3    4   14]
##  [  12    5    4   45    2  740   24    8   42   10]
##  [   8    4    4    2   16    9  903    0   12    0]
##  [   4   10   27   18    5    1    0  940    1   22]
##  [  11   13    6   13    9   10    7    2  900    3]
##  [   8    5    1    7   50    7    0   20   29  882]]
##4. Accuracy, Precision, Recall of  Training set
## Accuracy: 0.93
## Precision: 0.93
## Recall: 0.93

### 4. Random dataset with Sigmoid activation – R code

This is the random data set used in the Python code above which was saved as a CSV. The code is used to test a L -Layer DL network with Sigmoid Activation in R.

source("DLfunctions5.R")
# Read the random data set
x <- z[,1:2]
y <- z[,3]
X <- t(x)
Y <- t(y)
# Set the dimensions of the  layer
layersDimensions = c(2, 9, 9,1)

# Run Gradient Descent on the data set with relu hidden unit activation
# sigmoid activation unit in the output layer
retvals = L_Layer_DeepModel(X, Y, layersDimensions,
hiddenActivationFunc='relu',
outputActivationFunc="sigmoid",
learningRate = 0.3,
numIterations = 5000,
print_cost = True)
#Plot the cost vs iterations
iterations <- seq(0,5000,1000)
costs=retvals$costs df=data.frame(iterations,costs) ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") + ggtitle("Costs vs iterations") + xlab("Iterations") + ylab("Loss") ### 5. Spiral dataset with Softmax activation – R The spiral data set used in the Python code above, is reused to test multi-class classification with Softmax. source("DLfunctions5.R") Z <- as.matrix(read.csv("spiral.csv",header=FALSE)) # Setup the data X <- Z[,1:2] y <- Z[,3] X <- t(X) Y <- t(y) # Initialize number of features, number of hidden units in hidden layer and # number of classes numFeats<-2 # No features numHidden<-100 # No of hidden units numOutput<-3 # No of classes # Set the layer dimensions layersDimensions = c(numFeats,numHidden,numOutput) # Perform gradient descent with relu activation unit for hidden layer # and softmax activation in the output retvals = L_Layer_DeepModel(X, Y, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.5, numIterations = 9000, print_cost = True) #Plot cost vs iterations iterations <- seq(0,9000,1000) costs=retvals$costs
df=data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() + geom_line(color="blue") +
ggtitle("Costs vs iterations") + xlab("Iterations") + ylab("Costs")

### 6. MNIST dataset with Softmax activation – R

The code below executes a L – Layer Deep Learning network with Softmax output activation, to classify the 10 handwritten digits from MNIST with Stochastic Gradient Descent. The entire 60000 data set was used to train the data. R takes almost 8 hours to process this data set with a mini-batch size of 1000.  The use of ‘for’ loops is limited to iterating through epochs, mini batches and for creating the mini batches itself. All other code is vectorized. Yet, it seems to crawl. Most likely the use of ‘lists’ in R, to return multiple values is performance intensive. Some day, I will try to profile the code, and see where the issue is. However the code works!

Having said that, the Confusion Matrix in R dumps a lot of interesting statistics! There is a bunch of statistical measures for each class. For e.g. the Balanced Accuracy for the digits ‘6’ and ‘9’ is around 50%. Looks like, the classifier is confused by the fact that 6 is inverted 9 and vice-versa. The accuracy on the Test data set is just around 75%. I could have played around with the number of layers, number of hidden units, learning rates, epochs etc to get a much higher accuracy. But since each test took about 8+ hours, I may work on this, some other day!

source("DLfunctions5.R")
source("mnist.R")
show_digit(train$x[2,]) #Set the layer dimensions layersDimensions=c(784, 15,9, 10) # Works at 1500 x <- t(train$x)
X <- x[,1:60000]
y <-train$y y1 <- y[1:60000] y2 <- as.matrix(y1) Y=t(y2) # Subset 32768 random samples from MNIST permutation = c(sample(2^15)) # Randomly shuffle the training data X1 = X[, permutation] y1 = Y[1, permutation] y2 <- as.matrix(y1) Y1=t(y2) # Execute Stochastic Gradient Descent on the entire training set # with Softmax activation retvalsSGD= L_Layer_DeepModel_SGD(X1, Y1, layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.05, mini_batch_size = 512, num_epochs = 1, print_cost = True)  # Compute the Confusion Matrix library(caret) library(e1071) predictions=predictProba(retvalsSGD[['parameters']], X,hiddenActivationFunc='relu', outputActivationFunc="softmax") confusionMatrix(predictions,Y) # Confusion Matrix on the Training set > confusionMatrix(predictions,Y) Confusion Matrix and Statistics Reference Prediction 0 1 2 3 4 5 6 7 8 9 0 5738 1 21 5 16 17 7 15 9 43 1 5 6632 21 24 25 3 2 33 13 392 2 12 32 5747 106 25 28 3 27 44 4779 3 0 27 12 5715 1 21 1 20 1 13 4 10 5 21 18 5677 9 17 30 15 166 5 142 21 96 136 93 5306 5884 43 60 413 6 0 0 0 0 0 0 0 0 0 0 7 6 9 13 13 3 4 0 6085 0 55 8 8 12 7 43 1 32 2 7 5703 69 9 2 3 20 71 1 1 2 5 6 19 Overall Statistics Accuracy : 0.777 95% CI : (0.7737, 0.7804) No Information Rate : 0.1124 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7524 Mcnemar's Test P-Value : NA Statistics by Class: Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Sensitivity 0.96877 0.9837 0.96459 0.93215 0.97176 0.97879 0.00000 Specificity 0.99752 0.9903 0.90644 0.99822 0.99463 0.87380 1.00000 Pos Pred Value 0.97718 0.9276 0.53198 0.98348 0.95124 0.43513 NaN Neg Pred Value 0.99658 0.9979 0.99571 0.99232 0.99695 0.99759 0.90137 Prevalence 0.09872 0.1124 0.09930 0.10218 0.09737 0.09035 0.09863 Detection Rate 0.09563 0.1105 0.09578 0.09525 0.09462 0.08843 0.00000 Detection Prevalence 0.09787 0.1192 0.18005 0.09685 0.09947 0.20323 0.00000 Balanced Accuracy 0.98314 0.9870 0.93551 0.96518 0.98319 0.92629 0.50000 Class: 7 Class: 8 Class: 9 Sensitivity 0.9713 0.97471 0.0031938 Specificity 0.9981 0.99666 0.9979464 Pos Pred Value 0.9834 0.96924 0.1461538 Neg Pred Value 0.9967 0.99727 0.9009521 Prevalence 0.1044 0.09752 0.0991500 Detection Rate 0.1014 0.09505 0.0003167 Detection Prevalence 0.1031 0.09807 0.0021667 Balanced Accuracy 0.9847 0.98568 0.5005701  # Confusion Matrix on the Training set xtest <- t(test$x) Xtest <- xtest[,1:10000] ytest <-test$y ytest1 <- ytest[1:10000] ytest2 <- as.matrix(ytest1) Ytest=t(ytest2)  Confusion Matrix and Statistics Reference Prediction 0 1 2 3 4 5 6 7 8 9 0 950 2 2 3 0 6 9 4 7 6 1 3 1110 4 2 9 0 3 12 5 74 2 2 6 965 21 9 14 5 16 12 789 3 1 2 9 908 2 16 0 21 2 6 4 0 1 9 5 938 1 8 6 8 39 5 19 5 25 35 20 835 929 8 54 67 6 0 0 0 0 0 0 0 0 0 0 7 4 4 7 10 2 4 0 952 5 6 8 1 5 8 14 2 16 2 3 876 21 9 0 0 3 12 0 0 2 6 5 1 Overall Statistics Accuracy : 0.7535 95% CI : (0.7449, 0.7619) No Information Rate : 0.1135 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.7262 Mcnemar's Test P-Value : NA Statistics by Class: Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Sensitivity 0.9694 0.9780 0.9351 0.8990 0.9552 0.9361 0.0000 Specificity 0.9957 0.9874 0.9025 0.9934 0.9915 0.8724 1.0000 Pos Pred Value 0.9606 0.9083 0.5247 0.9390 0.9241 0.4181 NaN Neg Pred Value 0.9967 0.9972 0.9918 0.9887 0.9951 0.9929 0.9042 Prevalence 0.0980 0.1135 0.1032 0.1010 0.0982 0.0892 0.0958 Detection Rate 0.0950 0.1110 0.0965 0.0908 0.0938 0.0835 0.0000 Detection Prevalence 0.0989 0.1222 0.1839 0.0967 0.1015 0.1997 0.0000 Balanced Accuracy 0.9825 0.9827 0.9188 0.9462 0.9733 0.9043 0.5000 Class: 7 Class: 8 Class: 9 Sensitivity 0.9261 0.8994 0.0009911 Specificity 0.9953 0.9920 0.9968858 Pos Pred Value 0.9577 0.9241 0.0344828 Neg Pred Value 0.9916 0.9892 0.8989068 Prevalence 0.1028 0.0974 0.1009000 Detection Rate 0.0952 0.0876 0.0001000 Detection Prevalence 0.0994 0.0948 0.0029000 Balanced Accuracy 0.9607 0.9457 0.4989384  ### 7. Random dataset with Sigmoid activation – Octave The Octave code below uses the random data set used by Python. The code below implements a L-Layer Deep Learning with Sigmoid Activation.  source("DL5functions.m") # Read the data data=csvread("data.csv"); X=data(:,1:2); Y=data(:,3); #Set the layer dimensions layersDimensions = [2 9 7 1]; #tanh=-0.5(ok), #relu=0.1 best! # Perform gradient descent [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="sigmoid", learningRate = 0.1, numIterations = 10000); # Plot cost vs iterations plotCostVsIterations(10000,costs);  ### 8. Spiral dataset with Softmax activation – Octave The code below uses the spiral data set used by Python above. The code below implements a L-Layer Deep Learning with Softmax Activation. # Read the data data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); # Set the number of features, number of hidden units in hidden layer and number of classess numFeats=2; #No features numHidden=100; # No of hidden units numOutput=3; # No of classes # Set the layer dimensions layersDimensions = [numFeats numHidden numOutput]; #Perform gradient descent with softmax activation unit [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.1, numIterations = 10000);  ### 9. MNIST dataset with Softmax activation – Octave The code below implements a L-Layer Deep Learning Network in Octave with Softmax output activation unit, for classifying the 10 handwritten digits in the MNIST dataset. Unfortunately, Octave can only index to around 10000 training at a time, and I was getting an error ‘error: out of memory or dimension too large for Octave’s index type error: called from…’, when I tried to create a batch size of 20000. So I had to come with a work around to create a batch size of 10000 (randomly) and then use a mini-batch of 1000 samples and execute Stochastic Gradient Descent. The performance was good. Octave takes about 15 minutes, on a batch size of 10000 and a mini batch of 1000. I thought if the performance was not good, I could iterate through these random batches and refining the gradients as follows # Pseudo code that could be used since Octave only allows 10K batches # at a time # Randomly create weights [weights biases] = initialize_weights() for i=1:k # Create a random permutation and create a random batch permutation = randperm(10000); X=trainX(permutation,:); Y=trainY(permutation,:); # Compute weights from SGD and update weights in the next batch update [weights biases costs]=L_Layer_DeepModel_SGD(X,Y,mini_bactch=1000,weights, biases,...); ... endfor # Load the MNIST data load('./mnist/mnist.txt.gz'); #Create a random permutatation from 60K permutation = randperm(10000); disp(length(permutation)); # Use this 10K as the batch X=trainX(permutation,:); Y=trainY(permutation,:); # Set layer dimensions layersDimensions=[784, 15, 9, 10]; # Run Stochastic Gradient descent with batch size=10K and mini_batch_size=1000 [weights biases costs]=L_Layer_DeepModel_SGD(X', Y', layersDimensions, hiddenActivationFunc='relu', outputActivationFunc="softmax", learningRate = 0.01, mini_batch_size = 2000, num_epochs = 5000);  #### 9. Final thoughts Here are some of my final thoughts after working on Python, R and Octave in this series and in other projects 1. Python, with its highly optimized numpy library, is ideally suited for creating Deep Learning Models, which have a lot of matrix manipulations. Python is a real workhorse when it comes to Deep Learning computations. 2. R is somewhat clunky in comparison to its cousin Python in handling matrices or in returning multiple values. But R’s statistical libraries, dplyr, and ggplot are really superior to the Python peers. Also, I find R handles dataframes, much better than Python. 3. Octave is a no-nonsense,minimalist language which is very efficient in handling matrices. It is ideally suited for implementing Machine Learning and Deep Learning from scratch. But Octave has its problems and cannot handle large matrix sizes, and also lacks the statistical libaries of R and Python. They possibly exist in its sibling, Matlab Feel free to clone/download the code from GitHub at DeepLearning-Part5. #### Conclusion Building a Deep Learning Network from scratch is quite challenging, time-consuming but nevertheless an exciting task. While the statements in the different languages for manipulating matrices, summing up columns, finding columns which have ones don’t take more than a single statement, extreme care has to be taken to ensure that the statements work well for any dimension. The lessons learnt from creating L -Layer Deep Learning network are many and well worth it. Give it a try! Hasta la vista! I’ll be back, so stick around! Watch this space! To see all posts click Index of Posts # Presentation on ‘Machine Learning in plain English – Part 2’ This presentation is a continuation of my earlier presentation Presentation on ‘Machine Learning in plain English – Part 1’. As the title suggests, the presentation is devoid of any math or programming constructs, and just focuses on the concepts and approaches to different Machine Learning algorithms. In this 2nd part, I discuss KNN regression, KNN classification, Cross Validation techniques like (LOOCV, K-Fold) feature selection methods including best-fit,forward-fit and backward fit and finally Ridge (L2) and Lasso Regression (L1) If you would like to see the implementations of the discussed algorithms, in this presentation, do check out my book My book ‘Practical Machine Learning with R and Python’ on Amazon To see all post click Index of posts # Presentation on ‘Machine Learning in plain English – Part 1’ This is the first part on my series ‘Machine Learning in plain English – Part 1’ in which I discuss the intuition behind different Machine Learning algorithms, metrics and the approaches etc. These presentations will not include tiresome math or laborious programming constructs, and will instead focus on just the concepts behind the Machine Learning algorithms. This presentation discusses what Machine Learning is, Gradient Descent, linear, multi variate & polynomial regression, bias/variance, under fit, good fit and over fit and finally logistic regression etc. It is hoped that these presentations will trigger sufficient interest in you, to explore this fascinating field further To see actual implementations of the most widely used Machine Learning algorithms in R and Python, check out My book ‘Practical Machine Learning with R and Python’ on Amazon To see all post see “Index of posts # Deep Learning from first principles in Python, R and Octave – Part 2 “What does the world outside your head really ‘look’ like? Not only is there no color, there’s also no sound: the compression and expansion of air is picked up by the ears, and turned into electrical signals. The brain then presents these signals to us as mellifluous tones and swishes and clatters and jangles. Reality is also odorless: there’s no such thing as smell outside our brains. Molecules floating through the air bind to receptors in our nose and are interpreted as different smells by our brain. The real world is not full of rich sensory events; instead, our brains light up the world with their own sensuality.” The Brain: The Story of You” by David Eagleman The world is Maya, illusory. The ultimate reality, the Brahman, is all-pervading and all-permeating, which is colourless, odourless, tasteless, nameless and formless Bhagavad Gita ## 1. Introduction This post is a follow-up post to my earlier post Deep Learning from first principles in Python, R and Octave-Part 1. In the first part, I implemented Logistic Regression, in vectorized Python,R and Octave, with a wannabe Neural Network (a Neural Network with no hidden layers). In this second part, I implement a regular, but somewhat primitive Neural Network (a Neural Network with just 1 hidden layer). The 2nd part implements classification of manually created datasets, where the different clusters of the 2 classes are not linearly separable. Neural Network perform really well in learning all sorts of non-linear boundaries between classes. Initially logistic regression is used perform the classification and the decision boundary is plotted. Vanilla logistic regression performs quite poorly. Using SVMs with a radial basis kernel would have performed much better in creating non-linear boundaries. To see R and Python implementations of SVMs take a look at my post Practical Machine Learning with R and Python – Part 4. Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. Take a look at my video presentation which discusses the below derivation step-by- step Elements of Neural Networks and Deep Learning – Part 3 You can clone and fork this R Markdown file along with the vectorized implementations of the 3 layer Neural Network for Python, R and Octave from Github DeepLearning-Part2 ### 2. The 3 layer Neural Network A simple representation of a 3 layer Neural Network (NN) with 1 hidden layer is shown below. In the above Neural Network, there are 2 input features at the input layer, 3 hidden units at the hidden layer and 1 output layer as it deals with binary classification. The activation unit at the hidden layer can be a tanh, sigmoid, relu etc. At the output layer the activation is a sigmoid to handle binary classification # Superscript indicates layer 1 $z_{11} = w_{11}^{1}x_{1} + w_{21}^{1}x_{2} + b_{1}$ $z_{12} = w_{12}^{1}x_{1} + w_{22}^{1}x_{2} + b_{1}$ $z_{13} = w_{13}^{1}x_{1} + w_{23}^{1}x_{2} + b_{1}$ Also $a_{11} = tanh(z_{11})$ $a_{12} = tanh(z_{12})$ $a_{13} = tanh(z_{13})$ # Superscript indicates layer 2 $z_{21} = w_{11}^{2}a_{11} + w_{21}^{2}a_{12} + w_{31}^{2}a_{13} + b_{2}$ $a_{21} = sigmoid(z21)$ Hence $Z1= \begin{pmatrix} z11\\ z12\\ z13 \end{pmatrix} =\begin{pmatrix} w_{11}^{1} & w_{21}^{1} \\ w_{12}^{1} & w_{22}^{1} \\ w_{13}^{1} & w_{23}^{1} \end{pmatrix} * \begin{pmatrix} x1\\ x2 \end{pmatrix} + b_{1}$ And $A1= \begin{pmatrix} a11\\ a12\\ a13 \end{pmatrix} = \begin{pmatrix} tanh(z11)\\ tanh(z12)\\ tanh(z13) \end{pmatrix}$ Similarly $Z2= z_{21} = \begin{pmatrix} w_{11}^{2} & w_{21}^{2} & w_{31}^{2} \end{pmatrix} *\begin{pmatrix} z_{11}\\ z_{12}\\ z_{13} \end{pmatrix} +b_{2}$ and $A2 = a_{21} = sigmoid(z_{21})$ These equations can be written as $Z1 = W1 * X + b1$ $A1 = tanh(Z1)$ $Z2 = W2 * A1 + b2$ $A2 = sigmoid(Z2)$ I) Some important results (a memory refresher!) $d/dx(e^{x}) = e^{x}$ and $d/dx(e^{-x}) = -e^{-x}$ -(a) and $sinhx = (e^{x} - e^{-x})/2$ and $coshx = (e^{x} + e^{-x})/2$ Using (a) we can shown that $d/dx(sinhx) = coshx$ and $d/dx(coshx) = sinhx$ (b) Now $d/dx(f(x)/g(x)) = (g(x)*d/dx(f(x)) - f(x)*d/dx(g(x)))/g(x)^{2}$ -(c) Since $tanhx =z= sinhx/coshx$ and using (b) we get $tanhx = (coshx*d/dx(sinhx) - sinhx*d/dx(coshx))/(cosh^{2})$ Using the values of the derivatives of sinhx and coshx from (b) above we get $d/dx(tanhx) = (coshx^{2} - sinhx{2})/coshx{2} = 1 - tanhx^{2}$ Since $tanhx =z$ $d/dx(tanhx) = 1 - tanhx^{2}= 1 - z^{2}$ -(d) II) Derivatives $L=-(Ylog(A2) + (1-Y)log(1-A2))$ $dL/dA2 = -(Y/A2 + (1-Y)/(1-A2))$ Since $A2 = sigmoid(Z2)$ therefore $dA2/dZ2 = A2(1-A2)$ see Part1 $Z2 = W2A1 +b2$ $dZ2/dW2 = A1$ $dZ2/db2 = 1$ $A1 = tanh(Z1)$ and $dA1/dZ1 = 1 - A1^{2}$ $Z1 = W1X + b1$ $dZ1/dW1 = X$ $dZ1/db1 = 1$ III) Back propagation Using the derivatives from II) we can derive the following results using Chain Rule $\partial L/\partial Z2 = \partial L/\partial A2 * \partial A2/\partial Z2$ $= -(Y/A2 + (1-Y)/(1-A2)) * A2(1-A2) = A2 - Y$ $\partial L/\partial W2 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial W2$ $= (A2-Y) *A1$ -(A) $\partial L/\partial b2 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial b2 = (A2-Y)$ -(B) $\partial L/\partial Z1 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial A1 *\partial A1/\partial Z1 = (A2-Y) * W2 * (1-A1^{2})$ $\partial L/\partial W1 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial A1 *\partial A1/\partial Z1 *\partial Z1/\partial W1$ $=(A2-Y) * W2 * (1-A1^{2}) * X$ -(C) $\partial L/\partial b1 = \partial L/\partial A2 * \partial A2/\partial Z2 * \partial Z2/\partial A1 *dA1/dZ1 *dZ1/db1$ $= (A2-Y) * W2 * (1-A1^{2})$ -(D) IV) Gradient Descent The key computations in the backward cycle are $W1 = W1-learningRate * \partial L/\partial W1$ – From (C) $b1 = b1-learningRate * \partial L/\partial b1$ – From (D) $W2 = W2-learningRate * \partial L/\partial W2$ – From (A) $b2 = b2-learningRate * \partial L/\partial b2$ – From (B) The weights and biases (W1,b1,W2,b2) are updated for each iteration thus minimizing the loss/cost. These derivations can be represented pictorially using the computation graph (from the book Deep Learning by Ian Goodfellow, Joshua Bengio and Aaron Courville) ### 3. Manually create a data set that is not lineary separable Initially I create a dataset with 2 classes which has around 9 clusters that cannot be separated by linear boundaries. Note: This data set is saved as data.csv and is used for the R and Octave Neural networks to see how they perform on the same dataset. import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap import sklearn import sklearn.datasets colors=['black','gold'] cmap = matplotlib.colors.ListedColormap(colors) X, y = make_blobs(n_samples = 400, n_features = 2, centers = 7, cluster_std = 1.3, random_state = 4) #Create 2 classes y=y.reshape(400,1) y = y % 2 #Plot the figure plt.figure() plt.title('Non-linearly separable classes') plt.scatter(X[:,0], X[:,1], c=y, marker= 'o', s=50,cmap=cmap) plt.savefig('fig1.png', bbox_inches='tight') ### 4. Logistic Regression On the above created dataset, classification with logistic regression is performed, and the decision boundary is plotted. It can be seen that logistic regression performs quite poorly import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap import sklearn import sklearn.datasets #from DLfunctions import plot_decision_boundary execfile("./DLfunctions.py") # Since import does not work in Rmd!!! colors=['black','gold'] cmap = matplotlib.colors.ListedColormap(colors) X, y = make_blobs(n_samples = 400, n_features = 2, centers = 7, cluster_std = 1.3, random_state = 4) #Create 2 classes y=y.reshape(400,1) y = y % 2 # Train the logistic regression classifier clf = sklearn.linear_model.LogisticRegressionCV(); clf.fit(X, y); # Plot the decision boundary for logistic regression plot_decision_boundary_n(lambda x: clf.predict(x), X.T, y.T,"fig2.png")  ### 5. The 3 layer Neural Network in Python (vectorized) The vectorized implementation is included below. Note that in the case of Python a learning rate of 0.5 and 3 hidden units performs very well. ## Random data set with 9 clusters import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd from sklearn.datasets import make_classification, make_blobs execfile("./DLfunctions.py") # Since import does not work in Rmd!!! X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9, cluster_std = 1.3, random_state = 4) #Create 2 classes Y1=Y1.reshape(400,1) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T #Perform gradient descent parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=0.5, numIterations = 10000) plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(4),str(0.5),"fig3.png") ## Cost after iteration 0: 0.692669 ## Cost after iteration 1000: 0.246650 ## Cost after iteration 2000: 0.227801 ## Cost after iteration 3000: 0.226809 ## Cost after iteration 4000: 0.226518 ## Cost after iteration 5000: 0.226331 ## Cost after iteration 6000: 0.226194 ## Cost after iteration 7000: 0.226085 ## Cost after iteration 8000: 0.225994 ## Cost after iteration 9000: 0.225915 ### 6. The 3 layer Neural Network in R (vectorized) For this the dataset created by Python is saved to see how R performs on the same dataset. The vectorized implementation of a Neural Network was just a little more interesting as R does not have a similar package like ‘numpy’. While numpy handles broadcasting implicitly, in R I had to use the ‘sweep’ command to broadcast. The implementaion is included below. Note that since the initialization with random weights is slightly different, R performs best with a learning rate of 0.1 and with 6 hidden units source("DLfunctions2_1.R") z <- as.matrix(read.csv("data.csv",header=FALSE)) # x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) #Perform gradient descent nn <-computeNN(x1, y1, 6, learningRate=0.1,numIterations=10000) # Good ## [1] 0.7075341 ## [1] 0.2606695 ## [1] 0.2198039 ## [1] 0.2091238 ## [1] 0.211146 ## [1] 0.2108461 ## [1] 0.2105351 ## [1] 0.210211 ## [1] 0.2099104 ## [1] 0.2096437 ## [1] 0.209409 plotDecisionBoundary(z,nn,6,0.1) ### 7. The 3 layer Neural Network in Octave (vectorized) This uses the same dataset that was generated using Python code. source("DL-function2.m") data=csvread("data.csv"); X=data(:,1:2); Y=data(:,3); # Make sure that the model parameters are correct. Take the transpose of X & Y #Perform gradient descent [W1,b1,W2,b2,costs]= computeNN(X', Y',4, learningRate=0.5, numIterations = 10000); ### 8a. Performance for different learning rates (Python) import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd from sklearn.datasets import make_classification, make_blobs execfile("./DLfunctions.py") # Since import does not work in Rmd!!! # Create data X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9, cluster_std = 1.3, random_state = 4) #Create 2 classes Y1=Y1.reshape(400,1) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T # Create a list of learning rates learningRate=[0.5,1.2,3.0] df=pd.DataFrame() #Compute costs for each learning rate for lr in learningRate: parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=lr, numIterations = 10000) print(costs) df1=pd.DataFrame(costs) df=pd.concat([df,df1],axis=1) #Set the iterations iterations=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000] #Create data frame #Set index df1=df.set_index([iterations]) df1.columns=[0.5,1.2,3.0] fig=df1.plot() fig=plt.title("Cost vs No of Iterations for different learning rates") plt.savefig('fig4.png', bbox_inches='tight') ### 8b. Performance for different hidden units (Python) import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd from sklearn.datasets import make_classification, make_blobs execfile("./DLfunctions.py") # Since import does not work in Rmd!!! #Create data set X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9, cluster_std = 1.3, random_state = 4) #Create 2 classes Y1=Y1.reshape(400,1) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T # Make a list of hidden unis numHidden=[3,5,7] df=pd.DataFrame() #Compute costs for different hidden units for numHid in numHidden: parameters,costs = computeNN(X2, Y2, numHidden = numHid, learningRate=1.2, numIterations = 10000) print(costs) df1=pd.DataFrame(costs) df=pd.concat([df,df1],axis=1) #Set the iterations iterations=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000] #Set index df1=df.set_index([iterations]) df1.columns=[3,5,7] #Plot fig=df1.plot() fig=plt.title("Cost vs No of Iterations for different no of hidden units") plt.savefig('fig5.png', bbox_inches='tight') ### 9a. Performance for different learning rates (R) source("DLfunctions2_1.R") # Read data z <- as.matrix(read.csv("data.csv",header=FALSE)) # x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) #Loop through learning rates and compute costs learningRate <-c(0.1,1.2,3.0) df <- NULL for(i in seq_along(learningRate)){ nn <- computeNN(x1, y1, 6, learningRate=learningRate[i],numIterations=10000) cost <- nn$costs
df <- cbind(df,cost)

}      

#Create dataframe
df <- data.frame(df)
iterations=seq(0,10000,by=1000)
df <- cbind(iterations,df)
names(df) <- c("iterations","0.5","1.2","3.0")
library(reshape2)
df1 <- melt(df,id="iterations")  # Melt the data
#Plot
ggplot(df1) + geom_line(aes(x=iterations,y=value,colour=variable),size=1)  +
xlab("Iterations") +
ylab('Cost') + ggtitle("Cost vs No iterations for  different learning rates")

### 9b. Performance  for different hidden units (R)

source("DLfunctions2_1.R")
# Loop through Num hidden units
numHidden <-c(4,6,9)
df <- NULL
for(i in seq_along(numHidden)){
nn <-  computeNN(x1, y1, numHidden[i], learningRate=0.1,numIterations=10000)
cost <- nn$costs df <- cbind(df,cost) }  df <- data.frame(df) iterations=seq(0,10000,by=1000) df <- cbind(iterations,df) names(df) <- c("iterations","4","6","9") library(reshape2) # Melt df1 <- melt(df,id="iterations") # Plot ggplot(df1) + geom_line(aes(x=iterations,y=value,colour=variable),size=1) + xlab("Iterations") + ylab('Cost') + ggtitle("Cost vs No iterations for different number of hidden units") ## 10a. Performance of the Neural Network for different learning rates (Octave) source("DL-function2.m") plotLRCostVsIterations() print -djph figa.jpg ## 10b. Performance of the Neural Network for different number of hidden units (Octave) source("DL-function2.m") plotHiddenCostVsIterations() print -djph figa.jpg ## 11. Turning the heat on the Neural Network In this 2nd part I create a a central region of positives and and the outside region as negatives. The points are generated using the equation of a circle (x – a)^{2} + (y -b) ^{2} = R^{2} . How does the 3 layer Neural Network perform on this? Here’s a look! Note: The same dataset is also used for R and Octave Neural Network constructions ## 12. Manually creating a circular central region import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap import sklearn import sklearn.datasets colors=['black','gold'] cmap = matplotlib.colors.ListedColormap(colors) x1=np.random.uniform(0,10,800).reshape(800,1) x2=np.random.uniform(0,10,800).reshape(800,1) X=np.append(x1,x2,axis=1) X.shape # Create (x-a)^2 + (y-b)^2 = R^2 # Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel() Y=a.reshape(800,1) cmap = matplotlib.colors.ListedColormap(colors) plt.figure() plt.title('Non-linearly separable classes') plt.scatter(X[:,0], X[:,1], c=Y, marker= 'o', s=15,cmap=cmap) plt.savefig('fig6.png', bbox_inches='tight') ### 13a. Decision boundary with hidden units=4 and learning rate = 2.2 (Python) With the above hyper parameters the decision boundary is triangular import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model execfile("./DLfunctions.py") x1=np.random.uniform(0,10,800).reshape(800,1) x2=np.random.uniform(0,10,800).reshape(800,1) X=np.append(x1,x2,axis=1) X.shape # Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel() Y=a.reshape(800,1) X2=X.T Y2=Y.T parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=2.2, numIterations = 10000) plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(4),str(2.2),"fig7.png")  ## Cost after iteration 0: 0.692836 ## Cost after iteration 1000: 0.331052 ## Cost after iteration 2000: 0.326428 ## Cost after iteration 3000: 0.474887 ## Cost after iteration 4000: 0.247989 ## Cost after iteration 5000: 0.218009 ## Cost after iteration 6000: 0.201034 ## Cost after iteration 7000: 0.197030 ## Cost after iteration 8000: 0.193507 ## Cost after iteration 9000: 0.191949 ### 13b. Decision boundary with hidden units=12 and learning rate = 2.2 (Python) With the above hyper parameters the decision boundary is triangular import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model execfile("./DLfunctions.py") x1=np.random.uniform(0,10,800).reshape(800,1) x2=np.random.uniform(0,10,800).reshape(800,1) X=np.append(x1,x2,axis=1) X.shape # Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel() Y=a.reshape(800,1) X2=X.T Y2=Y.T parameters,costs = computeNN(X2, Y2, numHidden = 12, learningRate=2.2, numIterations = 10000) plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(12),str(2.2),"fig8.png")  ## Cost after iteration 0: 0.693291 ## Cost after iteration 1000: 0.383318 ## Cost after iteration 2000: 0.298807 ## Cost after iteration 3000: 0.251735 ## Cost after iteration 4000: 0.177843 ## Cost after iteration 5000: 0.130414 ## Cost after iteration 6000: 0.152400 ## Cost after iteration 7000: 0.065359 ## Cost after iteration 8000: 0.050921 ## Cost after iteration 9000: 0.039719 ### 14a. Decision boundary with hidden units=9 and learning rate = 0.5 (R) When the number of hidden units is 6 and the learning rate is 0,1, is also a triangular shape in R source("DLfunctions2_1.R") z <- as.matrix(read.csv("data1.csv",header=FALSE)) # N x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) nn <-computeNN(x1, y1, 9, learningRate=0.5,numIterations=10000) # Triangular ## [1] 0.8398838 ## [1] 0.3303621 ## [1] 0.3127731 ## [1] 0.3012791 ## [1] 0.3305543 ## [1] 0.3303964 ## [1] 0.2334615 ## [1] 0.1920771 ## [1] 0.2341225 ## [1] 0.2188118 ## [1] 0.2082687 plotDecisionBoundary(z,nn,6,0.1) ### 14b. Decision boundary with hidden units=8 and learning rate = 0.1 (R) source("DLfunctions2_1.R") z <- as.matrix(read.csv("data1.csv",header=FALSE)) # N x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) nn <-computeNN(x1, y1, 8, learningRate=0.1,numIterations=10000) # Hemisphere ## [1] 0.7273279 ## [1] 0.3169335 ## [1] 0.2378464 ## [1] 0.1688635 ## [1] 0.1368466 ## [1] 0.120664 ## [1] 0.111211 ## [1] 0.1043362 ## [1] 0.09800573 ## [1] 0.09126161 ## [1] 0.0840379 plotDecisionBoundary(z,nn,8,0.1) ### 15a. Decision boundary with hidden units=12 and learning rate = 1.5 (Octave) source("DL-function2.m") data=csvread("data1.csv"); X=data(:,1:2); Y=data(:,3); # Make sure that the model parameters are correct. Take the transpose of X & Y [W1,b1,W2,b2,costs]= computeNN(X', Y',12, learningRate=1.5, numIterations = 10000); plotDecisionBoundary(data, W1,b1,W2,b2) print -djpg fige.jpg Conclusion: This post implemented a 3 layer Neural Network to create non-linear boundaries while performing classification. Clearly the Neural Network performs very well when the number of hidden units and learning rate are varied. To be continued… Watch this space!! To see all posts check Index of posts # Deep Learning from first principles in Python, R and Octave – Part 1 “You don’t perceive objects as they are. You perceive them as you are.” “Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.” “The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”  David Eagleman - The Brain: The Story of You This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all! This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1 Check out my video presentation which discusses the derivations in detail 1. Elements of Neural Networks and Deep Le- Part 1 2. Elements of Neural Networks and Deep Learning – Part 2 To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below ## 1. Logistic Regression import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96 To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2. Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449). You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning. ## 2. Logistic Regression as a 2 layer Neural Network In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification. This simple 2 layer Neural Network is shown below At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values. $Z= w_{1}x_{1} +w_{2}x_{2} +..+ w_{30}x_{30} + b$ where ‘b’ is the bias term The Activation function is the sigmoid function which is $a= 1/(1+e^{-z})$ The Loss, when the sigmoid function is used in the output layer, is given by $L=-(ylog(a) + (1-y)log(1-a))$ (1) ## Gradient Descent ### Forward propagation In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above. ### Backward propagation The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples. Chain rule of differentiation Let y=f(u) and u=g(x) then $\partial y/\partial x = \partial y/\partial u * \partial u/\partial x$ Derivative of sigmoid $\sigma=1/(1+e^{-z})$ Let $x= 1 + e^{-z}$ then $\sigma = 1/x$ $\partial \sigma/\partial x = -1/x^{2}$ $\partial x/\partial z = -e^{-z}$ Using the chain rule of differentiation we get $\partial \sigma/\partial z = \partial \sigma/\partial x * \partial x/\partial z$ $=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}$ Therefore $\partial \sigma/\partial z = \sigma(1-\sigma)$ -(2) The 3 equations for the 2 layer Neural Network representation of Logistic Regression are $L=-(y*log(a) + (1-y)*log(1-a))$ -(a) $a=1/(1+e^{-Z})$ -(b) $Z= w_{1}x_{1} +w_{2}x_{2} +...+ w_{30}x_{30} +b = Z = \sum_{i} w_{i}*x_{i} + b$ -(c) The back propagation step requires the computation of $dL/dw_{i}$ and $dL/db_{i}$. In the case of regression it would be $dE/dw_{i}$ and $dE/db_{i}$ where dE is the Mean Squared Error function. Computing the derivatives for back propagation we have $dL/da = -(y/a + (1-y)/(1-a))$ -(d) because $d/dx(logx) = 1/x$ Also from equation (2) we get $da/dZ = a (1-a)$ – (e) By chain rule $\partial L/\partial Z = \partial L/\partial a * \partial a/\partial Z$ therefore substituting the results of (d) & (e) we get $\partial L/\partial Z = -(y/a + (1-y)/(1-a)) * a(1-a) = a-y$ (f) Finally $\partial L/\partial w_{i}= \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial w_{i}$ -(g) $\partial Z/\partial w_{i} = x_{i}$ – (h) and from (f) we have $\partial L/\partial Z =a-y$ Therefore (g) reduces to $\partial L/\partial w_{i} = x_{i}* (a-y)$ -(i) Also $\partial L/\partial b = \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial b$ -(j) Since $\partial Z/\partial b = 1$ and using (f) in (j) $\partial L/\partial b = a-y$ The gradient computes the weights at the input layer and the corresponding bias by using the values of $dw_{i}$ and $db$ $w_{i} := w_{i} -\alpha * dw_{i}$ $b := b -\alpha * db$ I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below ### 3. Neural Network for Logistic Regression -Python code (vectorized) import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split # Define the sigmoid function def sigmoid(z): a=1/(1+np.exp(-z)) return a # Initialize def initialize(dim): w = np.zeros(dim).reshape(dim,1) b = 0 return w # Compute the loss def computeLoss(numTraining,Y,A): loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A))) return(loss) # Execute the forward propagation def forwardPropagation(w,b,X,Y): # Compute Z Z=np.dot(w.T,X)+b # Determine the number of training samples numTraining=float(len(X)) # Compute the output of the sigmoid activation function A=sigmoid(Z) #Compute the loss loss = computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ=A-Y dw=1/numTraining*np.dot(X,dZ.T) db=1/numTraining*np.sum(dZ) # Return the results as a dictionary gradients = {"dw": dw, "db": db} loss = np.squeeze(loss) return gradients,loss # Compute Gradient Descent def gradientDescent(w, b, X, Y, numIerations, learningRate): losses=[] idx =[] # Iterate for i in range(numIerations): gradients,loss=forwardPropagation(w,b,X,Y) #Get the derivates dw = gradients["dw"] db = gradients["db"] w = w-learningRate*dw b = b-learningRate*db # Store the loss if i % 100 == 0: idx.append(i) losses.append(loss) # Set params and grads params = {"w": w, "b": b} grads = {"dw": dw, "db": db} return params, grads, losses,idx # Predict the output for a training set def predict(w,b,X): size=X.shape[1] yPredicted=np.zeros((1,size)) Z=np.dot(w.T,X) # Compute the sigmoid A=sigmoid(Z) for i in range(A.shape[1]): #If the value is > 0.5 then set as 1 if(A[0][i] > 0.5): yPredicted[0][i]=1 else: # Else set as 0 yPredicted[0][i]=0 return yPredicted #Normalize the data def normalize(x): x_norm = None x_norm = np.linalg.norm(x,axis=1,keepdims=True) x= x/x_norm return x # Run the 2 layer Neural Network on the cancer data set from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) # Create train and test sets X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Normalize the data for better performance X_train1=normalize(X_train) # Create weight vectors of zeros. The size is the number of features in the data set=30 w=np.zeros((X_train.shape[1],1)) #w=np.zeros((30,1)) b=0 #Normalize the training data so that gradient descent performs better X_train1=normalize(X_train) #Transpose X_train so that we have a matrix as (features, numSamples) X_train2=X_train1.T # Reshape to remove the rank 1 array and then transpose y_train1=y_train.reshape(len(y_train),1) y_train2=y_train1.T # Run gradient descent for 4000 times and compute the weights parameters, grads, costs,idx = gradientDescent(w, b, X_train2, y_train2, numIerations=4000, learningRate=0.75) w = parameters["w"] b = parameters["b"] # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1.T #Reshape y_test y_test1=y_test.reshape(len(y_test),1) y_test2=y_test1.T # Predict the values for yPredictionTest = predict(w, b, X_test2) yPredictionTrain = predict(w, b, X_train2) # Print the accuracy print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100)) print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100)) # Plot the Costs vs the number of iterations fig1=plt.plot(idx,costs) fig1=plt.title("Gradient descent-Cost vs No of iterations") fig1=plt.xlabel("No of iterations") fig1=plt.ylabel("Cost") fig1.figure.savefig("fig1", bbox_inches='tight') ## train accuracy: 90.3755868545 % ## test accuracy: 89.5104895105 % Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks. ### 4. Neural Network for Logistic Regression -R code (vectorized) source("RFunctions-1.R") # Define the sigmoid function sigmoid <- function(z){ a <- 1/(1+ exp(-z)) a } # Compute the loss computeLoss <- function(numTraining,Y,A){ loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A)) return(loss) } # Compute forward propagation forwardPropagation <- function(w,b,X,Y){ # Compute Z Z <- t(w) %*% X +b #Set the number of samples numTraining <- ncol(X) # Compute the activation function A=sigmoid(Z) #Compute the loss loss <- computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ<-A-Y dw<-1/numTraining * X %*% t(dZ) db<-1/numTraining*sum(dZ) fwdProp <- list("loss" = loss, "dw" = dw, "db" = db) return(fwdProp) } # Perform one cycle of Gradient descent gradientDescent <- function(w, b, X, Y, numIerations, learningRate){ losses <- NULL idx <- NULL # Loop through the number of iterations for(i in 1:numIerations){ fwdProp <-forwardPropagation(w,b,X,Y) #Get the derivatives dw <- fwdProp$dw
db <- fwdProp$db #Perform gradient descent w = w-learningRate*dw b = b-learningRate*db l <- fwdProp$loss
# Stoe the loss
if(i %% 100 == 0){
idx <- c(idx,i)
losses <- c(losses,l)
}
}

# Return the weights and losses

}

# Compute the predicted value for input
predict <- function(w,b,X){
m=dim(X)[2]
# Create a ector of 0's
yPredicted=matrix(rep(0,m),nrow=1,ncol=m)
Z <- t(w) %*% X +b
# Compute sigmoid
A=sigmoid(Z)
for(i in 1:dim(A)[2]){
# If A > 0.5 set value as 1
if(A[1,i] > 0.5)
yPredicted[1,i]=1
else
# Else set as 0
yPredicted[1,i]=0
}

return(yPredicted)
}

# Normalize the matrix
normalize <- function(x){
#Create the norm of the matrix.Perform the Frobenius norm of the matrix
n<-as.matrix(sqrt(rowSums(x^2)))
#Sweep by rows by norm. Note '1' in the function which performing on every row
normalized<-sweep(x, 1, n, FUN="/")
return(normalized)
}

# Run the 2 layer Neural Network on the cancer data set
# Read the data (from sklearn)
# Rename the target variable
names(cancer) <- c(seq(1,30),"output")
# Split as training and test sets
train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5)
train <- cancer[train_idx, ]
test <- cancer[-train_idx, ]

# Set the features
X_train <-train[,1:30]
y_train <- train[,31]
X_test <- test[,1:30]
y_test <- test[,31]
# Create a matrix of 0's with the number of features
w <-matrix(rep(0,dim(X_train)[2]))
b <-0
X_train1 <- normalize(X_train)
X_train2=t(X_train1)

# Reshape  then transpose
y_train1=as.matrix(y_train)
y_train2=t(y_train1)

# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=t(X_test1)

#Reshape y_test and take transpose
y_test1=as.matrix(y_test)
y_test2=t(y_test1)

# Use the values of the weights generated from Gradient Descent
yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2)
yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2)

sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100))
## [1] "Train accuracy: 90.845070"
sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100))
## [1] "test accuracy: 87.323944"
df <-data.frame(gradDescent$idx, gradDescent$losses)
names(df) <- c("iterations","losses")
ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") +
ggtitle("Gradient Descent - Losses vs No of Iterations") +
xlab("No of iterations") + ylab("Losses")

### 4. Neural Network for Logistic Regression -Octave code (vectorized)

 1; # Define sigmoid function function a = sigmoid(z) a = 1 ./ (1+ exp(-z)); end # Compute the loss function loss=computeLoss(numtraining,Y,A) loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A)); end
 # Perform forward propagation function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y) % Compute Z Z = w' * X + b; numtraining = size(X)(1,2); # Compute sigmoid A = sigmoid(Z);
 #Compute loss. Note this is element wise product loss =computeLoss(numtraining,Y,A); # Compute the gradients dZ, dw and db dZ = A-Y; dw = 1/numtraining* X * dZ'; db =1/numtraining*sum(dZ);

end
 # Compute Gradient Descent function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate) #Initialize losses and idx losses=[]; index=[]; # Loop through the number of iterations for i=1:numIerations, [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y); # Perform Gradient descent w = w - learningRate*dw; b = b - learningRate*db; if(mod(i,100) ==0) # Append index and loss index = [index i]; losses = [losses loss]; endif

end
end
 # Determine the predicted value for dataset function yPredicted = predict(w,b,X) m = size(X)(1,2); yPredicted=zeros(1,m); # Compute Z Z = w' * X + b; # Compute sigmoid A = sigmoid(Z); for i=1:size(X)(1,2), # Set predicted as 1 if A > 0,5 if(A(1,i) >= 0.5) yPredicted(1,i)=1; else yPredicted(1,i)=0; endif end end
 # Normalize by dividing each value by the sum of squares function normalized = normalize(x) # Compute Frobenius norm. Square the elements, sum rows and then find square root a = sqrt(sum(x .^ 2,2)); # Perform element wise division normalized = x ./ a; end
 # Split into train and test sets function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent) # Create a random index ix = randperm(length(dataset)); # Split into training trainSize = floor(trainPercent/100 * length(dataset)); train=dataset(ix(1:trainSize),:); # And test test=dataset(ix(trainSize+1:length(dataset)),:); X_train = train(:,1:30); y_train = train(:,31); X_test = test(:,1:30); y_test = test(:,31); end

 cancer=csvread("cancer.csv"); [X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75); w=zeros(size(X_train)(1,2),1); b=0; X_train1=normalize(X_train); X_train2=X_train1'; y_train1=y_train'; [w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75); # Normalize X_test X_test1=normalize(X_test); #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1'; y_test1=y_test'; # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(w1, b1, X_test2); yPredictionTrain = predict(w1, b1, X_train2); 

 trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100 testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100 trainAccuracy = 90.845 testAccuracy = 89.510 graphics_toolkit('gnuplot') plot(idx,losses); title ('Gradient descent- Cost vs No of iterations'); xlabel ("No of iterations"); ylabel ("Cost");

Conclusion
This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers

The Deep Learning journey has begun… Don’t miss the bus!
Stay tuned for more interesting posts in Deep Learning!!

To see all posts check Index of posts