# Deep Learning from first principles in Python, R and Octave – Part 1

“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

David Eagleman - The Brain: The Story of You

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

Check out my video presentation which discusses the derivations in detail
1. Elements of Neural Networks and Deep Le- Part 1
2. Elements of Neural Networks and Deep Learning – Part 2

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

## 1. Logistic Regression

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

## 2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
$Z= w_{1}x_{1} +w_{2}x_{2} +..+ w_{30}x_{30} + b$
where ‘b’ is the bias term

The Activation function is the sigmoid function which is $a= 1/(1+e^{-z})$
The Loss, when the sigmoid function is used in the output layer, is given by
$L=-(ylog(a) + (1-y)log(1-a))$ (1)

### Forward propagation

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

### Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

Chain rule of differentiation
Let y=f(u)
and u=g(x) then
$\partial y/\partial x = \partial y/\partial u * \partial u/\partial x$

Derivative of sigmoid
$\sigma=1/(1+e^{-z})$
Let $x= 1 + e^{-z}$  then
$\sigma = 1/x$
$\partial \sigma/\partial x = -1/x^{2}$
$\partial x/\partial z = -e^{-z}$
Using the chain rule of differentiation we get
$\partial \sigma/\partial z = \partial \sigma/\partial x * \partial x/\partial z$
$=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}$
Therefore $\partial \sigma/\partial z = \sigma(1-\sigma)$        -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
$L=-(y*log(a) + (1-y)*log(1-a))$      -(a)
$a=1/(1+e^{-Z})$      -(b)
$Z= w_{1}x_{1} +w_{2}x_{2} +...+ w_{30}x_{30} +b = Z = \sum_{i} w_{i}*x_{i} + b$ -(c)

The back propagation step requires the computation of $dL/dw_{i}$ and $dL/db_{i}$. In the case of regression it would be $dE/dw_{i}$ and $dE/db_{i}$ where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
$dL/da = -(y/a + (1-y)/(1-a))$          -(d)
because $d/dx(logx) = 1/x$
Also from equation (2) we get
$da/dZ = a (1-a)$                                  – (e)
By chain rule
$\partial L/\partial Z = \partial L/\partial a * \partial a/\partial Z$
therefore substituting the results of (d) & (e) we get
$\partial L/\partial Z = -(y/a + (1-y)/(1-a)) * a(1-a) = a-y$         (f)
Finally
$\partial L/\partial w_{i}= \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial w_{i}$                                                           -(g)
$\partial Z/\partial w_{i} = x_{i}$            – (h)
and from (f) we have  $\partial L/\partial Z =a-y$
Therefore  (g) reduces to
$\partial L/\partial w_{i} = x_{i}* (a-y)$ -(i)
Also
$\partial L/\partial b = \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial b$ -(j)
Since
$\partial Z/\partial b = 1$ and using (f) in (j)
$\partial L/\partial b = a-y$

The gradient computes the weights at the input layer and the corresponding bias by using the values
of $dw_{i}$ and $db$
$w_{i} := w_{i} -\alpha * dw_{i}$
$b := b -\alpha * db$
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

### 3. Neural Network for Logistic Regression -Python code (vectorized)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Define the sigmoid function
def sigmoid(z):
a=1/(1+np.exp(-z))
return a

# Initialize
def initialize(dim):
w = np.zeros(dim).reshape(dim,1)
b = 0
return w

# Compute the loss
def computeLoss(numTraining,Y,A):
loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
return(loss)

# Execute the forward propagation
def forwardPropagation(w,b,X,Y):
# Compute Z
Z=np.dot(w.T,X)+b
# Determine the number of training samples
numTraining=float(len(X))
# Compute the output of the sigmoid activation function
A=sigmoid(Z)
#Compute the loss
loss = computeLoss(numTraining,Y,A)
# Compute the gradients dZ, dw and db
dZ=A-Y
dw=1/numTraining*np.dot(X,dZ.T)
db=1/numTraining*np.sum(dZ)

# Return the results as a dictionary
"db": db}
loss = np.squeeze(loss)

def gradientDescent(w, b, X, Y, numIerations, learningRate):
losses=[]
idx =[]
# Iterate
for i in range(numIerations):
#Get the derivates
w = w-learningRate*dw
b = b-learningRate*db

# Store the loss
if i % 100 == 0:
idx.append(i)
losses.append(loss)
params = {"w": w,
"b": b}
"db": db}

# Predict the output for a training set
def predict(w,b,X):
size=X.shape[1]
yPredicted=np.zeros((1,size))
Z=np.dot(w.T,X)
# Compute the sigmoid
A=sigmoid(Z)
for i in range(A.shape[1]):
#If the value is > 0.5 then set as 1
if(A[0][i] > 0.5):
yPredicted[0][i]=1
else:
# Else set as 0
yPredicted[0][i]=0

return yPredicted

#Normalize the data
def normalize(x):
x_norm = None
x_norm = np.linalg.norm(x,axis=1,keepdims=True)
x= x/x_norm
return x

# Run the 2 layer Neural Network on the cancer data set

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Normalize the data for better performance
X_train1=normalize(X_train)

# Create weight vectors of zeros. The size is the number of features in the data set=30
w=np.zeros((X_train.shape[1],1))
#w=np.zeros((30,1))
b=0

#Normalize the training data so that gradient descent performs better
X_train1=normalize(X_train)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_train2=X_train1.T

# Reshape to remove the rank 1 array and then transpose
y_train1=y_train.reshape(len(y_train),1)
y_train2=y_train1.T

# Run gradient descent for 4000 times and compute the weights
w = parameters["w"]
b = parameters["b"]

# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1.T

#Reshape y_test
y_test1=y_test.reshape(len(y_test),1)
y_test2=y_test1.T

# Predict the values for
yPredictionTest = predict(w, b, X_test2)
yPredictionTrain = predict(w, b, X_train2)

# Print the accuracy
print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100))

# Plot the Costs vs the number of iterations
fig1=plt.plot(idx,costs)
fig1=plt.title("Gradient descent-Cost vs No of iterations")
fig1=plt.xlabel("No of iterations")
fig1=plt.ylabel("Cost")
fig1.figure.savefig("fig1", bbox_inches='tight')
## train accuracy: 90.3755868545 %
## test accuracy: 89.5104895105 %

Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

### 4. Neural Network for Logistic Regression -R code (vectorized)

source("RFunctions-1.R")
# Define the sigmoid function
sigmoid <- function(z){
a <- 1/(1+ exp(-z))
a
}

# Compute the loss
computeLoss <- function(numTraining,Y,A){
loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A))
return(loss)
}

# Compute forward propagation
forwardPropagation <- function(w,b,X,Y){
# Compute Z
Z <- t(w) %*% X +b
#Set the number of samples
numTraining <- ncol(X)
# Compute the activation function
A=sigmoid(Z)

#Compute the loss
loss <- computeLoss(numTraining,Y,A)

# Compute the gradients dZ, dw and db
dZ<-A-Y
dw<-1/numTraining * X %*% t(dZ)
db<-1/numTraining*sum(dZ)

fwdProp <- list("loss" = loss, "dw" = dw, "db" = db)
return(fwdProp)
}

# Perform one cycle of Gradient descent
gradientDescent <- function(w, b, X, Y, numIerations, learningRate){
losses <- NULL
idx <- NULL
# Loop through the number of iterations
for(i in 1:numIerations){
fwdProp <-forwardPropagation(w,b,X,Y)
#Get the derivatives
dw <- fwdProp$dw db <- fwdProp$db
w = w-learningRate*dw
b = b-learningRate*db
l <- fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx <- c(idx,i) losses <- c(losses,l) } } # Return the weights and losses gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict <- function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z <- t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize <- function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<-as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<-sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Set the features X_train <-train[,1:30] y_train <- train[,31] X_test <- test[,1:30] y_test <- test[,31] # Create a matrix of 0's with the number of features w <-matrix(rep(0,dim(X_train)[2])) b <-0 X_train1 <- normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <-data.frame(gradDescent$idx, gradDescent$losses) names(df) <- c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent - Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses") ### 4. Neural Network for Logistic Regression -Octave code (vectorized) 1; # Define sigmoid function function a = sigmoid(z) a = 1 ./ (1+ exp(-z)); end # Compute the loss function loss=computeLoss(numtraining,Y,A) loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A)); end # Perform forward propagation function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y) % Compute Z Z = w' * X + b; numtraining = size(X)(1,2); # Compute sigmoid A = sigmoid(Z); #Compute loss. Note this is element wise product loss =computeLoss(numtraining,Y,A); # Compute the gradients dZ, dw and db dZ = A-Y; dw = 1/numtraining* X * dZ'; db =1/numtraining*sum(dZ); end # Compute Gradient Descent function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate) #Initialize losses and idx losses=[]; index=[]; # Loop through the number of iterations for i=1:numIerations, [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y); # Perform Gradient descent w = w - learningRate*dw; b = b - learningRate*db; if(mod(i,100) ==0) # Append index and loss index = [index i]; losses = [losses loss]; endif end end # Determine the predicted value for dataset function yPredicted = predict(w,b,X) m = size(X)(1,2); yPredicted=zeros(1,m); # Compute Z Z = w' * X + b; # Compute sigmoid A = sigmoid(Z); for i=1:size(X)(1,2), # Set predicted as 1 if A > 0,5 if(A(1,i) >= 0.5) yPredicted(1,i)=1; else yPredicted(1,i)=0; endif end end # Normalize by dividing each value by the sum of squares function normalized = normalize(x) # Compute Frobenius norm. Square the elements, sum rows and then find square root a = sqrt(sum(x .^ 2,2)); # Perform element wise division normalized = x ./ a; end # Split into train and test sets function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent) # Create a random index ix = randperm(length(dataset)); # Split into training trainSize = floor(trainPercent/100 * length(dataset)); train=dataset(ix(1:trainSize),:); # And test test=dataset(ix(trainSize+1:length(dataset)),:); X_train = train(:,1:30); y_train = train(:,31); X_test = test(:,1:30); y_test = test(:,31); end cancer=csvread("cancer.csv"); [X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75); w=zeros(size(X_train)(1,2),1); b=0; X_train1=normalize(X_train); X_train2=X_train1'; y_train1=y_train'; [w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75); # Normalize X_test X_test1=normalize(X_test); #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1'; y_test1=y_test'; # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(w1, b1, X_test2); yPredictionTrain = predict(w1, b1, X_train2); trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100 testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100 trainAccuracy = 90.845 testAccuracy = 89.510 graphics_toolkit('gnuplot') plot(idx,losses); title ('Gradient descent- Cost vs No of iterations'); xlabel ("No of iterations"); ylabel ("Cost"); Conclusion This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers The Deep Learning journey has begun… Don’t miss the bus! Stay tuned for more interesting posts in Deep Learning!! To see all posts check Index of posts # My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI) Then felt I like some watcher of the skies When a new planet swims into his ken; Or like stout Cortez when with eagle eyes He star’d at the Pacific—and all his men Look’d at each other with a wild surmise— Silent, upon a peak in Darien. On First Looking into Chapman’s Homer by John Keats The above excerpt from John Keat’s poem captures the the exhilaration that one experiences, when discovering something for the first time. This also summarizes to some extent my own as enjoyment while pursuing Data Science, Machine Learning and the like. I decided to write this post, as occasionally youngsters approach me and ask me where they should start their adventure in Data Science & Machine Learning. There are other times, when the ‘not-so-youngsters’ want to know what their next step should be after having done some courses. This post includes my travels through the domains of Data Science, Machine Learning, Deep Learning and (soon to be done AI). By no means, am I an authority in this field, which is ever-widening and almost bottomless, yet I would like to share some of my experiences in this fascinating field. I include a short review of the courses I have done below. I also include alternative routes through courses which I did not do, but are probably equally good as well. Feel free to pick and choose any course or set of courses. Alternatively, you may prefer to read books or attend bricks-n-mortar classes, In any case, I hope the list below will provide you with some overall direction. All my learning in the above domains have come from MOOCs and I restrict myself to the top 3 MOOCs, or in my opinion, ‘the original MOOCs’, namely Coursera, edX or Udacity, but may throw in some courses from other online sites if they are only available there. I would recommend these 3 MOOCs over the other numerous online courses and also over face-to-face classroom courses for the following reasons. These MOOCs • Are taken by world class colleges and the lectures are delivered by top class Professors who have a great depth of knowledge and a wealth of experience • The Professors, besides delivering quality content, also point out to important tips, tricks and traps • You can revisit lectures in online courses anytime to refresh your memory • Lectures are usually short between 8 -15 mins (Personally, my attention span is around 15-20 mins at a time!) Here is a fair warning and something quite obvious. No amount of courses, lectures or books will help if you don’t put it to use through some language like Octave, R or Python. The journey My trip through Data Science, Machine Learning started with an off-chance remark,about 3 years ago, from an old friend of mine who spoke to me about having done a few courses at Coursera, and really liked it. He further suggested that I should try. This was the final push which set me sailing into this vast domain. I have included the list of the courses I have done over the past 5 years (37+ certifications completed and another 9 audited-listened only without doing the assignments). For each of the courses I have included a short review of the course, whether I think the course is mandatory, the language in which the course is based on, and finally whether I have done the course myself etc. I have also included alternative courses, which I may have not done, but which I think are equally good. Finally, I suggest some courses which I have heard of and which are very good and worth taking. 1. Machine Learning, Stanford, Prof Andrew Ng, Coursera (Requirement: Mandatory, Language:Octave,Status:Completed) This course provides an excellent foundation to build your Machine Learning citadel on. The course covers the mathematical details of linear, logistic and multivariate regression. There is also a good coverage of topics like Neural Networks, SVMs, Anamoly Detection, underfitting, overfitting, regularization etc. Prof Andrew Ng presents the material in a very lucid manner. It is a great course to start with. It would be a good idea to brush up some basics of linear algebra, matrices and a little bit of calculus, specifically computing the local maxima/minima. You should be able to take this course even if you don’t know Octave as the Prof goes over the key aspects of the language. 2. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford– (Requirement:Mandatory, Language:R, Status;Completed) – The course includes linear and polynomial regression, logistic regression. Details also include cross-validation and the bootstrap methods, how to do model selection and regularization (ridge and lasso). It also touches on non-linear models, generalized additive models, boosting and SVMs. Some unsupervised learning methods are also discussed. The 2 Professors take turns in delivering lectures with a slight touch of humor. 3a. Data Science Specialization: Prof Roger Peng, Prof Brian Caffo & Prof Jeff Leek, John Hopkins University (Requirement: Option A, Language: R Status: Completed) This is a comprehensive 10 module specialization based on R. This Specialization gives a very broad overview of Data Science and Machine Learning. The modules cover R programming, Statistical Inference, Practical Machine Learning, how to build R products and R packages and finally has a very good Capstone project on NLP 3b. Applied Data Science with Python Specialization: University of Michigan (Requirement: Option B, Language: Python, Status: Not done) In this specialization I only did the Applied Machine Learning in Python (Prof Kevyn-Collin Thomson). This is a very good course that covers a lot of Machine Learning algorithms(linear, logistic, ridge, lasso regression, knn, SVMs etc. Also included are confusion matrices, ROC curves etc. This is based on Python’s Scikit Learn 3c. Machine Learning Specialization, University Of Washington (Requirement:Option C, Language:Python, Status : Not completed). This appears to be a very good Specialization in Python 4. Statistics with R Specialization, Duke University (Requirement: Useful and a must know, Language R, Status:Not Completed) I audited (listened only) to the following 2 modules from this Specialization. a.Inferential Statistics b.Linear Regression and Modeling Both these courses are taught by Prof Mine Cetikya-Rundel who delivers her lessons with extraordinary clarity. Her lectures are filled with many examples which she walks you through in great detail 5.Bayesian Statistics: From Concept to Data Analysis: Univ of California, Santa Cruz (Requirement: Optional, Language : R, Status:Completed) This is an interesting course and provides an alternative point of view to frequentist approach 6. Data Science and Engineering with Spark, University of California, Berkeley, Prof Antony Joseph, Prof Ameet Talwalkar, Prof Jon Bates (Required: Mandatory for Big Data, Status:Completed, Language; pySpark) This specialization contains 3 modules a.Introduction to Apache Spark b.Distributed Machine Learning with Apache Spark c.Big Data Analysis with Apache Spark This is an excellent course for those who want to make an entry into Distributed Machine Learning. The exercises are fairly challenging and your code will predominantly be made of map/reduce and lambda operations as you process data that is distributed across Spark RDDs. I really liked the part where the Prof shows how a matrix multiplication on a single machine is of the order of O(nd^2+d^3) (which is the basis of Machine Learning) is reduced to O(nd^2) by taking outer products on data which is distributed. 7. Deep Learning Prof Andrew Ng, Younes Bensouda Mourri, Kian Katanforoosh : Requirement:Mandatory,Language:Python, Tensorflow Status:Completed) This course had 5 Modules which start from the fundamentals of Neural Networks, their derivation and vectorized Python implementation. The specialization also covers regularization, optimization techniques, mini batch normalization, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs applied to a wide variety of real world problems The modules are a. Neural Networks and Deep Learning In this course Prof Andrew Ng explains differential calculus, linear algebra and vectorized Python implementations of Deep Learning algorithms. The derivation for back-propagation is done and then the Prof shows how to compute a multi-layered DL network b.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization Deep Neural Networks can be very flexible, and come with a lots of knobs (hyper-parameters) to tune with. In this module, Prof Andrew Ng shows a systematic way to tune hyperparameters and by how much should one tune. The course also covers regularization(L1,L2,dropout), gradient descent optimization and batch normalization methods. The visualizations used to explain the momentum method, RMSprop, Adam,LR decay and batch normalization are really powerful and serve to clarify the concepts. As an added bonus,the module also includes a great introduction to Tensorflow. c.Structuring Machine Learning Projects A very good module with useful tips, tricks and traps that need to be considered while working on Machine Learning and Deep Learning projects d. Convolutional Neural Networks This domain has a lot of really cool ideas, where images represented as 3D volumes, are compressed and stretched longitudinally before applying a multi-layered deep learning neural network to this thin slice for performing classification,detection etc. The Prof provides a glimpse into this fascinating world of image classification, detection andl neural art transfer with frameworks like Keras and Tensorflow. e. Sequence Models In this module covers in good detail concepts like RNNs, GRUs, LSTMs, word embeddings, beam search and attention model. 8. Neural Networks for Machine Learning, Prof Geoffrey Hinton,University of Toronto (Requirement: Mandatory, Language;Octave, Status:Completed) This is a broad course which starts from the basic of Perceptrons, all the way to Boltzman Machines, RNNs, CNNS, LSTMs etc The course also covers regularisation, learning rate decay, momentum method etc 9.Probabilistic Graphical Models, Stanford Prof Daphne Koller(Language:Octave, Status: Partially completed) This has 3 courses a.Probabilistic Graphical Models 1: Representation – Done b.Probabilistic Graphical Models 2: Inference – To do c.Probabilistic Graphical Models 3: Learning – To do This course discusses how a system, which can be represented as a complex interaction of probability distributions, will behave. This is probably the toughest course I did. I did manage to get through the 1st module, While I felt that grasped a few things, I did not wholly understand the import of this. However I feel this is an important domain and I will definitely revisit this in future 10. Reinforcement Specialization : University of Alberta, Prof Adam White and Prof Martha White (Requirement: Very important, Language;Python, Status: Partially Completed) This is a set of 4 courses. I did the first 2 of the 4. Reinforcement Learning appears deceptively simple, but it is anything but simple. Definitely a very critical area to learn. a.Fundamentals of Reinforcement Learning: This course discusses Markov models, value functions and Bellman equations and dynamic programming. b.Sample based learning Learning methods: This course touches on Monte Carlo methods, Temporal Difference methods, Q Learning etc. Reinforcement Learning is a must-have in your AI arsenal. 11. Tensorflow in Practice Specialization – Prof Laurence Moroney – Deep Learning.AI (Requirement: Important, Language;Python, Status: Completed) This is a good course but definitely do the Deep Learning Specialization by Prof Andrew Ng There are 4 courses in this Specialization. I completed all 4 courses. They are fairly straight forward a. Introduction to TensorFlow – This course introduces you to Tensorflow, image recognition with brute-force method b. Convolutional Neural Networks in Tensorflow – This course touches on how to build a CNN, image augmentation, transfer learning and multi-class classification c. Natural Language Processing in Tensorflow – Word embeddings, sentiment analysis, LSTMs, RNNs are discussed. d. Sequences, time series and prediction – This course discusses using RNNs for time series, auto correlation 12. Natural Language Processing Specialization – Prof Younes Bensouda, Lukasz Kaiser from DeepLearning.AI (Requirement: Very Important, Language;Python, Status: Partially Completed) This is the latest specialization from Deep Learning.AI. I have completed the first 2 courses a.Natural Language Processing with Classification and Vector Spaces -The first course deals with sentiment analysis with Naive Bayes, vector space models, capturing dependencies using PCA etc b. Natural Language Processing with Probabilistic Models – In this course techniques for auto correction, Markov models and Viterbi algorithm for Parts of Speech tagging, auto completion and word embedding are discussed. 13. Mining Massive Data Sets Prof Jure Leskovec, Prof Anand Rajaraman and ProfJeff Ullman. Online Stanford, Status Partially done., I did quickly audit this course, a year back, when it used to be in Coursera. It now seems to have moved to Stanford online. But this is a very good course that discusses key concepts of Mining Big Data of the order a few Petabytes 14. Introduction to Artificial Intelligence, Prof Sebastian Thrun & Prof Peter Norvig, Udacity This is a really good course. I have started on this course a couple of times and somehow gave up. Will revisit to complete in future. Quite extensive in its coverage.Touches BFS,DFS, A-Star, PGM, Machine Learning etc. 15.Deep Learning (with TensorFlow), Vincent Vanhoucke, Principal Scientist at Google Brain. Got started on this one and abandoned some time back. In my to do list though My learning journey is based on Lao Tzu’s dictum of ‘A good traveler has no fixed plans and is not intent on arriving’. You could have a goal and try to plan your courses accordingly. And so my journey continues… I hope you find this list useful. Have a great journey ahead!!! # Neural Networks: On Perceptrons and Sigmoid Neurons Neural Networks had their beginnings in 1943 when Warren McCulloch, a neurophysiologist, and a young mathematician, Walter Pitts, wrote a paper on how neurons might work. Much later in 1958, Frank Rosenblatt, a neuro-biologist proposed the Perceptron. The Perceptron is a computer model or computerized machine which is devised to represent or simulate the ability of the brain to recognize and discriminate. In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers Initially it was believed that Perceptrons were capable of many things including “the ability to walk, talk, see, write, reproduce itself and be conscious of its existence.” However, a subsequent paper by Marvin Minky and Seymour Papert of MIT, titled “Perceptrons” proved that the Perceptron was truly limited in its functionality. Specifically they showed that the Perceptron was incapable of producing XOR functionality. The Perceptron is only capable of classification where the data points are linearly separable. Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version(\$9.99/Rs449).

This post implements the simple learning algorithm of the ‘Linear Perceptron’ and the ‘Sigmoid Perceptron’.  The implementation has been done in Octave. This implementation is based on “Neural networks for Machine Learning” course by Prof Geoffrey Hinton at Coursera

Perceptron learning procedure
z = ∑wixi  + b
where wi is the ith weight and xi is the ith  feature

For every training case compute the activation output zi

• If the output classifies correctly, leave the weights alone
• If the output classifies a ‘0’ as a ‘1’, then subtract the the feature from the weight
• If the output classifies a ‘0’ as a ‘1’, then add the feature to the weight

This simple neural network is represented below

Sigmoid neuron learning procedure
zi = sigmoid(∑wixi  + b)
where sigmoid is
$sigmoid(z) = 1/1+e^{-z}$

Hence
$z_{i} = 1/1+e^{-(\sum w_{i}x_{i}+b)}$
For every training case compute the activation output zi

• If the output classifies correctly, leave the weights alone
• If the output incorrectly classifies a ‘0’ as a ‘1’ i.e. $z_{i} >sigmoid(0)$, then subtract the feature from the weight
• If the output incorrectly classifies a ‘1’ as ‘0’ i.e., i.e $z_{i} < sigmoid(0)$, then add the feature to the weight
• Iterate till errors <= 1

This is shown below

I have implemented the learning algorithm of the Perceptron and Sigmoid Neuron in Octave. The code is available at Github at Perceptron.

1. Perceptron execution

I performed the tests on 2 different datasets

Data 1

Data 2

2. Sigmoid Perceptron execution
Data 1 & Data 2

It can be seen that the Perceptron does work for simple linearly separable data. I will be implementing other more advanced Neural Networks in the months to come.

Watch this space!

# Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1

Here is the 1st part of my video presentation on “Machine Learning, Data Science, NLP and Big Data – Part 1”

# The brave, new frontiers of computing

This article was published in Telecom Asia, 21 March 2014 – The brave new frontiers of computing

Von Neumann reference architecture and the sequential processing of Turing machines have been the basis for ‘classical’ computers for the last 6 decades. The juggernaut of technology has resulted in faster and denser processors being churned out inexorably by the semiconductor industry, substantiating Gordon Moore’s claim of transistors density in chips doubling every 18 months, now famously known as Moore’s law. These days we have processors with an excess of billion transistors. We are now reaching the physical limit of the number of transistors on a chip. There is now an imminent need to look at alternative paradigms to crack problems of the internet age, confronting human which cannot be solved by classical computing

In the last decade or so 3 new, radical and lateral paradigms have surfaced which hold tremendous promise. They are

i) Deep learning ii) Quantum computing and iii) Genetic programming.

These techniques hold enormous potential and may offer solutions to problems which would take classical computers anywhere between a few years to a few decades to solve.

Deep Learning:Deep Learning is a new area of Machine Learning research. The objective of deep learning is to bring Machine Learning closer to one of its original goals namely Artificial Intelligence. Deep Learning is based on multi-level neural networks called deep neural networks. Deep Learning works on large sets of unclassified data and is able to learn lower level patterns on which it builds higher level representations much the same way the human brain works.

Deep learning tries to mimic the human brain For example, the visual cortex shows a sequence of areas where signals flow from one level to the next. In the visual cortex the feature hierarchy represents input at a different level of abstraction, with more abstract features further up in the hierarchy, defined in terms of the lower-level ones. Deep Learning is based on the premise that humans organize ideas hierarchically and compose more abstract concepts from simpler ones.

Deep Learning algorithms generally requires powerful processors and works on enormous amounts of data to learn key features. The characteristic of Deep Learning algorithms is that the input is passed through several non-linearities before generating its output.

.

About 3 years ago, researcher’s at Google’s Brain ran a deep learning algorithm on 10 million still images extracted from Youtube, on 1000’s of extremely powerful processors called GPUs. Google’s Brain was able independently infer that these images consisted of a preponderance of cat’s videos. A seemingly trivial result, but of great significance as the algorithm inferred this result without any other input!

An interesting article in Nature, “The learning machines”, discusses how deep learning has proved useful for several scientific tasks including handwriting recognition, speech recognition, natural language processing, and in analyzing 3 dimensional images of brain slices etc.

The importance of Deep Learning has not been lost on the Tech titans like Google, Facebook, Microsoft and IBM which have all taken steps to stay ahead in this race.

Deep Learning is in its infancy and is still esoteric knowledge. Deep Learning is truly a fascinating area of research and may be the harbinger of the real breakthrough in Artificial Intelligence has been looking for in decades.

Genetic Programming (GP) is another radical approach to computing. It had its origins in the early 1950’s and has been gaining traction in the last decade. Genetic programming (GP) is a branch of AI, based on Darwinian evolutionary principle of ‘natural selection’ and ‘survival of the fittest’. Essentially GP is a set of instructions and a fitness function to measure how well a computer program has performed a task. It is a specialization of genetic algorithms (GA) where each individual is a computer program.

Genetic Programming is a machine learning technique in which a population of computer programs are optimized according to ‘fitness criteria’ determined by a program’s ability to perform a given computational task. Fit programs survive and are moved along the evolutionary process. Fitness usually denotes the optimum value for a given objective function. In other words the fitness represents the ‘quality’ of a given solution over others. Individuals in a new population are created by the method of ‘reproduction’ and ‘cross over’.

In other words, the ‘most fit’ programs are crossbred and also possibly randomly mutated, creating a new generation of child programs. The unfit programs are discarded out and the best are bred again.

Once set up, the genetic program runs and evolves by itself and needs no further human input. Genetic Programming was pioneered by Stanford’s John Koza who was able to invent an antenna for NASA, identify proteins and invent electrical controllers.

The eerie part of GP is that the code is inscrutable. The program evolves and mutates into variations that cannot be easily reproduced. Clearly this is fodder for science fiction-like scenarios of self-aware, paranoid & psychopathic programs. Here is an interesting article that discusses this- This is What Happens When You Teach Machines the Power of Natural Selection

Quantum computing

Computers of today from hardy mainframes to smartphones operate on binary logic. The entire edifice of today’s computing is based on the binary states of the semiconductor which can be either in the state of ‘0’ or ‘1’. All computation can be reduced to arithmetic and logical operation on binary digits or more simply, binary arithmetic. Quantum computers deviate significantly from the binary arithmetic of classical computers. The unit in the quantum computer is the ‘qubit’ which can be in state ‘0’, ‘1’ and both the state ‘0’ and ‘1’ through the principle of superposition.

To understand the power of quantum computing here is an excerpt from ArsTechnica “A tale of two qubits: How quantum computers work

Bits, either classical or quantum, are the simplest possible units of information…. Measuring a bit, either classical or quantum, will result in one of two possible outcomes. At first glance, this makes it sound like there is no difference between bits and qubits. In fact, the difference is not in the possible answers, but in the possible questions. For normal bits, only a single measurement is permitted, meaning that only a single question can be asked: Is this bit a zero or a one? In contrast, a qubit is a system which can be asked many, many different questions, but to each question, only one of two answers can be given”

The article further goes on to state that “Classical computer memories are constrained to exist at any given time as a simple list of zeros and ones. In contrast, in a single quantum memory many such combinations can all exist simultaneously. During a quantum algorithm, this symphony of possibilities is split and merged, eventually coalescing around a single solution. “

Having more than 1 qubit results in additional property called ‘quantum entanglement’. A pair of qubits cannot be described by the states of the individual qubits alone. Those states which exhibit extra correlations are described as ‘entangled’ states. Hence in the case of 2 qubits ‘the whole is greater than the sum of its parts”. Entanglement and superposition are the cornerstones which gives quantum computing its power. Here is a short and interesting animation of quantum computing

With classical computing techniques searching an unsorted phonebook of 10,000 entries, would require us to look up at least 5000 entries, while a quantum search algorithm only needs to guess 100 times. In other words it would take a quantum computer only 5000 guesses to search through a phonebook with 25 million names. That is the power of quantum computers!

Applications of quantum computers range from weather modeling, cryptography, solving problems that have been considered ‘intractable’ with classical computing methods. NASA is planning to use quantum computers in its search for exoplanets.

Deep Learning, Genetic Programming and Quantum Computing represent paradigmatic, lateral shifts in computing. They herald a new era in computing and will enable us to crack extremely complex problems in this Age of the Internet.

Classical computing will continue to play a role in a daily lives but for real world problems of the next decade & beyond it will be these 3 computing approaches that will hold the key to our future!

# Simplifying ML: Neural networks- Part 3

Neural networks try to overcome the shortcomings of logistic regression in which  we have to choose a non-linear hypothesis. Logistic regression requires that we choose an appropriate combination of polynomial terms and the order of the equation. The problem with this is sometimes we either tend to overfit or underfit. Neural networks allow the ability to learns new model parameters from the basis raw parameters.

The neural network is modeled on the neural networking ability of the human brain. The brain is made of trillions of neurons. Each neuron is a processing unit which has several inputs in the dendrites and an output the axon. The neurons communicate thro a combination of electro chemical signal at the synapses or the spaces between the neuron.

A neural network mimics the working of the neuron.

So in a neural network the features of the problem serve as input. For e.g in the case of being able to determine if a mail is spam or not the features could be the words in the subject line, the from address, the contents etc. Based on a combination of these features we need to classify whether the mail is spam or not.

The above diagram shows a simple neural network with features x1, x2, x3 and a bias unit x0

With a hypothesis function hƟ(x) = 1/(1 + e-x)

The edges from the features xi  are the model parameters Ɵ. In other words the edges represent weights.

A typical neural network is a network of many logistic units organized in layers. The output of each layer forms the input to the next subsequent layer. This is shown below

As can be seen in a multi-layer neural network at the left we have the features x1,x2, .. xn.

This at the layer becomes the activation unit. The key advantage of neural networks over regular logistic regression that learns the models parameters is that learned model parameters are input to the next subsequent layers which learn the model parameters more finely. Hence this gives a better fit for the combination of parameters.

The activation parameters at the next layer are

a12 = g(Ɵ101x0+ Ɵ111x1+ Ɵ121x2 + Ɵ131x3) where g is the logistic function or the sigmoid function discussed in my previous post Simplifying ML: Logistic regression – Part 2

Here a12 is the activation parameter at layer 1

Ɵ10 is the model parameter at layer 1 and is the 0th parameter. Similarly Ɵ11 is the model parameter at layer 1 and is the 1st parameter and so on.

Similarly the other activation parameters can be written as

a22 = g(Ɵ201x0+ Ɵ211x1+ Ɵ221x2 + Ɵ231x3)

a32 = g(Ɵ301x0+ Ɵ311x1+ Ɵ321x2 + Ɵ331x3)

hƟ(x) = a13 = g(Ɵ102a0+ Ɵ112a1+ Ɵ122a2 + Ɵ132a3  – (A)

The crux of neural networks is that instead of creating a hypothesis based on the set of raw features, the neural network with multiple hidden layers can learn its own features. In the equation (A) we can see that the hypothesis is not a function of the input raw features x1,x2,… xbut on a new set of features or the activation units a1,a2, … an . In other words the network has ‘learned’ its own features.

As mentioned above the output of each layer is the logistic function or the sigmoid function

The beauty of neural networks based on logistic functions is that we can easily realize the equivalent of logic gates like AND, OR, NOT, NOR etc.

The hypothesis for the above network would be

hƟ(x) = g(-30 + 20 * x1 + 20 * x2)

So for x1= 0 and x2 = 0 we would have

hƟ(x) = g(-30 + 0 + 0) = g(-30)

Since g(-30) < g(0) < 0.5 = 0

Similarly a NOT gate can be constructed with a neural network as follows

Neural networks can also be used for multi class classification.

Hence there are multiple advantages to neural networks. Neural networks are amenable to a) creating complex logic models of combinations of AND, NOT, OR gates

b) The model parameters are learned from the raw parameters and can be more flexible.

It appears that the interest in neural networks surged in the 1980s and then waned, The neural networks were similar to the above and were based on forward propagation. However it appears that in recent time’s backward propagation has been used successfully in areas of research known as ‘deep learning’

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. A highy enjoyable and classic course!!!