# My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($9.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99) 2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle-$9.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

# Deep Learning from first principles in Python, R and Octave – Part 4

In this 4th post of my series on Deep Learning from first principles in Python, R and Octave – Part 4, I explore the details of creating a multi-class classifier using the Softmax activation unit in a neural network. The earlier posts in this series were

1. Deep Learning from first principles in Python, R and Octave – Part 1. In this post I implemented logistic regression as a simple Neural Network in vectorized Python, R and Octave
2. Deep Learning from first principles in Python, R and Octave – Part 2. This 2nd part implemented the most elementary neural network with 1 hidden layer and any number of activation units in the hidden layer with sigmoid activation at the output layer
3. Deep Learning from first principles in Python, R and Octave – Part 3. The 3rd implemented a multi-layer Deep Learning network with an arbitrary number if hidden layers and activation units per hidden layer. The output layer was for binary classification which was based on the sigmoid unit. This multi-layer deep network was implemented in vectorized Python, R and Octave.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

This 4th part takes a swing at multi-class classification and uses the Softmax as the activation unit in the output layer. Inclusion of the Softmax activation unit in the activation layer requires us to compute the derivative of Softmax, or rather the “Jacobian” of the Softmax function, besides also computing the log loss for this Softmax activation during back propagation. Since the derivation of the Jacobian of a Softmax and the computation of the Cross Entropy/log loss is very involved, I have implemented a basic neural network with just 1 hidden layer with the Softmax activation at the output layer. I also perform multi-class classification based on the ‘spiral’ data set from CS231n Convolutional Neural Networks Stanford course, to test the performance and correctness of the implementations in Python, R and Octave. You can clone download the code for the Python, R and Octave implementations from Github at Deep Learning – Part 4

Note: A detailed discussion of the derivation below can also be seen in my video presentation Neural Networks 5

The Softmax function takes an N dimensional vector as input and generates a N dimensional vector as output.
The Softmax function is given by
$S_{j}= \frac{e_{j}}{\sum_{i}^{N}e_{k}}$
There is a probabilistic interpretation of the Softmax, since the sum of the Softmax values of a set of vectors will always add up to 1, given that each Softmax value is divided by the total of all values.

As mentioned earlier, the Softmax takes a vector input and returns a vector of outputs.  For e.g. the Softmax of a vector a=[1, 3, 6]  is another vector S=[0.0063,0.0471,0.9464]. Notice that vector output is proportional to the input vector.  Also, taking the derivative of a vector by another vector, is known as the Jacobian. By the way, The Matrix Calculus You Need For Deep Learning by Terence Parr and Jeremy Howard, is very good paper that distills all the main mathematical concepts for Deep Learning in one place.

Let us take a simple 2 layered neural network with just 2 activation units in the hidden layer is shown below

$Z_{1}^{1} =W_{11}^{1}x_{1} + W_{21}^{1}x_{2} + b_{1}^{1}$
$Z_{2}^{1} =W_{12}^{1}x_{1} + W_{22}^{1}x_{2} + b_{2}^{1}$
and
$A_{1}^{1} = g'(Z_{1}^{1})$
$A_{2}^{1} = g'(Z_{2}^{1})$
where g'() is the activation unit in the hidden layer which can be a relu, sigmoid or a
tanh function

Note: The superscript denotes the layer. The above denotes the equation for layer 1
of the neural network. For layer 2 with the Softmax activation, the equations are
$Z_{1}^{2} =W_{11}^{2}x_{1} + W_{21}^{2}x_{2} + b_{1}^{2}$
$Z_{2}^{2} =W_{12}^{2}x_{1} + W_{22}^{2}x_{2} + b_{2}^{2}$
and
$A_{1}^{2} = S(Z_{1}^{2})$
$A_{2}^{2} = S(Z_{2}^{2})$
where S() is the Softmax activation function
$S=\begin{pmatrix} S(Z_{1}^{2})\\ S(Z_{2}^{2}) \end{pmatrix}$
$S=\begin{pmatrix} \frac{e^{Z1}}{e^{Z1}+e^{Z2}}\\ \frac{e^{Z2}}{e^{Z1}+e^{Z2}} \end{pmatrix}$

The Jacobian of the softmax ‘S’ is given by
$\begin{pmatrix} \frac {\partial S_{1}}{\partial Z_{1}} & \frac {\partial S_{1}}{\partial Z_{2}}\\ \frac {\partial S_{2}}{\partial Z_{1}} & \frac {\partial S_{2}}{\partial Z_{2}} \end{pmatrix}$
$\begin{pmatrix} \frac{\partial}{\partial Z_{1}} \frac {e^{Z1}}{e^{Z1}+ e^{Z2}} & \frac{\partial}{\partial Z_{2}} \frac {e^{Z1}}{e^{Z1}+ e^{Z2}}\\ \frac{\partial}{\partial Z_{1}} \frac {e^{Z2}}{e^{Z1}+ e^{Z2}} & \frac{\partial}{\partial Z_{2}} \frac {e^{Z2}}{e^{Z1}+ e^{Z2}} \end{pmatrix}$     – (A)

Now the ‘division-rule’  of derivatives is as follows. If u and v are functions of x, then
$\frac{d}{dx} \frac {u}{v} =\frac {vdu -udv}{v^{2}}$
Using this to compute each element of the above Jacobian matrix, we see that
when i=j we have
$\frac {\partial}{\partial Z1}\frac{e^{Z1}}{e^{Z1}+e^{Z2}} = \frac {\sum e^{Z1} - e^{Z1^{2}}}{\sum ^{2}}$
and when $i \neq j$
$\frac {\partial}{\partial Z1}\frac{e^{Z2}}{e^{Z1}+e^{Z2}} = \frac {0 - e^{z1}e^{Z2}}{\sum ^{2}}$
This is of the general form
$\frac {\partial S_{j}}{\partial z_{i}} = S_{i}( 1-S_{j})$  when i=j
and
$\frac {\partial S_{j}}{\partial z_{i}} = -S_{i}S_{j}$  when $i \neq j$
Note: Since the Softmax essentially gives the probability the following
notation is also used
$\frac {\partial p_{j}}{\partial z_{i}} = p_{i}( 1-p_{j})$ when i=j
and
$\frac {\partial p_{j}}{\partial z_{i}} = -p_{i}p_{j} when i \neq j$
If you throw the “Kronecker delta” into the equation, then the above equations can be expressed even more concisely as
$\frac {\partial p_{j}}{\partial z_{i}} = p_{i} (\delta_{ij} - p_{j})$
where $\delta_{ij} = 1$ when i=j and 0 when $i \neq j$

This reduces the Jacobian of the simple 2 output softmax vectors  equation (A) as
$\begin{pmatrix} p_{1}(1-p_{1}) & -p_{1}p_{2} \\ -p_{2}p_{1} & p_{2}(1-p_{2}) \end{pmatrix}$
The loss of Softmax is given by
$L = -\sum y_{i} log(p_{i})$
For the 2 valued Softmax output this is
$\frac {dL}{dp1} = -\frac {y_{1}}{p_{1}}$
$\frac {dL}{dp2} = -\frac {y_{2}}{p_{2}}$
Using the chain rule we can write
$\frac {\partial L}{\partial w_{pq}} = \sum _{i}\frac {\partial L}{\partial p_{i}} \frac {\partial p_{i}}{\partial w_{pq}}$ (1)
and
$\frac {\partial p_{i}}{\partial w_{pq}} = \sum _{k}\frac {\partial p_{i}}{\partial z_{k}} \frac {\partial z_{k}}{\partial w_{pq}}$ (2)
In expanded form this is
$\frac {\partial L}{\partial w_{pq}} = \sum _{i}\frac {\partial L}{\partial p_{i}} \sum _{k}\frac {\partial p_{i}}{\partial z_{k}} \frac {\partial z_{k}}{\partial w_{pq}}$
Also
$\frac {\partial L}{\partial Z_{i}} =\sum _{i} \frac {\partial L}{\partial p} \frac {\partial p}{\partial Z_{i}}$
Therefore
$\frac {\partial L}{\partial Z_{1}} =\frac {\partial L}{\partial p_{1}} \frac {\partial p_{1}}{\partial Z_{1}} +\frac {\partial L}{\partial p_{2}} \frac {\partial p_{2}}{\partial Z_{1}}$
$\frac {\partial L}{\partial z_{1}}=-\frac {y1}{p1} p1(1-p1) - \frac {y2}{p2}*(-p_{2}p_{1})$
Since
$\frac {\partial p_{j}}{\partial z_{i}} = p_{i}( 1-p_{j})$ when i=j
and
$\frac {\partial p_{j}}{\partial z_{i}} = -p_{i}p_{j}$ when $i \neq j$
which simplifies to
$\frac {\partial L}{\partial Z_{1}} = -y_{1} + y_{1}p_{1} + y_{2}p_{1} =$
$p_{1}\sum (y_{1} + y_2) - y_{1}$
$\frac {\partial L}{\partial Z_{1}}= p_{1} - y_{1}$
Since
$\sum_{i} y_{i} =1$
Similarly
$\frac {\partial L}{\partial Z_{2}} =\frac {\partial L}{\partial p_{1}} \frac {\partial p_{1}}{\partial Z_{2}} +\frac {\partial L}{\partial p_{2}} \frac {\partial p_{2}}{\partial Z_{2}}$
$\frac {\partial L}{\partial z_{2}}=-\frac {y1}{p1}*(p_{1}p_{2}) - \frac {y2}{p2}*p_{2}(1-p_{2})$
$y_{1}p_{2} + y_{2}p_{2} - y_{2}$
$\frac {\partial L}{\partial Z_{2}} =p_{2}\sum (y_{1} + y_2) - y_{2}\\ = p_{2} - y_{2}$
In general this is of the form
$\frac {\partial L}{\partial z_{i}} = p_{i} -y_{i}$
For e.g if the probabilities computed were p=[0.1, 0.7, 0.2] then this implies that the class with probability 0.7 is the likely class. This would imply that the ‘One hot encoding’ for  yi  would be yi=[0,1,0] therefore the gradient pi-yi = [0.1,-0.3,0.2]

<strong>Note: Further, we could extend this derivation for a Softmax activation output that outputs 3 classes
$S=\begin{pmatrix} \frac{e^{z1}}{e^{z1}+e^{z2}+e^{z3}}\\ \frac{e^{z2}}{e^{z1}+e^{z2}+e^{z3}} \\ \frac{e^{z3}}{e^{z1}+e^{z2}+e^{z3}} \end{pmatrix}$

We could derive
$\frac {\partial L}{\partial z1}= \frac {\partial L}{\partial p_{1}} \frac {\partial p_{1}}{\partial z_{1}} +\frac {\partial L}{\partial p_{2}} \frac {\partial p_{2}}{\partial z_{1}} +\frac {\partial L}{\partial p_{3}} \frac {\partial p_{3}}{\partial z_{1}}$ which similarly reduces to
$\frac {\partial L}{\partial z_{1}}=-\frac {y1}{p1} p1(1-p1) - \frac {y2}{p2}*(-p_{2}p_{1}) - \frac {y3}{p3}*(-p_{3}p_{1})$
$-y_{1}+ y_{1}p_{1} + y_{2}p_{1} + y_{3}p1 = p_{1}\sum (y_{1} + y_2 + y_3) - y_{1} = p_{1} - y_{1}$
Interestingly, despite the lengthy derivations the final result is simple and intuitive!

As seen in my post ‘Deep Learning from first principles with Python, R and Octave – Part 3 the key equations for forward and backward propagation are

Forward propagation equations layer 1
$Z_{1} = W_{1}X +b_{1}$     and  $A_{1} = g(Z_{1})$
Forward propagation equations layer 1
$Z_{2} = W_{2}A_{1} +b_{2}$  and  $A_{2} = S(Z_{2})$

Using the result (A) in the back propagation equations below we have
Backward propagation equations layer 2
$\partial L/\partial W_{2} =\partial L/\partial Z_{2}*A_{1}=(p_{2}-y_{2})*A_{1}$
$\partial L/\partial b_{2} =\partial L/\partial Z_{2}=p_{2}-y_{2}$
$\partial L/\partial A_{1} = \partial L/\partial Z_{2} * W_{2}=(p_{2}-y_{2})*W_{2}$
Backward propagation equations layer 1
$\partial L/\partial W_{1} =\partial L/\partial Z_{1} *A_{0}=(p_{1}-y_{1})*A_{0}$
$\partial L/\partial b_{1} =\partial L/\partial Z_{1}=(p_{1}-y_{1})$

#### 2.0 Spiral data set

As I mentioned earlier, I will be using the ‘spiral’ data from CS231n Convolutional Neural Networks to ensure that my vectorized implementations in Python, R and Octave are correct. Here is the ‘spiral’ data set.

import numpy as np
import matplotlib.pyplot as plt
import os
os.chdir("C:/junk/dl-4/dl-4")

# Create an input data set - Taken from CS231n Convolutional Neural networks
# http://cs231n.github.io/neural-networks-case-study/
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.savefig("fig1.png", bbox_inches='tight')

The implementations of the vectorized Python, R and Octave code are shown diagrammatically below

#### 2.1 Multi-class classification with Softmax – Python code

A simple 2 layer Neural network with a single hidden layer , with 100 Relu activation units in the hidden layer and the Softmax activation unit in the output layer is used for multi-class classification. This Deep Learning Network, plots the non-linear boundary of the 3 classes as shown below

import numpy as np
import matplotlib.pyplot as plt
import os
os.chdir("C:/junk/dl-4/dl-4")

N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j

# Set the number of features, hidden units in hidden layer and number of classess
numHidden=100 # No of hidden units in hidden layer
numFeats= 2 # dimensionality
numOutput = 3 # number of classes

# Initialize the model
parameters=initializeModel(numFeats,numHidden,numOutput)
W1= parameters['W1']
b1= parameters['b1']
W2= parameters['W2']
b2= parameters['b2']

# Set the learning rate
learningRate=0.6

# Initialize losses
losses=[]
for i in range(10000):
# Forward propagation through hidden layer with Relu units
A1,cache1= layerActivationForward(X.T,W1,b1,'relu')

# Forward propagation through output layer with Softmax
A2,cache2 = layerActivationForward(A1,W2,b2,'softmax')

# No of training examples
numTraining = X.shape[0]
# Compute log probs. Take the log prob of correct class based on output y
correct_logprobs = -np.log(A2[range(numTraining),y])
# Conpute loss
loss = np.sum(correct_logprobs)/numTraining

# Print the loss
if i % 1000 == 0:
print("iteration %d: loss %f" % (i, loss))
losses.append(loss)

dA=0

# Backward  propagation through output layer with Softmax
dA1,dW2,db2 = layerActivationBackward(dA, cache2, y, activationFunc='softmax')
# Backward  propagation through hidden layer with Relu unit
dA0,dW1,db1 = layerActivationBackward(dA1.T, cache1, y, activationFunc='relu')

#Update paramaters with the learning rate
W1 += -learningRate * dW1
b1 += -learningRate * db1
W2 += -learningRate * dW2.T
b2 += -learningRate * db2.T

#Plot losses vs iterations
i=np.arange(0,10000,1000)
plt.plot(i,losses)

plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.title('Losses vs Iterations')
plt.savefig("fig2.png", bbox="tight")

#Compute the multi-class Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# We need to determine the predicted values from the learnt data
# Forward propagation through hidden layer with Relu units
A1,cache1= layerActivationForward(X.T,W1,b1,'relu')

# Forward propagation through output layer with Softmax
A2,cache2 = layerActivationForward(A1,W2,b2,'softmax')
#Compute predicted values from weights and biases
yhat=np.argmax(A2, axis=1)

a=confusion_matrix(y.T,yhat.T)
print("Multi-class Confusion Matrix")
print(a)
## iteration 0: loss 1.098507
## iteration 1000: loss 0.214611
## iteration 2000: loss 0.043622
## iteration 3000: loss 0.032525
## iteration 4000: loss 0.025108
## iteration 5000: loss 0.021365
## iteration 6000: loss 0.019046
## iteration 7000: loss 0.017475
## iteration 8000: loss 0.016359
## iteration 9000: loss 0.015703
## Multi-class Confusion Matrix
## [[ 99   1   0]
##  [  0 100   0]
##  [  0   1  99]]

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Second edition- Machine Learning in stereo”  available in Amazon in paperback($10.99) and kindle($7.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

#### 2.2 Multi-class classification with Softmax – R code

The spiral data set created with Python was saved, and is used as the input with R code. The R Neural Network seems to perform much,much slower than both Python and Octave. Not sure why! Incidentally the computation of loss and the softmax derivative are identical for both R and Octave. yet R is much slower. To compute the softmax derivative I create matrices for the One Hot Encoded yi and then stack them before subtracting pi-yi. I am sure there is a more elegant and more efficient way to do this, much like Python. Any suggestions?

library(ggplot2)
library(dplyr)
library(RColorBrewer)
source("DLfunctions41.R")
Z1=data.frame(Z)
#Plot the dataset
ggplot(Z1,aes(x=V1,y=V2,col=V3)) +geom_point() +
scale_colour_gradientn(colours = brewer.pal(10, "Spectral"))

# Setup the data
X <- Z[,1:2]
y <- Z[,3]
X1 <- t(X)
Y1 <- t(y)

# Initialize number of features, number of hidden units in hidden layer and
# number of classes
numFeats<-2 # No features
numHidden<-100 # No of hidden units
numOutput<-3 # No of classes

# Initialize model
parameters <-initializeModel(numFeats, numHidden,numOutput)

W1 <-parameters[['W1']]
b1 <-parameters[['b1']]
W2 <-parameters[['W2']]
b2 <-parameters[['b2']]

# Set the learning rate
learningRate <- 0.5
# Initialize losses
losses <- NULL
for(i in 0:9000){

# Forward propagation through hidden layer with Relu units
retvals <- layerActivationForward(X1,W1,b1,'relu')
A1 <- retvals[['A']]
cache1 <- retvals[['cache']]
forward_cache1 <- cache1[['forward_cache1']]
activation_cache <- cache1[['activation_cache']]

# Forward propagation through output layer with Softmax units
retvals = layerActivationForward(A1,W2,b2,'softmax')
A2 <- retvals[['A']]
cache2 <- retvals[['cache']]
forward_cache2 <- cache2[['forward_cache1']]
activation_cache2 <- cache2[['activation_cache']]

# No oftraining examples
numTraining <- dim(X)[1]
dA <-0

# Select the elements where the y values are 0, 1 or 2 and make a vector
a=c(A2[y==0,1],A2[y==1,2],A2[y==2,3])
# Take log
correct_probs = -log(a)
# Compute loss
loss= sum(correct_probs)/numTraining

if(i %% 1000 == 0){
sprintf("iteration %d: loss %f",i, loss)
print(loss)
}
# Backward propagation through output layer with Softmax units
retvals = layerActivationBackward(dA, cache2, y, activationFunc='softmax')
dA1 = retvals[['dA_prev']]
dW2= retvals[['dW']]
db2= retvals[['db']]
# Backward propagation through hidden layer with Relu units
retvals = layerActivationBackward(t(dA1), cache1, y, activationFunc='relu')
dA0 = retvals[['dA_prev']]
dW1= retvals[['dW']]
db1= retvals[['db']]

# Update parameters
W1 <- W1 - learningRate * dW1
b1 <- b1 - learningRate * db1
W2 <- W2 - learningRate * t(dW2)
b2 <- b2 - learningRate * t(db2)
}
## [1] 1.212487
## [1] 0.5740867
## [1] 0.4048824
## [1] 0.3561941
## [1] 0.2509576
## [1] 0.7351063
## [1] 0.2066114
## [1] 0.2065875
## [1] 0.2151943
## [1] 0.1318807

#Create iterations
iterations <- seq(0,10)
#df=data.frame(iterations,losses)
ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(color="blue") +
ggtitle("Losses vs iterations") + xlab("Iterations") + ylab("Loss")

plotDecisionBoundary(Z,W1,b1,W2,b2)

Multi-class Confusion Matrix

library(caret)
library(e1071)

# Forward propagation through hidden layer with Relu units
retvals <- layerActivationForward(X1,W1,b1,'relu')
A1 <- retvals[['A']]

# Forward propagation through output layer with Softmax units
retvals = layerActivationForward(A1,W2,b2,'softmax')
A2 <- retvals[['A']]
yhat <- apply(A2, 1,which.max) -1
Confusion Matrix and Statistics
Reference
Prediction  0  1  2
0 97  0  1
1  2 96  4
2  1  4 95

Overall Statistics
Accuracy : 0.96
95% CI : (0.9312, 0.9792)
No Information Rate : 0.3333
P-Value [Acc > NIR] : <2e-16

Kappa : 0.94
Mcnemar's Test P-Value : 0.5724
Statistics by Class:

Class: 0 Class: 1 Class: 2
Sensitivity            0.9700   0.9600   0.9500
Specificity            0.9950   0.9700   0.9750
Pos Pred Value         0.9898   0.9412   0.9500
Neg Pred Value         0.9851   0.9798   0.9750
Prevalence             0.3333   0.3333   0.3333
Detection Rate         0.3233   0.3200   0.3167
Detection Prevalence   0.3267   0.3400   0.3333
Balanced Accuracy      0.9825   0.9650   0.9625


My book “Practical Machine Learning with R and Python” includes the implementation for many Machine Learning algorithms and associated metrics. Pick up your copy today!

#### 2.3 Multi-class classification with Softmax – Octave code

A 2 layer Neural network with the Softmax activation unit in the output layer is constructed in Octave. The same spiral data set is used for Octave also
 source("DL41functions.m") # Read the spiral data data=csvread("spiral.csv"); # Setup the data X=data(:,1:2); Y=data(:,3); # Set the number of features, number of hidden units in hidden layer and number of classes numFeats=2; #No features numHidden=100; # No of hidden units numOutput=3; # No of classes # Initialize model [W1 b1 W2 b2] = initializeModel(numFeats,numHidden,numOutput); # Initialize losses losses=[] #Initialize learningRate learningRate=0.5; for k =1:10000 # Forward propagation through hidden layer with Relu units [A1,cache1 activation_cache1]= layerActivationForward(X',W1,b1,activationFunc ='relu'); # Forward propagation through output layer with Softmax units [A2,cache2 activation_cache2] = layerActivationForward(A1,W2,b2,activationFunc='softmax'); # No of training examples numTraining = size(X)(1); # Select rows where Y=0,1,and 2 and concatenate to a long vector a=[A2(Y==0,1) ;A2(Y==1,2) ;A2(Y==2,3)]; #Select the correct column for log prob correct_probs = -log(a); #Compute log loss loss= sum(correct_probs)/numTraining; if(mod(k,1000) == 0) disp(loss); losses=[losses loss]; endif dA=0; # Backward propagation through output layer with Softmax units [dA1 dW2 db2] = layerActivationBackward(dA, cache2, activation_cache2,Y,activationFunc='softmax'); # Backward propagation through hidden layer with Relu units [dA0,dW1,db1] = layerActivationBackward(dA1', cache1, activation_cache1, Y, activationFunc='relu'); #Update parameters W1 += -learningRate * dW1; b1 += -learningRate * db1; W2 += -learningRate * dW2'; b2 += -learningRate * db2'; endfor # Plot Losses vs Iterations iterations=0:1000:9000 plotCostVsIterations(iterations,losses) # Plot the decision boundary plotDecisionBoundary( X,Y,W1,b1,W2,b2)

The code for the Python, R and Octave implementations can be downloaded from Github at Deep Learning – Part 4

#### Conclusion

In this post I have implemented a 2 layer Neural Network with the Softmax classifier. In Part 3, I implemented a multi-layer Deep Learning Network. I intend to include the Softmax activation unit into the generalized multi-layer Deep Network along with the other activation units of sigmoid,tanh and relu.

Stick around, I’ll be back!!
Watch this space!

To see all post click Index of posts

# Deep Learning from first principles in Python, R and Octave – Part 1

“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

                          David Eagleman - The Brain: The Story of You

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

Check out my video presentation which discusses the derivations in detail
1. Elements of Neural Networks and Deep Le- Part 1
2. Elements of Neural Networks and Deep Learning – Part 2

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

## 1. Logistic Regression

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs

from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2.

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($14.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($10.99) and Kindle($7.99/Rs449) versions. This book is ideal for a quick reference of the various ML functions and associated measurements in both R and Python which are essential to delve deep into Deep Learning.

## 2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
$Z= w_{1}x_{1} +w_{2}x_{2} +..+ w_{30}x_{30} + b$
where ‘b’ is the bias term

The Activation function is the sigmoid function which is $a= 1/(1+e^{-z})$
The Loss, when the sigmoid function is used in the output layer, is given by
$L=-(ylog(a) + (1-y)log(1-a))$ (1)

### Forward propagation

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

### Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

Chain rule of differentiation
Let y=f(u)
and u=g(x) then
$\partial y/\partial x = \partial y/\partial u * \partial u/\partial x$

Derivative of sigmoid
$\sigma=1/(1+e^{-z})$
Let $x= 1 + e^{-z}$  then
$\sigma = 1/x$
$\partial \sigma/\partial x = -1/x^{2}$
$\partial x/\partial z = -e^{-z}$
Using the chain rule of differentiation we get
$\partial \sigma/\partial z = \partial \sigma/\partial x * \partial x/\partial z$
$=-1/(1+e^{-z})^{2}* -e^{-z} = e^{-z}/(1+e^{-z})^{2}$
Therefore $\partial \sigma/\partial z = \sigma(1-\sigma)$        -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
$L=-(y*log(a) + (1-y)*log(1-a))$      -(a)
$a=1/(1+e^{-Z})$      -(b)
$Z= w_{1}x_{1} +w_{2}x_{2} +...+ w_{30}x_{30} +b = Z = \sum_{i} w_{i}*x_{i} + b$ -(c)

The back propagation step requires the computation of $dL/dw_{i}$ and $dL/db_{i}$. In the case of regression it would be $dE/dw_{i}$ and $dE/db_{i}$ where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
$dL/da = -(y/a + (1-y)/(1-a))$          -(d)
because $d/dx(logx) = 1/x$
Also from equation (2) we get
$da/dZ = a (1-a)$                                  – (e)
By chain rule
$\partial L/\partial Z = \partial L/\partial a * \partial a/\partial Z$
therefore substituting the results of (d) & (e) we get
$\partial L/\partial Z = -(y/a + (1-y)/(1-a)) * a(1-a) = a-y$         (f)
Finally
$\partial L/\partial w_{i}= \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial w_{i}$                                                           -(g)
$\partial Z/\partial w_{i} = x_{i}$            – (h)
and from (f) we have  $\partial L/\partial Z =a-y$
Therefore  (g) reduces to
$\partial L/\partial w_{i} = x_{i}* (a-y)$ -(i)
Also
$\partial L/\partial b = \partial L/\partial a * \partial a/\partial Z * \partial Z/\partial b$ -(j)
Since
$\partial Z/\partial b = 1$ and using (f) in (j)
$\partial L/\partial b = a-y$

The gradient computes the weights at the input layer and the corresponding bias by using the values
of $dw_{i}$ and $db$
$w_{i} := w_{i} -\alpha * dw_{i}$
$b := b -\alpha * db$
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

### 3. Neural Network for Logistic Regression -Python code (vectorized)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Define the sigmoid function
def sigmoid(z):
a=1/(1+np.exp(-z))
return a

# Initialize
def initialize(dim):
w = np.zeros(dim).reshape(dim,1)
b = 0
return w

# Compute the loss
def computeLoss(numTraining,Y,A):
loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
return(loss)

# Execute the forward propagation
def forwardPropagation(w,b,X,Y):
# Compute Z
Z=np.dot(w.T,X)+b
# Determine the number of training samples
numTraining=float(len(X))
# Compute the output of the sigmoid activation function
A=sigmoid(Z)
#Compute the loss
loss = computeLoss(numTraining,Y,A)
# Compute the gradients dZ, dw and db
dZ=A-Y
dw=1/numTraining*np.dot(X,dZ.T)
db=1/numTraining*np.sum(dZ)

# Return the results as a dictionary
"db": db}
loss = np.squeeze(loss)

def gradientDescent(w, b, X, Y, numIerations, learningRate):
losses=[]
idx =[]
# Iterate
for i in range(numIerations):
#Get the derivates
w = w-learningRate*dw
b = b-learningRate*db

# Store the loss
if i % 100 == 0:
idx.append(i)
losses.append(loss)
params = {"w": w,
"b": b}
"db": db}

# Predict the output for a training set
def predict(w,b,X):
size=X.shape[1]
yPredicted=np.zeros((1,size))
Z=np.dot(w.T,X)
# Compute the sigmoid
A=sigmoid(Z)
for i in range(A.shape[1]):
#If the value is > 0.5 then set as 1
if(A[0][i] > 0.5):
yPredicted[0][i]=1
else:
# Else set as 0
yPredicted[0][i]=0

return yPredicted

#Normalize the data
def normalize(x):
x_norm = None
x_norm = np.linalg.norm(x,axis=1,keepdims=True)
x= x/x_norm
return x

# Run the 2 layer Neural Network on the cancer data set

(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
# Normalize the data for better performance
X_train1=normalize(X_train)

# Create weight vectors of zeros. The size is the number of features in the data set=30
w=np.zeros((X_train.shape[1],1))
#w=np.zeros((30,1))
b=0

#Normalize the training data so that gradient descent performs better
X_train1=normalize(X_train)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_train2=X_train1.T

# Reshape to remove the rank 1 array and then transpose
y_train1=y_train.reshape(len(y_train),1)
y_train2=y_train1.T

# Run gradient descent for 4000 times and compute the weights
w = parameters["w"]
b = parameters["b"]

# Normalize X_test
X_test1=normalize(X_test)
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1.T

#Reshape y_test
y_test1=y_test.reshape(len(y_test),1)
y_test2=y_test1.T

# Predict the values for
yPredictionTest = predict(w, b, X_test2)
yPredictionTrain = predict(w, b, X_train2)

# Print the accuracy
print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100))

# Plot the Costs vs the number of iterations
fig1=plt.plot(idx,costs)
fig1=plt.title("Gradient descent-Cost vs No of iterations")
fig1=plt.xlabel("No of iterations")
fig1=plt.ylabel("Cost")
fig1.figure.savefig("fig1", bbox_inches='tight')
## train accuracy: 90.3755868545 %
## test accuracy: 89.5104895105 %

Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

### 4. Neural Network for Logistic Regression -R code (vectorized)

source("RFunctions-1.R")
# Define the sigmoid function
sigmoid <- function(z){
a <- 1/(1+ exp(-z))
a
}

# Compute the loss
computeLoss <- function(numTraining,Y,A){
loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A))
return(loss)
}

# Compute forward propagation
forwardPropagation <- function(w,b,X,Y){
# Compute Z
Z <- t(w) %*% X +b
#Set the number of samples
numTraining <- ncol(X)
# Compute the activation function
A=sigmoid(Z)

#Compute the loss
loss <- computeLoss(numTraining,Y,A)

# Compute the gradients dZ, dw and db
dZ<-A-Y
dw<-1/numTraining * X %*% t(dZ)
db<-1/numTraining*sum(dZ)

fwdProp <- list("loss" = loss, "dw" = dw, "db" = db)
return(fwdProp)
}

# Perform one cycle of Gradient descent
gradientDescent <- function(w, b, X, Y, numIerations, learningRate){
losses <- NULL
idx <- NULL
# Loop through the number of iterations
for(i in 1:numIerations){
fwdProp <-forwardPropagation(w,b,X,Y)
#Get the derivatives
dw <- fwdProp$dw db <- fwdProp$db
w = w-learningRate*dw
b = b-learningRate*db
l <- fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx <- c(idx,i) losses <- c(losses,l) } } # Return the weights and losses gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict <- function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z <- t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize <- function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<-as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<-sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Set the features X_train <-train[,1:30] y_train <- train[,31] X_test <- test[,1:30] y_test <- test[,31] # Create a matrix of 0's with the number of features w <-matrix(rep(0,dim(X_train)[2])) b <-0 X_train1 <- normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <-data.frame(gradDescent$idx, gradDescent$losses) names(df) <- c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent - Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses") ### 4. Neural Network for Logistic Regression -Octave code (vectorized)  1; # Define sigmoid function function a = sigmoid(z) a = 1 ./ (1+ exp(-z)); end # Compute the loss function loss=computeLoss(numtraining,Y,A) loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A)); end  # Perform forward propagation function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y) % Compute Z Z = w' * X + b; numtraining = size(X)(1,2); # Compute sigmoid A = sigmoid(Z);  #Compute loss. Note this is element wise product loss =computeLoss(numtraining,Y,A); # Compute the gradients dZ, dw and db dZ = A-Y; dw = 1/numtraining* X * dZ'; db =1/numtraining*sum(dZ); end  # Compute Gradient Descent function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate) #Initialize losses and idx losses=[]; index=[]; # Loop through the number of iterations for i=1:numIerations, [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y); # Perform Gradient descent w = w - learningRate*dw; b = b - learningRate*db; if(mod(i,100) ==0) # Append index and loss index = [index i]; losses = [losses loss]; endif end end  # Determine the predicted value for dataset function yPredicted = predict(w,b,X) m = size(X)(1,2); yPredicted=zeros(1,m); # Compute Z Z = w' * X + b; # Compute sigmoid A = sigmoid(Z); for i=1:size(X)(1,2), # Set predicted as 1 if A > 0,5 if(A(1,i) >= 0.5) yPredicted(1,i)=1; else yPredicted(1,i)=0; endif end end  # Normalize by dividing each value by the sum of squares function normalized = normalize(x) # Compute Frobenius norm. Square the elements, sum rows and then find square root a = sqrt(sum(x .^ 2,2)); # Perform element wise division normalized = x ./ a; end  # Split into train and test sets function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent) # Create a random index ix = randperm(length(dataset)); # Split into training trainSize = floor(trainPercent/100 * length(dataset)); train=dataset(ix(1:trainSize),:); # And test test=dataset(ix(trainSize+1:length(dataset)),:); X_train = train(:,1:30); y_train = train(:,31); X_test = test(:,1:30); y_test = test(:,31); end  cancer=csvread("cancer.csv"); [X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75); w=zeros(size(X_train)(1,2),1); b=0; X_train1=normalize(X_train); X_train2=X_train1'; y_train1=y_train'; [w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75); # Normalize X_test X_test1=normalize(X_test); #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=X_test1'; y_test1=y_test'; # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(w1, b1, X_test2); yPredictionTrain = predict(w1, b1, X_train2);   trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100 testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100 trainAccuracy = 90.845 testAccuracy = 89.510 graphics_toolkit('gnuplot') plot(idx,losses); title ('Gradient descent- Cost vs No of iterations'); xlabel ("No of iterations"); ylabel ("Cost"); Conclusion This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers The Deep Learning journey has begun… Don’t miss the bus! Stay tuned for more interesting posts in Deep Learning!! To see all posts check Index of posts # My 2 video presentations on ‘Essential Python for Datascience’ Here, in this post I include 2 sessions on ‘Essential Python for Datascience’. These 2 presentations cover the most important features of the Python language with which you can hit the ground running in datascience. All the related material for these sessions can be cloned/downloaded from Github at ‘EssentialPythonForDatascience 1. Essential Python for Datascience -1 In this video presentation I cover basic data types like tuples,lists, dictionaries. How to get the type of a variable, subsetting and numpy arrays. Some basic operations on numpy arrays, slicing is also covered 2. Essential Python for Datascience -2 In the 2nd part I cover Pandas, pandas Series, dataframes, how to subset dataframes using iloc,loc, selection of specific columns, filtering dataframes by criteria etc. Other operations include group_by, apply,agg. Lastly I also touch upon matplotlib. This is no means an exhaustive coverage of the multitude of features available in Python but can provide as a good starting point for those venturing into datascience with Python. Good luck with Python! To see all posts see Index of posts # R vs Python: Different similarities and similar differences A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say. I include my own experience with the languages below. Check out my compact and minimal book “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!! ### 1. R data types R has the following data types 1. Character 2. Integer 3. Numeric 4. Logical 5. Complex 6. Raw Python has several data types 1. Int 2. float 3. Long 4. Complex and so on ### 2. R Vector vs Python List A common data type in R is the vector. Python has a similar data type, the list # R vectors a<-c(4,5,1,3,4,5) print(a[3]) ## [1] 1 print(a[3:4]) # R does not always need the explicit print.  ## [1] 1 3 #R type of variable print(class(a)) ## [1] "numeric" # Length of a print(length(a)) ## [1] 6 # Python lists a=[4,5,1,3,4,5] # print(a[2]) # Some python IDEs require the explicit print print(a[2:5]) print(type(a)) # Length of a print(len(a)) ## 1 ## [1, 3, 4] ## ## 6 ### 2a. Other data types – Python Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R # Python tuple b = (4,5,7,8) print(b) #Python dictionary c={'name':'Ganesh','age':54,'Work':'Professional'} print(c) #Print type of variable c  ## (4, 5, 7, 8) ## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'} ### 2.Type of Variable To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’ #R - Type of variable a<-c(4,5,1,3,4,5) print(class(a)) ## [1] "numeric" #Python - Print type of tuple a a=[4,5,1,3,4,5] print(type(a)) b=(4,3,"the",2) print(type(b)) ## ##  ### 3. Length To know length in R, use length() #R - Length of vector # Length of a a<-c(4,5,1,3,4,5) print(length(a)) ## [1] 6 To know the length of a list,tuple or dict we can use len() # Python - Length of list , tuple etc # Length of a a=[4,5,1,3,4,5] print(len(a)) # Length of b b = (4,5,7,8) print(len(b))  ## 6 ## 4 ### 4. Accessing help To access help in R we use the ‘?’ or the ‘help’ function #R - Help - To be done in R console or RStudio #?sapply #help(sapply) Help in python on any topic involves #Python help - This can be done on a (I)Python console #help(len) #?len ### 5. Subsetting The key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use #R - Subset a<-c(4,5,1,3,4,8,12,18,1) print(a[3]) ## [1] 1 # To print a range or a slice. Print from the 3rd to the 5th element print(a[3:6]) ## [1] 1 3 4 8 Python also uses indices. The difference in Python is that the index starts from 0/ #Python - Subset a=[4,5,1,3,4,8,12,18,1] # Print the 4th element (starts from 0) print(a[3]) # Print a slice from 4 to 6th element print(a[3:6]) ## 3 ## [3, 4, 8] ### 6. Operations on vectors in R and operation on lists in Python In R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show #R - Operations on vectors a<- c(5,2,3,1,7) b<- c(1,5,4,6,8) #Element wise Addition print(a+b) ## [1] 6 7 7 7 15 #Element wise subtraction print(a-b) ## [1] 4 -3 -1 -5 -1 #Element wise product print(a*b) ## [1] 5 10 12 6 56 # Exponentiating the elements of a vector print(a^2) ## [1] 25 4 9 1 49 In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows # Python - Operations on list a =[5,2,3,1,7] b =[1,5,4,6,8] #Element wise addition with map & lambda print(list(map(lambda x,y: x+y,a,b))) #Element wise subtraction print(list(map(lambda x,y: x-y,a,b))) #Element wise product print(list(map(lambda x,y: x*y,a,b))) # Exponentiating the elements of a list print(list(map(lambda x: x**2,a)))  ## [6, 7, 7, 7, 15] ## [4, -3, -1, -5, -1] ## [5, 10, 12, 6, 56] ## [25, 4, 9, 1, 49] However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations import numpy as np a =[5,2,3,1,7] b =[1,5,4,6,8] a=np.array(a) b=np.array(b) #Element wise addition print(a+b) #Element wise subtraction print(a-b) #Element wise product print(a*b) # Exponentiating the elements of a list print(a**2)  ## [ 6 7 7 7 15] ## [ 4 -3 -1 -5 -1] ## [ 5 10 12 6 56] ## [25 4 9 1 49] ### 7. Getting the index of element To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4 # R - Which a<- c(5,2,3,1,7) print(which(a == 1)) ## [1] 4 In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0 # Python - np.where import numpy as np a =[5,2,3,1,7] a=np.array(a) print(np.where(a==1)) ## (array([3], dtype=int64),) ### 8. Data frames R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package # R # To check built datasets use #data() - In R console or in R Studio #iris - Don't print to console We can use the in-built data sets that come with Scikit package #Python import sklearn as sklearn import pandas as pd from sklearn import datasets # This creates a Sklearn bunch data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) ### 9. Working with dataframes With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are • Check the size of the dataframe • Take a look at the top 5 or bottom 5 rows of dataframe • Check the content of the dataframe #### a.Size In R use dim() #R - Size dim(iris) ## [1] 150 5 For Python use .shape #Python - size import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.shape #### b. Top & bottom 5 rows of dataframe To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python #R head(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa tail(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.head(5)) print(iris.tail(5)) ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 145 6.7 3.0 5.2 2.3 ## 146 6.3 2.5 5.0 1.9 ## 147 6.5 3.0 5.2 2.0 ## 148 6.2 3.4 5.4 2.3 ## 149 5.9 3.0 5.1 1.8 #### c. Check the content of the dataframe #R summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##  str(iris) ## 'data.frame': 150 obs. of 5 variables: ##$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ##$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ##$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.info())
##
## RangeIndex: 150 entries, 0 to 149
## Data columns (total 4 columns):
## sepal length (cm)    150 non-null float64
## sepal width (cm)     150 non-null float64
## petal length (cm)    150 non-null float64
## petal width (cm)     150 non-null float64
## dtypes: float64(4)
## memory usage: 4.8 KB
## None

#### d. Check column names

#R
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"
## [5] "Species"
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"
## [5] "Species"
#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
#Get column names
print(iris.columns)
## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
##        'petal width (cm)'],
##       dtype='object')

#### e. Rename columns

In R we can assign a vector to column names

#R
colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species")
colnames(iris)
## [1] "lengthOfSepal" "widthOfSepal"  "lengthOfPetal" "widthOfPetal"
## [5] "Species"

In Python we can assign a list to s.columns

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"]
print(iris.columns)
## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object')

### f.Details of dataframe

#Python
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
print(iris.info())
##
## RangeIndex: 150 entries, 0 to 149
## Data columns (total 4 columns):
## sepal length (cm)    150 non-null float64
## sepal width (cm)     150 non-null float64
## petal length (cm)    150 non-null float64
## petal width (cm)     150 non-null float64
## dtypes: float64(4)
## memory usage: 4.8 KB
## None

#### g. Subsetting dataframes

# R
#To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector]
#df[row,column]
iris[3,4]
## [1] 0.2
#df[row vector, column vector]
iris[2:5,1:3]
##   lengthOfSepal widthOfSepal lengthOfPetal
## 2           4.9          3.0           1.4
## 3           4.7          3.2           1.3
## 4           4.6          3.1           1.5
## 5           5.0          3.6           1.4
#If we omit the row vector, then it implies all rows or if we omit the column vector
# then implies all columns for that row
iris[2:5,]
##   lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species
## 2           4.9          3.0           1.4          0.2  setosa
## 3           4.7          3.2           1.3          0.2  setosa
## 4           4.6          3.1           1.5          0.2  setosa
## 5           5.0          3.6           1.4          0.2  setosa
# In R we can all specific columns by column names
iris$Sepal.Length[2:5] ## NULL #Python # To select an entire row we use .iloc. The index can be used with the ':'. If # .iloc[start row: end row]. If start row is omitted then it implies the beginning of # data frame, if end row is omitted then it implies all rows till end #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.iloc[3]) print(iris[:5]) # In python we can select columns by column name as follows print(iris['sepal length (cm)'][2:6]) #If you want to select more than 2 columns then you must use the double '[[]]' since the # index is a list itself print(iris[['sepal length (cm)','sepal width (cm)']][4:7]) ## sepal length (cm) 4.6 ## sepal width (cm) 3.1 ## petal length (cm) 1.5 ## petal width (cm) 0.2 ## Name: 3, dtype: float64 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## 2 4.7 ## 3 4.6 ## 4 5.0 ## 5 5.4 ## Name: sepal length (cm), dtype: float64 ## sepal length (cm) sepal width (cm) ## 4 5.0 3.6 ## 5 5.4 3.9 ## 6 4.6 3.4 #### h. Computing Mean, Standard deviation #R #Mean mean(iris$lengthOfSepal)
## [1] 5.843333
#Standard deviation
sd(iris$widthOfSepal) ## [1] 0.4358663 #Python #Mean import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) # Convert to Pandas dataframe print(iris['sepal length (cm)'].mean()) #Standard deviation print(iris['sepal width (cm)'].std()) ## 5.843333333333335 ## 0.4335943113621737 #### i. Boxplot Boxplot can be produced in R using baseplot #R boxplot(iris$lengthOfSepal)

Matplotlib is a popular package in Python for plots

#Python
import sklearn as sklearn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
img=plt.boxplot(iris['sepal length (cm)'])
plt.show(img)

#### j.Scatter plot

#R
plot(iris$widthOfSepal,iris$lengthOfSepal)

#Python
import matplotlib.pyplot as plt
import sklearn as sklearn
import pandas as pd
from sklearn import datasets
# Convert to Pandas dataframe
iris = pd.DataFrame(data.data, columns=data.feature_names)
img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)'])
#plt.show(img)

#### k. Read from csv file

#R
#Dimensions of dataframe
dim(tendulkar)
## [1] 347  13
names(tendulkar)
##  [1] "X"          "Runs"       "Mins"       "BF"         "X4s"
##  [6] "X6s"        "SR"         "Pos"        "Dismissal"  "Inns"
## [11] "Opposition" "Ground"     "Start.Date"

#Python
import pandas as pd
print(tendulkar.shape)
print(tendulkar.columns)
## (347, 13)
## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos',
##        'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'],
##       dtype='object')

#### l. Clean the dataframe in R and Python.

The following steps are done for R and Python
1.Remove rows with ‘DNB’
2.Remove rows with ‘TDNB’
3.Remove rows with absent
4.Remove the “*” indicating not out
5.Remove incomplete rows with NA for R or NaN in Python
6.Do a scatter plot

#R
# Remove rows with 'DNB'
a <- tendulkar$Runs != "DNB" tendulkar <- tendulkar[a,] dim(tendulkar) ## [1] 330 13 # Remove rows with 'TDNB' b <- tendulkar$Runs != "TDNB"
tendulkar <- tendulkar[b,]

# Remove rows with absent
c <- tendulkar$Runs != "absent" tendulkar <- tendulkar[c,] dim(tendulkar) ## [1] 329 13 # Remove the "* indicating not out tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs)) dim(tendulkar) ## [1] 329 13 # Select only complete rows - complete.cases() c <- complete.cases(tendulkar) #Subset the rows which are complete tendulkar <- tendulkar[c,] dim(tendulkar) ## [1] 327 13 # Do some base plotting - Scatter plot plot(tendulkar$BF,tendulkar\$Runs)

#Python
import pandas as pd
import matplotlib.pyplot as plt
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
print(tendulkar.shape)
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
print(tendulkar.shape)
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar[c]
print(tendulkar.shape)
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")
#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
print(tendulkar.shape)
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
#Scatter plot
plt.scatter(tendulkar.BF,tendulkar.Runs)
## (347, 13)
## (330, 13)
## (329, 13)
## (329, 13)
## (327, 13)

#### m.Chaining operations on dataframes

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

1. Group by ground
2. Compute average runs in each ground
3. Arrange in descending order
#R
library(dplyr)
tendulkar1 <- tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs)) %>%
arrange(desc(meanRuns))
head(tendulkar1,10)
## # A tibble: 10 × 2
##           Ground  meanRuns
##
## 1         Multan 194.00000
## 2          Leeds 193.00000
## 3  Colombo (RPS) 143.00000
## 4        Lucknow 142.00000
## 5          Dhaka 132.75000
## 6     Manchester  93.50000
## 7         Sydney  87.22222
## 8   Bloemfontein  85.00000
## 9     Georgetown  81.00000
## 10 Colombo (SSC)  77.55556
#Python
import pandas as pd
print(tendulkar.shape)
# Remove rows with 'DNB'
a=tendulkar.Runs !="DNB"
tendulkar=tendulkar[a]
# Remove rows with 'TDNB'
b=tendulkar.Runs !="TDNB"
tendulkar=tendulkar[b]
# Remove rows with absent
c= tendulkar.Runs != "absent"
tendulkar=tendulkar[c]
# Remove the "* indicating not out
tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","")

#Select only complete rows - dropna()
tendulkar=tendulkar.dropna()
tendulkar.Runs = tendulkar.Runs.astype(int)
tendulkar.BF = tendulkar.BF.astype(int)
tendulkar1= tendulkar.groupby('Ground').mean()['Runs'].sort_values(ascending=False)
print(tendulkar1.head(10))
## (347, 13)
## Ground
## Multan           194.000000
## Leeds            193.000000
## Colombo (RPS)    143.000000
## Lucknow          142.000000
## Dhaka            132.750000
## Manchester        93.500000
## Sydney            87.222222
## Bloemfontein      85.000000
## Georgetown        81.000000
## Colombo (SSC)     77.555556
## Name: Runs, dtype: float64

### 9. Functions

product <- function(a,b){
c<- a*b
c
}
product(5,7)
## [1] 35
def product(a,b):
c = a*b
return c

print(product(5,7))

## 35



## Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

Note: This post was created using RStudio’s RMarkdown which allows you to embed R and Python code snippets. It works perfectly, except that matplotlib’s pyplot does not display.