Deconstructing Convolutional Neural Networks with Tensorflow and Keras

I have been very fascinated by how Convolution Neural  Networks have been able to, so efficiently,  do image classification and image recognition CNN’s have been very successful in in both these tasks. A good paper that explores the workings of a CNN Visualizing and Understanding Convolutional Networks  by Matthew D Zeiler and Rob Fergus. They show how through a reverse process of convolution using a deconvnet.

In their paper they show how by passing the feature map through a deconvnet ,which does the reverse process of the convnet, they can reconstruct what input pattern originally caused a given activation in the feature map

In the paper they say “A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features, it does the opposite. An input image is presented to the CNN and features  activation computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.”

I started to scout the internet to see how I can implement this reverse process of Convolution to understand what really happens under the hood of a CNN.  There are a lot of good articles and blogs, but I found this post Applied Deep Learning – Part 4: Convolutional Neural Networks take the visualization of the CNN one step further.

This post takes VGG16 as the pre-trained network and then uses this network to display the intermediate visualizations.  While this post was very informative and also the visualizations of the various images were very clear, I wanted to simplify the problem for my own understanding.

Hence I decided to take the MNIST digit classification as my base problem. I created a simple 3 layer CNN which gives close to 99.1% accuracy and decided to see if I could do the visualization.

As mentioned in the above post, there are 3 major visualisations

  1. Feature activations at the layer
  2. Visualisation of the filters
  3. Visualisation of the class outputs

Feature Activation – This visualization the feature activation at the 3 different layers for a given input image. It can be seen that first layer  activates based on the edge of the image. Deeper layers activate in a more abstract way.

Visualization of the filters: This visualization shows what patterns the filters respond maximally to. This is implemented in Keras here

To do this the following is repeated in a loop

  • Choose a loss function that maximizes the value of a convnet filter activation
  • Do gradient ascent (maximization) in input space that increases the filter activation

Visualizing Class Outputs of the MNIST Convnet: This process is similar to determining the filter activation. Here the convnet is made to generate an image that represents the category maximally.

You can access the Google colab notebook here – Deconstructing Convolutional Neural Networks in Tensoflow and Keras

import numpy as np
import pandas as pd
import os
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Input
from keras.models import Model
from sklearn.model_selection import train_test_split
from keras.utils import np_utils
Using TensorFlow backend.
In [0]:
mnist=tf.keras.datasets.mnist
# Set training and test data and labels
(training_images,training_labels),(test_images,test_labels)=mnist.load_data()
In [0]:
#Normalize training data
X =np.array(training_images).reshape(training_images.shape[0],28,28,1) 
# Normalize the images by dividing by 255.0
X = X/255.0
X.shape
# Use one hot encoding for the labels
Y = np_utils.to_categorical(training_labels, 10)
Y.shape
# Split training data into training and validation data in the ratio of 80:20
X_train, X_validation, y_train, y_validation = train_test_split(X,Y,test_size=0.20, random_state=42)
In [4]:
# Normalize test data
X_test =np.array(test_images).reshape(test_images.shape[0],28,28,1) 
X_test=X_test/255.0
#Use OHE for the test labels
Y_test = np_utils.to_categorical(test_labels, 10)
X_test.shape
Out[4]:
(10000, 28, 28, 1)

Display data

Display the training data and the corresponding labels

In [5]:
print(training_labels[0:10])
f, axes = plt.subplots(1, 10, sharey=True,figsize=(10,10))
for i,ax in enumerate(axes.flat):
    ax.axis('off')
    ax.imshow(X[i,:,:,0],cmap="gray")

Create a Convolutional Neural Network

The CNN consists of 3 layers

  • Conv2D of size 28 x 28 with 24 filters
  • Perform Max pooling
  • Conv2D of size 14 x 14 with 48 filters
  • Perform max pooling
  • Conv2d of size 7 x 7 with 64 filters
  • Flatten
  • Use Dense layer with 128 units
  • Perform 25% dropout
  • Perform categorical cross entropy with softmax activation function
In [0]:
num_classes=10
inputs = Input(shape=(28,28,1))
x = Conv2D(24,kernel_size=(3,3),padding='same',activation="relu")(inputs)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(48, (3, 3), padding='same',activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(64, (3, 3), padding='same',activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.25)(x)
output = Dense(num_classes,activation="softmax")(x)

model = Model(inputs,output)

model.compile(loss='categorical_crossentropy', 
          optimizer='adam', 
          metrics=['accuracy'])

Summary of CNN

Display the summary of CNN

In [7]:
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 24)        240       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 24)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 14, 14, 48)        10416     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 7, 7, 48)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 7, 7, 64)          27712     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 3, 3, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               73856     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
=================================================================
Total params: 113,514
Trainable params: 113,514
Non-trainable params: 0
_________________________________________________________________

Perform Gradient descent and validate with the validation data

In [8]:
epochs = 20
batch_size=256
history = model.fit(X_train,y_train,
         epochs=epochs,
         batch_size=batch_size,
         validation_data=(X_validation,y_validation))
————————————————
acc = history.history[ ‘accuracy’ ]
val_acc = history.history[ ‘val_accuracy’ ]
loss = history.history[ ‘loss’ ]
val_loss = history.history[‘val_loss’ ]
epochs = range(len(acc)) # Get number of epochs
#————————————————
# Plot training and validation accuracy per epoch
#————————————————
plt.plot ( epochs, acc,label=”training accuracy” )
plt.plot ( epochs, val_acc, label=’validation acuracy’ )
plt.title (‘Training and validation accuracy’)
plt.legend()
plt.figure()
#————————————————
# Plot training and validation loss per epoch
#————————————————
plt.plot ( epochs, loss , label=”training loss”)
plt.plot ( epochs, val_loss,label=”validation loss” )
plt.title (‘Training and validation loss’ )
plt.legend()
Test model on test data
f, axes = plt.subplots(1, 10, sharey=True,figsize=(10,10))
for i,ax in enumerate(axes.flat):
ax.axis(‘off’)
ax.imshow(X_test[i,:,:,0],cmap=”gray”)
l=[]
for i in range(10):
  x=X_test[i].reshape(1,28,28,1)
  y=model.predict(x)
  m = np.argmax(y, axis=1)
  print(m)
[7]
[2]
[1]
[0]
[4]
[1]
[4]
[9]
[5]
[9]

Generate the filter activations at the intermediate CNN layers

In [12]:
img = test_images[51].reshape(1,28,28,1)
fig = plt.figure(figsize=(5,5))
print(img.shape)
plt.imshow(img[0,:,:,0],cmap="gray")
plt.axis('off')

Display the activations at the intermediate layers

This displays the intermediate activations as the image passes through the filters and generates these feature maps

In [13]:
layer_names = ['conv2d_4', 'conv2d_5', 'conv2d_6']

layer_outputs = [layer.output for layer in model.layers if 'conv2d' in layer.name]
activation_model = Model(inputs=model.input,outputs=layer_outputs)
intermediate_activations = activation_model.predict(img)
images_per_row = 8
max_images = 8

for layer_name, layer_activation in zip(layer_names, intermediate_activations):
    print(layer_name,layer_activation.shape)
    n_features = layer_activation.shape[-1]
    print("features=",n_features)
    n_features = min(n_features, max_images)
    print(n_features)

    size = layer_activation.shape[1]
    print("size=",size)
    n_cols = n_features // images_per_row
    display_grid = np.zeros((size * n_cols, images_per_row * size))


    for col in range(n_cols):
      for row in range(images_per_row):
          channel_image = layer_activation[0,:, :, col * images_per_row + row]

          channel_image -= channel_image.mean()
          channel_image /= channel_image.std()
          channel_image *= 64
          channel_image += 128
          channel_image = np.clip(channel_image, 0, 255).astype('uint8')
          display_grid[col * size : (col + 1) * size,
                         row * size : (row + 1) * size] = channel_image
    scale = 2. / size
    plt.figure(figsize=(scale * display_grid.shape[1],
                        scale * display_grid.shape[0]))
    plt.axis('off')
    plt.title(layer_name)
    plt.grid(False)
    plt.imshow(display_grid, aspect='auto', cmap='viridis')
    
plt.show()

It can be seen that at the higher layers only abstract features of the input image are captured
# To fix the ImportError: cannot import name 'imresize' in the next cell. Run this cell. Then comment and restart and run all
#!pip install scipy==1.1.0

Visualize the pattern that the filters respond to maximally

  • Choose a loss function that maximizes the value of the CNN filter in a given layer
  • Start from a blank input image.
  • Do gradient ascent in input space. Modify input values so that the filter activates more
  • Repeat this in a loop.
In [14]:
from vis.visualization import visualize_activation, get_num_filters
from vis.utils import utils
from vis.input_modifiers import Jitter

max_filters = 24
selected_indices = []
vis_images = [[], [], [], [], []]
i = 0
selected_filters = [[0, 3, 11, 15, 16, 17, 18, 22], 
    [8, 21, 23, 25, 31, 32, 35, 41], 
    [2, 7, 11, 14, 19, 26, 35, 48]]

# Set the layers
layer_name = ['conv2d_4', 'conv2d_5', 'conv2d_6']
# Set the layer indices
layer_idx = [1,3,5]
for layer_name,layer_idx in zip(layer_name,layer_idx):


    # Visualize all filters in this layer.
    if selected_filters:
        filters = selected_filters[i]
    else:
        # Randomly select filters
        filters = sorted(np.random.permutation(get_num_filters(model.layers[layer_idx]))[:max_filters])
    selected_indices.append(filters)

    # Generate input image for each filter.
    # Loop through the selected filters in each layer and generate the activation of these filters
    for idx in filters:
        img = visualize_activation(model, layer_idx, filter_indices=idx, tv_weight=0., 
                                   input_modifiers=[Jitter(0.05)], max_iter=300) 
        vis_images[i].append(img)

    # Generate stitched image palette with 4 cols so we get 2 rows.
    stitched = utils.stitch_images(vis_images[i], cols=4)    
    plt.figure(figsize=(20, 30))
    plt.title(layer_name)
    plt.axis('off')
    stitched = stitched.reshape(1,61,127,1)
    plt.imshow(stitched[0,:,:,0])
    plt.show()
    i += 1
from vis.utils import utils
new_vis_images = [[], [], [], [], []]
i = 0
layer_name = ['conv2d_4', 'conv2d_5', 'conv2d_6']
layer_idx = [1,3,5]
for layer_name,layer_idx in zip(layer_name,layer_idx):
   
    # Generate input image for each filter.
    for j, idx in enumerate(selected_indices[i]):
        img = visualize_activation(model, layer_idx, filter_indices=idx, 
                                   seed_input=vis_images[i][j], input_modifiers=[Jitter(0.05)], max_iter=300) 
        #img = utils.draw_text(img, 'Filter {}'.format(idx))  
        new_vis_images[i].append(img)

    stitched = utils.stitch_images(new_vis_images[i], cols=4)   
    plt.figure(figsize=(20, 30))
    plt.title(layer_name)
    plt.axis('off')
    print(stitched.shape)
    stitched = stitched.reshape(1,61,127,1)
    plt.imshow(stitched[0,:,:,0])
    plt.show()
    i += 1

Visualizing Class Outputs

Here the CNN will generate the image that maximally represents the category. Each of the output represents one of the digits as can be seen below

In [16]:
from vis.utils import utils
from keras import activations
codes = '''
zero 0
one 1
two 2
three 3
four 4
five 5
six 6
seven 7
eight 8
nine 9
'''
layer_idx=10
initial = []
images = []
tuples = []
# Swap softmax with linear for better visualization
model.layers[layer_idx].activation = activations.linear
model = utils.apply_modifications(model)
for line in codes.split('\n'):
    if not line:
        continue
    name, idx = line.rsplit(' ', 1)
    idx = int(idx)
    img = visualize_activation(model, layer_idx, filter_indices=idx, 
                               tv_weight=0., max_iter=300, input_modifiers=[Jitter(0.05)])

    initial.append(img)
    tuples.append((name, idx))

i = 0
for name, idx in tuples:
    img = visualize_activation(model, layer_idx, filter_indices=idx,
                               seed_input = initial[i], max_iter=300, input_modifiers=[Jitter(0.05)])
    #img = utils.draw_text(img, name) # Unable to display text on gray scale image
    i += 1
    images.append(img)

stitched = utils.stitch_images(images, cols=4)
plt.figure(figsize=(20, 20))
plt.axis('off')
stitched = stitched.reshape(1,94,127,1)
plt.imshow(stitched[0,:,:,0])

plt.show()

In the grid below the class outputs show the MNIST digits to which output responds to maximally. We can see the ghostly outline
of digits 0 – 9. We can clearly see the outline if 0,1, 2,3,4,5 (yes, it is there!),6,7, 8 and 9. If you look at this from a little distance the digits are clearly visible. Isn’t that really cool!!


 

Conclusion:


It is really interesting to see the class outputs which show the image to which the class output responds to maximally. In the
post Applied Deep Learning – Part 4: Convolutional Neural Networks the class output show much more complicated images and is worth a look. It is really interesting to note that the model has adjusted the filter values and the weights of the fully connected network to maximally respond to the MNIST digits

References

1. Visualizing and Understanding Convolutional Networks
2. Applied Deep Learning – Part 4: Convolutional Neural Networks
3. Visualizing Intermediate Activations of a CNN trained on the MNIST Dataset
4. How convolutional neural networks see the world
5. Keras – Activation_maximization

Also see

1. Using Reinforcement Learning to solve Gridworld
2. Deep Learning from first principles in Python, R and Octave – Part 8
3. Cricketr learns new tricks : Performs fine-grained analysis of players
4. Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1
5. Big Data-2: Move into the big league:Graduate from R to SparkR
6. OpenCV: Fun with filters and convolution
7. Powershell GUI – Adding bells and whistles

To see all posts click Index of posts

Practical Machine Learning with R and Python – Part 3

In this post ‘Practical Machine Learning with R and Python – Part 3’,  I discuss ‘Feature Selection’ methods. This post is a continuation of my 2 earlier posts

  1. Practical Machine Learning with R and Python – Part 1
  2. Practical Machine Learning with R and Python – Part 2

While applying Machine Learning techniques, the data set will usually include a large number of predictors for a target variable. It is quite likely, that not all the predictors or feature variables will have an impact on the output. Hence it is becomes necessary to choose only those features which influence the output variable thus simplifying  to a reduced feature set on which to train the ML model on. The techniques that are used are the following

  • Best fit
  • Forward fit
  • Backward fit
  • Ridge Regression or L2 regularization
  • Lasso or L1 regularization

This post includes the equivalent ML code in R and Python.

All these methods remove those features which do not sufficiently influence the output. As in my previous 2 posts on “Practical Machine Learning with R and Python’, this post is largely based on the topics in the following 2 MOOC courses
1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

You can download this R Markdown file and the associated data from Github – Machine Learning-RandPython-Part3. 

Note: Please listen to my video presentations Machine Learning in youtube
1. Machine Learning in plain English-Part 1
2. Machine Learning in plain English-Part 2
3. Machine Learning in plain English-Part 3

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

 

1.1 Best Fit

For a dataset with features f1,f2,f3…fn, the ‘Best fit’ approach, chooses all possible combinations of features and creates separate ML models for each of the different combinations. The best fit algotithm then uses some filtering criteria based on Adj Rsquared, Cp, BIC or AIC to pick out the best model among all models.

Since the Best Fit approach searches the entire solution space it is computationally infeasible. The number of models that have to be searched increase exponentially as the number of predictors increase. For ‘p’ predictors a total of 2^{p} ML models have to be searched. This can be shown as follows

There are C_{1} ways to choose single feature ML models among ‘n’ features, C_{2} ways to choose 2 feature models among ‘n’ models and so on, or
1+C_{1} + C_{2} +... + C_{n}
= Total number of models in Best Fit.  Since from Binomial theorem we have
(1+x)^{n} = 1+C_{1}x + C_{2}x^{2} +... + C_{n}x^{n}
When x=1 in the equation (1) above, this becomes
2^{n} = 1+C_{1} + C_{2} +... + C_{n}

Hence there are 2^{n} models to search amongst in Best Fit. For 10 features this is 2^{10} or ~1000 models and for 40 features this becomes 2^{40} which almost 1 trillion. Usually there are datasets with 1000 or maybe even 100000 features and Best fit becomes computationally infeasible.

Anyways I have included the Best Fit approach as I use the Boston crime datasets which is available both the MASS package in R and Sklearn in Python and it has 13 features. Even this small feature set takes a bit of time since the Best fit needs to search among ~2^{13}= 8192  models

Initially I perform a simple Linear Regression Fit to estimate the features that are statistically insignificant. By looking at the p-values of the features it can be seen that ‘indus’ and ‘age’ features have high p-values and are not significant

1.1a Linear Regression – R code

source('RFunctions-1.R')
#Read the Boston crime data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")
dim(df1)
## [1] 506  14
# Linear Regression fit
fit <- lm(cost~. ,data=df1)
summary(fit)
## 
## Call:
## lm(formula = cost ~ ., data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crimeRate    -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zone          4.642e-02  1.373e-02   3.382 0.000778 ***
## indus         2.056e-02  6.150e-02   0.334 0.738288    
## charles       2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox          -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rooms         3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age           6.922e-04  1.321e-02   0.052 0.958229    
## distances    -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## highways      3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax          -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## teacherRatio -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## color         9.312e-03  2.686e-03   3.467 0.000573 ***
## status       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Next we apply the different feature selection models to automatically remove features that are not significant below

1.1a Best Fit – R code

The Best Fit requires the ‘leaps’ R package

library(leaps)
source('RFunctions-1.R')
#Read the Boston crime data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")

# Perform a best fit
bestFit=regsubsets(cost~.,df1,nvmax=13)

# Generate a summary of the fit
bfSummary=summary(bestFit)

# Plot the Residual Sum of Squares vs number of variables 
plot(bfSummary$rss,xlab="Number of Variables",ylab="RSS",type="l",main="Best fit RSS vs No of features")
# Get the index of the minimum value
a=which.min(bfSummary$rss)
# Mark this in red
points(a,bfSummary$rss[a],col="red",cex=2,pch=20)

The plot below shows that the Best fit occurs with all 13 features included. Notice that there is no significant change in RSS from 11 features onward.

# Plot the CP statistic vs Number of variables
plot(bfSummary$cp,xlab="Number of Variables",ylab="Cp",type='l',main="Best fit Cp vs No of features")
# Find the lowest CP value
b=which.min(bfSummary$cp)
# Mark this in red
points(b,bfSummary$cp[b],col="red",cex=2,pch=20)

Based on Cp metric the best fit occurs at 11 features as seen below. The values of the coefficients are also included below

# Display the set of features which provide the best fit
coef(bestFit,b)
##   (Intercept)     crimeRate          zone       charles           nox 
##  36.341145004  -0.108413345   0.045844929   2.718716303 -17.376023429 
##         rooms     distances      highways           tax  teacherRatio 
##   3.801578840  -1.492711460   0.299608454  -0.011777973  -0.946524570 
##         color        status 
##   0.009290845  -0.522553457
#  Plot the BIC value
plot(bfSummary$bic,xlab="Number of Variables",ylab="BIC",type='l',main="Best fit BIC vs No of Features")
# Find and mark the min value
c=which.min(bfSummary$bic)
points(c,bfSummary$bic[c],col="red",cex=2,pch=20)

# R has some other good plots for best fit
plot(bestFit,scale="r2",main="Rsquared vs No Features")

R has the following set of really nice visualizations. The plot below shows the Rsquared for a set of predictor variables. It can be seen when Rsquared starts at 0.74- indus, charles and age have not been included. 

plot(bestFit,scale="Cp",main="Cp vs NoFeatures")

The Cp plot below for value shows indus, charles and age as not included in the Best fit

plot(bestFit,scale="bic",main="BIC vs Features")

1.1b Best fit (Exhaustive Search ) – Python code

The Python package for performing a Best Fit is the Exhaustive Feature Selector EFS.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# Read the Boston crime data
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")

#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
# Set X and y 
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']

# Perform an Exhaustive Search. The EFS and SFS packages use 'neg_mean_squared_error'. The 'mean_squared_error' seems to have been deprecated. I think this is just the MSE with the a negative sign.
lr = LinearRegression()
efs1 = EFS(lr, 
           min_features=1,
           max_features=13,
           scoring='neg_mean_squared_error',
           print_progress=True,
           cv=5)


# Create a efs fit
efs1 = efs1.fit(X.as_matrix(), y.as_matrix())

print('Best negtive mean squared error: %.2f' % efs1.best_score_)
## Print the IDX of the best features 
print('Best subset:', efs1.best_idx_)
Features: 8191/8191Best negtive mean squared error: -28.92
## ('Best subset:', (0, 1, 4, 6, 7, 8, 9, 10, 11, 12))

The indices for the best subset are shown above.

1.2 Forward fit

Forward fit is a greedy algorithm that tries to optimize the feature selected, by minimizing the selection criteria (adj Rsqaured, Cp, AIC or BIC) at every step. For a dataset with features f1,f2,f3…fn, the forward fit starts with the NULL set. It then pick the ML model with a single feature from n features which has the highest adj Rsquared, or minimum Cp, BIC or some such criteria. After picking the 1 feature from n which satisfies the criteria the most, the next feature from the remaining n-1 features is chosen. When the 2 feature model which satisfies the selection criteria the best is chosen, another feature from the remaining n-2 features are added and so on. The forward fit is a sub-optimal algorithm. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of  n + n-1 + n -2 + .. 1 = n(n+1)/2 which is of the order of n^{2}. Though forward fit is a sub optimal solution it is far more computationally efficient than best fit

1.2a Forward fit – R code

Forward fit in R determines that 11 features are required for the best fit. The features are shown below

library(leaps)
# Read the data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                     "distances","highways","tax","teacherRatio","color","status","cost")

#Split as training and test 
train_idx <- trainTestSplit(df1,trainPercent=75,seed=5)
train <- df1[train_idx, ]
test <- df1[-train_idx, ]

# Find the best forward fit
fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward")

# Compute the MSE
valErrors=rep(NA,13)
test.mat=model.matrix(cost~.,data=test)
for(i in 1:13){
    coefi=coef(fitFwd,id=i)
    pred=test.mat[,names(coefi)]%*%coefi
    valErrors[i]=mean((test$cost-pred)^2)
}

# Plot the Residual Sum of Squares
plot(valErrors,xlab="Number of Variables",ylab="Validation Error",type="l",main="Forward fit RSS vs No of features")
# Gives the index of the minimum value
a<-which.min(valErrors)
print(a)
## [1] 11
# Highlight the smallest value
points(c,valErrors[a],col="blue",cex=2,pch=20)

Forward fit R selects 11 predictors as the best ML model to predict the ‘cost’ output variable. The values for these 11 predictors are included below

#Print the 11 ccoefficients
coefi=coef(fitFwd,id=i)
coefi
##   (Intercept)     crimeRate          zone         indus       charles 
##  2.397179e+01 -1.026463e-01  3.118923e-02  1.154235e-04  3.512922e+00 
##           nox         rooms           age     distances      highways 
## -1.511123e+01  4.945078e+00 -1.513220e-02 -1.307017e+00  2.712534e-01 
##           tax  teacherRatio         color        status 
## -1.330709e-02 -8.182683e-01  1.143835e-02 -3.750928e-01

1.2b Forward fit with Cross Validation – R code

The Python package SFS includes N Fold Cross Validation errors for forward and backward fit so I decided to add this code to R. This is not available in the ‘leaps’ R package, however the implementation is quite simple. Another implementation is also available at Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford 2.

library(dplyr)
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                     "distances","highways","tax","teacherRatio","color","status","cost")

set.seed(6)
# Set max number of features
nvmax<-13
cvError <- NULL
# Loop through each features
for(i in 1:nvmax){
    # Set no of folds
    noFolds=5
    # Create the rows which fall into different folds from 1..noFolds
    folds = sample(1:noFolds, nrow(df1), replace=TRUE) 
    cv<-0
    # Loop through the folds
    for(j in 1:noFolds){
        # The training is all rows for which the row is != j (k-1 folds -> training)
        train <- df1[folds!=j,]
        # The rows which have j as the index become the test set
        test <- df1[folds==j,]
        # Create a forward fitting model for this
        fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward")
        # Select the number of features and get the feature coefficients
        coefi=coef(fitFwd,id=i)
        #Get the value of the test data
        test.mat=model.matrix(cost~.,data=test)
        # Multiply the tes data with teh fitted coefficients to get the predicted value
        # pred = b0 + b1x1+b2x2... b13x13
        pred=test.mat[,names(coefi)]%*%coefi
        # Compute mean squared error
        rss=mean((test$cost - pred)^2)
        # Add all the Cross Validation errors
        cv=cv+rss
    }
    # Compute the average of MSE for K folds for number of features 'i'
    cvError[i]=cv/noFolds
}
a <- seq(1,13)
d <- as.data.frame(t(rbind(a,cvError)))
names(d) <- c("Features","CVError")
#Plot the CV Error vs No of Features
ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") +
    xlab("No of features") + ylab("Cross Validation Error") +
    ggtitle("Forward Selection - Cross Valdation Error vs No of Features")

Forward fit with 5 fold cross validation indicates that all 13 features are required

# This gives the index of the minimum value
a=which.min(cvError)
print(a)
## [1] 13
#Print the 13 coefficients of these features
coefi=coef(fitFwd,id=a)
coefi
##   (Intercept)     crimeRate          zone         indus       charles 
##  36.650645380  -0.107980979   0.056237669   0.027016678   4.270631466 
##           nox         rooms           age     distances      highways 
## -19.000715500   3.714720418   0.019952654  -1.472533973   0.326758004 
##           tax  teacherRatio         color        status 
##  -0.011380750  -0.972862622   0.009549938  -0.582159093

1.2c Forward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

Note: The Cross validation error for SFS in Sklearn is negative, possibly because it computes the ‘neg_mean_squared_error’. The earlier ‘mean_squared_error’ in the package seems to have been deprecated. I have taken the -ve of this neg_mean_squared_error. I think this would give mean_squared_error.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()
# Create a forward fit model
sfs = SFS(lr, 
          k_features=(1,13), 
          forward=True, # Forward fit
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=5)

# Fit this on the data
sfs = sfs.fit(X.as_matrix(), y.as_matrix())
# Get all the details of the forward fits
a=sfs.get_metric_dict()
n=[]
o=[]

# Compute the mean cross validation scores
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores']))  
m=np.arange(1,13)
# Get the index of the minimum CV score

# Plot the CV scores vs the number of features
fig1=plt.plot(m,n)
fig1=plt.title('Mean CV Scores vs No of features')
fig1.figure.savefig('fig1.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T)

idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

# Index the column names. 
# Features from forward fit
print("Features selected in forward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...   
## 4   -33.7681  20.1638  [-8.86076528781, -28.650217633, -35.7246353855...   
## 5   -33.6392  20.5271  [-8.90807628524, -28.0684679108, -35.827463022...   
## 6   -33.6276  19.0859  [-9.549485942, -30.9724602876, -32.6689523347,...   
## 7   -32.4082  19.1455  [-10.0177149635, -28.3780298492, -30.926917231...   
## 8   -32.3697   18.533  [-11.1431684243, -27.5765510172, -31.168994094...   
## 9   -32.4016  21.5561  [-10.8972555995, -25.739780653, -30.1837430353...   
## 10  -32.8504  22.6508  [-12.3909282079, -22.1533250755, -33.385407342...   
## 11  -34.1065  24.7019  [-12.6429253721, -22.1676650245, -33.956999528...   
## 12  -35.5814   25.693  [-12.7303397453, -25.0145323483, -34.211898373...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 5)  20.7142  10.3571  
## 4                               (10, 3, 12, 5)  20.0132  10.0066  
## 5                            (0, 10, 3, 12, 5)  20.3738  10.1869  
## 6                         (0, 3, 5, 7, 10, 12)  18.9433  9.47167  
## 7                      (0, 2, 3, 5, 7, 10, 12)  19.0026  9.50128  
## 8                   (0, 1, 2, 3, 5, 7, 10, 12)  18.3946  9.19731  
## 9               (0, 1, 2, 3, 5, 7, 10, 11, 12)  21.3952  10.6976  
## 10           (0, 1, 2, 3, 4, 5, 7, 10, 11, 12)  22.4816  11.2408  
## 11        (0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12)  24.5175  12.2587  
## 12     (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12)  25.5012  12.7506  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 7
## [0, 2, 3, 5, 7, 10, 12]
## #################################################################################
## Features selected in forward fit
## Index([u'crimeRate', u'indus', u'chasRiver', u'rooms', u'distances',
##        u'teacherRatio', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

The above plot indicates that 8 features provide the lowest Mean CV error

1.3 Backward Fit

Backward fit belongs to the class of greedy algorithms which tries to optimize the feature set, by dropping a feature at every stage which results in the worst performance for a given criteria of Adj RSquared, Cp, BIC or AIC. For a dataset with features f1,f2,f3…fn, the backward fit starts with the all the features f1,f2.. fn to begin with. It then pick the ML model with a n-1 features by dropping the feature,f_{j}, for e.g., the inclusion of which results in the worst performance in adj Rsquared, or minimum Cp, BIC or some such criteria. At every step 1 feature is dopped. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of n + n-1 + n -2 + .. 1 = n(n+1)/2 which is of the order of n^{2}. Though backward fit is a sub optimal solution it is far more computationally efficient than best fit

1.3a Backward fit – R code

library(dplyr)
# Read the data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                     "distances","highways","tax","teacherRatio","color","status","cost")

set.seed(6)
# Set max number of features
nvmax<-13
cvError <- NULL
# Loop through each features
for(i in 1:nvmax){
    # Set no of folds
    noFolds=5
    # Create the rows which fall into different folds from 1..noFolds
    folds = sample(1:noFolds, nrow(df1), replace=TRUE) 
    cv<-0
    for(j in 1:noFolds){
        # The training is all rows for which the row is != j 
        train <- df1[folds!=j,]
        # The rows which have j as the index become the test set
        test <- df1[folds==j,]
        # Create a backward fitting model for this
        fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="backward")
        # Select the number of features and get the feature coefficients
        coefi=coef(fitFwd,id=i)
        #Get the value of the test data
        test.mat=model.matrix(cost~.,data=test)
        # Multiply the tes data with teh fitted coefficients to get the predicted value
        # pred = b0 + b1x1+b2x2... b13x13
        pred=test.mat[,names(coefi)]%*%coefi
        # Compute mean squared error
        rss=mean((test$cost - pred)^2)
        # Add the Residual sum of square
        cv=cv+rss
    }
    # Compute the average of MSE for K folds for number of features 'i'
    cvError[i]=cv/noFolds
}
a <- seq(1,13)
d <- as.data.frame(t(rbind(a,cvError)))
names(d) <- c("Features","CVError")
# Plot the Cross Validation Error vs Number of features
ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") +
    xlab("No of features") + ylab("Cross Validation Error") +
    ggtitle("Backward Selection - Cross Valdation Error vs No of Features")

# This gives the index of the minimum value
a=which.min(cvError)
print(a)
## [1] 13
#Print the 13 coefficients of these features
coefi=coef(fitFwd,id=a)
coefi
##   (Intercept)     crimeRate          zone         indus       charles 
##  36.650645380  -0.107980979   0.056237669   0.027016678   4.270631466 
##           nox         rooms           age     distances      highways 
## -19.000715500   3.714720418   0.019952654  -1.472533973   0.326758004 
##           tax  teacherRatio         color        status 
##  -0.011380750  -0.972862622   0.009549938  -0.582159093

Backward selection in R also indicates the 13 features and the corresponding coefficients as providing the best fit

1.3b Backward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

# Read the data
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

# Create the SFS model
sfs = SFS(lr, 
          k_features=(1,13), 
          forward=False, # Backward
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=5)

# Fit the model
sfs = sfs.fit(X.as_matrix(), y.as_matrix())
a=sfs.get_metric_dict()
n=[]
o=[]

# Compute the mean of the validation scores
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores'])) 
m=np.arange(1,13)

# Plot the Validation scores vs number of features
fig2=plt.plot(m,n)
fig2=plt.title('Mean CV Scores vs No of features')
fig2.figure.savefig('fig2.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T)

# Get the index of minimum cross validation error
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward fit and convert to list
b=list(a[idx]['feature_idx'])
# Index the column names. 
# Features from backward fit
print("Features selected in bacward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -35.4992  13.9619  [-17.2329292677, -44.4178648308, -51.633177846...   
## 4    -33.463  12.4081  [-20.6415333292, -37.3247852146, -47.479302977...   
## 5   -33.1038  10.6156  [-20.2872309863, -34.6367078466, -45.931870352...   
## 6   -32.0638  10.0933  [-19.4463829372, -33.460638577, -42.726257249,...   
## 7   -30.7133  9.23881  [-19.4425181917, -31.1742902259, -40.531266671...   
## 8   -29.7432  9.84468  [-19.445277268, -30.0641187173, -40.2561247122...   
## 9   -29.0878  9.45027  [-19.3545569877, -30.094768669, -39.7506036377...   
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...   
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...   
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 7)  13.8576  6.92881  
## 4                               (12, 10, 4, 7)  12.3154  6.15772  
## 5                            (4, 7, 8, 10, 12)  10.5363  5.26816  
## 6                         (4, 7, 8, 9, 10, 12)  10.0179  5.00896  
## 7                      (1, 4, 7, 8, 9, 10, 12)  9.16981  4.58491  
## 8                  (1, 4, 7, 8, 9, 10, 11, 12)  9.77116  4.88558  
## 9               (0, 1, 4, 7, 8, 9, 10, 11, 12)  9.37969  4.68985  
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634  
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092  
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 9
## Features selected in bacward fit
## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate',
##        u'teacherRatio', u'color', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

Backward fit in Python indicate that 10 features provide the best fit

1.3c Sequential Floating Forward Selection (SFFS) – Python code

The Sequential Feature search also includes ‘floating’ variants which include or exclude features conditionally, once they were excluded or included. The SFFS can conditionally include features which were excluded from the previous step, if it results in a better fit. This option will tend to a better solution, than plain simple SFS. These variants are included below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

# Create the floating forward search
sffs = SFS(lr, 
          k_features=(1,13), 
          forward=True,  # Forward
          floating=True,  #Floating
          scoring='neg_mean_squared_error',
          cv=5)

# Fit a model
sffs = sffs.fit(X.as_matrix(), y.as_matrix())
a=sffs.get_metric_dict()
n=[]
o=[]
# Compute mean validation scores
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores'])) 
   
    
    
m=np.arange(1,13)


# Plot the cross validation score vs number of features
fig3=plt.plot(m,n)
fig3=plt.title('SFFS:Mean CV Scores vs No of features')
fig3.figure.savefig('fig3.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T)
# Get the index of the minimum CV score
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward floating fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

print("#################################################################################")
# Index the column names. 
# Features from forward fit
print("Features selected in forward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...   
## 4   -33.7681  20.1638  [-8.86076528781, -28.650217633, -35.7246353855...   
## 5   -33.6392  20.5271  [-8.90807628524, -28.0684679108, -35.827463022...   
## 6   -33.6276  19.0859  [-9.549485942, -30.9724602876, -32.6689523347,...   
## 7   -32.1834  12.1001  [-17.9491036167, -39.6479234651, -45.470227740...   
## 8   -32.0908  11.8179  [-17.4389015788, -41.2453629843, -44.247557798...   
## 9   -31.0671  10.1581  [-17.2689542913, -37.4379370429, -41.366372300...   
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...   
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...   
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 5)  20.7142  10.3571  
## 4                               (10, 3, 12, 5)  20.0132  10.0066  
## 5                            (0, 10, 3, 12, 5)  20.3738  10.1869  
## 6                         (0, 3, 5, 7, 10, 12)  18.9433  9.47167  
## 7                      (0, 1, 2, 3, 7, 10, 12)  12.0097  6.00487  
## 8                   (0, 1, 2, 3, 7, 8, 10, 12)  11.7297  5.86484  
## 9                (0, 1, 2, 3, 7, 8, 9, 10, 12)  10.0822  5.04111  
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634  
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092  
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 9
## [0, 1, 2, 3, 7, 8, 9, 10, 12]
## #################################################################################
## Features selected in forward fit
## Index([u'crimeRate', u'zone', u'indus', u'chasRiver', u'distances',
##        u'idxHighways', u'taxRate', u'teacherRatio', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

SFFS provides the best fit with 10 predictors

1.3d Sequential Floating Backward Selection (SFBS) – Python code

The SFBS is an extension of the SBS. Here features that are excluded at any stage can be conditionally included if the resulting feature set gives a better fit.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

sffs = SFS(lr, 
          k_features=(1,13), 
          forward=False, # Backward
          floating=True, # Floating
          scoring='neg_mean_squared_error',
          cv=5)

sffs = sffs.fit(X.as_matrix(), y.as_matrix())
a=sffs.get_metric_dict()
n=[]
o=[]
# Compute the mean cross validation score
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores']))  
    
m=np.arange(1,13)

fig4=plt.plot(m,n)
fig4=plt.title('SFBS: Mean CV Scores vs No of features')
fig4.figure.savefig('fig4.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T)

# Get the index of the minimum CV score
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best backward floating fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

print("#################################################################################")
# Index the column names. 
# Features from forward fit
print("Features selected in backward floating fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...   
## 4    -33.463  12.4081  [-20.6415333292, -37.3247852146, -47.479302977...   
## 5   -32.3699  11.2725  [-20.8771078371, -34.9825657934, -45.813447203...   
## 6   -31.6742  11.2458  [-20.3082500364, -33.2288990522, -45.535507868...   
## 7   -30.7133  9.23881  [-19.4425181917, -31.1742902259, -40.531266671...   
## 8   -29.7432  9.84468  [-19.445277268, -30.0641187173, -40.2561247122...   
## 9   -29.0878  9.45027  [-19.3545569877, -30.094768669, -39.7506036377...   
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...   
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...   
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 5)  20.7142  10.3571  
## 4                               (4, 10, 7, 12)  12.3154  6.15772  
## 5                            (12, 10, 4, 1, 7)  11.1883  5.59417  
## 6                        (4, 7, 8, 10, 11, 12)  11.1618  5.58088  
## 7                      (1, 4, 7, 8, 9, 10, 12)  9.16981  4.58491  
## 8                  (1, 4, 7, 8, 9, 10, 11, 12)  9.77116  4.88558  
## 9               (0, 1, 4, 7, 8, 9, 10, 11, 12)  9.37969  4.68985  
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634  
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092  
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 9
## [0, 1, 4, 7, 8, 9, 10, 11, 12]
## #################################################################################
## Features selected in backward floating fit
## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate',
##        u'teacherRatio', u'color', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

SFBS indicates that 10 features are needed for the best fit

1.4 Ridge regression

In Linear Regression the Residual Sum of Squares (RSS) is given as

RSS = \sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2}
Ridge regularization =\sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2} + \lambda \sum_{j=1}^{p}\beta^{2}

where is the regularization or tuning parameter. Increasing increases the penalty on the coefficients thus shrinking them. However in Ridge Regression features that do not influence the target variable will shrink closer to zero but never become zero except for very large values of

Ridge regression in R requires the ‘glmnet’ package

1.4a Ridge Regression – R code

library(glmnet)
library(dplyr)
# Read the data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
#Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")

# Set X and y as matrices
X=as.matrix(df1[,1:13])
y=df1$cost

# Fit a Ridge model
fitRidge <-glmnet(X,y,alpha=0)

#Plot the model where the coefficient shrinkage is plotted vs log lambda
plot(fitRidge,xvar="lambda",label=TRUE,main= "Ridge regression coefficient shrikage vs log lambda")

The plot below shows how the 13 coefficients for the 13 predictors vary when lambda is increased. The x-axis includes log (lambda). We can see that increasing lambda from 10^{2} to 10^{6} significantly shrinks the coefficients. We can draw a vertical line from the x-axis and read the values of the 13 coefficients. Some of them will be close to zero

# Compute the cross validation error
cvRidge=cv.glmnet(X,y,alpha=0)

#Plot the cross validation error
plot(cvRidge, main="Ridge regression Cross Validation Error (10 fold)")

This gives the 10 fold Cross Validation  Error with respect to log (lambda) As lambda increase the MSE increases

1.4a Ridge Regression – Python code

The coefficient shrinkage for Python can be plotted like R using Least Angle Regression model a.k.a. LARS package. This is included below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

# Scale the X_train and X_test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit a ridge regression with alpha=20
linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)

# Print the training R squared
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
# Print the test Rsquared
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

trainingRsquared=[]
testRsquared=[]
# Plot the effect of alpha on the test Rsquared
print('Ridge regression: effect of alpha regularization parameter\n')
# Choose a list of alpha values
for this_alpha in [0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    # Compute training rsquared
    r2_train = linridge.score(X_train_scaled, y_train)
    # Compute test rsqaured
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    trainingRsquared.append(r2_train)
    testRsquared.append(r2_test)
    
# Create a dataframe
alpha=[0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000]    
trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha)
testRsquared=pd.DataFrame(testRsquared,index=alpha)

# Plot training and test R squared as a function of alpha
df3=pd.concat([trainingRsquared,testRsquared],axis=1)
df3.columns=['trainingRsquared','testRsquared']

fig5=df3.plot()
fig5=plt.title('Ridge training and test squared error vs Alpha')
fig5.figure.savefig('fig5.png', bbox_inches='tight')

# Plot the coefficient shrinage using the LARS package

from sklearn import linear_model
# #############################################################################
# Compute paths

n_alphas = 200
alphas = np.logspace(0, 8, n_alphas)

coefs = []
for a in alphas:
    ridge = linear_model.Ridge(alpha=a, fit_intercept=False)
    ridge.fit(X_train_scaled, y_train)
    coefs.append(ridge.coef_)

# #############################################################################
# Display results

ax = plt.gca()

fig6=ax.plot(alphas, coefs)
fig6=ax.set_xscale('log')
fig6=ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
fig6=plt.xlabel('alpha')
fig6=plt.ylabel('weights')
fig6=plt.title('Ridge coefficients as a function of the regularization')
fig6=plt.axis('tight')
plt.savefig('fig6.png', bbox_inches='tight')
## R-squared score (training): 0.620
## R-squared score (test): 0.438
## Number of non-zero features: 13
## Ridge regression: effect of alpha regularization parameter

The plot below shows the training and test error when increasing the tuning or regularization parameter ‘alpha’

For Python the coefficient shrinkage with LARS must be viewed from right to left, where you have increasing alpha. As alpha increases the coefficients shrink to 0.

1.5 Lasso regularization

The Lasso is another form of regularization, also known as L1 regularization. Unlike the Ridge Regression where the coefficients of features which do not influence the target tend to zero, in the lasso regualrization the coefficients become 0. The general form of Lasso is as follows

\sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2} + \lambda \sum_{j=1}^{p}|\beta|

1.5a Lasso regularization – R code

library(glmnet)
library(dplyr)
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")

# Set X and y as matrices
X=as.matrix(df1[,1:13])
y=df1$cost

# Fit the lasso model
fitLasso <- glmnet(X,y)
# Plot the coefficient shrinkage as a function of log(lambda)
plot(fitLasso,xvar="lambda",label=TRUE,main="Lasso regularization - Coefficient shrinkage vs log lambda")

The plot below shows that in L1 regularization the coefficients actually become zero with increasing lambda

# Compute the cross validation error (10 fold)
cvLasso=cv.glmnet(X,y,alpha=0)
# Plot the cross validation error
plot(cvLasso)

This gives the MSE for the lasso model

1.5 b Lasso regularization – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model

scaler = MinMaxScaler()
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=0.1, max_iter = 10).fit(X_train_scaled, y_train)

print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))
        

print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

trainingRsquared=[]
testRsquared=[]
#for alpha in [0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]:
for alpha in [0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    trainingRsquared.append(r2_train)
    testRsquared.append(r2_test)
    
alpha=[0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]    
#alpha=[0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]
trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha)
testRsquared=pd.DataFrame(testRsquared,index=alpha)

df3=pd.concat([trainingRsquared,testRsquared],axis=1)
df3.columns=['trainingRsquared','testRsquared']

fig7=df3.plot()
fig7=plt.title('LASSO training and test squared error vs Alpha')
fig7.figure.savefig('fig7.png', bbox_inches='tight')



## Non-zero features: 7
## R-squared score (training): 0.726
## R-squared score (test): 0.561
## 
## Features with non-zero weight (sorted by absolute magnitude):
##  status, -18.361
##  rooms, 18.232
##  teacherRatio, -8.628
##  taxRate, -2.045
##  color, 1.888
##  chasRiver, 1.670
##  distances, -0.529
## Lasso regression: effect of alpha regularization
## parameter on number of features kept in final model
## 
## Computing regularization path using the LARS ...
## .C:\Users\Ganesh\ANACON~1\lib\site-packages\sklearn\linear_model\coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
##   ConvergenceWarning)

The plot below gives the training and test R squared error

1.5c Lasso coefficient shrinkage – Python code

To plot the coefficient shrinkage for Lasso the Least Angle Regression model a.k.a. LARS package. This is shown below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
scaler = MinMaxScaler()
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


print("Computing regularization path using the LARS ...")
alphas, _, coefs = linear_model.lars_path(X_train_scaled, y_train, method='lasso', verbose=True)

xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]

fig8=plt.plot(xx, coefs.T)

ymin, ymax = plt.ylim()
fig8=plt.vlines(xx, ymin, ymax, linestyle='dashed')
fig8=plt.xlabel('|coef| / max|coef|')
fig8=plt.ylabel('Coefficients')
fig8=plt.title('LASSO Path - Coefficient Shrinkage vs L1')
fig8=plt.axis('tight')
plt.savefig('fig8.png', bbox_inches='tight')
This plot show the coefficient shrinkage for lasso.
This 3rd part of the series covers the main ‘feature selection’ methods. I hope these posts serve as a quick and useful reference to ML code both for R and Python!
Stay tuned for further updates to this series!
Watch this space!

 

You may also like

1. Natural language processing: What would Shakespeare say?
2. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
3. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables
4. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
5. Experiments with deblurring using OpenCV
6. R vs Python: Different similarities and similar differences

To see all posts see Index of posts

My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

Then felt I like some watcher of the skies 
When a new planet swims into his ken; 
Or like stout Cortez when with eagle eyes 
He star’d at the Pacific—and all his men 
Look’d at each other with a wild surmise— 
Silent, upon a peak in Darien. 
On First Looking into Chapman’s Homer by John Keats

The above excerpt from John Keat’s poem captures the the exhilaration that one experiences, when discovering something for the first time. This also  summarizes to some extent my own as enjoyment while pursuing Data Science, Machine Learning and the like.

I decided to write this post, as occasionally youngsters approach me and ask me where they should start their adventure in Data Science & Machine Learning. There are other times, when the ‘not-so-youngsters’ want to know what their next step should be after having done some courses. This post includes my travels through the domains of Data Science, Machine Learning, Deep Learning and (soon to be done AI).

By no means, am I an authority in this field, which is ever-widening and almost bottomless, yet I would like to share some of my experiences in this fascinating field. I include a short review of the courses I have done below. I also include alternative routes through  courses which I did not do, but are probably equally good as well.  Feel free to pick and choose any course or set of courses. Alternatively, you may prefer to read books or attend bricks-n-mortar classes, In any case,  I hope the list below will provide you with some overall direction.

All my learning in the above domains have come from MOOCs and I restrict myself to the top 3 MOOCs, or in my opinion, ‘the original MOOCs’, namely Coursera, edX or Udacity, but may throw in some courses from other online sites if they are only available there. I would recommend these 3 MOOCs over the other numerous online courses and also over face-to-face classroom courses for the following reasons. These MOOCs

  • Are taken by world class colleges and the lectures are delivered by top class Professors who have a great depth of knowledge and a wealth of experience
  • The Professors, besides delivering quality content, also point out to important tips, tricks and traps
  • You can revisit lectures in online courses anytime to refresh your memory
  • Lectures are usually short between 8 -15 mins (Personally, my attention span is around 15-20 mins at a time!)

Here is a fair warning and something quite obvious. No amount of courses, lectures or books will help if you don’t put it to use through some language like Octave, R or Python.

The journey
My trip through Data Science, Machine Learning  started with an off-chance remark,about 3 years ago,  from an old friend of mine who spoke to me about having done a few  courses at Coursera, and really liked it.  He further suggested that I should try. This was the final push which set me sailing into this vast domain.

I have included the list of the courses I have done over the past 5 years (37+ certifications completed and another 9 audited-listened only without doing the assignments). For each of the courses I have included a short review of the course, whether I think the course is mandatory, the language in which the course is based on, and finally whether I have done the course myself etc. I have also included alternative courses, which I may have not done, but which I think are equally good. Finally, I suggest some courses which I have heard of and which are very good and worth taking.

1. Machine Learning, Stanford, Prof Andrew Ng, Coursera
(Requirement: Mandatory, Language:Octave,Status:Completed)
This course provides an excellent foundation to build your Machine Learning citadel on. The course covers the mathematical details of linear, logistic and multivariate regression. There is also a good coverage of topics like Neural Networks, SVMs, Anamoly Detection, underfitting, overfitting, regularization etc. Prof Andrew Ng presents the material in a very lucid manner. It is a great course to start with. It would be a good idea to brush up  some basics of linear algebra, matrices and a little bit of calculus, specifically computing the local maxima/minima. You should be able to take this course even if you don’t know Octave as the Prof goes over the key aspects of the language.

2. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford– (Requirement:Mandatory, Language:R, Status;Completed) –
The course includes linear and polynomial regression, logistic regression. Details also include cross-validation and the bootstrap methods, how to do model selection and regularization (ridge and lasso). It also touches on non-linear models, generalized additive models, boosting and SVMs. Some unsupervised learning methods are  also discussed. The 2 Professors take turns in delivering lectures with a slight touch of humor.

3a. Data Science Specialization: Prof Roger Peng, Prof Brian Caffo & Prof Jeff Leek, John Hopkins University (Requirement: Option A, Language: R Status: Completed)
This is a comprehensive 10 module specialization based on R. This Specialization gives a very broad overview of Data Science and Machine Learning. The modules cover R programming, Statistical Inference, Practical Machine Learning, how to build R products and R packages and finally has a very good Capstone project on NLP

3b. Applied Data Science with Python Specialization: University of Michigan (Requirement: Option B, Language: Python, Status: Not done)
In this specialization I only did  the Applied Machine Learning in Python (Prof Kevyn-Collin Thomson). This is a very good course that covers a lot of Machine Learning algorithms(linear, logistic, ridge, lasso regression, knn, SVMs etc. Also included are confusion matrices, ROC curves etc. This is based on Python’s Scikit Learn

3c. Machine Learning Specialization, University Of Washington (Requirement:Option C, Language:Python, Status : Not completed). This appears to be a very good Specialization in Python

4. Statistics with R Specialization, Duke University (Requirement: Useful and a must know, Language R, Status:Not Completed)
I audited (listened only) to the following 2 modules from this Specialization.
a.Inferential Statistics
b.Linear Regression and Modeling
Both these courses are taught by Prof Mine Cetikya-Rundel who delivers her lessons with extraordinary clarity.  Her lectures are filled with many examples which she walks you through in great detail

5.Bayesian Statistics: From Concept to Data Analysis: Univ of California, Santa Cruz (Requirement: Optional, Language : R, Status:Completed)
This is an interesting course and provides an alternative point of view to frequentist approach

6. Data Science and Engineering with Spark, University of California, Berkeley, Prof Antony Joseph, Prof Ameet Talwalkar, Prof Jon Bates
(Required: Mandatory for Big Data, Status:Completed, Language; pySpark)
This specialization contains 3 modules
a.Introduction to Apache Spark
b.Distributed Machine Learning with Apache Spark
c.Big Data Analysis with Apache Spark

This is an excellent course for those who want to make an entry into Distributed Machine Learning. The exercises are fairly challenging and your code will predominantly be made of map/reduce and lambda operations as you process data that is distributed across Spark RDDs. I really liked  the part where the Prof shows how a matrix multiplication on a single machine is of the order of O(nd^2+d^3) (which is the basis of Machine Learning) is reduced to O(nd^2) by taking outer products on data which is distributed.

7. Deep Learning Prof Andrew Ng, Younes Bensouda Mourri, Kian Katanforoosh : Requirement:Mandatory,Language:Python, Tensorflow Status:Completed)

This course had 5 Modules which start from the fundamentals of Neural Networks, their derivation and vectorized Python implementation. The specialization also covers regularization, optimization techniques, mini batch normalization, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs applied to a wide variety of real world problems

The modules are
a. Neural Networks and Deep Learning
In this course Prof Andrew Ng explains differential calculus, linear algebra and vectorized Python implementations of Deep Learning algorithms. The derivation for back-propagation is done and then the Prof shows how to compute a multi-layered DL network
b.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Deep Neural Networks can be very flexible, and come with a lots of knobs (hyper-parameters) to tune with. In this module, Prof Andrew Ng shows a systematic way to tune hyperparameters and by how much should one tune. The course also covers regularization(L1,L2,dropout), gradient descent optimization and batch normalization methods. The visualizations used to explain the momentum method, RMSprop, Adam,LR decay and batch normalization are really powerful and serve to clarify the concepts. As an added bonus,the module also includes a great introduction to Tensorflow.
c.Structuring Machine Learning Projects
A very good module with useful tips, tricks and traps that need to be considered while working on Machine Learning and Deep Learning projects
d. Convolutional Neural Networks
This domain has a lot of really cool ideas, where images represented as 3D volumes, are compressed and stretched longitudinally before applying a multi-layered deep learning neural network to this thin slice for performing classification,detection etc. The Prof provides a glimpse into this fascinating world of image classification, detection andl neural art transfer with frameworks like Keras and Tensorflow.
e. Sequence Models
In this module covers in good detail concepts like RNNs, GRUs, LSTMs, word embeddings, beam search and attention model.

8. Neural Networks for Machine Learning, Prof Geoffrey Hinton,University of Toronto
(Requirement: Mandatory, Language;Octave, Status:Completed)
This is a broad course which starts from the basic of Perceptrons, all the way to Boltzman Machines, RNNs, CNNS, LSTMs etc The course also covers regularisation, learning rate decay, momentum method etc

9.Probabilistic Graphical Models, Stanford  Prof Daphne Koller(Language:Octave, Status: Partially completed)
This has 3 courses
a.Probabilistic Graphical Models 1: Representation – Done
b.Probabilistic Graphical Models 2: Inference – To do
c.Probabilistic Graphical Models 3: Learning – To do
This course discusses how a system, which can be represented as a complex interaction
of probability distributions, will behave. This is probably the toughest course I did.  I did manage to get through the 1st module, While I felt that grasped a few things, I did not wholly understand the import of this. However I feel this is an important domain and I will definitely revisit this in future

10. Reinforcement Specialization : University of Alberta, Prof Adam White and Prof Martha White
(Requirement: Very important, Language;Python, Status: Partially Completed)
This is a set of 4 courses. I did the first 2 of the 4. Reinforcement Learning appears deceptively simple, but it is anything but simple. Definitely a very critical area to learn.

a.Fundamentals of Reinforcement Learning: This course discusses Markov models, value functions and Bellman equations and dynamic programming.
b.Sample based learning Learning methods: This course touches on Monte Carlo methods, Temporal Difference methods, Q Learning etc.

Reinforcement Learning is a must-have in your AI arsenal.

11. Tensorflow in Practice Specialization – Prof Laurence Moroney – Deep Learning.AI
(Requirement: Important, Language;Python, Status: Completed)
This is a good course but definitely do the Deep Learning Specialization by Prof Andrew Ng
There are 4 courses in this Specialization. I completed all 4 courses. They are fairly straight forward
a. Introduction to TensorFlow – This course introduces you to Tensorflow, image recognition with brute-force method
b. Convolutional Neural Networks in Tensorflow – This course touches on how to build a CNN, image augmentation, transfer learning and multi-class classification
c. Natural Language Processing in Tensorflow – Word embeddings, sentiment analysis, LSTMs, RNNs are discussed.
d. Sequences, time series and prediction – This course discusses using RNNs for time series, auto correlation

12. Natural Language Processing  Specialization – Prof Younes Bensouda, Lukasz Kaiser from DeepLearning.AI
(Requirement: Very Important, Language;Python, Status: Partially Completed)
This is the latest specialization from Deep Learning.AI. I have completed the first 2 courses
a.Natural Language Processing with Classification and Vector Spaces -The first course deals with sentiment analysis with Naive Bayes, vector space models, capturing dependencies using PCA etc
b. Natural Language Processing with Probabilistic Models – In this course techniques for auto correction, Markov models and Viterbi algorithm for Parts of Speech tagging, auto completion and word embedding are discussed.

13. Mining Massive Data Sets Prof Jure Leskovec, Prof Anand Rajaraman and ProfJeff Ullman. Online Stanford, Status Partially done.,
I did quickly audit this course, a year back, when it used to be in Coursera. It now seems to have moved to Stanford online. But this is a very good course that discusses key concepts of Mining Big Data of the order a few Petabytes

14. Introduction to Artificial Intelligence, Prof Sebastian Thrun & Prof Peter Norvig, Udacity
This is a really good course. I have started on this course a couple of times and somehow gave up. Will revisit to complete in future. Quite extensive in its coverage.Touches BFS,DFS, A-Star, PGM, Machine Learning etc.

15.Deep Learning (with TensorFlow), Vincent Vanhoucke, Principal Scientist at Google Brain.
Got started on this one and abandoned some time back. In my to do list though

My learning journey is based on Lao Tzu’s dictum of ‘A good traveler has no fixed plans and is not intent on arriving’. You could have a goal and try to plan your courses accordingly.
And so my journey continues…

I hope you find this list useful.
Have a great journey ahead!!!

Close encounters with the future

ss

Published in Telecom Asia, Oct 22,2013 – Close encounters with the future

Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers in the future may have only 1,000 vacuum tubes and perhaps weigh 1.5 tons.—POPULAR MECHANICS, 1949

Introduction: Ray Kurzweil in his non-fiction book “The Singularity is near – When humans transcend biology” predicts that by the year 2045 the Singularity will allow humans to transcend our ‘frail biological bodies’ and our ‘petty, derivative and circumscribed brains’ . Specifically the book claims “that there will be a ‘technological singularity’ in the year 2045, a point where progress is so rapid it outstrips humans’ ability to comprehend it. Irreversibly transformed, people will augment their minds and bodies with genetic alterations, nanotechnology, and artificial intelligence”.

He believes that advances in robotics, AI, nanotechnology and genetics will grow exponentially and will lead us into a future realm of intelligence that will far exceed biological intelligence. This explosion will be the result of ‘accelerating returns from significant advances in technology”

Futurescape

Here is a look at some of the more fascinating key trends in technology. You can decide whether we are heading to Singularity or not.

Autonomous Vehicles (AVs): Self driving cars have moved from the realm of science fiction to reality in recent times. Google’s autonomous cars has already driven around half a million miles. All the major car manufacturers of the world from BMW, Mercedes, Toyota, Nissan, Ford or GM are all coming with their own versions of autonomous cars. These cars are equipped with Adaptive Cruise Control and Collision Avoidance technologies and are already taking away control drivers. Moreover AVs alert drivers, if their attention strays from the road ahead, for too long. Autonomous Vehicles work with the help of Vehicular Communication Technology.

Vehicular Communication along with the Intelligent Transport Systems (ITS) achieves safety by enabling communication between vehicles, people and roads. Vehicle-to-vehicle communications are the fundamental building block of autonomous, self-driving cars. It enables the exchange of data between vehicles and allows automobiles to “see” and adapt to driving obstacles more completely, preventing accidents besides resulting in more efficient driving.

Smart Assistants: From the defeat of Kasparov in chess by IBM’s Deep Blue in 1997, and then subsequently to  the resounding victory of IBM’s Watson in Jeopardy, capable of understanding natural human language, to the more prevalent Apple’s intelligent assistant Siri, Artificially Intelligent  (AI) systems have come a long way. The newest trend in this area is Smart Assistants.  Robots are currently analyzing documents, filling prescriptions, and handling other tasks that were once exclusively done by humans. Smart Assistants are already taking over the tasks of BPO operators, paralegals, store clerks, baby sitters. Robots, in many ways, are not only smarter than humans, but also do not get easily bored,

Intelligent homes and intelligent offices. Rapid advances in technology will be closer to the home both literally and figuratively. The future home will have the ability to detect the presence of people, pets, smoke and changes to humidity, moisture, lighting, temperature. Smart devices will monitor the environment and take appropriate steps to save energy, improve safety and enhance security of homes.  Devices will start learning your habits and enhance your comfort and convenience. Everything from thermostats, fire detectors, washing machines, refrigerators will be equipped electronics that will be capable of adapting to the environment. All gadgets at home will be accessible through laptops, tablets or smartphones from anywhere. We will be able to monitor all aspects of our intelligent home from anywhere.

Smart devices will also make major inroads into offices leading to the birth of intelligent offices where the lighting, heating, cooling will be based on the presence of people in the offices. This will result in an enormous savings in energy. The advances in intelligent homes and intelligent offices will be in the greater context of the Smart Grid.

Swarms of drones: Contrary to the use of weaponized drones for unmanned aerial survey of enemy territory we will soon have commercial drones. Drone will start being used for civilian purposes.  The most compelling aspect of drones these days is the fact that they can be easily manufactured in large quantities, are cheap and can perform complex tasks either singly or collectively. Remotely controlled drones can perform hundreds of civilian jobs, including traffic monitoring, aerial surveying, and oil pipeline inspections and monitoring of crop conditions. Drones are also being employed for conservation of wildlife. In the wilderness of Africa, drones are already helping in providing aerial footage of the landscape, tracking poachers and in also herding elephants. However, before drones become a common sight, it is necessary to ensure that appropriate laws are made for maintaining the safety and security of civilians. This is likely to happen in US in 2015, when the Federal Aviation Administration (FAA) will come up with rules to safely integrate drones into the American skies.

MOOC (Massive Online Open Course): The concept of MOOC, or the ‘Massive Open Online Course’ from top colleges, though just a few years old, is already taking the world by storm. Coursera, edX and Udacity are the top 3 MOOCs besides many others and offer a variety of courses on technology, philosophy, sociology, computer science etc.  As more courses are available online, the requirements of having a uniform start and end date will diminish gradually. The availability of course lectures at all times and through all devices, namely the laptop, tablet or smartphone, will result in large scale adoption by students of all ages.

Contrary to regimented classes MOOCs now allow students to take classes at their own pace. It is likely that some students will breeze through an entire semester worth of classes in a few weeks. It is also likely that a few students will graduate in 4 years with more than a couple of degrees. MOOCs are a natural development considering that the world is going to be more knowledge driven where there will be the need for experts with a diverse set of in-depth skills. Here is an interesting article in WSJ “What College will be like in 2023

3D Printing: This is another technology that is bound to become ubiquitous in our future. 3D printers will revolutionize manufacturing in ways we could never imagine. A 3-D printer is similar to a hot-glue gun attached to a robotic arm. A 3-D printer creates an object by stacking one layer of material, typically plastic or metal, on top of another.  3D printers have been used for making everything from prosthetic limbs, phone cases, lamps all the way to a NASA funded 3D pizza. Here is a great article in New York Times “Dinner is Printed” It is likely that a 3D printer would be indispensable to our future homes much like the refrigerator and microwave.

Artificial sense organs: A recent news items in Science 2.0 “The Future touch sensitive prosthetic limbs”   discusses the invention of a prosthetic limb that can actually provide the sense of touch by stimulating the regions of the brain that deal with the sense of touch. The researchers identified the neural activity that occurs when grasping or feeling an object and successfully induced these patterns in the brain. Two parallel efforts are underway to understand how the human brain works. They are “The Human Brain Project” which has 130 members of the European Union and Obama’s BRAIN project. Both these projects attempt to ‘to give us a deeper and more meaningful understanding of how the human brain operates”. Possibilities as in the movies ‘Avatar’ or ‘Terminator’ may not be far away.

The Others: Besides the above, technologies like Big Data, Cloud Computing, Semantic Web, Internet of Things and Smart Grid will also be swamp us in the future and much has already been said about it.

Conclusion: The above sets of technologies represent seismic shifts and are bound to explode in our future in a million ways.

Given the advances in bionic limbs, Machine Intelligent AI systems, MOOCs, Autonomous Vehicles are we on target for the Singularity?

I wouldn’t be surprised at all!

Find me on Google+

The moving edge of computing

Published in The Hindu – 30 Sep 2012 as “Three computing technologies that will power the world

“The moving edge of computing computes and having computed moves on…” We could thus rephrase the Rubaiyat of Omar Khayyam’s “The moving hand…” Computing technology has really advanced by leaps and bounds. We are now in a new era of computing. We are in the midst of “intelligent and cognitive” computing.

From the initial days of number crunching by languages of FORTRAN, to the procedural methodology of Pascal or C and later the object oriented paradigm of C++ and Java we have now come a long way.  In this age of information overload technologies that can just solve problems through steps & procedures are no longer adequate. We need technology to detect complex patterns, trends, understand nuances in human language and to automatically resolve problems. In this new era of computing the following 3 technologies are furthering the frontiers of computing technology.

Predictive Analytics

By 2016 130 Exabyte’s (130 * 2 ^ 60) will rip through the internet. The number of mobile devices will exceed the human population this year, 2012 and by 2016 the number of connected devices will touch almost 10 billion. The devices connected to the net will range from mobiles, laptops, tablets, sensors and the millions of devices based on the “internet of things”. All these devices will constantly spew data on the internet. A hot and happening trend in computing is the ability to make business and strategic decisions by determining patterns, trends and outliers among mountains of data. Predictive analytics will be a key discipline in our future and experts will be much sought after. Predictive analytics uses statistical methods to mine intelligence, information and patterns in structured, unstructured and streams of data. Predictive analytics will be applied across many domains from banking, insurance, retail, telecom, energy. There are also applications for energy grids, water management, besides determining user sentiment by mining data from social networks etc.

Cognitive Computing

The most famous technological product in the domain of cognitive computing is IBM’s supercomputer Watson. IBM’s Watson is an artificial intelligence computer system capable of answering questions posed in natural language. IBM’s supercomputer Watson is best known for successfully trouncing a national champion in the popular US TV quiz competition, Jeopardy. What makes this victory more astonishing is that IBM’s Watson had to successfully decipher the nuances of natural language and pick the correct answer.  Following the success at Jeopardy, IBM’s Watson supercomputer has now  been employed by a leading medical insurance firm in US to diagnose medical illnesses and to recommend treatment options for patients. Watson will be able to analyze 1 million books, or roughly 200 million pages of information. The other equally well known mobile app is Siri the voice recognition app on the iPhone. The earlier avatar of cognitive computing was expert systems based on Artificial Intelligence. These expert systems were inference engines that were based on knowledge rules. The most famous among the expert systems were “Dendral” and “Mycin”. We appear to be on the cusp of tremendous advancement in cognitive computing based on the success of IBM’s Watson.

Autonomic Computing

This is another computing trend that will become prevalent in the networks of tomorrow. Autonomic computing refers to the self-managing characteristics of a network. Typically it signifies the ability of a network to self-heal in the event of failures or faults. Autonomic network can quickly localize and isolate faults in the network while keeping other parts of the network unaffected. Besides these networks can quickly correct and heal the faulty hardware without human intervention. Autonomic networks are typical in smart grids where a fault can be quickly isolated and the network healed without resulting in a major outage in the electrical grid.

These are truly exciting times in computing as we move towards true intelligence!

Find me on Google+