GenAI: The Science of Attention, Transformers

“Each of us is on our own trajectory – steered by our genes and our experiences – and as a result every brain has a different internal life. Brains are as unique as snowflakes.”

David Eagleman

Introduction

The rapidly expanding wavefront of Generative AI (Gen AI), in the last couple of years, can be largely attributed to a seminal paper “Attention is all you need” by Google. This paper by Google researchers, was a landmark in the field of AI and led to an explosion of ideas and spawned an entire industry based on its theme. The paper introduces 2 key concepts a) Attention mechanism the b) Transformer Architecture. In this post, I distil the essence of Attention and Transformers.

Transformers were originally invented for Natural Language Processing (NLP) tasks like language translation, classification, sentiment analysis, summarisation, chat sessions etc. However, it led to its adaptation to languages, voice, music, images, videos etc. Prior to the advent of transformers, Natural Language Processing (NLP) was largely done through Recurrent Neural Networks (RNNs) . The problem with encoder-decoder based RNNs is that it had a fixed length for an internal-hidden state, which stored the information, for translation or other NLP tasks. Clearly, it was difficult to capture all the relevant information in this fixed length internal-hidden state. A single, final hidden state had to capture all information from the input sequence enabling it to generate the output sequence for translation and other tasks. There were some enhancements to address the shortcomings of RNNs with approaches such as Long Short-term Memory(LSTM), Gated Recurrent Unit (GRU) etc., but by and large the performance of these NLP models fell short of being reliable and consistent. This shortcoming was addressed in the paper by Bahdanau et al in the paper ‘Neural machine translation by jointly learning to align and translate‘, which discussed how ‘attention’ can be used to identify which words, align to which other words in its neighbourhood, which is computed as context vector. It implemented a ‘mechanism of attention’ by enabling the decoder to decide which parts of the sentence it needs to pay attention to, thus relieving the encoder to encode all information of the sentence into a single internal-hidden state

The attention-based transformer architecture in the paper ‘Attention is all you need‘ took its inspiration from the above Bahdanau paper and eventually evolved into the current Large Language Models (LLMs). The transformer architecture based on the attention mechanism was able to effectively address the shortcomings of the earlier RNNs. The architecture of the LLM is based on 2 key principles

a. An attention mechanism which determines the relationships between words in a sequence. It identifies how each word relates to others words in the sequence

b. A feed-forward neural network that takes the output of the attention module and enriches the relationships between the words

A final layer using softmax can predict the next word in a given sequence

LLM’s are based on the Transformer architecture. LLMs like ChatGPT, Claude, Cohere, Llama etc., typically go through 2 stages a) Pre-training b) Fine tuning

During pre-training the LLM is trained on a large corpus of unstructured text from the internet, wikipedia, arXiv, stack overflow etc. The pre-training helps the LLMs in general language understanding, enabling LLMs to learn grammar, syntax, context etc. This is followed by a fine-tuning phase where the language models is trained for specific domain or task using a smaller curated and annotated dataset of input, output pairs. This adjusts the weights of the LLM to handle the specific task in a particular domain. This may be further enhanced based on Reinforcement Learning with Human Feedback (RLHF) with reward-penalty for a given task during training. (In many ways, we humans also go through the stages of pre-training and fine-tuning in my opinion. As David Eagleman states above, we all come with a genetic blueprint based on millions of years of evolution of responses to triggers. During our early formative years this genetic DNA will create certain neural pathways in the brain akin to pre-training. Further from 2-5 years, through a couple of years of fine-tuning we learn a lot more – languages, recognition, emotion etc. This does simplify things to an extent but still I think to a large extent it holds)

Clearly, our brain is not only much more complex but also uses a minuscule energy about 60W to compute complex tasks, which is roughly equivalent to a light bulb. While for e.g. training GPT-3 which has 175 billion parameters, consumes an estimated 1287 MWH, which is roughly equivalent the consumption of an average US household for 120 years (Ref: https://adasci.org/how-much-energy-do-llms-consume-unveiling-the-power-behind-ai/?ts=1734961973)

NLP is based on the fact that human language is some ordered sequence of words. Moreover, words in a language are repetitive and thus inherently categorical. Hence, we need to use a technique for handling these categorical words for e.g. One-Hot-Encoding (OHE). But since the vocabulary of languages is extremely large, using OHE would become unmanageable. There are several other more efficient encoding methods available. Large Language Models (LLMs), which are the backbone of GenAI are trained on a large corpus of text spanning the internet, wikipedia, and other domains. The text is first converted into a numerical form through a process called tokenisation, where the words, subwords are assigned a numerical value based on some scheme. Tokenisation, can be at the character level, sub-word level, word level, sentence or even paragraph level. The choice of encoding is trade-off between vocabulary size vs sequence or context length. For character level encoding, the vocabulary will be around ~36 including letters, punctuation etc., but the sequences generated for sentences with this method will be very long. While word encodings will have a large vocabulary, an entire sentences can be captured in a shorter sequences. The encodings typically used are Byte Pair Encoding (BPE) from OpenAI, WordPiece or Sentence encoding. The sentences are first converted to tokens.

The tokens are then converted into embedding vectors which can 16, 32 or 128 real-valued dimensions. The embedding vectors convert the tokens into a multi-dimensional continuous space and capture the semantic meaning of the tokens as they get trained. The embeddings assigned, do not inherently capture the semantic meaning of words fully. But in a rough sort of way. For e.g. “I sat on the bank of a river” and “I deposited money in a bank”, the word bank may have the same embedding. But as the model is trained through the transformer with sequences of text passing through the attention module, the embeddings are updated with contextual information. So for e.g. in the 1st sentence “bank” will be associated with the word “river” and in the 2nd sentence the attention module will also capture the context of the “bank” and associate it with the word “money”

A transformer is well suited for predicting the next word in a given sequence. This is called a auto-regressive decoder-only model. The sequence of steps a enable a Transformer to be capable of predicting the next word in a given sequence is based on the following steps

a) Training on a large corpus of text from internet, wikipedia, books, and research repositories like arXiv etc

The text are tokenised based on one of the tokenisation schemes mentioned above like BPE, Wordpiece etc. to convert the words into numerical values

The tokens are then converted into multi-dimensional real-valued embedding vectors. The embeddings are vectors which through multiple iterations capture richer meaning context-aware meaning of sentences

The Attention module determines the affinity each word has to the other words in the sentence. This affinity can be captured over longer sentence structures too and this is based on the context (sequence) length depending on the size of the model.

The weights from the output of the Attention module then go to a simple 2 layer Feed Forward Neural Network (FFN) which tries to predict the next word in a sentence. For this each sentence is taken as input with the target being the same sentence shifted by one place.

For e,g,

Input: Mary had a little lamb

Target: had a little lamb <end>

So in a sentence w1 , w2, w3, … , wn the FFN will use

w1 to predict w2

w1 , w2 to predict w3 and so on
During back propagation, the error between the predicted word and the actual target word is calculated and propagated backwards through the network updating the weights and biases of the NN. The FFN uses tries to minimise the cross-entropy or log loss which computes the difference between the predicted probabilities and target values.

Attention module

For e.g. if we had the sentence “The quick brown fox jumped over the lazy dog”, Attention is computed as follows

Each word in the above sentence is tokenised and represented as dense vector. The transformer architecture uses 3 weight matrices call Wq , Wk, Wv called the Query Weight, Key Weight and Value weight matrices which are learnable matrices. The embedding vectors of the sentence are multiplied with these Wq, Wk, Wv matrices to give Q (Query), K(Key) and V (Value) vectors.

The dot product of the Query vector with all the Key vectors is performed. Since these are vectors, the dot product will determine the similarity or alignment, the query ‘The’ has for the each of the Keys. This is the fundamental concept of of the Attention module. This is because in a multi-dimensional vector space, vectors which are closer together will give a high dot product. Hence, the dot product between the Query and all the Keys gives the affinity the Query has to all other Keys. This is computed as the Attention Score.

For e,g the above process could show that quick and brown attend to the fox, and lazy attends to the dog – and they have relatively high Attention Scores compared to the rest. In addition the Attention operation may also determine that there is a relation between fox and dog in the sentence.

These affinities show up over several iterations through batches of sentences as Wq, WK, Wv are learnable parameters. The relationship learned is depicted below

Next the values are normalised using the Softmax function since this will result in a mean of 0 and a variance of 1. This will give normalised attention scores

Causal attention. Since future words cannot affect the earlier words these values are made -Infinity so when we perform Softmax we get the value 0 for these values

Self-Attention Mechanism enables the model to evaluate the importance of tokens relative to each other. The self-attention mechanism is written as

Attention(Q,K,V) = softmax(\frac{Q.K^{T}}{sqrt(d_{K})})V

where Q, K, V are Query, Key and Value vectors and dK is the dimensionality of the Key vectors. \sqrt{d_{K}} scales the dot product so that the dot product values are not overly large

where the Scaled Attention score = (\frac{Q.K^{T}}{sqrt(d_{K})})

The Attention weights = softmax(Scaled Attention score)

Attention Output = \sum Attention Weights * V_{j}

This computes a context-aware representation of each token with respect to all other tokens

Feed Forward Network (FFN)

In the case of training a language model the fact that language is sequential enables the model to be trained on the language itself. This is done by training the model on large corpus of text, where the language learns to predict the next words in the sequence. So for example in the sentence

Feedforward Network (FFN) comprises two linear transformations separated by a non-linearity, typically modeled

with the first layer transformation as

\sigma(xW_{1}+b_{1})

and the second layer transformation is

FFN(x)=\sigma(xW_{1}+b_{1})W_{2} + b_{2}
where W_{1} and W_{2} are the weight matrices, and b_{1} and b_{2} are the biases

where W_{1} = d_{model} x d_{hidden} and

W_{2} = d_{hidden} x d_{model} and d_{hidden} is usually 4 times the dimesnion of d_{hidden}

\sigma is the activation function which can be ReLU, GELU or SwiGLU

Input to the FFN

The Decoder receives the output of the Self Attention module to the FFN network. This output from the Attention module is context-aware with the words in the input sequence having different affinities to other words in the sequence

The Target of the FFN is the input sequence shifted by one word

The output from the Attention head and the layer normalization

Normed Output = LayerNorm(Input+MultiHeadOutput)

In essence the Decoder only architecture can be boiled down to the following main modules

  1. Tokenization – The input text is split either on characters, subwords, words, to convert the text into numbers
  2. Vector Embedding – The numerical tokens are then converted into Dense vectors
  3. Positional Embedding – The position order of the text sequence is encoded using the positional embedding. This positional embedding is added to the vector embedding
    • Input embedding = Vector embedding + Positional embedding
  4. Attention module – The attention module computes the affinity the different words have for other words in its vicinity. This is done through the the use of 3 weight matrices W_{Q}, W_{K}, W_{V}. By multiplyting these matrices with the input vectors we get Q,K and V vectors. The attention is computed as
    • Attention(Q,K,V)=softmax(\frac{Q.K^{T}}{\sqrt{d_{K}}})V
  5. For the decoder, attention is masked to prevent the model from looking at future tokens during training also known as causal attention mentioned above
  6. The output pf the Attention module is passed to a 2 layer FFN which uses GeLU activation with Adam optimszation. This involves the following 2 steps
    • Computing the cross-entropy (loss) using the predicted words and the actual labels
    • Backpropagting the error through all the layers and updating the layer weights, W_{Q}, W_{K}, W_{V}
  7. If the size of the FFN’s output is the vocabulary size then we can use
    P(next word|context)=softmax(FFN output)
    If the size of the model output is not the vocabulary size then the a final linear layer embeds the output to the size of the dictionary. This maps the model’s hidden states to the vocabulary size enabling the predicting of the next word from the vocabulary
  8. Next word prediction : The next word prediction is done by applying softmax on the output of the FFN layer (logits) to compute the probability for the vocabulary
    • P(next word∣context)=softmax(Logits)
  9. After computing the probability the model selects the next word based on one of many options – to either choose the most probable word or on some other algorithm

The above sequence of steps is a bare-bones attention and transformer module. In of itself it can achieve little as the transformer module will have to contend with vanishing or exploding gradient issues. It needs additional bells and whistles to make it work effectively

Additional layers to the above architecture

a) Residual Connection and Layer Normalisation (Add + Norm)

i) Residual, skip connections

Residual connection or skip connections are added the input of each layer to the output to enable the gradients to propagate effectively. This is based on the paper ‘Deep Residual Learning for Image Recognition” from Microsoft

Residual connections also known as skip or shortcut connections are performed by adding the input of layer to the output of the layer as a shortcut. This helps in preventing the vanishing gradient, because of the gradients become progressively smaller as they pass through successive layers.

ii) Layer normalisation

In addition layer normalisation is done to stabilise the activation across a single feature to have 0 mean and a variance of 1 by computing

Mean and variance calculation

\mu = \frac{1}{D}\sum_{i=1}^{D}x_{i} , \sigma ^{2} = \frac{1}{D} \sum_{i=1}^{D} (x_{i}-\mu)^{2}

Normalization

\hat{x_{i}} =\frac{x_{i-\mu}}{\sqrt{\sigma^{2}-\epsilon}}

Layer normalization introduces learnable parameters using the equation

y_{i} = \gamma \hat{x_{i}} +\beta

This can be written as
ResidualOutput=Input+Output of Attention/FFN

The above statement mentions that the Input layer to the Attention /FFN module is added to the output to mitigate the vanishing gradient problem

NormedOutput=LayerNorm(Residual Output)

Layer Normalisation is then applied to the Residual Output to stabilise the activations.

b) Multi-headed Attention : Typically Transformer use multiple parallel heads of attention. Each head will compute a slightly different variations to the attention values, thus making the whole learning process richer. Multi-headed learning is capable of capturing more nuanced affinities of different words in the sentence to other words in the sentence/context.

c) Dropout : Dropout is a technique where random hidden units or neurons are dropped from the network during training. This prevents overfitting and helps to regularise/generalise the learning of the network. In Transformer Architectures, dropout is used after calculating the Attention Weights. Dropout can also be applied in the Feed Forward Network or in the Residual Connections

This is shown diagrammatically here

Points to note:

a) The Attention mechanism is able to pick out affinities between words in a sentence. This happens despite the fact the the WQ, WK, WV matrices are randomly initialised. As the model trained iteratively through a large corpus of text using next token prediction for Auto Regressive Transformers and Masked prediction as in the case of BERT, then the affinities show up. This training allows the model to learn the contextual relationships and affinities words have with each other. The dot product Q, K measures the affinity words have for each other and will be high if they are highly related to each other. This is because they will aligned in a the multi-dimensional embedding space of these vectors, besides semantically and contextually related tokens are closer to each other.

b) The Feed Forward Network (FFN) in the Transformer’s Attention block is relatively small and has just 2 layers. This is for computational efficiency and deeper Neural Networks can increase costs. Moreover, it has been found that deeper and wider networks did not significantly improve performance while also preventing overfitting.

c) The above architecture is based on the Causal Attention, Decoder only transformer. The original paper includes both the encoder and the decoder to enable translation across different languages. In addition architectures like BERT use ‘masked attention’ and randomly mask words


The flow of vectors and dimensionality from the input sentence tokens to the output token prediction is as follows

a) For a batch (B) of 2 sentences with 6 words (T) each, where each word is converted into a token. If Byte Pair Encoding (BPE) is used then an integer value between 1-50257 will be obtained.

Input shape = (B x T) = (2 x 6)

b) Token embedding – Each token in the vocabulary is converted into an embedding vector of size d_{model} = 512 dimension vector

Output shape = (B x T x d_{model}) = (2 x 6 x 512)

c) Positional embedding is added

Shape of positional embedding = T x d_{model} = (6 x512)

d) Output shape with token and positional embedding is the same

Output shape = (B x T x d_{model}) = (2 x 6 x 512)

d) Multi-head attention

e) The WQ, WK, WV learnable matrices are each of size

d_{model} x d_{model}

f) Q = X x WQ = (B x T x d_{model}) x (d_{model} x d_{model})

Output shape of Q, K, V = (B x T x d_{model}) = (2 x 6 x 512)

g) Number of heads h = 8

Dimensionality of each head = d_{model}/8 = d_{k} = 64

h) Splitting across the heads we have

Shape per head = (B, h, T, d_{k}) = ( 2, 8, 6, 64)

h) Weighted sum of values =

Output shape per head = (B, h, T, d_{k}) = ( 2, 8, 6, 64)

i) All the heads are concatenated

(B x T x d_{model}) = (2 x 6 x 512)

j) The FFN has one hidden layer which is 4 times d_{model}

d_{hidden} = d_{model} x 4

Final output of FFN after passing through hidden layer and back

Output shape =(B x T x d_{model}) = (2 x 6 x 512)

k) Residual, shortcut connections and layer norm do not change shape

Output shape =(B x T x d_{model}) = (2 x 6 x 512)

l) The final output is projected back into the original vocabulary space. For BPE it

50257.

Using a weight matrix (512 x vocab_size) = (512 x 50257)

Final output shape = (B x T x vocab_size) = (2 x 6 x 50257)

The output is in probabilities and hence gives the most likely next word in the sentence

Conclusion

This post tries to condense the key concepts around the Attention mechanism and the Transformer Architecture which have been the catalyst in the last few years, resulting in an explosion in the area of Gen AI, and there seems to be no stopping. It is indeed fascinating how the human language has been mathematically analysed for semantic meaning and relevance.

References

  1. Building LLMs from Scratch by Sebastian Raschka
  2. Hands-on Large Language Model by Jay Alammar, Maarten Grootendorst
  3. Let’s build GPT from scratch: from scratch, in code, spelled out – Andrej Karpathy
  4. Attention in Transformers, visually explained – 3Blue1Brown
  5. Awesome LLM – Hannibal046 (collection of LLM papers)

Also see

  1. Singularity – Short science fiction
  2. Introducing QCSimulator – A 5 qubit quantum computing simulator in R
  3. GooglyPlusPlus: Win Probability using Deep Learning and player embeddings
  4. Natural Language Processing; What would Shakespeare say?
  5. Deep Learning from first principles in vectorized Python, R and Octave
  6. Big Data 4: Webserver log analysis with RDDs, Pyspark, SparkR and SparklyR
  7. Reintroducing cricketr: An R package to analyse performances of cricketers in R

To see all posts, click Index of posts

My presentations on ‘Elements of Neural Networks & Deep Learning’ -Part1,2,3

I will be uploading a series of presentations on ‘Elements of Neural Networks and Deep Learning’. In these video presentations I discuss the derivations of L -Layer Deep Learning Networks, starting from the basics. The corresponding implementations are available in vectorized R, Python and Octave are available in my book ‘Deep Learning from first principles:Second edition- In vectorized Python, R and Octave

1. Elements of Neural Networks and Deep Learning – Part 1
This presentation introduces Neural Networks and Deep Learning. A look at history of Neural Networks, Perceptrons and why Deep Learning networks are required and concluding with a simple toy examples of a Neural Network and how they compute

2. Elements of Neural Networks and Deep Learning – Part 2
This presentation takes logistic regression as an example and creates an equivalent 2 layer Neural network. The presentation also takes a look at forward & backward propagation and how the cost is minimized using gradient descent


The implementation of the discussed 2 layer Neural Network in vectorized R, Python and Octave are available in my post ‘Deep Learning from first principles in Python, R and Octave – Part 1

3. Elements of Neural Networks and Deep Learning – Part 3
This 3rd part, discusses a primitive neural network with an input layer, output layer and a hidden layer. The neural network uses tanh activation in the hidden layer and a sigmoid activation in the output layer. The equations for forward and backward propagation are derived.


To see the implementations for the above discussed video see my post ‘Deep Learning from first principles in Python, R and Octave – Part 2

Important note: Do check out my later version of these videos at Take 4+: Presentations on ‘Elements of Neural Networks and Deep Learning’ – Parts 1-8 . These have more content and also include some corrections. Check it out!

To be continued. Watch this space!

Checkout my book ‘Deep Learning from first principles: Second Edition – In vectorized Python, R and Octave’. My book starts with the implementation of a simple 2-layer Neural Network and works its way to a generic L-Layer Deep Learning Network, with all the bells and whistles. The derivations have been discussed in detail. The code has been extensively commented and included in its entirety in the Appendix sections. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like
1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon
2. Introducing cricpy:A python package to analyze performances of cricketers
3. Natural language processing: What would Shakespeare say?
4. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
5. Getting started with memcached-libmemcached
6. Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

To see all posts click Index of posts

My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

Are you wondering whether to get into the ‘R’ bus or ‘Python’ bus?
My suggestion is to you is “Why not get into the ‘R and Python’ train?”

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($8.99/Rs449) versions.  In the third edition all code sections have been re-formatted to use the fixed width font ‘Consolas’. This neatly organizes output which have columns like confusion matrix, dataframes etc to be columnar, making the code more readable.  There is a science to formatting too!! which improves the look and feel. It is little wonder that Steve Jobs had a keen passion for calligraphy! Additionally some typos have been fixed.

 

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Python Third Edition – Machine Learning in Stereo(Kindle- $8.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

Note: The 3rd edition of this book is now available My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

The third edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback ($12.99) and kindle ($9.99/Rs449) versions.  This second edition includes more content,  extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Third Edition – Machine Learning in Stereo(Paperback-$12.99)
2. Practical machine with R and Third Edition – Machine Learning in Stereo(Kindle- $9.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Table of Contents
Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Pick up your copy today!!
Hope you have a great time learning as I did while implementing these algorithms!

Practical Machine Learning with R and Python – Part 3

In this post ‘Practical Machine Learning with R and Python – Part 3’,  I discuss ‘Feature Selection’ methods. This post is a continuation of my 2 earlier posts

  1. Practical Machine Learning with R and Python – Part 1
  2. Practical Machine Learning with R and Python – Part 2

While applying Machine Learning techniques, the data set will usually include a large number of predictors for a target variable. It is quite likely, that not all the predictors or feature variables will have an impact on the output. Hence it is becomes necessary to choose only those features which influence the output variable thus simplifying  to a reduced feature set on which to train the ML model on. The techniques that are used are the following

  • Best fit
  • Forward fit
  • Backward fit
  • Ridge Regression or L2 regularization
  • Lasso or L1 regularization

This post includes the equivalent ML code in R and Python.

All these methods remove those features which do not sufficiently influence the output. As in my previous 2 posts on “Practical Machine Learning with R and Python’, this post is largely based on the topics in the following 2 MOOC courses
1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

You can download this R Markdown file and the associated data from Github – Machine Learning-RandPython-Part3. 

Note: Please listen to my video presentations Machine Learning in youtube
1. Machine Learning in plain English-Part 1
2. Machine Learning in plain English-Part 2
3. Machine Learning in plain English-Part 3

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

 

1.1 Best Fit

For a dataset with features f1,f2,f3…fn, the ‘Best fit’ approach, chooses all possible combinations of features and creates separate ML models for each of the different combinations. The best fit algotithm then uses some filtering criteria based on Adj Rsquared, Cp, BIC or AIC to pick out the best model among all models.

Since the Best Fit approach searches the entire solution space it is computationally infeasible. The number of models that have to be searched increase exponentially as the number of predictors increase. For ‘p’ predictors a total of 2^{p} ML models have to be searched. This can be shown as follows

There are C_{1} ways to choose single feature ML models among ‘n’ features, C_{2} ways to choose 2 feature models among ‘n’ models and so on, or
1+C_{1} + C_{2} +... + C_{n}
= Total number of models in Best Fit.  Since from Binomial theorem we have
(1+x)^{n} = 1+C_{1}x + C_{2}x^{2} +... + C_{n}x^{n}
When x=1 in the equation (1) above, this becomes
2^{n} = 1+C_{1} + C_{2} +... + C_{n}

Hence there are 2^{n} models to search amongst in Best Fit. For 10 features this is 2^{10} or ~1000 models and for 40 features this becomes 2^{40} which almost 1 trillion. Usually there are datasets with 1000 or maybe even 100000 features and Best fit becomes computationally infeasible.

Anyways I have included the Best Fit approach as I use the Boston crime datasets which is available both the MASS package in R and Sklearn in Python and it has 13 features. Even this small feature set takes a bit of time since the Best fit needs to search among ~2^{13}= 8192  models

Initially I perform a simple Linear Regression Fit to estimate the features that are statistically insignificant. By looking at the p-values of the features it can be seen that ‘indus’ and ‘age’ features have high p-values and are not significant

1.1a Linear Regression – R code

source('RFunctions-1.R')
#Read the Boston crime data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")
dim(df1)
## [1] 506  14
# Linear Regression fit
fit <- lm(cost~. ,data=df1)
summary(fit)
## 
## Call:
## lm(formula = cost ~ ., data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crimeRate    -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zone          4.642e-02  1.373e-02   3.382 0.000778 ***
## indus         2.056e-02  6.150e-02   0.334 0.738288    
## charles       2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox          -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rooms         3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age           6.922e-04  1.321e-02   0.052 0.958229    
## distances    -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## highways      3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax          -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## teacherRatio -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## color         9.312e-03  2.686e-03   3.467 0.000573 ***
## status       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Next we apply the different feature selection models to automatically remove features that are not significant below

1.1a Best Fit – R code

The Best Fit requires the ‘leaps’ R package

library(leaps)
source('RFunctions-1.R')
#Read the Boston crime data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")

# Perform a best fit
bestFit=regsubsets(cost~.,df1,nvmax=13)

# Generate a summary of the fit
bfSummary=summary(bestFit)

# Plot the Residual Sum of Squares vs number of variables 
plot(bfSummary$rss,xlab="Number of Variables",ylab="RSS",type="l",main="Best fit RSS vs No of features")
# Get the index of the minimum value
a=which.min(bfSummary$rss)
# Mark this in red
points(a,bfSummary$rss[a],col="red",cex=2,pch=20)

The plot below shows that the Best fit occurs with all 13 features included. Notice that there is no significant change in RSS from 11 features onward.

# Plot the CP statistic vs Number of variables
plot(bfSummary$cp,xlab="Number of Variables",ylab="Cp",type='l',main="Best fit Cp vs No of features")
# Find the lowest CP value
b=which.min(bfSummary$cp)
# Mark this in red
points(b,bfSummary$cp[b],col="red",cex=2,pch=20)

Based on Cp metric the best fit occurs at 11 features as seen below. The values of the coefficients are also included below

# Display the set of features which provide the best fit
coef(bestFit,b)
##   (Intercept)     crimeRate          zone       charles           nox 
##  36.341145004  -0.108413345   0.045844929   2.718716303 -17.376023429 
##         rooms     distances      highways           tax  teacherRatio 
##   3.801578840  -1.492711460   0.299608454  -0.011777973  -0.946524570 
##         color        status 
##   0.009290845  -0.522553457
#  Plot the BIC value
plot(bfSummary$bic,xlab="Number of Variables",ylab="BIC",type='l',main="Best fit BIC vs No of Features")
# Find and mark the min value
c=which.min(bfSummary$bic)
points(c,bfSummary$bic[c],col="red",cex=2,pch=20)

# R has some other good plots for best fit
plot(bestFit,scale="r2",main="Rsquared vs No Features")

R has the following set of really nice visualizations. The plot below shows the Rsquared for a set of predictor variables. It can be seen when Rsquared starts at 0.74- indus, charles and age have not been included. 

plot(bestFit,scale="Cp",main="Cp vs NoFeatures")

The Cp plot below for value shows indus, charles and age as not included in the Best fit

plot(bestFit,scale="bic",main="BIC vs Features")

1.1b Best fit (Exhaustive Search ) – Python code

The Python package for performing a Best Fit is the Exhaustive Feature Selector EFS.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

# Read the Boston crime data
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")

#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
# Set X and y 
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']

# Perform an Exhaustive Search. The EFS and SFS packages use 'neg_mean_squared_error'. The 'mean_squared_error' seems to have been deprecated. I think this is just the MSE with the a negative sign.
lr = LinearRegression()
efs1 = EFS(lr, 
           min_features=1,
           max_features=13,
           scoring='neg_mean_squared_error',
           print_progress=True,
           cv=5)


# Create a efs fit
efs1 = efs1.fit(X.as_matrix(), y.as_matrix())

print('Best negtive mean squared error: %.2f' % efs1.best_score_)
## Print the IDX of the best features 
print('Best subset:', efs1.best_idx_)
Features: 8191/8191Best negtive mean squared error: -28.92
## ('Best subset:', (0, 1, 4, 6, 7, 8, 9, 10, 11, 12))

The indices for the best subset are shown above.

1.2 Forward fit

Forward fit is a greedy algorithm that tries to optimize the feature selected, by minimizing the selection criteria (adj Rsqaured, Cp, AIC or BIC) at every step. For a dataset with features f1,f2,f3…fn, the forward fit starts with the NULL set. It then pick the ML model with a single feature from n features which has the highest adj Rsquared, or minimum Cp, BIC or some such criteria. After picking the 1 feature from n which satisfies the criteria the most, the next feature from the remaining n-1 features is chosen. When the 2 feature model which satisfies the selection criteria the best is chosen, another feature from the remaining n-2 features are added and so on. The forward fit is a sub-optimal algorithm. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of  n + n-1 + n -2 + .. 1 = n(n+1)/2 which is of the order of n^{2}. Though forward fit is a sub optimal solution it is far more computationally efficient than best fit

1.2a Forward fit – R code

Forward fit in R determines that 11 features are required for the best fit. The features are shown below

library(leaps)
# Read the data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                     "distances","highways","tax","teacherRatio","color","status","cost")

#Split as training and test 
train_idx <- trainTestSplit(df1,trainPercent=75,seed=5)
train <- df1[train_idx, ]
test <- df1[-train_idx, ]

# Find the best forward fit
fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward")

# Compute the MSE
valErrors=rep(NA,13)
test.mat=model.matrix(cost~.,data=test)
for(i in 1:13){
    coefi=coef(fitFwd,id=i)
    pred=test.mat[,names(coefi)]%*%coefi
    valErrors[i]=mean((test$cost-pred)^2)
}

# Plot the Residual Sum of Squares
plot(valErrors,xlab="Number of Variables",ylab="Validation Error",type="l",main="Forward fit RSS vs No of features")
# Gives the index of the minimum value
a<-which.min(valErrors)
print(a)
## [1] 11
# Highlight the smallest value
points(c,valErrors[a],col="blue",cex=2,pch=20)

Forward fit R selects 11 predictors as the best ML model to predict the ‘cost’ output variable. The values for these 11 predictors are included below

#Print the 11 ccoefficients
coefi=coef(fitFwd,id=i)
coefi
##   (Intercept)     crimeRate          zone         indus       charles 
##  2.397179e+01 -1.026463e-01  3.118923e-02  1.154235e-04  3.512922e+00 
##           nox         rooms           age     distances      highways 
## -1.511123e+01  4.945078e+00 -1.513220e-02 -1.307017e+00  2.712534e-01 
##           tax  teacherRatio         color        status 
## -1.330709e-02 -8.182683e-01  1.143835e-02 -3.750928e-01

1.2b Forward fit with Cross Validation – R code

The Python package SFS includes N Fold Cross Validation errors for forward and backward fit so I decided to add this code to R. This is not available in the ‘leaps’ R package, however the implementation is quite simple. Another implementation is also available at Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford 2.

library(dplyr)
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                     "distances","highways","tax","teacherRatio","color","status","cost")

set.seed(6)
# Set max number of features
nvmax<-13
cvError <- NULL
# Loop through each features
for(i in 1:nvmax){
    # Set no of folds
    noFolds=5
    # Create the rows which fall into different folds from 1..noFolds
    folds = sample(1:noFolds, nrow(df1), replace=TRUE) 
    cv<-0
    # Loop through the folds
    for(j in 1:noFolds){
        # The training is all rows for which the row is != j (k-1 folds -> training)
        train <- df1[folds!=j,]
        # The rows which have j as the index become the test set
        test <- df1[folds==j,]
        # Create a forward fitting model for this
        fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward")
        # Select the number of features and get the feature coefficients
        coefi=coef(fitFwd,id=i)
        #Get the value of the test data
        test.mat=model.matrix(cost~.,data=test)
        # Multiply the tes data with teh fitted coefficients to get the predicted value
        # pred = b0 + b1x1+b2x2... b13x13
        pred=test.mat[,names(coefi)]%*%coefi
        # Compute mean squared error
        rss=mean((test$cost - pred)^2)
        # Add all the Cross Validation errors
        cv=cv+rss
    }
    # Compute the average of MSE for K folds for number of features 'i'
    cvError[i]=cv/noFolds
}
a <- seq(1,13)
d <- as.data.frame(t(rbind(a,cvError)))
names(d) <- c("Features","CVError")
#Plot the CV Error vs No of Features
ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") +
    xlab("No of features") + ylab("Cross Validation Error") +
    ggtitle("Forward Selection - Cross Valdation Error vs No of Features")

Forward fit with 5 fold cross validation indicates that all 13 features are required

# This gives the index of the minimum value
a=which.min(cvError)
print(a)
## [1] 13
#Print the 13 coefficients of these features
coefi=coef(fitFwd,id=a)
coefi
##   (Intercept)     crimeRate          zone         indus       charles 
##  36.650645380  -0.107980979   0.056237669   0.027016678   4.270631466 
##           nox         rooms           age     distances      highways 
## -19.000715500   3.714720418   0.019952654  -1.472533973   0.326758004 
##           tax  teacherRatio         color        status 
##  -0.011380750  -0.972862622   0.009549938  -0.582159093

1.2c Forward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

Note: The Cross validation error for SFS in Sklearn is negative, possibly because it computes the ‘neg_mean_squared_error’. The earlier ‘mean_squared_error’ in the package seems to have been deprecated. I have taken the -ve of this neg_mean_squared_error. I think this would give mean_squared_error.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()
# Create a forward fit model
sfs = SFS(lr, 
          k_features=(1,13), 
          forward=True, # Forward fit
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=5)

# Fit this on the data
sfs = sfs.fit(X.as_matrix(), y.as_matrix())
# Get all the details of the forward fits
a=sfs.get_metric_dict()
n=[]
o=[]

# Compute the mean cross validation scores
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores']))  
m=np.arange(1,13)
# Get the index of the minimum CV score

# Plot the CV scores vs the number of features
fig1=plt.plot(m,n)
fig1=plt.title('Mean CV Scores vs No of features')
fig1.figure.savefig('fig1.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T)

idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

# Index the column names. 
# Features from forward fit
print("Features selected in forward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...   
## 4   -33.7681  20.1638  [-8.86076528781, -28.650217633, -35.7246353855...   
## 5   -33.6392  20.5271  [-8.90807628524, -28.0684679108, -35.827463022...   
## 6   -33.6276  19.0859  [-9.549485942, -30.9724602876, -32.6689523347,...   
## 7   -32.4082  19.1455  [-10.0177149635, -28.3780298492, -30.926917231...   
## 8   -32.3697   18.533  [-11.1431684243, -27.5765510172, -31.168994094...   
## 9   -32.4016  21.5561  [-10.8972555995, -25.739780653, -30.1837430353...   
## 10  -32.8504  22.6508  [-12.3909282079, -22.1533250755, -33.385407342...   
## 11  -34.1065  24.7019  [-12.6429253721, -22.1676650245, -33.956999528...   
## 12  -35.5814   25.693  [-12.7303397453, -25.0145323483, -34.211898373...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 5)  20.7142  10.3571  
## 4                               (10, 3, 12, 5)  20.0132  10.0066  
## 5                            (0, 10, 3, 12, 5)  20.3738  10.1869  
## 6                         (0, 3, 5, 7, 10, 12)  18.9433  9.47167  
## 7                      (0, 2, 3, 5, 7, 10, 12)  19.0026  9.50128  
## 8                   (0, 1, 2, 3, 5, 7, 10, 12)  18.3946  9.19731  
## 9               (0, 1, 2, 3, 5, 7, 10, 11, 12)  21.3952  10.6976  
## 10           (0, 1, 2, 3, 4, 5, 7, 10, 11, 12)  22.4816  11.2408  
## 11        (0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12)  24.5175  12.2587  
## 12     (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12)  25.5012  12.7506  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 7
## [0, 2, 3, 5, 7, 10, 12]
## #################################################################################
## Features selected in forward fit
## Index([u'crimeRate', u'indus', u'chasRiver', u'rooms', u'distances',
##        u'teacherRatio', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

The above plot indicates that 8 features provide the lowest Mean CV error

1.3 Backward Fit

Backward fit belongs to the class of greedy algorithms which tries to optimize the feature set, by dropping a feature at every stage which results in the worst performance for a given criteria of Adj RSquared, Cp, BIC or AIC. For a dataset with features f1,f2,f3…fn, the backward fit starts with the all the features f1,f2.. fn to begin with. It then pick the ML model with a n-1 features by dropping the feature,f_{j}, for e.g., the inclusion of which results in the worst performance in adj Rsquared, or minimum Cp, BIC or some such criteria. At every step 1 feature is dopped. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of n + n-1 + n -2 + .. 1 = n(n+1)/2 which is of the order of n^{2}. Though backward fit is a sub optimal solution it is far more computationally efficient than best fit

1.3a Backward fit – R code

library(dplyr)
# Read the data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                     "distances","highways","tax","teacherRatio","color","status","cost")

set.seed(6)
# Set max number of features
nvmax<-13
cvError <- NULL
# Loop through each features
for(i in 1:nvmax){
    # Set no of folds
    noFolds=5
    # Create the rows which fall into different folds from 1..noFolds
    folds = sample(1:noFolds, nrow(df1), replace=TRUE) 
    cv<-0
    for(j in 1:noFolds){
        # The training is all rows for which the row is != j 
        train <- df1[folds!=j,]
        # The rows which have j as the index become the test set
        test <- df1[folds==j,]
        # Create a backward fitting model for this
        fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="backward")
        # Select the number of features and get the feature coefficients
        coefi=coef(fitFwd,id=i)
        #Get the value of the test data
        test.mat=model.matrix(cost~.,data=test)
        # Multiply the tes data with teh fitted coefficients to get the predicted value
        # pred = b0 + b1x1+b2x2... b13x13
        pred=test.mat[,names(coefi)]%*%coefi
        # Compute mean squared error
        rss=mean((test$cost - pred)^2)
        # Add the Residual sum of square
        cv=cv+rss
    }
    # Compute the average of MSE for K folds for number of features 'i'
    cvError[i]=cv/noFolds
}
a <- seq(1,13)
d <- as.data.frame(t(rbind(a,cvError)))
names(d) <- c("Features","CVError")
# Plot the Cross Validation Error vs Number of features
ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") +
    xlab("No of features") + ylab("Cross Validation Error") +
    ggtitle("Backward Selection - Cross Valdation Error vs No of Features")

# This gives the index of the minimum value
a=which.min(cvError)
print(a)
## [1] 13
#Print the 13 coefficients of these features
coefi=coef(fitFwd,id=a)
coefi
##   (Intercept)     crimeRate          zone         indus       charles 
##  36.650645380  -0.107980979   0.056237669   0.027016678   4.270631466 
##           nox         rooms           age     distances      highways 
## -19.000715500   3.714720418   0.019952654  -1.472533973   0.326758004 
##           tax  teacherRatio         color        status 
##  -0.011380750  -0.972862622   0.009549938  -0.582159093

Backward selection in R also indicates the 13 features and the corresponding coefficients as providing the best fit

1.3b Backward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

# Read the data
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

# Create the SFS model
sfs = SFS(lr, 
          k_features=(1,13), 
          forward=False, # Backward
          floating=False, 
          scoring='neg_mean_squared_error',
          cv=5)

# Fit the model
sfs = sfs.fit(X.as_matrix(), y.as_matrix())
a=sfs.get_metric_dict()
n=[]
o=[]

# Compute the mean of the validation scores
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores'])) 
m=np.arange(1,13)

# Plot the Validation scores vs number of features
fig2=plt.plot(m,n)
fig2=plt.title('Mean CV Scores vs No of features')
fig2.figure.savefig('fig2.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T)

# Get the index of minimum cross validation error
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward fit and convert to list
b=list(a[idx]['feature_idx'])
# Index the column names. 
# Features from backward fit
print("Features selected in bacward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -35.4992  13.9619  [-17.2329292677, -44.4178648308, -51.633177846...   
## 4    -33.463  12.4081  [-20.6415333292, -37.3247852146, -47.479302977...   
## 5   -33.1038  10.6156  [-20.2872309863, -34.6367078466, -45.931870352...   
## 6   -32.0638  10.0933  [-19.4463829372, -33.460638577, -42.726257249,...   
## 7   -30.7133  9.23881  [-19.4425181917, -31.1742902259, -40.531266671...   
## 8   -29.7432  9.84468  [-19.445277268, -30.0641187173, -40.2561247122...   
## 9   -29.0878  9.45027  [-19.3545569877, -30.094768669, -39.7506036377...   
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...   
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...   
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 7)  13.8576  6.92881  
## 4                               (12, 10, 4, 7)  12.3154  6.15772  
## 5                            (4, 7, 8, 10, 12)  10.5363  5.26816  
## 6                         (4, 7, 8, 9, 10, 12)  10.0179  5.00896  
## 7                      (1, 4, 7, 8, 9, 10, 12)  9.16981  4.58491  
## 8                  (1, 4, 7, 8, 9, 10, 11, 12)  9.77116  4.88558  
## 9               (0, 1, 4, 7, 8, 9, 10, 11, 12)  9.37969  4.68985  
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634  
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092  
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 9
## Features selected in bacward fit
## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate',
##        u'teacherRatio', u'color', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

Backward fit in Python indicate that 10 features provide the best fit

1.3c Sequential Floating Forward Selection (SFFS) – Python code

The Sequential Feature search also includes ‘floating’ variants which include or exclude features conditionally, once they were excluded or included. The SFFS can conditionally include features which were excluded from the previous step, if it results in a better fit. This option will tend to a better solution, than plain simple SFS. These variants are included below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

# Create the floating forward search
sffs = SFS(lr, 
          k_features=(1,13), 
          forward=True,  # Forward
          floating=True,  #Floating
          scoring='neg_mean_squared_error',
          cv=5)

# Fit a model
sffs = sffs.fit(X.as_matrix(), y.as_matrix())
a=sffs.get_metric_dict()
n=[]
o=[]
# Compute mean validation scores
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores'])) 
   
    
    
m=np.arange(1,13)


# Plot the cross validation score vs number of features
fig3=plt.plot(m,n)
fig3=plt.title('SFFS:Mean CV Scores vs No of features')
fig3.figure.savefig('fig3.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T)
# Get the index of the minimum CV score
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward floating fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

print("#################################################################################")
# Index the column names. 
# Features from forward fit
print("Features selected in forward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...   
## 4   -33.7681  20.1638  [-8.86076528781, -28.650217633, -35.7246353855...   
## 5   -33.6392  20.5271  [-8.90807628524, -28.0684679108, -35.827463022...   
## 6   -33.6276  19.0859  [-9.549485942, -30.9724602876, -32.6689523347,...   
## 7   -32.1834  12.1001  [-17.9491036167, -39.6479234651, -45.470227740...   
## 8   -32.0908  11.8179  [-17.4389015788, -41.2453629843, -44.247557798...   
## 9   -31.0671  10.1581  [-17.2689542913, -37.4379370429, -41.366372300...   
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...   
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...   
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 5)  20.7142  10.3571  
## 4                               (10, 3, 12, 5)  20.0132  10.0066  
## 5                            (0, 10, 3, 12, 5)  20.3738  10.1869  
## 6                         (0, 3, 5, 7, 10, 12)  18.9433  9.47167  
## 7                      (0, 1, 2, 3, 7, 10, 12)  12.0097  6.00487  
## 8                   (0, 1, 2, 3, 7, 8, 10, 12)  11.7297  5.86484  
## 9                (0, 1, 2, 3, 7, 8, 9, 10, 12)  10.0822  5.04111  
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634  
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092  
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 9
## [0, 1, 2, 3, 7, 8, 9, 10, 12]
## #################################################################################
## Features selected in forward fit
## Index([u'crimeRate', u'zone', u'indus', u'chasRiver', u'distances',
##        u'idxHighways', u'taxRate', u'teacherRatio', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

SFFS provides the best fit with 10 predictors

1.3d Sequential Floating Backward Selection (SFBS) – Python code

The SFBS is an extension of the SBS. Here features that are excluded at any stage can be conditionally included if the resulting feature set gives a better fit.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

sffs = SFS(lr, 
          k_features=(1,13), 
          forward=False, # Backward
          floating=True, # Floating
          scoring='neg_mean_squared_error',
          cv=5)

sffs = sffs.fit(X.as_matrix(), y.as_matrix())
a=sffs.get_metric_dict()
n=[]
o=[]
# Compute the mean cross validation score
for i in np.arange(1,13):
    n.append(-np.mean(a[i]['cv_scores']))  
    
m=np.arange(1,13)

fig4=plt.plot(m,n)
fig4=plt.title('SFBS: Mean CV Scores vs No of features')
fig4.figure.savefig('fig4.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T)

# Get the index of the minimum CV score
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best backward floating fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

print("#################################################################################")
# Index the column names. 
# Features from forward fit
print("Features selected in backward floating fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...   
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...   
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...   
## 4    -33.463  12.4081  [-20.6415333292, -37.3247852146, -47.479302977...   
## 5   -32.3699  11.2725  [-20.8771078371, -34.9825657934, -45.813447203...   
## 6   -31.6742  11.2458  [-20.3082500364, -33.2288990522, -45.535507868...   
## 7   -30.7133  9.23881  [-19.4425181917, -31.1742902259, -40.531266671...   
## 8   -29.7432  9.84468  [-19.445277268, -30.0641187173, -40.2561247122...   
## 9   -29.0878  9.45027  [-19.3545569877, -30.094768669, -39.7506036377...   
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...   
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...   
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...   
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...   
## 
##                                    feature_idx  std_dev  std_err  
## 1                                        (12,)  18.9042  9.45212  
## 2                                     (10, 12)  16.1965  8.09826  
## 3                                  (10, 12, 5)  20.7142  10.3571  
## 4                               (4, 10, 7, 12)  12.3154  6.15772  
## 5                            (12, 10, 4, 1, 7)  11.1883  5.59417  
## 6                        (4, 7, 8, 10, 11, 12)  11.1618  5.58088  
## 7                      (1, 4, 7, 8, 9, 10, 12)  9.16981  4.58491  
## 8                  (1, 4, 7, 8, 9, 10, 11, 12)  9.77116  4.88558  
## 9               (0, 1, 4, 7, 8, 9, 10, 11, 12)  9.37969  4.68985  
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634  
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092  
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265  
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546  
## No of features= 9
## [0, 1, 4, 7, 8, 9, 10, 11, 12]
## #################################################################################
## Features selected in backward floating fit
## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate',
##        u'teacherRatio', u'color', u'status'],
##       dtype='object')

The table above shows the average score, 10 fold CV errors, the features included at every step, std. deviation and std. error

SFBS indicates that 10 features are needed for the best fit

1.4 Ridge regression

In Linear Regression the Residual Sum of Squares (RSS) is given as

RSS = \sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2}
Ridge regularization =\sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2} + \lambda \sum_{j=1}^{p}\beta^{2}

where is the regularization or tuning parameter. Increasing increases the penalty on the coefficients thus shrinking them. However in Ridge Regression features that do not influence the target variable will shrink closer to zero but never become zero except for very large values of

Ridge regression in R requires the ‘glmnet’ package

1.4a Ridge Regression – R code

library(glmnet)
library(dplyr)
# Read the data
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
#Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")

# Set X and y as matrices
X=as.matrix(df1[,1:13])
y=df1$cost

# Fit a Ridge model
fitRidge <-glmnet(X,y,alpha=0)

#Plot the model where the coefficient shrinkage is plotted vs log lambda
plot(fitRidge,xvar="lambda",label=TRUE,main= "Ridge regression coefficient shrikage vs log lambda")

The plot below shows how the 13 coefficients for the 13 predictors vary when lambda is increased. The x-axis includes log (lambda). We can see that increasing lambda from 10^{2} to 10^{6} significantly shrinks the coefficients. We can draw a vertical line from the x-axis and read the values of the 13 coefficients. Some of them will be close to zero

# Compute the cross validation error
cvRidge=cv.glmnet(X,y,alpha=0)

#Plot the cross validation error
plot(cvRidge, main="Ridge regression Cross Validation Error (10 fold)")

This gives the 10 fold Cross Validation  Error with respect to log (lambda) As lambda increase the MSE increases

1.4a Ridge Regression – Python code

The coefficient shrinkage for Python can be plotted like R using Least Angle Regression model a.k.a. LARS package. This is included below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

# Scale the X_train and X_test
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit a ridge regression with alpha=20
linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)

# Print the training R squared
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
# Print the test Rsquared
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

trainingRsquared=[]
testRsquared=[]
# Plot the effect of alpha on the test Rsquared
print('Ridge regression: effect of alpha regularization parameter\n')
# Choose a list of alpha values
for this_alpha in [0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    # Compute training rsquared
    r2_train = linridge.score(X_train_scaled, y_train)
    # Compute test rsqaured
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    trainingRsquared.append(r2_train)
    testRsquared.append(r2_test)
    
# Create a dataframe
alpha=[0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000]    
trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha)
testRsquared=pd.DataFrame(testRsquared,index=alpha)

# Plot training and test R squared as a function of alpha
df3=pd.concat([trainingRsquared,testRsquared],axis=1)
df3.columns=['trainingRsquared','testRsquared']

fig5=df3.plot()
fig5=plt.title('Ridge training and test squared error vs Alpha')
fig5.figure.savefig('fig5.png', bbox_inches='tight')

# Plot the coefficient shrinage using the LARS package

from sklearn import linear_model
# #############################################################################
# Compute paths

n_alphas = 200
alphas = np.logspace(0, 8, n_alphas)

coefs = []
for a in alphas:
    ridge = linear_model.Ridge(alpha=a, fit_intercept=False)
    ridge.fit(X_train_scaled, y_train)
    coefs.append(ridge.coef_)

# #############################################################################
# Display results

ax = plt.gca()

fig6=ax.plot(alphas, coefs)
fig6=ax.set_xscale('log')
fig6=ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
fig6=plt.xlabel('alpha')
fig6=plt.ylabel('weights')
fig6=plt.title('Ridge coefficients as a function of the regularization')
fig6=plt.axis('tight')
plt.savefig('fig6.png', bbox_inches='tight')
## R-squared score (training): 0.620
## R-squared score (test): 0.438
## Number of non-zero features: 13
## Ridge regression: effect of alpha regularization parameter

The plot below shows the training and test error when increasing the tuning or regularization parameter ‘alpha’

For Python the coefficient shrinkage with LARS must be viewed from right to left, where you have increasing alpha. As alpha increases the coefficients shrink to 0.

1.5 Lasso regularization

The Lasso is another form of regularization, also known as L1 regularization. Unlike the Ridge Regression where the coefficients of features which do not influence the target tend to zero, in the lasso regualrization the coefficients become 0. The general form of Lasso is as follows

\sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2} + \lambda \sum_{j=1}^{p}|\beta|

1.5a Lasso regularization – R code

library(glmnet)
library(dplyr)
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
              "distances","highways","tax","teacherRatio","color","status","cost")
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
                            "distances","highways","tax","teacherRatio","color","status","cost")

# Set X and y as matrices
X=as.matrix(df1[,1:13])
y=df1$cost

# Fit the lasso model
fitLasso <- glmnet(X,y)
# Plot the coefficient shrinkage as a function of log(lambda)
plot(fitLasso,xvar="lambda",label=TRUE,main="Lasso regularization - Coefficient shrinkage vs log lambda")

The plot below shows that in L1 regularization the coefficients actually become zero with increasing lambda

# Compute the cross validation error (10 fold)
cvLasso=cv.glmnet(X,y,alpha=0)
# Plot the cross validation error
plot(cvLasso)

This gives the MSE for the lasso model

1.5 b Lasso regularization – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model

scaler = MinMaxScaler()
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=0.1, max_iter = 10).fit(X_train_scaled, y_train)

print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))
        

print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

trainingRsquared=[]
testRsquared=[]
#for alpha in [0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]:
for alpha in [0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    trainingRsquared.append(r2_train)
    testRsquared.append(r2_test)
    
alpha=[0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]    
#alpha=[0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]
trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha)
testRsquared=pd.DataFrame(testRsquared,index=alpha)

df3=pd.concat([trainingRsquared,testRsquared],axis=1)
df3.columns=['trainingRsquared','testRsquared']

fig7=df3.plot()
fig7=plt.title('LASSO training and test squared error vs Alpha')
fig7.figure.savefig('fig7.png', bbox_inches='tight')



## Non-zero features: 7
## R-squared score (training): 0.726
## R-squared score (test): 0.561
## 
## Features with non-zero weight (sorted by absolute magnitude):
##  status, -18.361
##  rooms, 18.232
##  teacherRatio, -8.628
##  taxRate, -2.045
##  color, 1.888
##  chasRiver, 1.670
##  distances, -0.529
## Lasso regression: effect of alpha regularization
## parameter on number of features kept in final model
## 
## Computing regularization path using the LARS ...
## .C:\Users\Ganesh\ANACON~1\lib\site-packages\sklearn\linear_model\coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
##   ConvergenceWarning)

The plot below gives the training and test R squared error

1.5c Lasso coefficient shrinkage – Python code

To plot the coefficient shrinkage for Lasso the Least Angle Regression model a.k.a. LARS package. This is shown below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
scaler = MinMaxScaler()
df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1")
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
              "distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


print("Computing regularization path using the LARS ...")
alphas, _, coefs = linear_model.lars_path(X_train_scaled, y_train, method='lasso', verbose=True)

xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]

fig8=plt.plot(xx, coefs.T)

ymin, ymax = plt.ylim()
fig8=plt.vlines(xx, ymin, ymax, linestyle='dashed')
fig8=plt.xlabel('|coef| / max|coef|')
fig8=plt.ylabel('Coefficients')
fig8=plt.title('LASSO Path - Coefficient Shrinkage vs L1')
fig8=plt.axis('tight')
plt.savefig('fig8.png', bbox_inches='tight')
This plot show the coefficient shrinkage for lasso.
This 3rd part of the series covers the main ‘feature selection’ methods. I hope these posts serve as a quick and useful reference to ML code both for R and Python!
Stay tuned for further updates to this series!
Watch this space!

 

You may also like

1. Natural language processing: What would Shakespeare say?
2. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
3. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables
4. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
5. Experiments with deblurring using OpenCV
6. R vs Python: Different similarities and similar differences

To see all posts see Index of posts

My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

Then felt I like some watcher of the skies 
When a new planet swims into his ken; 
Or like stout Cortez when with eagle eyes 
He star’d at the Pacific—and all his men 
Look’d at each other with a wild surmise— 
Silent, upon a peak in Darien. 
On First Looking into Chapman’s Homer by John Keats

The above excerpt from John Keat’s poem captures the the exhilaration that one experiences, when discovering something for the first time. This also  summarizes to some extent my own as enjoyment while pursuing Data Science, Machine Learning and the like.

I decided to write this post, as occasionally youngsters approach me and ask me where they should start their adventure in Data Science & Machine Learning. There are other times, when the ‘not-so-youngsters’ want to know what their next step should be after having done some courses. This post includes my travels through the domains of Data Science, Machine Learning, Deep Learning and (soon to be done AI).

By no means, am I an authority in this field, which is ever-widening and almost bottomless, yet I would like to share some of my experiences in this fascinating field. I include a short review of the courses I have done below. I also include alternative routes through  courses which I did not do, but are probably equally good as well.  Feel free to pick and choose any course or set of courses. Alternatively, you may prefer to read books or attend bricks-n-mortar classes, In any case,  I hope the list below will provide you with some overall direction.

All my learning in the above domains have come from MOOCs and I restrict myself to the top 3 MOOCs, or in my opinion, ‘the original MOOCs’, namely Coursera, edX or Udacity, but may throw in some courses from other online sites if they are only available there. I would recommend these 3 MOOCs over the other numerous online courses and also over face-to-face classroom courses for the following reasons. These MOOCs

  • Are taken by world class colleges and the lectures are delivered by top class Professors who have a great depth of knowledge and a wealth of experience
  • The Professors, besides delivering quality content, also point out to important tips, tricks and traps
  • You can revisit lectures in online courses anytime to refresh your memory
  • Lectures are usually short between 8 -15 mins (Personally, my attention span is around 15-20 mins at a time!)

Here is a fair warning and something quite obvious. No amount of courses, lectures or books will help if you don’t put it to use through some language like Octave, R or Python.

The journey
My trip through Data Science, Machine Learning  started with an off-chance remark,about 3 years ago,  from an old friend of mine who spoke to me about having done a few  courses at Coursera, and really liked it.  He further suggested that I should try. This was the final push which set me sailing into this vast domain.

I have included the list of the courses I have done over the past 5 years (37+ certifications completed and another 9 audited-listened only without doing the assignments). For each of the courses I have included a short review of the course, whether I think the course is mandatory, the language in which the course is based on, and finally whether I have done the course myself etc. I have also included alternative courses, which I may have not done, but which I think are equally good. Finally, I suggest some courses which I have heard of and which are very good and worth taking.

1. Machine Learning, Stanford, Prof Andrew Ng, Coursera
(Requirement: Mandatory, Language:Octave,Status:Completed)
This course provides an excellent foundation to build your Machine Learning citadel on. The course covers the mathematical details of linear, logistic and multivariate regression. There is also a good coverage of topics like Neural Networks, SVMs, Anamoly Detection, underfitting, overfitting, regularization etc. Prof Andrew Ng presents the material in a very lucid manner. It is a great course to start with. It would be a good idea to brush up  some basics of linear algebra, matrices and a little bit of calculus, specifically computing the local maxima/minima. You should be able to take this course even if you don’t know Octave as the Prof goes over the key aspects of the language.

2. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford– (Requirement:Mandatory, Language:R, Status;Completed) –
The course includes linear and polynomial regression, logistic regression. Details also include cross-validation and the bootstrap methods, how to do model selection and regularization (ridge and lasso). It also touches on non-linear models, generalized additive models, boosting and SVMs. Some unsupervised learning methods are  also discussed. The 2 Professors take turns in delivering lectures with a slight touch of humor.

3a. Data Science Specialization: Prof Roger Peng, Prof Brian Caffo & Prof Jeff Leek, John Hopkins University (Requirement: Option A, Language: R Status: Completed)
This is a comprehensive 10 module specialization based on R. This Specialization gives a very broad overview of Data Science and Machine Learning. The modules cover R programming, Statistical Inference, Practical Machine Learning, how to build R products and R packages and finally has a very good Capstone project on NLP

3b. Applied Data Science with Python Specialization: University of Michigan (Requirement: Option B, Language: Python, Status: Not done)
In this specialization I only did  the Applied Machine Learning in Python (Prof Kevyn-Collin Thomson). This is a very good course that covers a lot of Machine Learning algorithms(linear, logistic, ridge, lasso regression, knn, SVMs etc. Also included are confusion matrices, ROC curves etc. This is based on Python’s Scikit Learn

3c. Machine Learning Specialization, University Of Washington (Requirement:Option C, Language:Python, Status : Not completed). This appears to be a very good Specialization in Python

4. Statistics with R Specialization, Duke University (Requirement: Useful and a must know, Language R, Status:Not Completed)
I audited (listened only) to the following 2 modules from this Specialization.
a.Inferential Statistics
b.Linear Regression and Modeling
Both these courses are taught by Prof Mine Cetikya-Rundel who delivers her lessons with extraordinary clarity.  Her lectures are filled with many examples which she walks you through in great detail

5.Bayesian Statistics: From Concept to Data Analysis: Univ of California, Santa Cruz (Requirement: Optional, Language : R, Status:Completed)
This is an interesting course and provides an alternative point of view to frequentist approach

6. Data Science and Engineering with Spark, University of California, Berkeley, Prof Antony Joseph, Prof Ameet Talwalkar, Prof Jon Bates
(Required: Mandatory for Big Data, Status:Completed, Language; pySpark)
This specialization contains 3 modules
a.Introduction to Apache Spark
b.Distributed Machine Learning with Apache Spark
c.Big Data Analysis with Apache Spark

This is an excellent course for those who want to make an entry into Distributed Machine Learning. The exercises are fairly challenging and your code will predominantly be made of map/reduce and lambda operations as you process data that is distributed across Spark RDDs. I really liked  the part where the Prof shows how a matrix multiplication on a single machine is of the order of O(nd^2+d^3) (which is the basis of Machine Learning) is reduced to O(nd^2) by taking outer products on data which is distributed.

7. Deep Learning Prof Andrew Ng, Younes Bensouda Mourri, Kian Katanforoosh : Requirement:Mandatory,Language:Python, Tensorflow Status:Completed)

This course had 5 Modules which start from the fundamentals of Neural Networks, their derivation and vectorized Python implementation. The specialization also covers regularization, optimization techniques, mini batch normalization, Convolutional Neural Networks, Recurrent Neural Networks, LSTMs applied to a wide variety of real world problems

The modules are
a. Neural Networks and Deep Learning
In this course Prof Andrew Ng explains differential calculus, linear algebra and vectorized Python implementations of Deep Learning algorithms. The derivation for back-propagation is done and then the Prof shows how to compute a multi-layered DL network
b.Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
Deep Neural Networks can be very flexible, and come with a lots of knobs (hyper-parameters) to tune with. In this module, Prof Andrew Ng shows a systematic way to tune hyperparameters and by how much should one tune. The course also covers regularization(L1,L2,dropout), gradient descent optimization and batch normalization methods. The visualizations used to explain the momentum method, RMSprop, Adam,LR decay and batch normalization are really powerful and serve to clarify the concepts. As an added bonus,the module also includes a great introduction to Tensorflow.
c.Structuring Machine Learning Projects
A very good module with useful tips, tricks and traps that need to be considered while working on Machine Learning and Deep Learning projects
d. Convolutional Neural Networks
This domain has a lot of really cool ideas, where images represented as 3D volumes, are compressed and stretched longitudinally before applying a multi-layered deep learning neural network to this thin slice for performing classification,detection etc. The Prof provides a glimpse into this fascinating world of image classification, detection andl neural art transfer with frameworks like Keras and Tensorflow.
e. Sequence Models
In this module covers in good detail concepts like RNNs, GRUs, LSTMs, word embeddings, beam search and attention model.

8. Neural Networks for Machine Learning, Prof Geoffrey Hinton,University of Toronto
(Requirement: Mandatory, Language;Octave, Status:Completed)
This is a broad course which starts from the basic of Perceptrons, all the way to Boltzman Machines, RNNs, CNNS, LSTMs etc The course also covers regularisation, learning rate decay, momentum method etc

9.Probabilistic Graphical Models, Stanford  Prof Daphne Koller(Language:Octave, Status: Partially completed)
This has 3 courses
a.Probabilistic Graphical Models 1: Representation – Done
b.Probabilistic Graphical Models 2: Inference – To do
c.Probabilistic Graphical Models 3: Learning – To do
This course discusses how a system, which can be represented as a complex interaction
of probability distributions, will behave. This is probably the toughest course I did.  I did manage to get through the 1st module, While I felt that grasped a few things, I did not wholly understand the import of this. However I feel this is an important domain and I will definitely revisit this in future

10. Reinforcement Specialization : University of Alberta, Prof Adam White and Prof Martha White
(Requirement: Very important, Language;Python, Status: Partially Completed)
This is a set of 4 courses. I did the first 2 of the 4. Reinforcement Learning appears deceptively simple, but it is anything but simple. Definitely a very critical area to learn.

a.Fundamentals of Reinforcement Learning: This course discusses Markov models, value functions and Bellman equations and dynamic programming.
b.Sample based learning Learning methods: This course touches on Monte Carlo methods, Temporal Difference methods, Q Learning etc.

Reinforcement Learning is a must-have in your AI arsenal.

11. Tensorflow in Practice Specialization – Prof Laurence Moroney – Deep Learning.AI
(Requirement: Important, Language;Python, Status: Completed)
This is a good course but definitely do the Deep Learning Specialization by Prof Andrew Ng
There are 4 courses in this Specialization. I completed all 4 courses. They are fairly straight forward
a. Introduction to TensorFlow – This course introduces you to Tensorflow, image recognition with brute-force method
b. Convolutional Neural Networks in Tensorflow – This course touches on how to build a CNN, image augmentation, transfer learning and multi-class classification
c. Natural Language Processing in Tensorflow – Word embeddings, sentiment analysis, LSTMs, RNNs are discussed.
d. Sequences, time series and prediction – This course discusses using RNNs for time series, auto correlation

12. Natural Language Processing  Specialization – Prof Younes Bensouda, Lukasz Kaiser from DeepLearning.AI
(Requirement: Very Important, Language;Python, Status: Partially Completed)
This is the latest specialization from Deep Learning.AI. I have completed the first 2 courses
a.Natural Language Processing with Classification and Vector Spaces -The first course deals with sentiment analysis with Naive Bayes, vector space models, capturing dependencies using PCA etc
b. Natural Language Processing with Probabilistic Models – In this course techniques for auto correction, Markov models and Viterbi algorithm for Parts of Speech tagging, auto completion and word embedding are discussed.

13. Mining Massive Data Sets Prof Jure Leskovec, Prof Anand Rajaraman and ProfJeff Ullman. Online Stanford, Status Partially done.,
I did quickly audit this course, a year back, when it used to be in Coursera. It now seems to have moved to Stanford online. But this is a very good course that discusses key concepts of Mining Big Data of the order a few Petabytes

14. Introduction to Artificial Intelligence, Prof Sebastian Thrun & Prof Peter Norvig, Udacity
This is a really good course. I have started on this course a couple of times and somehow gave up. Will revisit to complete in future. Quite extensive in its coverage.Touches BFS,DFS, A-Star, PGM, Machine Learning etc.

15.Deep Learning (with TensorFlow), Vincent Vanhoucke, Principal Scientist at Google Brain.
Got started on this one and abandoned some time back. In my to do list though

16. Generative AI with LLMs

My learning journey is based on Lao Tzu’s dictum of ‘A good traveler has no fixed plans and is not intent on arriving’. You could have a goal and try to plan your courses accordingly.
And so my journey continues…

I hope you find this list useful.
Have a great journey ahead!!!

Video presentation on Machine Learning, Data Science, NLP and Big Data – Part 1

Here is the 1st part of my video presentation on “Machine Learning, Data Science, NLP and Big Data – Part 1”

An Octave primer

Here is a simple Octave Primer. Octave is a powerful language for implementing Machine Learning algorithms. As I have mentioned its strength is its simplicity. I am including some basic commands with which you can get by implementing fairly complex code

%%Matrix
A matrix can be created as a = [1 2 3; 4 7 8; 12 35 14]; % This is 3 x 3 matrix
Matrix multiplication can be done between m x n * n x k matrix as follows

a = [4 56 3; 2 3 4]; b = [23 1; 3 12; 34 12]; % a = 3 x 2 matrix b = 2 x 3 matrix
c = a*b; %% c = 3 x 2 * 2 * 3 = 3 x 3 matrix

c =
362 712
191 86

%%Inverse of a matrix can be obtained by
d = pinv(c);
octave-3.2.4.exe:37> d = pinv(c)
d =
-8.2014e-004 6.7900e-003
1.8215e-003 -3.4522e-003

%%Transpose of a matrix
e = c'; % e is the transpose of done

octave-3.2.4.exe:38> e = c'
e =
362 191
712 86

The following operations are done on all elements of a matrix or a vector
k = 5;
a = [1 2; 3 4; 5 6]; k = 5.23;
c = k * a;
d = a - 2
e = a / 5
f = a .* a % Dot product
g = a .^2; % Square each elements

%% Select slice of matrix
b = a(:,2); % Select column 2 of matrix a (all rows)
c = a(2,:) % Select row of matrix 'a' (all columns)

d = [7 8; 8 9; 10 11; 12 13]; % 4 rows 2 columns
d(2:3,:); %Select from rows 2 to 3 (all columns)

octave-3.2.4.exe:41> d
d =
7 8
8 9
10 11
12 13
octave-3.2.4.exe:43> d(2:3,:)
ans =
8 9
10 11

%% Appending rows to matrix
a = [ 4 5; 5 6; 5 7; 9 8]; % 4 x 2
b = [ 1 3; 2 4]; % 2 x 2
c = [ a; b] % stack a over b
d = [b ; a] % stack b over a*b

octave-3.2.4.exe:44> a = [ 4 5; 5 6; 5 7; 9 8] % 4 x 2
a =
4 5
5 6
5 7
9 8

octave-3.2.4.exe:45> b = [ 1 3; 2 4] % 2 x 2
b =
1 3
2 4

octave-3.2.4.exe:46> c = [ a; b] % stack a over b
c =
4 5
5 6
5 7
9 8
1 3
2 4

octave-3.2.4.exe:47> d = [b ; a] % stack b over a*b
d =
1 3
2 4
4 5
5 6
5 7
9 8

%% Appending columns
a = [ 1 2 3; 3 4 5]; b = [ 1 2; 3 4];
c = [a b];
d = [b a];

octave-3.2.4.exe:48> a = [ 1 2 3; 3 4 5]
a =
1 2 3
3 4 5

octave-3.2.4.exe:49> b = [ 1 2; 3 4]
b =
1 2
3 4

octave-3.2.4.exe:50> c = [a b]
c =
1 2 3 1 2
3 4 5 3 4

octave-3.2.4.exe:51> d = [b a]
d =
1 2 1 2 3
3 4 3 4 5
%%Size of a matrix
[c d ] = size(a);

Creating a matrix of all zeros or ones
d = ones(3,2);
e = zeros(4,3);

%Appending an intercept term to a matrix
a = [1 2 3; 4 5 6]; %2 x 3
b = ones(2,1);
a = [b a];

%% Plotting
Creating 2 vectors
x = [1 3 4 5 6];
y = [5 6 7 8 9];
plot(x,y);

%%Create labels
xlabel("X values); ylabel("Y values);
axis([1 10 4 10]); % Set the range of x and y
title("Test plot);

%%Creating a 3D scatter plot
If we have 3 column csv file then we can load the data as follows
data = load('values.csv');
X = data(:, 1:2);
y = data(:, 3);
scatter3(X(:,1),X(:,2),y,[],[240 15 15],'x'); % X(:,1) - x axis X(:,2) - yaxis y[] - z axis

%% Drawing a 3D mesh
x = linspace(0,xrange + 20,10);
y = linspace(1,yrange+ 20,10);
[XX, YY ] = meshgrid(x,y);

[a b] = size(XX)

Draw the mesh
for i=1:a,
for j= 1:b,
ZZ(i,j) = [1 (XX(i,j)-mu(1))/sigma(1) (YY(i,j) - mu(2))/sigma(2) ] * theta;
end;
end;
mesh(XX,YY,ZZ);

For more details please see post Informed choices using Machine Learning 2- Pitting Kumble, Kapil and B S Chandra
kapil-2

%% Creating different polynomial equations
Let X be a feature vector
then
X = [X X.^2 X^3] %X X^2 X^3

This can be created using a for loop as follows
for i= 1:n
xtemp = xinput .^i;
x = [x xtemp];
end;

 

Finally while doing multivariate regression if we wanted to create polynomial terms of higher we could do as follows. Let us say we have a feature vector X made of 3 features x1, x2,

Let us say we wanted to create a polynomial of the form x1^2 x1.x2 x2^2 then we could create X as

X = [X(:,1) .^2 X(:,1) . X(:,2) X(:,2) .^2]

As you can see Octave is really powerful language for Machine Learning and has just a few handful of constructs with which one can implement powerful Machine Learning algorithms

Applying the principles of Machine Learning

While working with multivariate regression there are certain essential principles that must be applied to ensure the correctness of the solution while being able to pick the most optimum solution. This is all the more important when the problem has a large number of features. In this post I apply these important principles to a regression data set which I was able to pull of the internet. This data set was taken from the UCI Machine Learning repository and deals with Boston housing data.  The housing data provides the cost of house in Boston suburbs given the number of rooms, the connectivity to main highways, and crime rate in the area and several other data.  There are a total of 506 data points in this data set with a total of 13 features.

This seemed a reasonable dataset to start to try out the principles of Machine Learning I had picked up from Coursera’s ML course.

Out of a total of 13 features 2 features ’ZN’ and ‘CHAS’ proximity to  Charles river were dropped as the values were mostly zero in these columns . The remaining 11 features were used to map to the output variable of the price.

The following key rules have been applied on the

  • The dataset was divided into training samples (60%), cross-validation set (20%) and test set (20%) using a random index
  • Try out different polynomial functions while performing gradient descent to determine the theta values
  • Different combinations of ‘alpha’ learning rate and ‘lambda’ the regularization parameter were tried while performing gradient descent
  • The error rate is then calculated on the cross-validation and test set
  • The theta values that were obtained for the lowest cost for a polynomial was used to compute and plot the learning curve for the different polynomials against increasing number of training and cross-validation test samples to check for bias and variance.
  • The plot of the cost versus the polynomial degree was plotted to obtain the best fit polynomial for the data set.

A multivariate regression hypothesis can be represented as

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 + …
And the cost can is determined as
J(θ0, θ1, θ2, θ3..) = 1/2m ∑ (hΘ (xi) -yi)2
The implementation was done using Octave. As in my previous posts some functions have not been include to comply with Coursera’s Honor Code. The code can be cloned from GitHub at machine-learning-principles

a) housing compute.m. In this module I perform gradient descent for different polynomial degrees and check the error that is obtained when using the computed theta on the cross validation and test set

max_degrees =4;
J_history = zeros(max_degrees, 1);
Jcv_history = zeros(max_degrees, 1);
for degree = 1:max_degrees;
[J Jcv alpha lambda] = train_samples(randidx, training,cross_validation,test_data,degree);
end;

b) train_samples.m – This module uses gradient descent to check the best fit for a given polynomial degree for different combinations of alpha (learning rate) and lambda( regularization).

for i = 1:length(alpha_arr),
for j = 1:length(lambda_arr)
alpha = alpha_arr{i};
lambda= lambda_arr{j};
% Perform Gradient descent
% Compute error for training sample for computed theta values
% Compute the error rate for the cross validation samples
% Compute the error rate against the test set
end;
end;

c) cross_validation.m – This module uses the theta values to compute cost for the cross validation set

d) test-samples.m – This modules computes the error when using the trained theta on the test set

e) poly.m – This module constructs polynomial vectors based on the degree as follows
function [x] = poly(xinput, n)
x = [];
for i= 1:n
xtemp = xinput .^i;
x = [x xtemp];
end;

e) learning_curve.m – The learning curve module plots the error rate for increasing number of training and cross validation samples. This is done as follows. For the theta with the lowest cost as determined by gradient descent
for i from 1 to 100

  • Compute the error for ‘i’ training samples
  • Compute the error for ‘i’ cross-validation
  • Plot the learning curve to determine the bias and variance of the polynomial fit

This is included below
for i = 1: 100
xsample = xtrain(1:i,:);
ysample = ytrain(1:i,:);
size(xsample);
size(ysample);
[xsample] = poly(xsample,degree);
xsample= [ones(i, 1) xsample];
[c d] = size(xsample);
theta = zeros(d, 1);
% Minimize using fmincg
J = computeCost(xsample, ysample, theta);
Jtrain(i) = J;
xsample_cv = xcv(1:i,:);
ysample_cv = ycv(1:i,:);
[xsample_cv] = poly(xsample_cv,degree);
xsample_cv= [ones(i, 1) xsample_cv];
J_cv = computeCost(xsample_cv, ysample_cv,theta)
Jcv(i) = J_cv;
end;

Finally a plot is done been different lambda and the cost.

The results are included below

A) Polynomial degree 1
Convergence graph
convergence-1

Learning curve
learning-curve-1

The above figure does show a stronger bias. Note: the learning curve was done with around 100 samples
B) Polynomial degree 2

Convergence graph
convergence-2

Learning curve
learning-curve-2

The learning curve for degree 2 shows a stronger variance.

C) Polynomial degree 3
Convergence graph

convergence-3

Learning curve
learning-curve-3

D) Polynomial degree 4
Convergence graph
convergence-4

E) Learning curve
learning-curve-4

This plot is useful to determine which polynomial degree will give the best fit for the dataset and the lowest cost

degree-cost-1

Clearly from the above it can be seen that degree 2 will give a good fit for the data set.

F) Lambda vs Cost function
lambda-cost-1

The above code demonstrates some key principles while performing multivariate regression
The code can be cloned from GitHub at machine-learning-principles

Informed choices through Machine Learning-2: Pitting together Kumble, Kapil, Chandra

Continuing my earlier ‘innings’, of test driving my knowledge in Machine Learning acquired via Coursera,  I now turn my attention towards the bowling performances of our Indian bowling heroes. In this post I give a slightly different ‘spin’ to the bowling analysis and hope I can ‘swing’ your opinion based on my assessment.

I guess that is enough of my cricketing ‘double-speak’ for now and I will get down to the real business of my bowling analysis!

If you are passionate about cricket, and love analyzing cricket performances, then check out my 2 racy books on cricket! In my books, I perform detailed yet compact analysis of performances of both batsmen, bowlers besides evaluating team & match performances in Tests , ODIs, T20s & IPL. You can buy my books on cricket from Amazon at $12.99 for the paperback and $4.99/$6.99 respectively for the kindle versions. The books can be accessed at Cricket analytics with cricketr  and Beaten by sheer pace-Cricket analytics with yorkr  A must read for any cricket lover! Check it out!!

1

 

As in my earlier post Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid ,the first part of the post has my analyses and the latter part has the details of the implementation of the algorithm. Feel free to read the first part and either scan or skip the latter.

To perform this analysis I have skipped the data on our recent crop of new bowlers. The reason being that data is scant on these bowlers, besides they also seem to have a relatively shorter shelf life (hope there are a couple of finds in this Australian tour of Dec 2014). For the analyses I have chosen B S Chandrasekhar, Kapil Dev Anil Kumble. My rationale as to why I chose the above 3

B S Chandrasekhar also known as “Chandra’ was one of the most lethal leg spinners in the late 1970’s. He had a very dangerous combination of fast leg breaks, searing tops spins interspersed with the  occasional googly. On many occasions he would leave most batsmen completely clueless.

Kapil Nikhanj Dev, the Haryana Hurricane who could outwit the most technically sound batsmen  through some really clever bowling. His variations were almost always effective and he would achieve the vital breakthrough outsmarting the opponent.

And finally Anil Kumble, I chose Kumble because in my opinion he is truly the embodiment of the ‘thinking’ bowler. Many times I have seen Kumble repeatedly beat batsmen. It was like he was telling the batsman ‘check’ as he bowled faster leg breaks, flippers, a straighter delivery or top spins before finally crashing into the wickets or trapping the batsmen. It felt he was saying ‘checkmate dude!’

I have taken the data for the 3 bowlers from ESPN Cricinfo. Only the Test matches were considered for the analyses. All tests against all oppositions both at home and away were included

The assumptions taken and basis of the computation is included below
a.The data is based on the following 2 input variables a) Overs bowled b) Runs given. The output variable is ‘Wickets taken’

b.To my surprise I found that in the late 1970’s when BS Chandrasekhar used to bowl, an over had 8 balls for matches in Australia. So, I had to normalize this data for Chandra to make it on par with the others. Hence for Chandra where the overs were made up of 8 balls the overs was calculated as follows
Overs (O) = (Overs * 8)/6

c.The Economy rate E was calculated as below
E = Overs/runs was chosen as input variable to take into account fewer runs given by the bowler

d.The output variable was re-calculated as Strike Rate (SR) to determine the ‘bowling effectiveness’
Strike Rate = Wickets/Overs
(not be confused with a batsman’s strike rate batsman strike rate = runs/ balls faced)

e.Hence the analysis is based on
f(O,E) = SR
An outline of the Octave code and the data used can be cloned from GitHub at ml-bowling-analyze

 1. Surface of Bowling Effectiveness (SBE)
In my earlier post I was able to fit a ‘prediction plane’ based on the minutes at crease, balls faced versus the runs scored. But in this case a plane did not make sense as the wickets can only range from 0 – 10 and in most cases averaging between 3 and 5. So I plot the best fitting 3-D surface over the predicted hypothesis function. The steps performed are

1) The data for the different  bowlers were cleaned with data which indicated (DNB – Did not bowl)
2) The Economy Rate (E) = Runs given/Overs and Strike Rate(SR) = Wickets/overs were calculated.
3) The product of Overs (O), and Economy(E) were stored as Over_Economy(OE)
4) The hypothesis function was computed as h(O, E, OE) = y
5) Theta was calculated using the Normal Equation. The Surface of Bowling Effectiveness( SBE) was then plotted. The plots for each of the bowler is shown below

Here are the plots

A) Anil Kumble
The  data of Kumble, based on Overs bowled & Economy rate versus the Strike Rate is plotted as a 3-D scatter plot (pink crosses). The best fit as determined by solving the optimum theta using the Normal Equation is plotted as 3-D surface shown below.
kumble-1
The 3-D surface is what I have termed as ‘Surface of Bowling Effectiveness (SBE)’ as it depicts bowlers overall effectiveness as it plots the overs (O), ‘economy rate’ E against predicted ‘strike rate’ SR.
Here is another view
kumble-2
The theta values obtained for Kumble are
Theta =
0.104208
-0.043769
-0.016305
0.011949

And the cost at this theta is
Cost Function J = 0.0046269

B) B S Chandrasekhar
Here are the best optimal surface plot for Chandra with the data on O,E vs SR plotted as a 3D scatter plot.  Note: The dataset for  Chandrasekhar is smaller compared to the other two.
chandra-1Another view for Chandra
chandra-2

Theta values for B S Chandrasekhar are
Theta =
0.095780
-0.025377
-0.024847
0.023415
and the cost is
Cost Function J = 0.0032980

c) Kapil Dev
The plots  for Kapil
kapil-1
Another view of SBE for Kapil
kapil-2
The Theta values and cost function for Kapil are
Theta =
0.090219
0.027725
0.023894
-0.021434
Cost Function J = 0.0035123

2. Predicting wickets
In the previous section the optimum theta with the lowest Cost Function J was calculated. Based on the value of theta, the wickets that will be taken by a bowler can be computed as the product of the hypothesis function and theta. i.e.

y= h(x) * theta  => Strike Rate (SR) = [1 O E OE] * theta
Now predicted wickets can be calculated as

wickets = Strike rate(SR) * Overs(O)
This is done  for Kumble, Chandra and Kapil  for different combinations of Overs(O) and Economy(E) rate.

Here are the results
Predicted wickets for Anil Kumble
The plot of predicted wickets for Kumble is represented below
kumble-wickets-1
This can also be represented as a a table
kumble-wkts-tbl

Predicted wickets for B S Chandrasekhar
chandra-wickets-1
The table for Chandra
chandra-wkts-tbl
 Predicted wickets for Kapil Dev

The plot
kapil-wicket-2

The predicted table from the hypothesis function for Kapil Dev
kapil-wkts-tbl

Observation: A closer look at  the predicted wickets for Kapil, Kumble and B S Chandra shows an interesting aspect. The predicted number of wickets is higher for lower economy rates. With a little thought we can see bowlers on turning or pitches with a lot of movement can not only be more economical but can also be destructive and take a lot of wickets. Hence the higher wickets for lower economy rates!

Implementation details
In this post I have used the Normal Equation to get the optimal values of theta for local minimum of the Gradient function.  As mentioned above when I had run the 3D scatter plot fitting a 2D plane did not seem quite right. So I had to experiment with different polynomial equations first trying 2nd order, 3rd order and also the sqrt

I tried the following where ‘O is Overs, ‘E’ stands for Economy Rate and ‘SR’ the predicated Strike rate. Theta is the computed theta from the Normal Equation. The notation in  Matrix notation is shown below

i) A linear plane
SR = [1 O E] * theta

ii) Using the sqrt function
SR = [1 sqrt(O) sqrt(E)]  * theta

iii) Using 2nd order plynomial
SR = [1 O^2 E^2] * theta

iv) Using the 3rd order polynomial
SR = [1 O^3 E^3] * theta

v) Before finally settling on
SR = [1 O E OE] * theta

where OE  = O .* E

The last one seemed to give me the lowest cost and also seemed the most logical visual choice.

A good resource to play around with different functions and check out the shapes of combinations of variables and polynomial order of equation is at WolframAlpha: Plotting and Graphics

Note 1: The gradient descent with the Normal Equation has been performed on the entire data set (approx 220 for Kumble & Kapil) and 99 for Chandra. The proper process for verifying a Machine Learning algorithm is to split the data set into (60% training data, 20% cross validation data and 20% as the test set).  We need to validate the prediction function against the cross-validation set, fine tune it and finally ensure that it  fits  the test set samples well.  However, this split was not done as the data set itself was very low. The entire data set was used to perform the optimal surface fit

Note 2: The optimal theta values have been chosen with a feature vector that is of the form
[1 x y x .* y] The Surface of  Bowling Effectiveness’ has been plotted above. It may appear that there is a’high bias’ in the fit and an even better fit could be obtained by choosing higher order polynomials like
[1 x y x*y x^2 y^2 (x^2) .* y x  .* (y^2)] or
[1 x y x*y x^2 y^2 x^3 y^3]  etc
While we can get a better fit we could run into the problem of ‘high variance; and without the cross validation and test set we will not be able to verify the results, Hence the simpler option [1 x y x*y] was chosen

The Octave code outline and the data used can be cloned from GitHub at ml-bowling-analyze

 Conclusion:

1) Predicted wickets: The predicted number of wickets is higher at lower economy rates
2) Comparing performances: There are different ways of looking at the results. One possible way is to check for a particular number of overs and economy rate who is most effective. Here is one way. Taking a small slice from each bowler’s predicted wickets table for anm Economy Rate=4.0 the predicted wickets are

comp

From the above it does appear that Kapil is definitely more effective than the other two. However one could slice and dice in different ways, maybe the most economical for a given numbers and wickets combination or wickets taken in the least overs etc. Do add your thoughts. comments on my assessment or analysis

Also see
1. Analyzing cricket’s batting legends – Through the mirage with R
2. Masters of spin: Unraveling the web with R

You may also like
1. A peek into literacy in India:Statistical learning with R
2. A crime map of India in R: Crimes against women
3.  What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1