# Practical Machine Learning with R and Python – Part 3

In this post ‘Practical Machine Learning with R and Python – Part 3’,  I discuss ‘Feature Selection’ methods. This post is a continuation of my 2 earlier posts

While applying Machine Learning techniques, the data set will usually include a large number of predictors for a target variable. It is quite likely, that not all the predictors or feature variables will have an impact on the output. Hence it is becomes necessary to choose only those features which influence the output variable thus simplifying  to a reduced feature set on which to train the ML model on. The techniques that are used are the following

• Best fit
• Forward fit
• Backward fit
• Ridge Regression or L2 regularization
• Lasso or L1 regularization

This post includes the equivalent ML code in R and Python.

All these methods remove those features which do not sufficiently influence the output. As in my previous 2 posts on “Practical Machine Learning with R and Python’, this post is largely based on the topics in the following 2 MOOC courses
1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

You can download this R Markdown file and the associated data from Github – Machine Learning-RandPython-Part3.

1. Machine Learning in plain English-Part 1
2. Machine Learning in plain English-Part 2
3. Machine Learning in plain English-Part 3

Check out my compact and minimal book  “Practical Machine Learning with R and Python:Third edition- Machine Learning in stereo”  available in Amazon in paperback($12.99) and kindle($8.99) versions. My book includes implementations of key ML algorithms and associated measures and metrics. The book is ideal for anybody who is familiar with the concepts and would like a quick reference to the different ML algorithms that can be applied to problems and how to select the best model. Pick your copy today!!

1.1 Best Fit

For a dataset with features f1,f2,f3…fn, the ‘Best fit’ approach, chooses all possible combinations of features and creates separate ML models for each of the different combinations. The best fit algotithm then uses some filtering criteria based on Adj Rsquared, Cp, BIC or AIC to pick out the best model among all models.

Since the Best Fit approach searches the entire solution space it is computationally infeasible. The number of models that have to be searched increase exponentially as the number of predictors increase. For ‘p’ predictors a total of $2^{p}$ ML models have to be searched. This can be shown as follows

There are $C_{1}$ ways to choose single feature ML models among ‘n’ features, $C_{2}$ ways to choose 2 feature models among ‘n’ models and so on, or
$1+C_{1} + C_{2} +... + C_{n}$
= Total number of models in Best Fit.  Since from Binomial theorem we have
$(1+x)^{n} = 1+C_{1}x + C_{2}x^{2} +... + C_{n}x^{n}$
When x=1 in the equation (1) above, this becomes
$2^{n} = 1+C_{1} + C_{2} +... + C_{n}$

Hence there are $2^{n}$ models to search amongst in Best Fit. For 10 features this is $2^{10}$ or ~1000 models and for 40 features this becomes $2^{40}$ which almost 1 trillion. Usually there are datasets with 1000 or maybe even 100000 features and Best fit becomes computationally infeasible.

Anyways I have included the Best Fit approach as I use the Boston crime datasets which is available both the MASS package in R and Sklearn in Python and it has 13 features. Even this small feature set takes a bit of time since the Best fit needs to search among ~$2^{13}= 8192$  models

Initially I perform a simple Linear Regression Fit to estimate the features that are statistically insignificant. By looking at the p-values of the features it can be seen that ‘indus’ and ‘age’ features have high p-values and are not significant

# 1.1a Linear Regression – R code

source('RFunctions-1.R')
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")
dim(df1)
## [1] 506  14
# Linear Regression fit
fit <- lm(cost~. ,data=df1)
summary(fit)
##
## Call:
## lm(formula = cost ~ ., data = df1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -15.595  -2.730  -0.518   1.777  26.199
##
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crimeRate    -1.080e-01  3.286e-02  -3.287 0.001087 **
## zone          4.642e-02  1.373e-02   3.382 0.000778 ***
## indus         2.056e-02  6.150e-02   0.334 0.738288
## charles       2.687e+00  8.616e-01   3.118 0.001925 **
## nox          -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rooms         3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age           6.922e-04  1.321e-02   0.052 0.958229
## distances    -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## highways      3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax          -1.233e-02  3.760e-03  -3.280 0.001112 **
## teacherRatio -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## color         9.312e-03  2.686e-03   3.467 0.000573 ***
## status       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Next we apply the different feature selection models to automatically remove features that are not significant below

# 1.1a Best Fit – R code

The Best Fit requires the ‘leaps’ R package

library(leaps)
source('RFunctions-1.R')
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
# Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")

# Perform a best fit
bestFit=regsubsets(cost~.,df1,nvmax=13)

# Generate a summary of the fit
bfSummary=summary(bestFit)

# Plot the Residual Sum of Squares vs number of variables
plot(bfSummary$rss,xlab="Number of Variables",ylab="RSS",type="l",main="Best fit RSS vs No of features") # Get the index of the minimum value a=which.min(bfSummary$rss)
# Mark this in red
points(a,bfSummary$rss[a],col="red",cex=2,pch=20) The plot below shows that the Best fit occurs with all 13 features included. Notice that there is no significant change in RSS from 11 features onward. # Plot the CP statistic vs Number of variables plot(bfSummary$cp,xlab="Number of Variables",ylab="Cp",type='l',main="Best fit Cp vs No of features")
# Find the lowest CP value
b=which.min(bfSummary$cp) # Mark this in red points(b,bfSummary$cp[b],col="red",cex=2,pch=20)

Based on Cp metric the best fit occurs at 11 features as seen below. The values of the coefficients are also included below

# Display the set of features which provide the best fit
coef(bestFit,b)
##   (Intercept)     crimeRate          zone       charles           nox
##  36.341145004  -0.108413345   0.045844929   2.718716303 -17.376023429
##         rooms     distances      highways           tax  teacherRatio
##   3.801578840  -1.492711460   0.299608454  -0.011777973  -0.946524570
##         color        status
##   0.009290845  -0.522553457
#  Plot the BIC value
plot(bfSummary$bic,xlab="Number of Variables",ylab="BIC",type='l',main="Best fit BIC vs No of Features") # Find and mark the min value c=which.min(bfSummary$bic)
points(c,bfSummary$bic[c],col="red",cex=2,pch=20) # R has some other good plots for best fit plot(bestFit,scale="r2",main="Rsquared vs No Features") R has the following set of really nice visualizations. The plot below shows the Rsquared for a set of predictor variables. It can be seen when Rsquared starts at 0.74- indus, charles and age have not been included. plot(bestFit,scale="Cp",main="Cp vs NoFeatures") The Cp plot below for value shows indus, charles and age as not included in the Best fit plot(bestFit,scale="bic",main="BIC vs Features") ## 1.1b Best fit (Exhaustive Search ) – Python code The Python package for performing a Best Fit is the Exhaustive Feature Selector EFS. import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS # Read the Boston crime data df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] # Set X and y X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] # Perform an Exhaustive Search. The EFS and SFS packages use 'neg_mean_squared_error'. The 'mean_squared_error' seems to have been deprecated. I think this is just the MSE with the a negative sign. lr = LinearRegression() efs1 = EFS(lr, min_features=1, max_features=13, scoring='neg_mean_squared_error', print_progress=True, cv=5) # Create a efs fit efs1 = efs1.fit(X.as_matrix(), y.as_matrix()) print('Best negtive mean squared error: %.2f' % efs1.best_score_) ## Print the IDX of the best features print('Best subset:', efs1.best_idx_) Features: 8191/8191Best negtive mean squared error: -28.92 ## ('Best subset:', (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)) The indices for the best subset are shown above. # 1.2 Forward fit Forward fit is a greedy algorithm that tries to optimize the feature selected, by minimizing the selection criteria (adj Rsqaured, Cp, AIC or BIC) at every step. For a dataset with features f1,f2,f3…fn, the forward fit starts with the NULL set. It then pick the ML model with a single feature from n features which has the highest adj Rsquared, or minimum Cp, BIC or some such criteria. After picking the 1 feature from n which satisfies the criteria the most, the next feature from the remaining n-1 features is chosen. When the 2 feature model which satisfies the selection criteria the best is chosen, another feature from the remaining n-2 features are added and so on. The forward fit is a sub-optimal algorithm. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of $n + n-1 + n -2 + .. 1 = n(n+1)/2$ which is of the order of $n^{2}$. Though forward fit is a sub optimal solution it is far more computationally efficient than best fit ## 1.2a Forward fit – R code Forward fit in R determines that 11 features are required for the best fit. The features are shown below library(leaps) # Read the data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") #Split as training and test train_idx <- trainTestSplit(df1,trainPercent=75,seed=5) train <- df1[train_idx, ] test <- df1[-train_idx, ] # Find the best forward fit fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward") # Compute the MSE valErrors=rep(NA,13) test.mat=model.matrix(cost~.,data=test) for(i in 1:13){ coefi=coef(fitFwd,id=i) pred=test.mat[,names(coefi)]%*%coefi valErrors[i]=mean((test$cost-pred)^2)
}

# Plot the Residual Sum of Squares
plot(valErrors,xlab="Number of Variables",ylab="Validation Error",type="l",main="Forward fit RSS vs No of features")
# Gives the index of the minimum value
a<-which.min(valErrors)
print(a)
## [1] 11
# Highlight the smallest value
points(c,valErrors[a],col="blue",cex=2,pch=20)

Forward fit R selects 11 predictors as the best ML model to predict the ‘cost’ output variable. The values for these 11 predictors are included below

#Print the 11 ccoefficients
coefi=coef(fitFwd,id=i)
coefi
##   (Intercept)     crimeRate          zone         indus       charles
##  2.397179e+01 -1.026463e-01  3.118923e-02  1.154235e-04  3.512922e+00
##           nox         rooms           age     distances      highways
## -1.511123e+01  4.945078e+00 -1.513220e-02 -1.307017e+00  2.712534e-01
##           tax  teacherRatio         color        status
## -1.330709e-02 -8.182683e-01  1.143835e-02 -3.750928e-01

## 1.2b Forward fit with Cross Validation – R code

The Python package SFS includes N Fold Cross Validation errors for forward and backward fit so I decided to add this code to R. This is not available in the ‘leaps’ R package, however the implementation is quite simple. Another implementation is also available at Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford 2.

library(dplyr)
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")

# Select columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")

set.seed(6)
# Set max number of features
nvmax<-13
cvError <- NULL
# Loop through each features
for(i in 1:nvmax){
# Set no of folds
noFolds=5
# Create the rows which fall into different folds from 1..noFolds
folds = sample(1:noFolds, nrow(df1), replace=TRUE)
cv<-0
# Loop through the folds
for(j in 1:noFolds){
# The training is all rows for which the row is != j (k-1 folds -> training)
train <- df1[folds!=j,]
# The rows which have j as the index become the test set
test <- df1[folds==j,]
# Create a forward fitting model for this
fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="forward")
# Select the number of features and get the feature coefficients
coefi=coef(fitFwd,id=i)
#Get the value of the test data
test.mat=model.matrix(cost~.,data=test)
# Multiply the tes data with teh fitted coefficients to get the predicted value
# pred = b0 + b1x1+b2x2... b13x13
pred=test.mat[,names(coefi)]%*%coefi
# Compute mean squared error
rss=mean((test$cost - pred)^2) # Add all the Cross Validation errors cv=cv+rss } # Compute the average of MSE for K folds for number of features 'i' cvError[i]=cv/noFolds } a <- seq(1,13) d <- as.data.frame(t(rbind(a,cvError))) names(d) <- c("Features","CVError") #Plot the CV Error vs No of Features ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") + xlab("No of features") + ylab("Cross Validation Error") + ggtitle("Forward Selection - Cross Valdation Error vs No of Features") Forward fit with 5 fold cross validation indicates that all 13 features are required # This gives the index of the minimum value a=which.min(cvError) print(a) ## [1] 13 #Print the 13 coefficients of these features coefi=coef(fitFwd,id=a) coefi ## (Intercept) crimeRate zone indus charles ## 36.650645380 -0.107980979 0.056237669 0.027016678 4.270631466 ## nox rooms age distances highways ## -19.000715500 3.714720418 0.019952654 -1.472533973 0.326758004 ## tax teacherRatio color status ## -0.011380750 -0.972862622 0.009549938 -0.582159093 ## 1.2c Forward fit – Python code The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) Note: The Cross validation error for SFS in Sklearn is negative, possibly because it computes the ‘neg_mean_squared_error’. The earlier ‘mean_squared_error’ in the package seems to have been deprecated. I have taken the -ve of this neg_mean_squared_error. I think this would give mean_squared_error. import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.datasets import load_boston from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs import matplotlib.pyplot as plt from mlxtend.feature_selection import SequentialFeatureSelector as SFS from sklearn.linear_model import LinearRegression df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] lr = LinearRegression() # Create a forward fit model sfs = SFS(lr, k_features=(1,13), forward=True, # Forward fit floating=False, scoring='neg_mean_squared_error', cv=5) # Fit this on the data sfs = sfs.fit(X.as_matrix(), y.as_matrix()) # Get all the details of the forward fits a=sfs.get_metric_dict() n=[] o=[] # Compute the mean cross validation scores for i in np.arange(1,13): n.append(-np.mean(a[i]['cv_scores'])) m=np.arange(1,13) # Get the index of the minimum CV score # Plot the CV scores vs the number of features fig1=plt.plot(m,n) fig1=plt.title('Mean CV Scores vs No of features') fig1.figure.savefig('fig1.png', bbox_inches='tight') print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T) idx = np.argmin(n) print "No of features=",idx #Get the features indices for the best forward fit and convert to list b=list(a[idx]['feature_idx']) print(b) # Index the column names. # Features from forward fit print("Features selected in forward fit") print(X.columns[b]) ## avg_score ci_bound cv_scores \ ## 1 -42.6185 19.0465 [-23.5582499971, -41.8215743748, -73.993608929... ## 2 -36.0651 16.3184 [-18.002498199, -40.1507894517, -56.5286659068... ## 3 -34.1001 20.87 [-9.43012884381, -25.9584955394, -36.184188174... ## 4 -33.7681 20.1638 [-8.86076528781, -28.650217633, -35.7246353855... ## 5 -33.6392 20.5271 [-8.90807628524, -28.0684679108, -35.827463022... ## 6 -33.6276 19.0859 [-9.549485942, -30.9724602876, -32.6689523347,... ## 7 -32.4082 19.1455 [-10.0177149635, -28.3780298492, -30.926917231... ## 8 -32.3697 18.533 [-11.1431684243, -27.5765510172, -31.168994094... ## 9 -32.4016 21.5561 [-10.8972555995, -25.739780653, -30.1837430353... ## 10 -32.8504 22.6508 [-12.3909282079, -22.1533250755, -33.385407342... ## 11 -34.1065 24.7019 [-12.6429253721, -22.1676650245, -33.956999528... ## 12 -35.5814 25.693 [-12.7303397453, -25.0145323483, -34.211898373... ## 13 -37.1318 23.2657 [-12.4603005692, -26.0486211062, -33.074137979... ## ## feature_idx std_dev std_err ## 1 (12,) 18.9042 9.45212 ## 2 (10, 12) 16.1965 8.09826 ## 3 (10, 12, 5) 20.7142 10.3571 ## 4 (10, 3, 12, 5) 20.0132 10.0066 ## 5 (0, 10, 3, 12, 5) 20.3738 10.1869 ## 6 (0, 3, 5, 7, 10, 12) 18.9433 9.47167 ## 7 (0, 2, 3, 5, 7, 10, 12) 19.0026 9.50128 ## 8 (0, 1, 2, 3, 5, 7, 10, 12) 18.3946 9.19731 ## 9 (0, 1, 2, 3, 5, 7, 10, 11, 12) 21.3952 10.6976 ## 10 (0, 1, 2, 3, 4, 5, 7, 10, 11, 12) 22.4816 11.2408 ## 11 (0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12) 24.5175 12.2587 ## 12 (0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12) 25.5012 12.7506 ## 13 (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) 23.0919 11.546 ## No of features= 7 ## [0, 2, 3, 5, 7, 10, 12] ## ################################################################################# ## Features selected in forward fit ## Index([u'crimeRate', u'indus', u'chasRiver', u'rooms', u'distances', ## u'teacherRatio', u'status'], ## dtype='object') ## 1.3 Backward Fit Backward fit belongs to the class of greedy algorithms which tries to optimize the feature set, by dropping a feature at every stage which results in the worst performance for a given criteria of Adj RSquared, Cp, BIC or AIC. For a dataset with features f1,f2,f3…fn, the backward fit starts with the all the features f1,f2.. fn to begin with. It then pick the ML model with a n-1 features by dropping the feature,$f_{j}$, for e.g., the inclusion of which results in the worst performance in adj Rsquared, or minimum Cp, BIC or some such criteria. At every step 1 feature is dopped. There is no guarantee that the final list of features chosen will be the best among the lot. The computation required for this is of $n + n-1 + n -2 + .. 1 = n(n+1)/2$ which is of the order of $n^{2}$. Though backward fit is a sub optimal solution it is far more computationally efficient than best fit ## 1.3a Backward fit – R code library(dplyr) # Read the data df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Rename the columns names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Select columns df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") set.seed(6) # Set max number of features nvmax<-13 cvError <- NULL # Loop through each features for(i in 1:nvmax){ # Set no of folds noFolds=5 # Create the rows which fall into different folds from 1..noFolds folds = sample(1:noFolds, nrow(df1), replace=TRUE) cv<-0 for(j in 1:noFolds){ # The training is all rows for which the row is != j train <- df1[folds!=j,] # The rows which have j as the index become the test set test <- df1[folds==j,] # Create a backward fitting model for this fitFwd=regsubsets(cost~.,data=train,nvmax=13,method="backward") # Select the number of features and get the feature coefficients coefi=coef(fitFwd,id=i) #Get the value of the test data test.mat=model.matrix(cost~.,data=test) # Multiply the tes data with teh fitted coefficients to get the predicted value # pred = b0 + b1x1+b2x2... b13x13 pred=test.mat[,names(coefi)]%*%coefi # Compute mean squared error rss=mean((test$cost - pred)^2)
# Add the Residual sum of square
}
# Compute the average of MSE for K folds for number of features 'i'
cvError[i]=cv/noFolds
}
a <- seq(1,13)
d <- as.data.frame(t(rbind(a,cvError)))
names(d) <- c("Features","CVError")
# Plot the Cross Validation Error vs Number of features
ggplot(d,aes(x=Features,y=CVError),color="blue") + geom_point() + geom_line(color="blue") +
xlab("No of features") + ylab("Cross Validation Error") +
ggtitle("Backward Selection - Cross Valdation Error vs No of Features")

# This gives the index of the minimum value
a=which.min(cvError)
print(a)
## [1] 13
#Print the 13 coefficients of these features
coefi=coef(fitFwd,id=a)
coefi
##   (Intercept)     crimeRate          zone         indus       charles
##  36.650645380  -0.107980979   0.056237669   0.027016678   4.270631466
##           nox         rooms           age     distances      highways
## -19.000715500   3.714720418   0.019952654  -1.472533973   0.326758004
##           tax  teacherRatio         color        status
##  -0.011380750  -0.972862622   0.009549938  -0.582159093

Backward selection in R also indicates the 13 features and the corresponding coefficients as providing the best fit

## 1.3b Backward fit – Python code

The Backward Fit in Python uses the Sequential feature selection (SFS) package (SFS)(https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/)

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

# Create the SFS model
sfs = SFS(lr,
k_features=(1,13),
forward=False, # Backward
floating=False,
scoring='neg_mean_squared_error',
cv=5)

# Fit the model
sfs = sfs.fit(X.as_matrix(), y.as_matrix())
a=sfs.get_metric_dict()
n=[]
o=[]

# Compute the mean of the validation scores
for i in np.arange(1,13):
n.append(-np.mean(a[i]['cv_scores']))
m=np.arange(1,13)

# Plot the Validation scores vs number of features
fig2=plt.plot(m,n)
fig2=plt.title('Mean CV Scores vs No of features')
fig2.figure.savefig('fig2.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sfs.get_metric_dict(confidence_interval=0.90)).T)

# Get the index of minimum cross validation error
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward fit and convert to list
b=list(a[idx]['feature_idx'])
# Index the column names.
# Features from backward fit
print("Features selected in bacward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...
## 3   -35.4992  13.9619  [-17.2329292677, -44.4178648308, -51.633177846...
## 4    -33.463  12.4081  [-20.6415333292, -37.3247852146, -47.479302977...
## 5   -33.1038  10.6156  [-20.2872309863, -34.6367078466, -45.931870352...
## 6   -32.0638  10.0933  [-19.4463829372, -33.460638577, -42.726257249,...
## 7   -30.7133  9.23881  [-19.4425181917, -31.1742902259, -40.531266671...
## 8   -29.7432  9.84468  [-19.445277268, -30.0641187173, -40.2561247122...
## 9   -29.0878  9.45027  [-19.3545569877, -30.094768669, -39.7506036377...
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...
##
##                                    feature_idx  std_dev  std_err
## 1                                        (12,)  18.9042  9.45212
## 2                                     (10, 12)  16.1965  8.09826
## 3                                  (10, 12, 7)  13.8576  6.92881
## 4                               (12, 10, 4, 7)  12.3154  6.15772
## 5                            (4, 7, 8, 10, 12)  10.5363  5.26816
## 6                         (4, 7, 8, 9, 10, 12)  10.0179  5.00896
## 7                      (1, 4, 7, 8, 9, 10, 12)  9.16981  4.58491
## 8                  (1, 4, 7, 8, 9, 10, 11, 12)  9.77116  4.88558
## 9               (0, 1, 4, 7, 8, 9, 10, 11, 12)  9.37969  4.68985
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546
## No of features= 9
## Features selected in bacward fit
## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate',
##        u'teacherRatio', u'color', u'status'],
##       dtype='object')

## 1.3c Sequential Floating Forward Selection (SFFS) – Python code

The Sequential Feature search also includes ‘floating’ variants which include or exclude features conditionally, once they were excluded or included. The SFFS can conditionally include features which were excluded from the previous step, if it results in a better fit. This option will tend to a better solution, than plain simple SFS. These variants are included below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

# Create the floating forward search
sffs = SFS(lr,
k_features=(1,13),
forward=True,  # Forward
floating=True,  #Floating
scoring='neg_mean_squared_error',
cv=5)

# Fit a model
sffs = sffs.fit(X.as_matrix(), y.as_matrix())
a=sffs.get_metric_dict()
n=[]
o=[]
# Compute mean validation scores
for i in np.arange(1,13):
n.append(-np.mean(a[i]['cv_scores']))

m=np.arange(1,13)

# Plot the cross validation score vs number of features
fig3=plt.plot(m,n)
fig3=plt.title('SFFS:Mean CV Scores vs No of features')
fig3.figure.savefig('fig3.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T)
# Get the index of the minimum CV score
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best forward floating fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

print("#################################################################################")
# Index the column names.
# Features from forward fit
print("Features selected in forward fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...
## 4   -33.7681  20.1638  [-8.86076528781, -28.650217633, -35.7246353855...
## 5   -33.6392  20.5271  [-8.90807628524, -28.0684679108, -35.827463022...
## 6   -33.6276  19.0859  [-9.549485942, -30.9724602876, -32.6689523347,...
## 7   -32.1834  12.1001  [-17.9491036167, -39.6479234651, -45.470227740...
## 8   -32.0908  11.8179  [-17.4389015788, -41.2453629843, -44.247557798...
## 9   -31.0671  10.1581  [-17.2689542913, -37.4379370429, -41.366372300...
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...
##
##                                    feature_idx  std_dev  std_err
## 1                                        (12,)  18.9042  9.45212
## 2                                     (10, 12)  16.1965  8.09826
## 3                                  (10, 12, 5)  20.7142  10.3571
## 4                               (10, 3, 12, 5)  20.0132  10.0066
## 5                            (0, 10, 3, 12, 5)  20.3738  10.1869
## 6                         (0, 3, 5, 7, 10, 12)  18.9433  9.47167
## 7                      (0, 1, 2, 3, 7, 10, 12)  12.0097  6.00487
## 8                   (0, 1, 2, 3, 7, 8, 10, 12)  11.7297  5.86484
## 9                (0, 1, 2, 3, 7, 8, 9, 10, 12)  10.0822  5.04111
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546
## No of features= 9
## [0, 1, 2, 3, 7, 8, 9, 10, 12]
## #################################################################################
## Features selected in forward fit
## Index([u'crimeRate', u'zone', u'indus', u'chasRiver', u'distances',
##        u'idxHighways', u'taxRate', u'teacherRatio', u'status'],
##       dtype='object')

## 1.3d Sequential Floating Backward Selection (SFBS) – Python code

The SFBS is an extension of the SBS. Here features that are excluded at any stage can be conditionally included if the resulting feature set gives a better fit.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
import matplotlib.pyplot as plt
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
lr = LinearRegression()

sffs = SFS(lr,
k_features=(1,13),
forward=False, # Backward
floating=True, # Floating
scoring='neg_mean_squared_error',
cv=5)

sffs = sffs.fit(X.as_matrix(), y.as_matrix())
a=sffs.get_metric_dict()
n=[]
o=[]
# Compute the mean cross validation score
for i in np.arange(1,13):
n.append(-np.mean(a[i]['cv_scores']))

m=np.arange(1,13)

fig4=plt.plot(m,n)
fig4=plt.title('SFBS: Mean CV Scores vs No of features')
fig4.figure.savefig('fig4.png', bbox_inches='tight')

print(pd.DataFrame.from_dict(sffs.get_metric_dict(confidence_interval=0.90)).T)

# Get the index of the minimum CV score
idx = np.argmin(n)
print "No of features=",idx
#Get the features indices for the best backward floating fit and convert to list
b=list(a[idx]['feature_idx'])
print(b)

print("#################################################################################")
# Index the column names.
# Features from forward fit
print("Features selected in backward floating fit")
print(X.columns[b])
##    avg_score ci_bound                                          cv_scores  \
## 1   -42.6185  19.0465  [-23.5582499971, -41.8215743748, -73.993608929...
## 2   -36.0651  16.3184  [-18.002498199, -40.1507894517, -56.5286659068...
## 3   -34.1001    20.87  [-9.43012884381, -25.9584955394, -36.184188174...
## 4    -33.463  12.4081  [-20.6415333292, -37.3247852146, -47.479302977...
## 5   -32.3699  11.2725  [-20.8771078371, -34.9825657934, -45.813447203...
## 6   -31.6742  11.2458  [-20.3082500364, -33.2288990522, -45.535507868...
## 7   -30.7133  9.23881  [-19.4425181917, -31.1742902259, -40.531266671...
## 8   -29.7432  9.84468  [-19.445277268, -30.0641187173, -40.2561247122...
## 9   -29.0878  9.45027  [-19.3545569877, -30.094768669, -39.7506036377...
## 10  -28.9225  9.39697  [-18.562171585, -29.968504938, -39.9586835965,...
## 11  -29.4301  10.8831  [-18.3346152225, -30.3312847532, -45.065432793...
## 12  -30.4589  11.1486  [-18.493389527, -35.0290639374, -45.1558231765...
## 13  -37.1318  23.2657  [-12.4603005692, -26.0486211062, -33.074137979...
##
##                                    feature_idx  std_dev  std_err
## 1                                        (12,)  18.9042  9.45212
## 2                                     (10, 12)  16.1965  8.09826
## 3                                  (10, 12, 5)  20.7142  10.3571
## 4                               (4, 10, 7, 12)  12.3154  6.15772
## 5                            (12, 10, 4, 1, 7)  11.1883  5.59417
## 6                        (4, 7, 8, 10, 11, 12)  11.1618  5.58088
## 7                      (1, 4, 7, 8, 9, 10, 12)  9.16981  4.58491
## 8                  (1, 4, 7, 8, 9, 10, 11, 12)  9.77116  4.88558
## 9               (0, 1, 4, 7, 8, 9, 10, 11, 12)  9.37969  4.68985
## 10           (0, 1, 4, 6, 7, 8, 9, 10, 11, 12)   9.3268   4.6634
## 11        (0, 1, 3, 4, 6, 7, 8, 9, 10, 11, 12)  10.8018  5.40092
## 12     (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)  11.0653  5.53265
## 13  (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)  23.0919   11.546
## No of features= 9
## [0, 1, 4, 7, 8, 9, 10, 11, 12]
## #################################################################################
## Features selected in backward floating fit
## Index([u'crimeRate', u'zone', u'NO2', u'distances', u'idxHighways', u'taxRate',
##        u'teacherRatio', u'color', u'status'],
##       dtype='object')

# 1.4 Ridge regression

In Linear Regression the Residual Sum of Squares (RSS) is given as

$RSS = \sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2}$
Ridge regularization =$\sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2} + \lambda \sum_{j=1}^{p}\beta^{2}$

where is the regularization or tuning parameter. Increasing increases the penalty on the coefficients thus shrinking them. However in Ridge Regression features that do not influence the target variable will shrink closer to zero but never become zero except for very large values of

Ridge regression in R requires the ‘glmnet’ package

## 1.4a Ridge Regression – R code

library(glmnet)
library(dplyr)
df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL
#Rename the columns
names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")
# Select specific columns
df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age",
"distances","highways","tax","teacherRatio","color","status","cost")

# Set X and y as matrices
X=as.matrix(df1[,1:13])
y=df1$cost # Fit a Ridge model fitRidge <-glmnet(X,y,alpha=0) #Plot the model where the coefficient shrinkage is plotted vs log lambda plot(fitRidge,xvar="lambda",label=TRUE,main= "Ridge regression coefficient shrikage vs log lambda") The plot below shows how the 13 coefficients for the 13 predictors vary when lambda is increased. The x-axis includes log (lambda). We can see that increasing lambda from $10^{2}$ to $10^{6}$ significantly shrinks the coefficients. We can draw a vertical line from the x-axis and read the values of the 13 coefficients. Some of them will be close to zero # Compute the cross validation error cvRidge=cv.glmnet(X,y,alpha=0) #Plot the cross validation error plot(cvRidge, main="Ridge regression Cross Validation Error (10 fold)") This gives the 10 fold Cross Validation Error with respect to log (lambda) As lambda increase the MSE increases ## 1.4a Ridge Regression – Python code The coefficient shrinkage for Python can be plotted like R using Least Angle Regression model a.k.a. LARS package. This is included below import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") #Rename the columns df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status","cost"] X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age", "distances","idxHighways","taxRate","teacherRatio","color","status"]] y=df['cost'] from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() from sklearn.linear_model import Ridge X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) # Scale the X_train and X_test X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit a ridge regression with alpha=20 linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train) # Print the training R squared print('R-squared score (training): {:.3f}' .format(linridge.score(X_train_scaled, y_train))) # Print the test Rsquared print('R-squared score (test): {:.3f}' .format(linridge.score(X_test_scaled, y_test))) print('Number of non-zero features: {}' .format(np.sum(linridge.coef_ != 0))) trainingRsquared=[] testRsquared=[] # Plot the effect of alpha on the test Rsquared print('Ridge regression: effect of alpha regularization parameter\n') # Choose a list of alpha values for this_alpha in [0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000]: linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train) # Compute training rsquared r2_train = linridge.score(X_train_scaled, y_train) # Compute test rsqaured r2_test = linridge.score(X_test_scaled, y_test) num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0) trainingRsquared.append(r2_train) testRsquared.append(r2_test) # Create a dataframe alpha=[0.001,.01,.1,0, 1, 10, 20, 50, 100, 1000] trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha) testRsquared=pd.DataFrame(testRsquared,index=alpha) # Plot training and test R squared as a function of alpha df3=pd.concat([trainingRsquared,testRsquared],axis=1) df3.columns=['trainingRsquared','testRsquared'] fig5=df3.plot() fig5=plt.title('Ridge training and test squared error vs Alpha') fig5.figure.savefig('fig5.png', bbox_inches='tight') # Plot the coefficient shrinage using the LARS package from sklearn import linear_model # ############################################################################# # Compute paths n_alphas = 200 alphas = np.logspace(0, 8, n_alphas) coefs = [] for a in alphas: ridge = linear_model.Ridge(alpha=a, fit_intercept=False) ridge.fit(X_train_scaled, y_train) coefs.append(ridge.coef_) # ############################################################################# # Display results ax = plt.gca() fig6=ax.plot(alphas, coefs) fig6=ax.set_xscale('log') fig6=ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis fig6=plt.xlabel('alpha') fig6=plt.ylabel('weights') fig6=plt.title('Ridge coefficients as a function of the regularization') fig6=plt.axis('tight') plt.savefig('fig6.png', bbox_inches='tight') ## R-squared score (training): 0.620 ## R-squared score (test): 0.438 ## Number of non-zero features: 13 ## Ridge regression: effect of alpha regularization parameter The plot below shows the training and test error when increasing the tuning or regularization parameter ‘alpha’ For Python the coefficient shrinkage with LARS must be viewed from right to left, where you have increasing alpha. As alpha increases the coefficients shrink to 0. ## 1.5 Lasso regularization The Lasso is another form of regularization, also known as L1 regularization. Unlike the Ridge Regression where the coefficients of features which do not influence the target tend to zero, in the lasso regualrization the coefficients become 0. The general form of Lasso is as follows $\sum_{i=1}^{n} (y_{i} - \beta_{0} - \sum_{j=1}^{p}\beta_jx_{ij})^{2} + \lambda \sum_{j=1}^{p}|\beta|$ ## 1.5a Lasso regularization – R code library(glmnet) library(dplyr) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL names(df) <-c("no","crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") df1 <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color","status","cost") # Set X and y as matrices X=as.matrix(df1[,1:13]) y=df1$cost

# Fit the lasso model
fitLasso <- glmnet(X,y)
# Plot the coefficient shrinkage as a function of log(lambda)
plot(fitLasso,xvar="lambda",label=TRUE,main="Lasso regularization - Coefficient shrinkage vs log lambda")

The plot below shows that in L1 regularization the coefficients actually become zero with increasing lambda

# Compute the cross validation error (10 fold)
cvLasso=cv.glmnet(X,y,alpha=0)
# Plot the cross validation error
plot(cvLasso)

This gives the MSE for the lasso model

## 1.5 b Lasso regularization – Python code

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model

scaler = MinMaxScaler()
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=0.1, max_iter = 10).fit(X_train_scaled, y_train)

print('Non-zero features: {}'
.format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
.format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
.format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X), linlasso.coef_)),
key = lambda e: -abs(e[1])):
if e[1] != 0:
print('\t{}, {:.3f}'.format(e[0], e[1]))

print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

trainingRsquared=[]
testRsquared=[]
#for alpha in [0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]:
for alpha in [0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]:
linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
r2_train = linlasso.score(X_train_scaled, y_train)
r2_test = linlasso.score(X_test_scaled, y_test)
trainingRsquared.append(r2_train)
testRsquared.append(r2_test)

alpha=[0.01,0.07,0.05, 0.1, 1,2, 3, 5, 10]
#alpha=[0.01,0.05,0.1, 1, 2, 3, 5, 10, 20, 50]
trainingRsquared=pd.DataFrame(trainingRsquared,index=alpha)
testRsquared=pd.DataFrame(testRsquared,index=alpha)

df3=pd.concat([trainingRsquared,testRsquared],axis=1)
df3.columns=['trainingRsquared','testRsquared']

fig7=df3.plot()
fig7=plt.title('LASSO training and test squared error vs Alpha')
fig7.figure.savefig('fig7.png', bbox_inches='tight')

## Non-zero features: 7
## R-squared score (training): 0.726
## R-squared score (test): 0.561
##
## Features with non-zero weight (sorted by absolute magnitude):
##  status, -18.361
##  rooms, 18.232
##  teacherRatio, -8.628
##  taxRate, -2.045
##  color, 1.888
##  chasRiver, 1.670
##  distances, -0.529
## Lasso regression: effect of alpha regularization
## parameter on number of features kept in final model
##
## Computing regularization path using the LARS ...
## .C:\Users\Ganesh\ANACON~1\lib\site-packages\sklearn\linear_model\coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
##   ConvergenceWarning)

## 1.5c Lasso coefficient shrinkage – Python code

To plot the coefficient shrinkage for Lasso the Least Angle Regression model a.k.a. LARS package. This is shown below

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
scaler = MinMaxScaler()
#Rename the columns
df.columns=["no","crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status","cost"]
X=df[["crimeRate","zone","indus","chasRiver","NO2","rooms","age",
"distances","idxHighways","taxRate","teacherRatio","color","status"]]
y=df['cost']
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Computing regularization path using the LARS ...")
alphas, _, coefs = linear_model.lars_path(X_train_scaled, y_train, method='lasso', verbose=True)

xx = np.sum(np.abs(coefs.T), axis=1)
xx /= xx[-1]

fig8=plt.plot(xx, coefs.T)

ymin, ymax = plt.ylim()
fig8=plt.vlines(xx, ymin, ymax, linestyle='dashed')
fig8=plt.xlabel('|coef| / max|coef|')
fig8=plt.ylabel('Coefficients')
fig8=plt.title('LASSO Path - Coefficient Shrinkage vs L1')
fig8=plt.axis('tight')
plt.savefig('fig8.png', bbox_inches='tight')
This 3rd part of the series covers the main ‘feature selection’ methods. I hope these posts serve as a quick and useful reference to ML code both for R and Python!
Stay tuned for further updates to this series!
Watch this space!

You may also like

To see all posts see Index of posts

# cricketr digs the Ashes!

Published in R bloggers: cricketr digs the Ashes

# Introduction

In some circles the Ashes is considered the ‘mother of all cricketing battles’. But, being a staunch supporter of all things Indian, cricket or otherwise, I have to say that the Ashes pales in comparison against a India-Pakistan match. After all, what are a few frowns and raised eyebrows at the Ashes in comparison to the seething emotions and reckless exuberance of Indian fans.

Anyway, the Ashes are an interesting duel and I have decided to do some cricketing analysis using my R package cricketr. For this analysis I have chosen the top 2 batsman and top 2 bowlers from both the Australian and English sides.

Batsmen

1. Steven Smith (Aus) – Innings – 58 , Ave: 58.52, Strike Rate: 55.90
2. David Warner (Aus) – Innings – 76, Ave: 46.86, Strike Rate: 73.88
3. Alistair Cook (Eng) – Innings – 208 , Ave: 46.62, Strike Rate: 46.33
4. J E Root (Eng) – Innings – 53, Ave: 54.02, Strike Rate: 51.30

Bowlers

1. Mitchell Johnson (Aus) – Innings-131, Wickets – 299, Econ Rate : 3.28
2. Peter Siddle (Aus) – Innings – 104 , Wickets- 192, Econ Rate : 2.95
3. James Anderson (Eng) – Innings – 199 , Wickets- 406, Econ Rate : 3.05
4. Stuart Broad (Eng) – Innings – 148 , Wickets- 296, Econ Rate : 3.08

It is my opinion if any 2 of the 4 in either team click then they will be able to swing the match in favor of their team.

I have interspersed the plots with a few comments. Feel free to draw your conclusions!

If you are passionate about cricket, and love analyzing cricket performances, then check out my racy book on cricket ‘Cricket analytics with cricketr and cricpy – Analytics harmony with R & Python’! This book discusses and shows how to use my R package ‘cricketr’ and my Python package ‘cricpy’ to analyze batsmen and bowlers in all formats of the game (Test, ODI and T20). The paperback is available on Amazon at $21.99 and the kindle version at$9.99/Rs 449/-. A must read for any cricket lover! Check it out!!

cks), and $4.99/Rs 320 and$6.99/Rs448 respectively

Important note 1: The latest release of ‘cricketr’ now includes the ability to analyze performances of teams now!!  See Cricketr adds team analytics to its repertoire!!!

Important note 2 : Cricketr can now do a more fine-grained analysis of players, see Cricketr learns new tricks : Performs fine-grained analysis of players

Important note 3: Do check out the python avatar of cricketr, ‘cricpy’ in my post ‘Introducing cricpy:A python package to analyze performances of cricketers

The analysis is included below. Note: This post has also been hosted at Rpubs as cricketr digs the Ashes!
You can also download this analysis as a PDF file from cricketr digs the Ashes!

Do check out my interactive Shiny app implementation using the cricketr package – Sixer – R package cricketr’s new Shiny avatar

Note: If you would like to do a similar analysis for a different set of batsman and bowlers, you can clone/download my skeleton cricketr template from Github (which is the R Markdown file I have used for the analysis below). You will only need to make appropriate changes for the players you are interested in. Just a familiarity with R and R Markdown only is needed.

Important note: Do check out my other posts using cricketr at cricketr-posts

The package can be installed directly from CRAN

if (!require("cricketr")){
install.packages("cricketr",lib = "c:/test")
}
library(cricketr)

or from Github

library(devtools)
install_github("tvganesh/cricketr")
library(cricketr)

## Analyses of Batsmen

The following plots gives the analysis of the 2 Australian and 2 English batsmen. It must be kept in mind that Cooks has more innings than all the rest put together. Smith has the best average, and Warner has the best strike rate

## Box Histogram Plot

This plot shows a combined boxplot of the Runs ranges and a histogram of the Runs Frequency

batsmanPerfBoxHist("./smith.csv","S Smith")

batsmanPerfBoxHist("./warner.csv","D Warner")

batsmanPerfBoxHist("./cook.csv","A Cook")

batsmanPerfBoxHist("./root.csv","JE Root")

## Plot os 4s, 6s and the type of dismissals

A. Steven Smith

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./smith.csv","S Smith")
batsman6s("./smith.csv","S Smith")
batsmanDismissals("./smith.csv","S Smith")

dev.off()
## null device
##           1

B. David Warner

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./warner.csv","D Warner")
batsman6s("./warner.csv","D Warner")
batsmanDismissals("./warner.csv","D Warner")

dev.off()
## null device
##           1

C. Alistair Cook

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./cook.csv","A Cook")
batsman6s("./cook.csv","A Cook")
batsmanDismissals("./cook.csv","A Cook")

dev.off()
## null device
##           1

D. J E Root

par(mfrow=c(1,3))
par(mar=c(4,4,2,2))
batsman4s("./root.csv","JE Root")
batsman6s("./root.csv","JE Root")
batsmanDismissals("./root.csv","JE Root")

dev.off()
## null device
##           1

## Relative Mean Strike Rate

In this first plot I plot the Mean Strike Rate of the batsmen. It can be Warner’s has the best strike rate (hit outside the plot!) followed by Smith in the range 20-100. Root has a good strike rate above hundred runs. Cook maintains a good strike rate.

par(mar=c(4,4,2,2))
frames <- list("./smith.csv","./warner.csv","cook.csv","root.csv")
names <- list("Smith","Warner","Cook","Root")
relativeBatsmanSR(frames,names)

## Relative Runs Frequency Percentage

The plot below show the percentage contribution in each 10 runs bucket over the entire career.It can be seen that Smith pops up above the rest with remarkable regularity.COok is consistent over the entire range.

frames <- list("./smith.csv","./warner.csv","cook.csv","root.csv")
names <- list("Smith","Warner","Cook","Root")
relativeRunsFreqPerf(frames,names)

## Moving Average of runs over career

The moving average for the 4 batsmen indicate the following 1. S Smith is the most promising. There is a marked spike in Performance. Cook maintains a steady pace and is consistent over the years averaging 50 over the years.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanMovingAverage("./smith.csv","S Smith")
batsmanMovingAverage("./warner.csv","D Warner")
batsmanMovingAverage("./cook.csv","A Cook")
batsmanMovingAverage("./root.csv","JE Root")

dev.off()
## null device
##           1

## Runs forecast

The forecast for the batsman is shown below. As before Cooks’s performance is really consistent across the years and the forecast is good for the years ahead. In Cook’s case it can be seen that the forecasted and actual runs are reasonably accurate

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
batsmanPerfForecast("./smith.csv","S Smith")
batsmanPerfForecast("./warner.csv","D Warner")
batsmanPerfForecast("./cook.csv","A Cook")
## Warning in HoltWinters(ts.train): optimization difficulties: ERROR:
## ABNORMAL_TERMINATION_IN_LNSRCH
batsmanPerfForecast("./root.csv","JE Root")

dev.off()
## null device
##           1

## 3D plot of Runs vs Balls Faced and Minutes at Crease

The plot is a scatter plot of Runs vs Balls faced and Minutes at Crease. A prediction plane is fitted

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./smith.csv","S Smith")
battingPerf3d("./warner.csv","D Warner")

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
battingPerf3d("./cook.csv","A Cook")
battingPerf3d("./root.csv","JE Root")

dev.off()
## null device
##           1

## Predicting Runs given Balls Faced and Minutes at Crease

A multi-variate regression plane is fitted between Runs and Balls faced +Minutes at crease.

BF <- seq( 10, 400,length=15)
Mins <- seq(30,600,length=15)
newDF <- data.frame(BF,Mins)
smith <- batsmanRunsPredict("./smith.csv","S Smith",newdataframe=newDF)
warner <- batsmanRunsPredict("./warner.csv","D Warner",newdataframe=newDF)
cook <- batsmanRunsPredict("./cook.csv","A Cook",newdataframe=newDF)
root <- batsmanRunsPredict("./root.csv","JE Root",newdataframe=newDF)

The fitted model is then used to predict the runs that the batsmen will score for a given Balls faced and Minutes at crease. It can be seen that Warner sets a searing pace in the predicted runs for a given Balls Faced and Minutes at crease while Smith and Root are neck to neck in the predicted runs

batsmen <-cbind(round(smith$Runs),round(warner$Runs),round(cook$Runs),round(root$Runs))
colnames(batsmen) <- c("Smith","Warner","Cook","Root")
newDF <- data.frame(round(newDF$BF),round(newDF$Mins))
colnames(newDF) <- c("BallsFaced","MinsAtCrease")
predictedRuns <- cbind(newDF,batsmen)
predictedRuns
##    BallsFaced MinsAtCrease Smith Warner Cook Root
## 1          10           30     9     12    6    9
## 2          38           71    25     33   20   25
## 3          66          111    42     53   33   42
## 4          94          152    58     73   47   59
## 5         121          193    75     93   60   75
## 6         149          234    91    114   74   92
## 7         177          274   108    134   88  109
## 8         205          315   124    154  101  125
## 9         233          356   141    174  115  142
## 10        261          396   158    195  128  159
## 11        289          437   174    215  142  175
## 12        316          478   191    235  155  192
## 13        344          519   207    255  169  208
## 14        372          559   224    276  182  225
## 15        400          600   240    296  196  242

## Highest runs likelihood

The plots below the runs likelihood of batsman. This uses K-Means. It can be seen Smith has the best likelihood around 40% of scoring around 41 runs, followed by Root who has 28.3% likelihood of scoring around 81 runs

A. Steven Smith

batsmanRunsLikelihood("./smith.csv","S Smith")

## Summary of  S Smith 's runs scoring likelihood
## **************************************************
##
## There is a 40 % likelihood that S Smith  will make  41 Runs in  73 balls over 101  Minutes
## There is a 36 % likelihood that S Smith  will make  9 Runs in  21 balls over  27  Minutes
## There is a 24 % likelihood that S Smith  will make  139 Runs in  237 balls over 338  Minutes

B. David Warner

batsmanRunsLikelihood("./warner.csv","D Warner")

## Summary of  D Warner 's runs scoring likelihood
## **************************************************
##
## There is a 11.11 % likelihood that D Warner  will make  134 Runs in  159 balls over 263  Minutes
## There is a 63.89 % likelihood that D Warner  will make  17 Runs in  25 balls over  37  Minutes
## There is a 25 % likelihood that D Warner  will make  73 Runs in  105 balls over 156  Minutes

C. Alastair Cook

batsmanRunsLikelihood("./cook.csv","A Cook")

## Summary of  A Cook 's runs scoring likelihood
## **************************************************
##
## There is a 27.72 % likelihood that A Cook  will make  64 Runs in  140 balls over 195  Minutes
## There is a 59.9 % likelihood that A Cook  will make  15 Runs in  32 balls over  46  Minutes
## There is a 12.38 % likelihood that A Cook  will make  141 Runs in  300 balls over 420  Minutes

D. J E Root

batsmanRunsLikelihood("./root.csv","JE Root")

## Summary of  JE Root 's runs scoring likelihood
## **************************************************
##
## There is a 28.3 % likelihood that JE Root  will make  81 Runs in  158 balls over 223  Minutes
## There is a 7.55 % likelihood that JE Root  will make  179 Runs in  290 balls over  425  Minutes
## There is a 64.15 % likelihood that JE Root  will make  16 Runs in  39 balls over 59  Minutes

## Average runs at ground and against opposition

A. Steven Smith

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./smith.csv","S Smith")
batsmanAvgRunsOpposition("./smith.csv","S Smith")

dev.off()
## null device
##           1

B. David Warner

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./warner.csv","D Warner")
batsmanAvgRunsOpposition("./warner.csv","D Warner")

dev.off()
## null device
##           1

C. Alistair Cook

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./cook.csv","A Cook")
batsmanAvgRunsOpposition("./cook.csv","A Cook")

dev.off()
## null device
##           1

D. J E Root

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
batsmanAvgRunsGround("./root.csv","JE Root")
batsmanAvgRunsOpposition("./root.csv","JE Root")

dev.off()
## null device
##           1

## Analysis of bowlers

1. Mitchell Johnson (Aus) – Innings-131, Wickets – 299, Econ Rate : 3.28
2. Peter Siddle (Aus) – Innings – 104 , Wickets- 192, Econ Rate : 2.95
3. James Anderson (Eng) – Innings – 199 , Wickets- 406, Econ Rate : 3.05
4. Stuart Broad (Eng) – Innings – 148 , Wickets- 296, Econ Rate : 3.08

Anderson has the highest number of inning and wickets followed closely by Broad and Mitchell who are in a neck to neck race with respect to wickets. Johnson is on the more expensive side though. Siddle has fewer innings but a good economy rate.

# Wicket Frequency percentage

This plot gives the percentage of wickets for each wickets (1,2,3…etc)

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))
bowlerWktsFreqPercent("./johnson.csv","Johnson")
bowlerWktsFreqPercent("./siddle.csv","Siddle")
bowlerWktsFreqPercent("./anderson.csv","Anderson")

dev.off()
## null device
##           1

## Wickets Runs plot

The plot below gives a boxplot of the runs ranges for each of the wickets taken by the bowlers

par(mfrow=c(1,4))
par(mar=c(4,4,2,2))
bowlerWktsRunsPlot("./johnson.csv","Johnson")
bowlerWktsRunsPlot("./siddle.csv","Siddle")
bowlerWktsRunsPlot("./anderson.csv","Anderson")

dev.off()
## null device
##           1

## Average wickets in different grounds and opposition

A. Mitchell Johnson

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./johnson.csv","Johnson")
bowlerAvgWktsOpposition("./johnson.csv","Johnson")

dev.off()
## null device
##           1

B. Peter Siddle

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./siddle.csv","Siddle")
bowlerAvgWktsOpposition("./siddle.csv","Siddle")

dev.off()
## null device
##           1

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))

dev.off()
## null device
##           1

D. James Anderson

par(mfrow=c(1,2))
par(mar=c(4,4,2,2))
bowlerAvgWktsGround("./anderson.csv","Anderson")
bowlerAvgWktsOpposition("./anderson.csv","Anderson")

dev.off()
## null device
##           1

## Relative bowling performance

The plot below shows that Mitchell Johnson is the mopst effective bowler among the lot with a higher wickets in the 3-6 wicket range. Broad and Anderson seem to perform well in 2 wickets in comparison to Siddle but in 3 wickets Siddle is better than Broad and Anderson.

relativeBowlingPerf(frames,names)

## Relative Economy Rate against wickets taken

Anderson followed by Siddle has the best economy rates. Johnson is fairly expensive in the 4-8 wicket range.

relativeBowlingER(frames,names)

## Moving average of wickets over career

Johnson is on his second peak while Siddle is on the decline with respect to bowling. Broad and Anderson show improving performance over the years.

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerMovingAverage("./johnson.csv","Johnson")
bowlerMovingAverage("./siddle.csv","Siddle")
bowlerMovingAverage("./anderson.csv","Anderson")

dev.off()
## null device
##           1

## Wickets forecast

par(mfrow=c(2,2))
par(mar=c(4,4,2,2))
bowlerPerfForecast("./johnson.csv","Johnson")
bowlerPerfForecast("./siddle.csv","Siddle")
bowlerPerfForecast("./anderson.csv","Anderson")

dev.off()
## null device
##           1

# Key findings

Here are some key conclusions

1. Cook has the most number of innings and has been extremly consistent in his scores
2. Warner has the best strike rate among the lot followed by Smith and Root
3. The moving average shows a marked improvement over the years for Smith
4. Johnson is the most effective bowler but is fairly expensive
5. Anderson has the best economy rate followed by Siddle
6. Johnson is at his second peak with respect to bowling while Broad and Anderson maintain a steady line and length in their career bowling performance

Also see my other posts in R

You may also like