This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data. If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’. My book is available on Amazon as paperback ($18.99) and in kindle version($9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback($12.99) and Kindle($9.99/Rs449) versions.

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is given below

Consider a hypothesis function

h_{Ɵ} (x) = Ɵ_{0} + Ɵ_{1}x_{1} + Ɵ_{2}x_{2}^{2} + Ɵ_{3}x_{3}^{3 } + Ɵ_{4}x_{4}^{4}

The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

J_{test}(Ɵ) = 1/2m_{test }Σ(hƟ (x_{test}^{i } – y_{test}^{i})^{2}

For logistic regression the test set error is similarly determined as

J_{test}(Ɵ) = = 1/m_{test} Σ -y_{test} * log(h_{Ɵ }(x_{test})) – (1-y_{test}) * (log(1 – h_{Ɵ }(x_{test}))

The idea is that the test set error should as low as possible.

**Model selection**

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

- h
_{Ɵ}(x) = Ɵ_{0}+ Ɵ_{1}x_{1} - h
_{Ɵ}(x) = Ɵ_{0}+ Ɵ_{1}x_{1}+ Ɵ_{2}x_{2}^{2} - h
_{Ɵ}(x) = Ɵ_{0}+ Ɵ_{1}x_{1}^{2}+ Ɵ_{2}x_{2}^{2}+ Ɵ_{3}x_{3}^{3 } - …

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

- Train all 10 models against the training set (60%)
- Compute the cost value J against the cross-validation set (20%)
- Determine the lowest cost model
- Use this model against the test set and determine the generalization error.

**Degree of the polynomial versus bias and variance**

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below

**Effect of regularization on bias & variance**

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵ_{i} values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵ_{i }become negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest

Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias

In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Also see

1. My book ‘Practical Machine Learning in R and Python: Third edition’ on Amazon

2.My book ‘Deep Learning from first principles:Second Edition’ now on Amazon

3.The Clash of the Titans in Test and ODI cricket

4. Introducing QCSimulator: A 5-qubit quantum computing simulator in R

5.Latency, throughput implications for the Cloud

6. Simulating a Web Joint in Android

5. Pitching yorkpy … short of good length to IPL – Part 1