# Simplifying ML: Recommender Systems – Part 7

In this age of Amazon, Netflix and App stores where products, movies and apps are purchased online the method of up-selling and cross-selling online is through the use of recommender based systems.

When you go to site like Amazon/Flipkart or purchase apps on App store/Google Play we often see things like “People who bought this book/app also bought X, Y, Z”. These recommendations are the recommender system algorithms in action.

Recently, Netflix ran a competition in which users had to come with the best algorithm to recommend films that a user would also like. The prize money for this was of the order of \$1 million. That’s how critical recommender systems are to organizations of today where most of the transactions happen on the web.

Typically users are asked to give a rating of 1 to 5 with 1 being the lowest and 5 being the highest.  So for example if we had classics like Moby Dick, Great Expectations and current best sellers like The Client, The da Vinci Code and a Science Fiction like 2001- A Space Odyssey we can expect that different people will rate the books differently. Obviously not everybody would have read every book in the list and some elements would be blank.

Recommender Systems are based on machine learning algorithms. The goal of these algorithms is to predict what score any user would give to books they did not rate. In other words what would be rating the buyers would give for books or apps they did not buy. So if the algorithm predicts a high rating then we could recommend that the user would also ‘like’ them. Or we could give recommendations of books/apps bought by users who bought the books/apps bought by this user.

The notation is

nu = Number of users

nb = Number of books

r(i,j) = Boolean whether user j rated a book i

y(i,j) = The rating user j gave book i

mj  = The number of books that user j rated

Content based recommendation

In a typical content based recommendation algorithm we assume that we have data about some items we want to recommend rating for e.g. books/products/apps.  In the example for books bought in an online bookstore we assume some features in our case ‘classic’, “fiction” etc

So each book has its own feature vector where x1 is the feature vector of the first book x2  feature vector of the 2nd book and so on

This can be done through linear regression by minimizing the cost function of the sum of squared errors from the predicted value

So for a parameter vector Ɵjand a feature vector xi the recommender system will try to predict the rating that a user j will give a book i.

This can be written as

Number of stars (rating) = (θj) T xi

This reduces to the minimization problem over all θj for r=1

min 1/2m Σ  ((θj)T xi  – y i,j)2

θj                  i:r=1

Adding the regularization term this becomes

min 1/2m Σ((θj)T xi  – y i,j)2  + λ/2m(Σ θj)2

θj                  i:r=1

The recommender algorithm in essence tries to learn parameters θj for a set of features of xi the chosen system for e.g.  books in this case.

The recommender tries to learn the parameters for all the users

min 1/2m Σ Σ((θj)T x – y i,j)2  + λ/2m(Σ Σ θj)2

θ1…θn                i:r=1

The minimization is performed by gradient descent as

Θj k:= Θjk – α (Σ((θj)T x – y i,j)xi + λ Θj k

Recommender systems tries to learn the parameters for a set of chosen features over all users. Based on the learnt paramaters it then tries to predict the rating the user would give to books/apps that he is yet to purchase and push up those apps for which the user is likely to give a high rating based on the given set of ratings.

Recommender systems contribute substantially to the revenues of e-commerce sites like Amazon, Flipkart, Netflix etc

Note: This post, line previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng

# Perils and pitfalls of Big Data

Big Data is hurtling towards us in a big way. It is already in the news and the blip seems to getting bigger. Big Data will soon become the key driver for almost any kind of decision that is to be made in manufacturing, retail, finance all the way to astronomy, oceanography etc. The common aspect of all industries and areas is that data is generated in the order of several petabytes to exabytes. Big Data is the technique to analyze such large volumes of data.

Big Data represents the technique to handle the huge deluge of data that is already becoming enmeshed in our lives. Multiple disparate, varied streams of data (text. tweets, click streams, html) flow through with tremendous volume & velocity. The key aspects of data in the world are the volume, variety and the velocity. It is never ending and never seems to stop. How do we handle this deluge? How do we make sense of this data is what Big Data is all about.

Big Data provides algorithms to find patterns, determine trends or classify data depending on the features provided. It is supposed to enable the decision makes to make key decisions based on the answers the algorithms spew forth.

Big Data is also complicated by the fact that data comes is multiple forms from click streams, tweets, html, texts, CSVs, structured and non –structured data.

The ability to detect patterns, determine trends, classify, identify outliers is no easy task

In this post I try to take a philosophical look at Big Data and ask whether it can really help us. Will it help or will take is on wild goose chase? Can we trust the results?

Big Data depends on algorithms to make sense of data. Big Data deals with data that is in the order of Petabytes to Exabytes. At this scale with multiple features our cognitive abilities are of no use. We must rely on machines and algorithms to make sense of these large amounts of data. Our mind can handle a few hundred data points and at most 3 dimensions. Beyond that the data  can hardly make any sense.

Data by itself, in the absence of features & algorithms, is indistinguishable from noise. It is data science that makes sense of data. Data science separates the signal from the noise.

It is the algorithms that try to determine the best fit for a given set of data. But how reliable are the results. For example let us take the following case

An unsupervised learning algorithm for the above data points could try to separate the data into 2 sets. Clearly this is one way but what is more appropriate is that we have 2 shapes, the circle & the rectangle. A machine algorithm would try to work based on the features that we choose. Are we in a position to decide whether the answer the algorithm gives us is correct? We have no way of knowing because the amount of data is beyond our cognitive capabilities,

In other words, Big Data is full of perils and pitfalls.

When we let the machine to analyze on our behalf the possibility of coming to a wrong conclusion is fairly high.  This coupled with the fact that we are sometimes led to erroneous judgments, as discussed below, the problem is further compounded.

In his book “Thinking fast, thinking slow” Daniel Kahneman discusses several situations where our mind falls into the traps of lazy thinking. We come to wrong conclusions. Also our minds tend to detect patterns in data where there are none. Sometimes according to Kahneman ‘randomness appears as regularity or a tendency to cluster’. Also he says ‘the tendency to see patterns in randomness is overwhelming’. We could argue that in Big Data it is the algorithm that is determining the pattern we could be tricked into coming to false conclusions. Sometimes the human mind sees causality where there is none. Occasionally we fail to see the obvious.

In the ‘famous gorilla experiment’ the researchers tried to assess selective attention. The participants are asked to count the number of passes those in white t-shirts make. Surprisingly a large number of the participants were complete oblivious a gorilla that appears midway in the video. When we, as human fail to see such large objects, can we expect the machine to accurately identify patterns and perform accurate classifications?

There are techniques that help in determining false positives for e.g. the Bonferroni correction. Simply put the Bonferroni correction tries to determine the possibility of getting at least 1 significant result when one is testing 20 hypothesis simultaneously. If we want to test 20 hypotheses with the significance of 0.05 then the probability of at least 1 significant result is

P(at least one significant result) = 1 – P(no significant results)

= 1 – (1 – 0:05)^20

= 0.64

So, with 20 tests being considered, we have a 64% chance of observing at least one significant result, even if all of the tests are actually not significant. This would be a false positive.

Given that our ability to come to significant conclusions depends largely on being able to choose appropriate features, we must also be able to maneuver between false negatives and false positives. In addition we must also take into account the fallibility of the human mind.

Clearly, Big Data is the future! However with Big Data we are really on treacherous, slippery ground!

# Reducing to the Map-Reduce paradigm- Thinking Web Scale – Part 1

In physics there are 4 types of forces – gravitational forces among celestial bodies, electro-magnetic forces and strong and weak forces at the sub-atomic level. The equations that seem to work among large bodies don’t seem to apply at the sub-atomic level though there have been several attempts at grand unification theories

Similarly in computing we have: – computing at personal level, enterprise level, data-center level and a web scale level. The problems and paradigms at each level are very different and unique. The sequential processing, relational database accesses or network speeds at the local area network level are very different to the parallel processing requirements, NoSQL based storage accesses  and WAN latencies.

Here is the first of my posts on paradigms at the Web Scale.

The internet now contains in excess of 1 billion hosts.  This is based on a report in the World Fact Book published in 2012.

In these 1 billion and odd hosts there are at least ~1.5 billion pages that have been indexed. There must be several hundred million that are not indexed by the major search engines.

Search engines like Google, Bing or Yahoo have to work on several hundred million pages.  Similarly social web sites like Facebook, Twitter or LinkedIn have to deal with several hundred million users who constantly perform status updates, upload images, tweet etc. To handle large quantities of data efficiently and quickly there is a need for web scale algorithms.

One such algorithm is the map-reduce, that had its origins in Google. The map reduce essentially consists of a set of mappers which take as input a key-value pair and outputs 0 or more key value pairs. The reducer takes all tuples with the same key and combines them based on some function and emits a key value pair

Map-reduce, and its open source avatar, Hadoop, are now used routinely to solve several large scale problems. To be honest, I was and still am, puzzled whether the 2 simple tasks types of mapping & reducing can be used for a large variety of problems. However, it appears so.

I would have assumed that there would have been other flavors, maybe an ‘identify-update’, ‘determine-solve’ or some such equivalent, unless a large set of problems can be expressed as some combination of the map reduce paradigm.

Anyway here a few examples for which the map reduce algorithm is useful.

Word Counting: The standard example for map-reduce is the word counting program. In this the map reduce algorithm generates a list of words with their corresponding word count from a set of input files. The Map task reads each document and breaks it into a sequence of words (w1, w2, w3 …). It then emits a key value pair as follows

(w1,1),(w2,1),(w3,1),(w1,1) and so on. If a word is repeated in the document it occurs multiple times in the output.  Now the entire key, value pairs are grouped by keys and sent to one of the reducer tasks. Each reducer will then sum all the values thus giving the total for each word.

Matrix multiplication: Big Data is a typical challenge in the web where there is a need to determine patterns and trends in mountains of data. Machine learning algorithms are utilized to determine structure in data that has 3 characteristics of volume, variety and velocity. Machine learning algorithms typically depend on matrix operations. Map-reduce is ideally suited for this and one of the original purposes of Google for map-reduce was with matrix multiplication.

Let us assume that we have a n x n matrix M whose element in row i and column j is mij

Also let us assume that there is a vector ‘v’ whose jth element is vj . Then the matrix vector product can be is the vector x of the length n whose ith element is given as

xi = ∑ mijvj

Map function: The map function applies to each single element of the matrix M. For each element mij the map task outputs a key-value pair as follows (i, mijvj).  Hence we will have a key-value pairs for all ‘i’ from 1 to n.

Reduce function:  The reduce function takes all pairs with the same key ‘i’ and sum it up.

Hence each reducer will generate

xi = ∑ mijvj

(Reference: Mining of Massive Datasets– Anand Rajaraman, Jure Leskovec, Jeffrey D Ullman)

This link gives a good write-up on a matrix x matrix multiplication,

Map-reduce for Relational Operations: Map-reduce can be used to perform a number of operations on large scale data that are used in database operations. Multiple database operations can be performed on large scale data like selection, projection, union, intersection, difference, natural join, grouping etc.

Here is a an example taken from ‘Web Intelligence & Big Data’ course from Coursera any Gautam Shroff.

Let us assume that there are 2 tables ‘Sales by address’ and “City by address’ and the need is to find the total ‘Sales by City’. The SQL query for this

SELECT SUM(Sale),City FROM Sales, City WHERE Sales.Addr_id = Cities.Addr_id GROUP BY City

This can be done by 2 map-reduce tasks.

The first map-reduce task GROUPs BY Sales as follows

Map1: The first map task will emit (Address, rest of record (SALE/City))

Reduce1: The first reduce task will SUM (Sales) by Address for every City. Clearly this will have multiple occurrences of City.

At this point we will have the sum of the sales for every city. However each city can occur multiple times. Now we have to GROUP BY City

Map2: Now the mapper emits the (City, rest of record (SALES)

Reduce2: The 2nd reduce now SUMS all the sales for each city.

Clearly the map-reduce algorithm does solve some major areas. It is extremely useful when there is a need to perform the same operation on multiple documents. It would definitely be useful in building the inverted index or in Page rank. Also, map-reduce is very powerful in handling matrix operations. Large class of problems like machine learning, computer vision all use matrices extensively and map-reduce is extremely critical when it has done in large volumes of data.  Besides, the ability of map-reduce to perform a large set of database operations is something that can be used in many situations in the web.

However it is no silver bullet for all types of problems.

# Simplifying ML: Impact of degree of polynomial degree on bias & variance and other insights

This post takes off from my earlier post Simplifying Machine Learning: Bias, variance, regularization and odd facts- Part 4. As discussed earlier a poor hypothesis function could either underfit or overfit the data.  If the number of features selected were small of the order of 1 or 2 features, then we could plot the data and try to determine how the hypothesis function fits the data. We could also see whether the function is capable of predicting output target values for new data.

However if the number of features were large for e.g. of the order of 10’s of features then there needs to be method by which one can determine if the learned hypotheses is a ‘just right’ fit for all the data.

Checkout my book ‘Deep Learning from first principles Second Edition- In vectorized Python, R and Octave’.  My book is available on Amazon  as paperback (\$18.99) and in kindle version(\$9.99/Rs449).

You may also like my companion book “Practical Machine Learning with R and Python:Second Edition- Machine Learning in stereo” available in Amazon in paperback(\$12.99) and Kindle(\$9.99/Rs449) versions.

The following technique can be used to determine the ‘goodness’ of a hypothesis or how well the hypothesis can fit the data and can also generalize to new examples not in the training set.

Several insights on how to evaluate a hypothesis is  given below

Consider a hypothesis function

hƟ (x) = Ɵ0 + Ɵ1x1 + Ɵ2x22 + Ɵ3x33  +  Ɵ4x44

The above hypothesis does not generalize well enough for new examples in the data set.

Let us assume that there 100 training examples or data sets. Instead of using the entire set of 100 examples to learn the hypothesis function, the data set is divided into training set and test set in a 70%:30% ratio respectively

The hypothesis is learned from the training set. The learned hypothesis is then checked against the 30% test set data to determine whether the hypothesis is able to generalize on the test set also.

This is done by determining the error when the hypothesis is used against the test set.

For linear regression the error is computed by determining the average mean square error of the output value against the actual value as follows

The test set error is computed as follows

Jtest(Ɵ) = 1/2mtest Σ(hƟ (xtest – ytesti)2

For logistic regression the test set error is similarly determined as

Jtest(Ɵ) = = 1/mtest Σ -ytest * log(hƟ (xtest))  – (1-ytest) * (log(1 – hƟ (xtest))

The idea is that the test set error should as low as possible.

Model selection

A typical problem in determining the hypothesis is to choose the degree of the polynomial or to choose an appropriate model for the hypothesis

The method that can be followed is to choose 10 polynomial models

1. hƟ (x) = Ɵ0 + Ɵ1x1
2. hƟ (x) = Ɵ0 + Ɵ1x1 + Ɵ2x22
3. hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x33

Here‘d’ is the degree of the polynomial. One method is to train all the 10 models. Run each of the model’s hypotheses against the test set and then choose the model with the smallest error cost.

While this appears to a good technique to choose the best fit hypothesis, in reality it is not so. The reason is that the hypothesis chosen is based on the best fit and the least error for the test data. However this does not generalize well for examples not in the training or test set.

So the correct method is to divide the data into 3 sets  as 60:20:20 where 60% is the training set, 20% is used as a test set to determine the best fit and the remaining 20% is the cross-validation set.

The steps carried out against the data is

1. Train all 10 models against the training set (60%)
2. Compute the cost value J against the cross-validation set (20%)
3. Determine the lowest cost model
4. Use this model against the test set and determine the generalization error.

Degree of the polynomial versus bias and variance

How does the degree of the polynomial affect the bias and variance of a hypothesis?

Clearly for a given training set when the degree is low the hypothesis will underfit the data and there will be a high bias error. However when the degree of the polynomial is high then the fit will get better and better on the training set (Note: This does not imply a good generalization)

We run all the models with different polynomial degrees on the cross validation set. What we will observe is that when the degree of the polynomial is low then the error will be high. This error will decrease as the degree of the polynomial increases as we will tend to get a better fit. However the error will again increase as higher degree polynomials that overfit the training set will be a poor fit for the cross validation set.

This is shown below

Effect of regularization on bias & variance

Here is the technique to choose the optimum value for the regularization parameter λ

When λ is small then Ɵi values are large and we tend to overfit the data set. Hence the training error will be low but the cross validation error will be high. However when λ is large then the values of Ɵi become negligible almost leading to a polynomial degree of 1. These will underfit the data and result in a high training error and a cross validation error. Hence the chosen value of λ should be such that the cross validation error is the lowest

Plotting learning curves

This is another technique to identify if the learned hypothesis has a high bias or a high variance based on the number of training examples

A high bias indicates an underfit. When the number of samples in training set if low then the training error and cross validation error will be low as it will be easy to create a hypothesis if there are few training examples. As the number of samples increase the error will increase for the training set and will slightly decrease for the cross validation set. However for a high bias, or underfit, after a certain point increasing the number of samples will not change the error. This is the case of a high bias

In the case of high variance where a high degree polynomial is used for the hypothesis the training error will be low for smaller number of training examples. As the number of training examples increase the error will increase slowly. The cross validation error will be high for lesser number of training samples but will slowly decrease as the number of samples grow as the hypothesis will learn better. Hence for the case of high variance increasing the number of samples in the training set size will decrease the gap between the cross validation and the training error as shown below

Note: This post, line previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng

# Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4

In both linear and logistic regression the choice of the degree of the polynomial for the hypothesis function is extremely critical. A low degree for the polynomial can result in an underfit, while a very high degree can overfit the data as shown below

The figure on the left the data is underfit as we try to fit the data with a first order polynomial which is a straight line. This is a case of strong ‘bias’

The rightmost figure a much higher polynomial is used. All the data points are covered by the polynomial curve however it is not effective in predicting other values. This is a case of overfitting or a high variance.

The middle figure is just right as it intuitively fits the data points the best possible way.

A similar problem exists with logistic regression as shown below

There are 2 ways to handle overfitting

a)      Reducing the number of features selected

b)      Using regularization

In regularization the magnitude of the parameters Ɵ is decreased to reduce the effect of overfitting

Hence if we choose a hypothesis function

hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x33 +  Ɵ4x44

The cost function for this without regularization as mentioned in earlier posts

J(Ɵ) = 1/2m Σ(hƟ (xi  – yi)2

Where the key is minimize the above function for the least error

The cost function with regularization becomes

J(Ɵ) = 1/2m Σ(hƟ (xi  – yi)2 + λ Σ Ɵj2

As can be seen the regularization now adds a factor Ɵj2  as a part of the cost function which needs to be minimized.

Hence with the regularization factor the problem of underfitting/overfitting can be solved

However the trick is determine the value of λ. If λ is too big then it would result in underfitting or resulting in a high bias.

Similarly the regularized equation for logistic regression is as shown below

J(Ɵ) = |1/m Σ  -y * log(hƟ (x))  – (1-y) * (log(1 – hƟ (x))  | + λ/2m Σ Ɵj2

Some tips suggested by Prof Andrew Ng while determining the parameters and features for regression

a)      Get as many training examples. It is worth spending more effort in getting as much examples

c)      Observe changes to the learning algorithm with different values of λ

This post is continued in my next post – Simplifying ML: Impact of degree of polynomial on bias, variance and other insights

Note: This post, in line with my previous posts on Machine Learning,  is based on the Coursera course on Machine Learning by Professor Andrew Ng

# Simplifying ML: Logistic regression – Part 2

Logistic regression is another class of Machine Learning algorithms which comes under supervised learning. In this regression technique we need to classify data. Take a look at my earlier post Simplifying Machine Learning algorithms – Part 1 I had discussed linear regression. For e.g if we had data on tumor sizes versus the fact that the tumor was benign or malignant, the question is whether given a tumor size we can predict whether this tumor would be benign or cancerous. So we need to have the ability to classify this data.

This is shown below

It is obvious that a line with a certain slope could easily separate the two.

As another example we could have an algorithm that is able to automatically classify mail as either spam or not spam based on the subject line. So for e.g if the subject line had words like medicine, prize, lottery etc we could with a fair degree of probability classify this as spam.

However some classification problems could be far more complex.  We may need to classify another problem as shown below.

From the above it can be seen that hypothesis function is second order equation which is either a circle or an ellipse.

In the case of logistic regression the hypothesis function should be able to switch between 2 values 0 or 1 almost like a transistor either being in cutoff or in saturation state.

In the case of logistic regression 0 <= hƟ <= 1

The hypothesis function uses function of the following form

g(z) = 1/(1 + e‑z)

and hƟ (x) = g(ƟTX)

The function g(z) shown above has the characteristic required for logistic regression as it has the following shape

The function rapidly asymptotes at 1 when hƟ (x) >= 0.5 and  hƟ (x) asymptotes to 0 when hƟ (x) < 0.5

As in linear regression we can have hypothesis function be of an appropriate order. So for e.g. in the ellipse figure above one could choose a hypothesis function as follows

hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x1 +  Ɵ4x2

or

hƟ (x) = 1/(1 + e –(Ɵ0 + Ɵ1×12 + Ɵ2×22 + Ɵ3×1 +  Ɵ4×2))

We could choose the general form of a circle which is

f(x) = ax2 + by2 +2gx + 2hy + d

The cost function for logistic regression is given below

Cost(hƟ (x),y) = { -log(hƟ (x))             if y = 1

-log(1 – hƟ (x)))       if y = 0

In the case of regression there was a single cost function which could determine the error of the data against the predicted value.

The cost in the event of logistic regression is given as above as a set of 2 equations one for the case where the data is 1 and another for the case where the data is 0.

The reason for this is as follows. If we consider y =1 as a positive value, then when our hypothesis correctly predicts 1 then we have a ‘true positive’ however if we predict 0 when it should be 1 then we have a false negative. Similarly when the data is 0 and we predict a 1 then this is the case of a false positive and if we correctly predict 0 when it is 0 it is true negative.

Here is the reason as how the cost function

Cost(hƟ (x),y) = { -log(hƟ (x))             if y = 1

-log(1 – hƟ (x)))       if y = 0

Was arrived at. By definition the cost function gives the error between the predicted value and the data value.

The logic for determining the appropriate function is as follows

For y = 1

y=1 & hypothesis = 1 then cost = 0

y= 1 & hypothesis = 0 then cost = Infinity

Similarly for y = 0

y = 0 & hypotheses  = 0 then cost = 0

y = 0 & hypothesis = 1 then cost = Infinity

and the the functions above serve exactly this purpose as can be seen

Hence the cost can be written as

J(Ɵ) = Cost(hƟ (x),y) = -y * log(hƟ (x))  – (1-y) * (log(1 – hƟ (x))

This is the same as the equation above

The same gradient descent algorithm can now be used to minimize the cost function

So we can iterate througj

Ɵj =   Ɵj – α δ/δ Ɵj J(Ɵ0, Ɵ1,… Ɵn)

This works out to a function that is similar to linear regression

Ɵj = Ɵj – α 1/m { Σ hƟ (xi) – yi} xj i

This will enable the machine to fairly accurately determine the parameters Ɵj for the features x and provide the hypothesis function.

This is based on the Coursera course on Machine Learning by Professor Andrew Ng. Highly recommended!!!

# Simplifying Machine Learning algorithms – Part 1

Machine learning or the ability to use computers to predict values, classify data or identify patterns is truly a fascinating field. It is amazing how algorithms can come to conclusions on data. Detecting patterns is a inborn ability of the human mind. But our mind cannot handle large quantities of data with many features. It is here that machines have an edge over us.

This post is inspired by the Machine Learning course at Coursera conducted by Professor Andrew Ng of Stanford. The lectures are truly lucid and delivered with amazing clarity. In a series of post I will be trying to distil the meaning and motivation behind the algorithms that are part of machine learning.

There are 2 major types of learning

a)      Supervised learning b) Unsupervised learning

Supervised learning: In supervised learning we have to infer the relationship between input data and output values. The intention of supervised learning is determine the possible out for some random input once the relationship has been determined. Some examples of supervised learning are linear regression, logistic regression etc.

Unsupervised learning: In unsupervised learning the problem is to determine patterns and structure in unlabeled data. Some examples of unsupervised learning are K-Means clustering, hidden Markov models etc.

In this post I would like to take a look at Supervised Learning algorithms

Linear Regression

In regression problems we try to infer the relationship between a set of input parameters to an output value. Let us we have data for the number of rooms vs. price of the house as shown below

Depending on the data we could either fit a straight line or use a linear fit. Alternatively we could fit a higher order curve to data.

The function that determines the relationship is also known as hypothesis function. This can be represented as follows for e.g a hypothesis function with a single feature

hƟ(x) = Ɵ1x+ Ɵ0

The above equation is the hypothesis function where Ɵ is the parameter and x is the feature

We could have a higher order hypothesis function as follows

hƟ(x) = Ɵ2x2+ Ɵ1x+Ɵ0

To evaluate whether the hypothesis function is able to map the input and related output accurately is known as the ‘cost function’.

The cost function can be represented as

J(Ɵ) = 1/2m Σ(hƟ (xi)  – y i)2

The cost function really calculates the ‘mean squared error’ of the actual data points (y) with the points on the hypothesis function (hƟ). Clearly higher the value of J(Ɵ) the greater is the error in predicting the output based on a set of input parameters. If we just took the error instead of the squared error then if there were data points on either side of the predicted line then the positive & negative errors could cancel out. Hence the approach is usually to take the mean of the squared error.

The goal would be to minimize the error which will result in the best fit.

So the approach would be to choose values for the parameters Ɵi

The algorithm that is used for determining the values of the parameters that will result in the minimum error is gradient descent

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

Where α is the learning rate

Gradient descent starts by picking a random value for Ɵi. Then the algorithm looks around to search for the next combination that will take us down fastest. By continuing this process the local minima is determined.

Gradient descent is based on the observation that if the multivariable function  is defined and differentiable in a neighborhood of a point , then  decreases fastest if one goes from  in the direction of the negative gradient. This is shown in the below diagram taken from Wikipedia.

For e.g for a curve as shown below

This how I think the gradient descent works. In the above diagram at point A the slope is +ve and taking the negative of the slope multiplied by the learning factor α and subtracting it from Ɵj will result in a value that is less than Ɵj. That is we move towards the minima or C. Similarly at point B the slope will be -ve. If we multiply by  – α then we will add to Ɵj. Hence we will move to the right or towards point C.

By applying the iterative process of gradient descent we can get the combination of parameter values for  Ɵ that will provide the best fit for the set of data points

The iterative process of gradient descent is applied to minimize the cost function which is function of the error in the current hypothesis

δ/δ J(Ɵ) = δ/ δ Ɵ * 1/2m Σ(hƟ (xi)  – y i)2

This process is applied iteratively to the below equation to arrive at the values of Ɵi

The formula is

Ɵj := Ɵj – αd/d Ɵj J(Ɵ)

to obtain the values for the best fit equation

hƟ(x) = Ɵ2xn+ Ɵ1xn-1+ …+  Ɵ0

Also read my post on Simplifying ML: Logistic regression – Part 2