Our brain is an extraordinary apparatus. It is amazing how we humans can instantaneously perceive shapes, objects, forms. For e.g. when see a scene with many objects we are immediately able to identify the different objects in the scene. View this against the backdrop of a recent Google’s artificial brain experiment of a neural network with 16000 processors and a billion connections. This artificial brain was fed with 10 million thumbnails of you tube videos before it was able to recognize cat videos.

That’s an awful lot of work to recognize cat videos!

We can see that a lot of work involved getting a computer to do something as simple thing as this.

Consider how a baby learns to recognize objects for e.g. cat, dog, toy etc. The human brain does not try to measure the number of eyes, spacing between the eyes, the mouth shape of face etc. The brain immediately is able to distinguish the different animals. How does it do it? Amazing right?

In any case here is a machine learning algorithm that is capable of identifying structure in data. This is also known as K-Means and is a form of unsupervised learning algorithm.

The K-Means algorithm takes as input an unlabeled data set and identifies groups in the set. It tries to determine structure in the data set.

Take a look at the picture below

It is readily obvious that there are 2 clusters in the above diagram. However to the computer this is just a random set of points.

How does the K-Means cluster identify the clusters in the above diagram?

The algorithm is fairly simply and intuitive.

1) Let us start by choosing 2 random points which we call as ‘cluster centroids

2) We then associate each centroid with the points in the dataset that are closest to it.

3) We then compute the average of each group of associated points in the centroid and move the centroid to that average.

4) We then repeat steps 2 – 4 until there is no significant change in the centroid

This is shown below

The above algorithm can be implemented iteratively as follows

For training set (x^{1}, x^{2}, x^{3} …)

Randomly initialize K cluster centroids μ_{1}, μ_{2}, μ_{3} … μ_{K}

_{ }

Repeat {

for 1 to m

c(i) = The cluster index from 1 to K that is closest to x^{i} => (A)

end

for k = 1 to K

u(k) = average of all points assigned to K => (B)

end

}

In step (A) the points x^{i} closest to the centroid k is added to the centroid’s set. Hence if points 1,3,4,8 are in centroid 1 then

x^{1} = 1, x^{3} = 1, x^{4} = 1, x^{8} =1

In step (B) the mean of the points 1, 3, 4, 8 is taken

So the centroid

c_{1x} = ¼ { x_{1} + x_{3} + x_{4} + x_{8}} and c_{1y} = ¼ { y_{1} + y_{3} + y_{4} + y_{8}}

This becomes the new c_{1}

_{ }

However there can be occasions where the K-means cluster would get stuck in local optima. To choose optimum cluster centroid we have to determine the least cost. This can be done with the optimization objective.

The optimization objective of K-Means is as follows

K-Mean cluster determination is the problem of minimizing the distance of each point from its centroid. This is also known as the K-Means cost function or distortion function.

J(c^{1},c^{2}…c^{m},, μ_{1},… μ_{2}) = 1/m Σ|| x^{i} – μ_{c(i)} ||^{2}

I like to visualize the algorithm as follows.

In step 1 we can visualize that there is a force of attraction between the datapoints and the cluster centroid based on proximity of the centroid.

In step 2 we can visualize that each datapoint attracts the centroid towards it. The centroid moves to the point where the attraction among all the datapoints balances out. This is average mean squared difference.

As can be seen the objective is to determine the average of the mean squared error of each data point to its closest centroid.

Given a set of data points how we choose the random centroids? One way is to initially pick some random data points themselves as the cluster centroid. The algorithm is then iterated to identify the real cluster centroids.

As mentioned before the algorithm can sometimes get stuck in local optima. One option is to choose another random set of data points and continue to iterate. We need to run this several times to determine the best clustering

There is also the problem of determining the number of cluster centroids. How we to determine how many clusters are would be there in a random data set? Visually we can easily identify the number of clusters. But a machine cannot.

One technique that can be used to determine the number of clusters is as follows. Start with 2, 3… 10 clusters and plot the cost function. Then pick the one with the least cost.

Note: This post, line previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng

Find me on Google+