selva raj
4 min readJun 10, 2021

--

This picture will be most appropriate to talk about KMeans.It contains with many data points separated by group on the basis of color. It might be a customer segmentation or performance of students classified as average and above average, or image classification on the basis of pixel size of image.
KMeans

The main theme of the KMeans algorithm is to make subgroups according to the features and its values. Just imagine the below mentioned incident:

  1. There is a dataset. In which, data's are denoted the weight of the students. You will be given a task to make them separate into groups or clusters with same characteristics as possible. There is no other labeled or dependent data's which denotes any classified conditions. Which means it is considered to be a unsupervised machine learning algorithm.

Note: I am planning to make you understand through paperwork, not with help of downloaded snapshots. hope you will get understand easily.

Definition:

KMeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.

NOTE: The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way KMeans algorithm works is as follows:

  1. Specify number of clusters K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids.

Compute the sum of the squared distance between data points and all centroids.

Assign each data point to the closest cluster (centroid).

Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

Let’s dive into problem: Consider the following data’s are weight of the students ,and the task is to make them into two groups(clusters) without overlapping.

Values: 15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65.

Step:1

Number of Clusters(k)=2

Step:2

In this case I take 16 and 22 as center points.(16 and 22 are chosen just by random, no reason behind ).

In simple words, We have to find the distance of each data points from center points. And the data points have shortest distance from centroid has to be grouped.

Step:3-Keep iterating until there is no change to the centroids.

Iteration:1

Centroid =16(cluster 1), Centroid= 22(cluster 2), Calculate the difference between each data points with respective of Centroids.

16–15=1(Diff=1),22–15=7(diff=7),Which means, the value of 15 is placed very close to centroid of 16 and far away to centroid of 22.

So, We can say 15 has to go into cluster 1.In the same way, we can calculate the distance for remaining data’s also.

2nd data point is 15, so obviously that is also goes into cluster 1.

3rd data point is 16,no difference in distance ,so that is also should belongs to cluster 1.

4th and 5th data point is 19. Let’s see where it goes to?

16–19=3(Diff=3),22–19=3(diff=3),same difference in distance. Wow, in this case we do not need to worry, because algorithm itself taking care of it.

Next data point is 20, Hopefully here we can see the change.

16–20=4(Diff=4),22–20=2(diff=2),Which means, the value of 20 is placed very close to centroid of 22 and far away to centroid of 16.

So, We can say 20 has to goes into cluster 2.

If you look at the data points once again after 20 everything else are above 20 so without calculating, we can say those points will joining into cluster 2.

Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.

Mean of the cluster 1:

15+15+16/3=15.33

Mean of the cluster 2:

19+19+20+20+21+22+28+35+40+41+42+43+44+60+

61+65/16=36.25

We have to repeat the same iteration process until there is no change in centroid(center values).

for your better understanding ,I will attach my paper work-in which I did further iterations.

Iteration:2

in the end of 2nd iteration we got centroids values of 18 and 45 for cluster 1 and cluster 2 respectively.

Here I am handover the next iterations to you. Please do until there is no change in centroids.

Note: If we follow the same way ,we will get same centroid values in both 3rd and 4th iterations. Centroid value of cluster 1 is 19 and cluster 2 is 47 for both 3rd and 4th iterations. Here we are gonna stop.

So, the data points 15,15,16,19,19,20,20,21,22,28 are comes under the umbrella's of cluster 1,

and 35,40,41,42,43,44,60,61,65 are comes under the umbrella’s of cluster 2.

Applications

KMeans algorithm is very popular and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is either:

  1. Get a meaningful intuition of the structure of the data we’re dealing with.
  2. Cluster-then-predict where different models will be built for different subgroups if we believe there is a wide variation in the behaviors of different subgroups.

Note:

In this article , I just tried to make you understand about what Basically KMeans is. and there is few more engineering things we can do to inspect the model.

Elbow method: To ensure the optimum K number

Silhouette score: Ranges from 0 to ,important metric to analyze how well formed or to check the goodness of the clusters is .

Will discuss about the elbow method and silhouette score as soon.

Thank you….

--

--