K-Means Clustering
Last updated
Last updated
In this post, I will talk about one of unsupervised learning techniques called K-Means Clustering. It is easy to understand and implement from the scratch.
As it is unsupervised, we don't get inputs with any labels. So when we use this technique, all we do is group them into groups and when data for prediction is given, we output which group it belongs to. We can use this in multiple situations.
For example, we are given locations of retail shops that we send our supplies to and we would like to know where we should build our storage units for those supplies. If we are only planning to have one mega-size storage, we could put one in the center of those shops. We can choose any number of storage units () and decide which location would be ideal for better, faster and easier access and delivery.
Below is an animation showing simple K-Means clustering and how it works.
The function in numpy np.random.normal(loc, scale, size)
creates randomly generated data points centered at the parameter loc
. Scale
indicates how much these generated numbers will be spread out from the center. The size
is the number of data points to be created.
For our problem, let us create two variables for our whole data set. To plot and see the points, x
and y
should have the same size.
And if we plot them, it will look like the following.
Below are the functions necessary to implement K-Means clustering and I've commented its usage so that you can understand what it is doing. Even without those comments, it is pretty easy and clear to grasp what each of it does.
For this exercise, I am using 2D examples so if it in different dimension, assign_points
should be modified that it takes more arguments. It is also possible to dynamically take in different number of arguments and feel free to do so if you want.
Now let's create five points and plot with earlier x and y values and see the first graph.
Most likely this and your graph look different as the generated data is random. Also I made data points to be a bit transparent so it is easier to distinguish them from 5 group points.
Now that we've trained a model, let's test with five randomly generated data points and predict each of its class (0 to 4).
Let's look at one sample next.
As we can see two different plots above, we have two data points (blue) at (-1,0) and (1,0). While in the left plot each point belongs to each different group, in the right we can see that two points belong to orange group and the green has no data points.
The codes above can all be further improved for either better efficiency and optimization or for visualizations but as this post is not about that, I will leave it as is.
Personally I use Jupyter Notebook and that some visualizations can be messy or popup windows are there forever until you close it if you are directly using python. So if you don't use the notebook, you may have to adjust a bit of pyplot accordingly.
Thank you for reading the post and please let me know if you find any error.
Now that we've implemented our data points and points, let's loop them until it converges. Note that the code below is longer than it is necessary because of the visualization. If you don't need any kind of plots, you can use the commented codes.
It works well! You should note that depending on where these points are generated, the final graph can be different from each other.
Just like this, though the data points are same it does not mean nor guarantee that the final results of two different plots will end up in the same and so how those points initialize can greatly impact the outcome so keep in mind.