K-Nearest Neighbour & K-Means
Lecture 5
kNN Introduction
Problem statement
The first algorithm we’re going to see today is a very simple one. Let’s image we have a feature space with labelled data points, such as this:
We want to use these labelled data points as our training data to be able to predict the classification of new data points (such as those from our testing set).
The algorithm we’re going to use to do this classification is called K-nearest neighbour, or kNN for short. This algorithm isn’t mathematically derived as some others we’ve seen, but rather based on intuition.
Example solution
kNN is a classification algorithm where, we as the user, get to set
Neighbour’s of a new data point can be determined using the euclidean distance, and selecting
Let’s say we set
The effect of
Accounting for ‘ties’/‘draws’
What if, when using
- Only use odd valued
. - Decrease
until the tie is broken. - Weight neighbours by the distance.
K-Means Introduction
Problem statement
Say we had a set of data, un-labelled data, and we wanted to separate them into groups or classes. Below we have an example where, as humans, we can see 3 distinct groups of data points. In today’s lecture, we’re going to look at an algorithm that can identify these same clusters or groups systematically.
K-Means clustering
This algorithm is called K-means. In essence, it is an algorithm that finds
Of course, we have to, ourselves, pick a value of for
Starting point
K-means is an iterative algorithm, which means that the centroids of the clusters will be randomly assigned in the feature space. Let’s say that we initialise a K-means algorithm with
Iterative process
As mentioned, K-means is an iterative process of assigning the position of the cluster’s centroid. Therefore, after randomly assigning each centroid to a different point in the feature space, the algorithm will iteratively move the centroid to better match the true clustering of data points. We’ll get back to how this is mathematically done later in the lecture, but for now we want to understand this intuition.
Assigning centroids
After the algorithm has converged or stopped, we will have 3 centroids, that will, hopefully, match the true clustering of data points.
After we have these positioned centroids, they can be used to label new data points by determining to which cluster do the new data points fall under, or are closet to.
K-Means Algorithm Detail
Initialisation
Let
And let
To initialise the K-means algorithm, we randomly select
After, we compute
So we select the cluster to which our new
The position of each centroid
Iteration
Classic optimisation problem:
There are 3 criterions for stopping the iterative process:
- There are no more changes in clusters by moving the centroids.
- Points remain within the same cluster as before.
- A maximum number of steps/iterations has been reached.
Classification
To determine whether a new point falls within the cluster of
So we select the cluster to which our new
Evaluation of K-means
Since we don’t have true labels with which to evaluate the k-means algorithm against, we must take a different tactic for evaluating the classifications or group of points it has clustered together. This works by evaluating the structure of the clusters.
intra-cluster distance – the average distance between all data points in the same cluster.
intra-cluster diameter – the distance between the two most remote objects in a cluster.
Inter-cluster distance
inter-cluster distance – average smallest distance to a different cluster.
silhouette score –
The effect of
The
This may give us some indication as to how many clusters to use.
Other times the value for
Summary
In today’s lecture, we’ve had a look at two different classification algorithms:
- K-Nearest Neighbour where we classify data points by looking at the existing classification of the existing
neighbours. - K-Means where, for un-labelled data, the algorithm finds the centroid of
clusters, which we can use in future to classify new data points depending on which cluster they fall within.
For each of these algorithms, we’ve first try understand, intuitively, what the algorithm is attempting to achieve. After this point, we’ve taken a look at the mathematics behind the algorithm so that we can gain a deeper understanding and appreciation for it’s mechanics.