K-means Clustering and Its use-case in the Security Domain
What is Unsupervised Learning ?
Unsupervised Learning is a sort of machine learning method that uses input data without tagged replies to make conclusions. The term “training data” refers to a collection of unlabeled data.
♦ What is K-Means Clustering?
The word “k-means” was originally used by James Macqueen in 1967 as part of his work on “some techniques for classification and analysis of multivariate observations”. In 1957, the standard algorithm was utilised in Bell Labs as part of a pulse code modulation approach. It was also published by E. W. Forgy in 1965 and is commonly referred to as the Lloyd-Forgy technique.
The assignment of items to homogenous groups (called clusters) while ensuring that objects in different groups are not identical is known as “clustering.” Clustering is an unsupervised job since it tries to explain the items’ hidden structure.
K-means Clustering is a common unsupervised machine learning technique that is both easy and effective.
Clustering is the process of separating a population or set of data points into many groups so that data points in the same group are more similar than data points in other groups. to put it simply.The goal is to sort groups with similar characteristics into clusters. The k-means algorithm’s objective is to discover groupings in data.
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
The way the K-means algorithm works is as follows:
- Specify the number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Keep iterating until there is no change to the centroids. i.e the assignment of data points to clusters isn’t changing.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster (centroid).
- Compute the centroids for the clusters by taking the average of all data points that belong to each cluster.
The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster.
The approach k-means follows to solve the problem is called Expectation-Maximization
Applications of K-means Clustering
The k-means method is widely utilised in a wide range of applications, including market segmentation, document clustering, picture segmentation, and compression, among others.It is mostly used in the field of categorising unlabeled data.
- Customer Profiling
- Market segmentation
- Computer vision
- Document clustering
- Identifying crime-prone areas
- Cluster analysis
- Feature learning or dictionary learning
- Identifying crime-prone areas
- Insurance fraud detection
- Public transport data analysis
Cyber profiling using K-Means
1. Document analysis
There are many different reasons why you would want to run an analysis on a document. In this scenario, you want to be able to organize the documents quickly and efficiently.
Problem: Imagine you are limited in time and need to organize information held in documents quickly. To be able to complete this task you need to: understand the theme of the text, compare it with other documents and classify it.
Solution: Hierarchical clustering has been used to solve this problem. The algorithm is able to look at the text and group it into different themes. Using this technique, you can cluster and organize similar documents quickly using the characteristics identified in the paragraph.
2. Spam filter
Do you know the junk folder in your email inbox? It is the place where emails have been identified as spam by the algorithm. Many machine learning courses, such as Andrew Ng’s famed Coursera course, use the spam filter as an example of unsupervised learning and clustering.
Problem: Spam emails are at best an annoying part of modern-day marketing techniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.
Solution: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together.
3. Identifying fraudulent or criminal activity
In this scenario, we are going to focus on fraudulent taxi driver behavior. However, the technique has been used in multiple scenarios.
Problem: You need to look into fraudulent driving activity. The challenge is how do you identify what is true and which is false?
Solution: By analysing the GPS logs, the algorithm is able to group similar behaviors. Based on the characteristics of the groups you are then able to classify them into those that are real and which are fraudulent.