Difference between K means and K medoids Clustering
Clustering is the most basic form of data grouping in data analysis as well as in machine learning; the process involves putting the given set of objects into various groups. In which the objects in any single group will be much closer to one another than the objects belonging to other groups. Facilitating the process of clustering, there are several algorithms that are quite popular: K-Means and K-Medoids. However, these two terms are not the same, though they have some related aspects in their application and usage.
The main purpose of this article is to outline the major points of difference between K-Means and K-Medoids clustering algorithms.
Table of Content
What is K-Means Clustering?
K-Means is an iterative algorithm that assigns K clusters to a dataset where each cluster has a center that is the average of all the points situated in it, always referred to as the centroid. The steps involved in K-Means are:
- Initialization: Choose some points in K as initial centroids randomly.
- Assignment: Label every data point to the closest centroid of that cluster using the formula of the distance that squares.
- Update: The centroids need to be calculated again, and it is done by taking the average of all the points that belong to a particular cluster.
- Repeat: Make assignments multiple times and perform update actions until they do not alter anymore (this step is called convergence).
Advantages of K-Means
- Simplicity: The advantage of K-Means is that it is simple to use and has a rather uncomplicated algorism.
- Efficiency: It is effective in terms of time complexity and thus can easily work with large data sets.
- Speed: generally converges quickly.
Disadvantages of K-Means
- Sensitivity to Outliers: K-Means is also susceptible to noise and outliers, chiefly because of its reliance on means as the critical measure in binning assignments.
- Shape Assumption: This tends to hold the distances by treating the clusters as spherical objects of equal sizes, which is not always true.
- Initial Centroids: The selection of the first K centroids may differ, and this may lead to different clustering and hence inaccurate clustering.
What is K-Medoids Clustering?
K-Medoids, or Partitioning Around Medoids (PAM), is similar to the K-Means clustering method but requires the use of medians for the formation of subgroups. A medoid is a centroid that best represents the objects in a defined cluster. The steps in K-Medoids are:The steps in K-Medoids are:
- Initialization: The first step in the selection process involves choosing K initial medoids at random.
- Assignment: Within the DUHI scenario, assign each data point to the nearest medoid or a previously defined criterion distance.
- Update: In order to do this, the choice of a new medoid to represent each cluster should be the one that would result in the smallest sum of the distances within the cluster.
- Repeat: Cycle through the assignment and update steps, revolving until meeting the convergence.
Advantages of K-Medoids
- Robustness to Outliers: Most often, medoids are less prone to outliers and noises than centroids.
- Flexibility in Distance Metrics: Others can use any distance measurement, not limited to the Euclidean distance measurement, making it general for all data types.
Disadvantages of K-Medoids
- Computationally Intensive: K-Medoids is relatively slower than K-Means due to its higher complexity when working with larger databases.
- Complexity: It is also more difficult to implement and understand than K-Means, even though it is a more effective clustering method.
Key Differences Between K-Means and K-Medoids
Centroid vs. Medoid
- K-Means: Uses the mean of the points in a cluster as the centroid, which may not be an actual data point.
- K-Medoids: Uses actual data points as medoids, making it more interpretable.
Distance Measures
- K-Means: Typically uses Euclidean distance, which may not be suitable for all data types.
- K-Medoids: Can use any distance measure, providing more flexibility.
Sensitivity to Outliers
- K-Means: Sensitive to outliers, as they can significantly affect the mean.
- K-Medoids: More robust to outliers, as medoids are actual data points less influenced by extreme values.
Computational Complexity
- K-Means: Generally faster and more efficient, making it suitable for very large datasets.
- K-Medoids: Slower due to the need to evaluate all possible swaps, better for smaller datasets or when robustness is crucial.
Convergence
- K-Means: Converges faster but may end up in local minima.
- K-Medoids: More likely to find a global optimum but at a higher computational cost.
Difference between K means and K medoids Clustering
Aspect | K -Means Clustering | K-Medoids Clustering |
---|---|---|
Representation of Clusters | K-Means Clustering uses the mean of points (centroid) to represent a cluster. | It uses the most centrally located point (medoid) to represent a cluster. |
Sensitivity to Outliers | Highly sensitive to outliers. | More robust to outliers. |
Distance Metrics | K-Means primarily uses Euclidean distance. | Whereas it can use any distance metric. |
Computational Efficiency | K-Means is generally faster and more efficient | It is slower due to the need to calculate all pairwise distances within clusters. |
Cluster Shape Assumption | It assumes spherical clusters. | It does not make strong assumptions about cluster shapes. |
Practical Considerations
When to Use K-Means
- When dealing with large datasets.
- When computational efficiency is a priority.
- When the data is well-behaved and not heavily influenced by outliers.
When to Use K-Medoids
- When robustness to outliers is important.
- When the dataset is smaller and the flexibility of using different distance measures is beneficial.
- When interpretability of cluster centers as actual data points is needed.
Conclusion
K-Means and K-Medoids are two significant clustering algorithms that are unique in their own way with their positive and negative aspects. Which of them to focus on depends on the concrete qualities of the data, how compute-intensive the task is, etc., and the extent to which the presence of outliers will impact the result. This means that, depending on the specific needs of the data scientist in question, he or she will be able to correctly choose the most appropriate method for the particular kind of clustering job that has to be accomplished.