PERFORMANCE EVALUATION OF SELECTED DISTANCE-BASED AND DISTRIBUTION-BASED CLUSTERING ALGORITHMS
AbstractClustering is an automated search for hidden patterns in a datasets to unveil group of related observations. The technique is one of the viable means by which the patterns or internal structure of the data within the same collection can be revealed. Choosing the right algorithm to achieve clusters of good quality is usually a challenge, especially when the number of clusters cannot be pre-determined. This study focuses on evaluating a number of selected clustering algorithms in finding quality clusters in the data sets. To achieve the central objective of this study, prominent technique in both the distance-based and the distribution-based clustering algorithm, specifically k-means and EM clustering algorithm respectively are implemented in this study. The data sets on which the algorithms were implemented comprised of 1,309 records of passenger information that boarded a ship retrieved from rapidMiner open repository. Experiments were conducted and clusters were formed based on the number of chosen partitions, k. The qualities of the clusters formed are measured using the concept of external criterion, Normalized Mutual Information (NMI), to validate all the clusters formed. The resulting output of this study shows that, the distance-based algorithm find clusters of higher quality with NMI value of 0.912 out of a maximum achievable value of 1. The experiment further reveals the average execution time it takes each algorithm to form the cluster model. The findings of this study also unveiled some useful insight into the choice of clustering algorithm as regards their support for a particular data type and the ease of execution of each algorithm.
Keywords: clustering, data mining, k-means, EM-clustering, un-supervised learning.