Skip to main navigation Skip to search Skip to main content

K-means clustering versus validation measures: A data distribution perspective

  • Hui Xiong*
  • , Junjie Wu
  • , Jian Chen
  • *Corresponding author for this work
  • Rutgers - The State University of New Jersey, New Brunswick
  • Tsinghua University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by Kmeans? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation (CV), is in a specific range, approximately from 0.3 to 1.0.

Original languageEnglish
Title of host publicationKDD 2006
Subtitle of host publicationProceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages779-784
Number of pages6
StatePublished - 2006
Externally publishedYes
EventKDD 2006: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Philadelphia, PA, United States
Duration: 20 Aug 200623 Aug 2006

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Volume2006

Conference

ConferenceKDD 2006: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CityPhiladelphia, PA
Period20/08/0623/08/06

Keywords

  • Coefficient of Variation (CV)
  • Entropy
  • K-means Clustering

Fingerprint

Dive into the research topics of 'K-means clustering versus validation measures: A data distribution perspective'. Together they form a unique fingerprint.

Cite this