Letting the computer automatically find groupings in data is incredibly powerful and is at the heart of “data mining” and “machine learning”. One of the most widely used methods for clustering data is *k*-means clustering. Unfortunately, *k*-means clustering can fail spectacularly as in the example below.

Centroid-based clustering algorithms work on multi-dimensional data by partitioning data points into *k* clusters such that the sum of squares from points to the assigned cluster centers is minimized. In simple terms, clusters contain all of the data points that are “close” to the cluster center.

Many of the simple examples on line demonstrate what happens when a clustering algorithm partitions data into roughly equal groups. But what happens when we know ahead of time that the groups we are looking for have very different sizes?

In our case, we are working with pollution monitoring data. Each file is associated with a particular monitor and each record within the file contains the latitude and longitude associated with an hourly measurement made by the monitor. These files give one insight into the life of one of these monitors:

- turned on in the lab for a few hours for testing
- turned on in the parking lot across from the lab for a day of outside testing
- field tested a few miles away
- field tested again after repositioning
- shipped across the country to be deployed
- deployed at site A
- moved to site B
- repositioned at site B

So we have a few records in one or two locations very close to each other. Then a few more records at a couple of sites nearby. Then some longer time series at one or more deployment sites with minor repositioning at the sites. Our goal is to cluster the latitude-longitude paris into groups that differentiate between deployments that are hundreds of miles apart but that group together small movements associated with “repositioning”.

A simple mockup of this situation would be the following:

x <- y <- c(rep(0,3),rep(1,3),rep(10,20),rep(11,20),rep(100,50),rep(101,50)) m <- cbind(x,y) plot(m, cex=4, lwd=2)

The most powerful statistical tools we will ever use, our eyes, clearly identify three groups. Lets see how R’s kmeans function does.

layout(matrix(seq(8), ncol=2)) par(mar=c(2,2,2,2)) for (i in 1:8) { clus <- stats::kmeans(m, 3) plot(x, y, xlab='', ylab='', axes=FALSE, xpd=NA, cex=4, pch=as.character( clus$cluster )) # Test the within cluster sum of squares col <- ifelse(clus$tot.withinss < 500, 'black', 'red') box(col=col) text(50,50,'kmeans',cex=2,font=2) }

When you run the code you may get different results because kmeans starts by picking random numbers. With our experimental dataset, the R implementation of kmeans generates bogus clusters for this dataset about 20% of the time. And the cluster identifiers change between runs. Yikes!

Luckily for those of us who aren’t statisticians, the CRAN Cluster Analysis Task View has advice which quickly leads to the cluster package which has LOTS of clustering functions.

Cutting to the chase, for our very simple use of clustering, the sister functions pam and clara worked well. Partitioning Around Medoids (pam) is a *k*-medoids function that you can read more about if you’re really interested in why it works better than *k*-means. For large datasets pam can be very slow and clara is recommended.

Here are the results of running pam on our dataset.

for (i in 1:8) { clus <- cluster::pam(m, 3) plot(x, y, xlab='', ylab='', axes=FALSE, xpd=NA, cex=4, pch=as.character( clus$cluster )) box() text(50,50,'pam',cex=2,font=2) }

Clusters where we expect them and consistent cluster identifiers between runs.

For clara we will try a slightly larger dataset:

x <- y <- c(rep(0,30),rep(1,30),rep(10,200),rep(11,200),rep(100,500),rep(101,500)) m <- cbind(x,y) for (i in 1:8) { clus <- cluster::clara(m, 3, samples=10) plot(x, y, xlab='', ylab='', axes=FALSE, xpd=NA, cex=4, pch=as.character( clus$cluster )) box() text(50,50,'clara',cex=2,font=2) }

Excellent results again.

While this may not be news to long-time clustering gurus it was a little sobering for us. It taught us once again that just because you’ve heard of some fancy algorithm and are using R’s implementation of that algorithm does not guarantee that you’ll get the results you expect. You always have to inspect results visually.

In the end, “The eyes have it.”

Best of luck creating meaningful clusters!

Nice example!

You give a “Yikes” about the cluster numbers changing between runs, but it’s really not unusual in clustering, and makes a lot of sense once you get into it. Not saying that it’s a benefit or anything, just that it makes sense: the clusters don’t come first and then points go into them, the clusters are made up of points that are similar in some manner. Whether you call the groups “A”, “B”, “C”, or “1”, “2”, “3”, or “2”, “1”, “3” is a different matter. Some algorithms might result in the same numbers for the same data, but that’s because of dependencies on the ordering of your data set or something that has nothing to do with the actual data itself.

That’s a major pitfall for people jumping into clustering. It’s not unique to k-means or the R implementation of k-means.