efg's Research Notes:  R TechNotes and Graphics Gallery

Correlation "Distances" and
Hierarchical Clustering

Earl F. Glynn
Stowers Institute for Medical Research
29 Dec 2005

 

Purpose
There are many ways to perform hierarchical clustering in R, but a "controlled" experiment may be useful in understanding which methods may be more useful when analyzing experimental data.  This Technote explains an R experiment where data with "known" correlations are analyzed to create a hierarchical clustering. Some of the results presented here should be useful in analyzing experimental data when the "answer" is not known.

Background
Creating random values in R is easy with its various random number generators.  We could create completely random data to analyze, but the approach here is to create pairs of data variables with known correlations ranging from 0.0 (completely uncorrelated) to 1.0 (complete correlated).  Actually, the correlation range from -1.0 to +1.0 will be spanned by picking pairs of values with these correlations:  0.0, -0.20, +0.40, -0.60, +0.80, -1.00.  How can random values with known correlations be computed?

Step-by-Step Instructions

A recent posting by Robin Hankin to the R-News mailing list provided some details.  The CorrelatedXY function below creates the specified pairs, N, of normal random values with given mean and variance, but with a specified correlation, too:

library(mvtnorm)

# Based on R-News posting by Robin Hankin, 16 Dec 2005
# http://tolstoy.newcastle.edu.au/R/help/05/12/17693.html
CorrelatedXY <- function(N, mean1,mean2,
                         variance1,variance2,
                         correlation)
{
  corvar <- correlation * sqrt(variance1*variance2)
  rmvnorm(n=N,mean=c(mean1,mean2),
  sigma=matrix(c(variance1, corvar,
  corvar, variance2), 2,2))
}

Let's now create six sets of correlated, normally-distributed values with these parameters:

N <- 1000
mean1 <- 2
mean2 <- 5
variance1 <- 1
variance2 <- 2

Let's set this random number seed so this experiment is repeatable.

set.seed(17)

The following may seem a bit obscure but is not.  The Raw matrix is initialized to be N pairs of random values that have correlation 0.0 with each other.  The for loop then adds another N pairs of random values (using cbind) as columns to the matrix.  The sets have the approximate correlations:  -0.2, +0.4, -0.6, +0.8, -1.0:

Raw <- CorrelatedXY(N, mean1, mean2, variance1, variance2, 0.0)
for (i in 1:5)
{
  Raw <- cbind(Raw, CorrelatedXY(N, mean1, mean2,
                    variance1, variance2,
                    (-1)^i * 0.2*i) )
}

Column names are assigned to the Raw data matrix.  The prefix "P" is for "Plus" and "M" is for "Minus" correlation.  The value 00 means a correlation of 0.0, the value 02 means a correlation of 0.2, ..., the value 10 means a correlation of 1.0. The "A" and "B" suffixes are used to distinguish the pair of values in each set:

colnames(Raw) <- paste(rep(c("P", "P", "M", "M"), 3),
  sprintf("%2.2d", c(0,0,2,2,4,4,6,6,8,8,10,10)),
  rep( c("A", "B"), 6), sep="")

 

# Look at correlation matrix
> round( cor(Raw), 4)
P00A P00B M02A M02B P04A P04B M06A M06B P08A P08B M10A M10B P00A 1.0000 0.0322 -0.0054 -0.0258 -0.0247 0.0102 -0.0159 0.0163 -0.0279 -0.0248 0.0150 -0.0150 P00B 0.0322 1.0000 -0.0349 -0.0052 0.0072 0.0231 0.0107 0.0088 -0.0303 -0.0299 0.0397 -0.0397 M02A -0.0054 -0.0349 1.0000 -0.2106 -0.0145 0.0125 -0.0175 -0.0124 0.0135 0.0283 -0.0246 0.0246 M02B -0.0258 -0.0052 -0.2106 1.0000 0.0740 0.0145 0.0479 -0.0402 0.0252 0.0327 0.0292 -0.0292 P04A -0.0247 0.0072 -0.0145 0.0740 1.0000 0.3925 0.0734 -0.0272 -0.0184 -0.0048 -0.0231 0.0231 P04B 0.0102 0.0231 0.0125 0.0145 0.3925 1.0000 0.0340 -0.0367 -0.0452 -0.0329 0.0229 -0.0229 M06A -0.0159 0.0107 -0.0175 0.0479 0.0734 0.0340 1.0000 -0.5693 0.0201 -0.0280 0.0084 -0.0084 M06B 0.0163 0.0088 -0.0124 -0.0402 -0.0272 -0.0367 -0.5693 1.0000 -0.0400 0.0093 -0.0168 0.0168 P08A -0.0279 -0.0303 0.0135 0.0252 -0.0184 -0.0452 0.0201 -0.0400 1.0000 0.7842 0.0278 -0.0278 P08B -0.0248 -0.0299 0.0283 0.0327 -0.0048 -0.0329 -0.0280 0.0093 0.7842 1.0000 0.0475 -0.0475 M10A 0.0150 0.0397 -0.0246 0.0292 -0.0231 0.0229 0.0084 -0.0168 0.0278 0.0475 1.0000 -1.0000 M10B -0.0150 -0.0397 0.0246 -0.0292 0.0231 -0.0229 -0.0084 0.0168 -0.0278 -0.0475 -1.0000 1.0000

 

The values in bold above show the assigned correlation values are approximately correct and the rest of the data is not correlated:

A "map" of the correlation matrix can be created by assigning a color to each cell of the matrix:

corRaw <- cor(Raw)
library(spatstat) # "im" function
plot(im(corRaw[nrow(corRaw):1,]), main="Correlation Matrix Map")

 

 

 
Several methods of "dissimilarity" were explored to see if one method is "better" in some way over the other methods:

  • Dissimilarity = 1 - Correlation
  • Dissimilarity = (1 - Correlation)/2
  • Dissimilarity = 1 - Abs(Correlation)
  • Dissimilarity = Sqrt(1 - Correlation2)

The 1st, 3rd and 4th methods above were suggested in a Dec 28 posting to R-Help.  The 2nd method is suggested as part of R's ?dist help, which is likely used to rescale the 1st method to always have values from 0.0 to 1.0 instead of 0.0 to 2.0.

Here's the R code that creates a distance matrix from 1-Correlation as the dissimilarity index:

dissimilarity <- 1 - cor(Raw)
distance <- as.dist(dissimilarity)

Note the as.dist function is used here to assign the correlation values to be "distances".  (In some cases, you may want to use the dist function to compute distances using a variety of distance metrics instead.)

>round(distance, 4)                                                                
       P00A   P00B   M02A   M02B   P04A   P04B   M06A   M06B   P08A   P08B   M10A
P00B 0.9678                                                                      
M02A 1.0054 1.0349                                                               
M02B 1.0258 1.0052 1.2106                                                        
P04A 1.0247 0.9928 1.0145 0.9260                                                 
P04B 0.9898 0.9769 0.9875 0.9855 0.6075                                          
M06A 1.0159 0.9893 1.0175 0.9521 0.9266 0.9660                                   
M06B 0.9837 0.9912 1.0124 1.0402 1.0272 1.0367 1.5693                            
P08A 1.0279 1.0303 0.9865 0.9748 1.0184 1.0452 0.9799 1.0400                     
P08B 1.0248 1.0299 0.9717 0.9673 1.0048 1.0329 1.0280 0.9907 0.2158              
M10A 0.9850 0.9603 1.0246 0.9708 1.0231 0.9771 0.9916 1.0168 0.9722 0.9525       
M10B 1.0150 1.0397 0.9754 1.0292 0.9769 1.0229 1.0084 0.9832 1.0278 1.0475 2.0000

The hierarchical clustering function, hclust, expects a dissimilarity matrix.  The plot function knows how to plot a dendrogram from hclust's result

plot(hclust(distance),
     main="Dissimilarity = 1 - Correlation", xlab="")

The results for this simple dissimilarity index are not good.  In particular, the values with a correlation of -1.0 are grouped incorrectly (M10B with M02A, and M10A with P00B).

 

As expected, the second method only changed the scaling and did not affect the clusters:

 

 
The dissimilarity measure 1-Abs(Correlation) worked well to discriminate all correlated pairs. In addition, pairs with "stronger" correlation were ordered correctly from the bottom (abs(correlation) =1.0) to the top (correlation = 0.0), almost linearly with their correlation values.  All correlated values were paired correctly from abs(correction) 0.2 to 1.0.  The values that were not correlated, P00A and P00B were not grouped together, but that is expected.

 

 
The final dissimilarity measure, Sqrt(1-Correlation2), worked well also, but narrowed the vertical spread shown in the diagram.  This method might be better when only a small number of highly correlated clusters are desired.  Many clusters that are resolved a bit better in the method shown above, are not as resolvable in this diagram:

 

 

The R cluster package provides an alternative "agnes" function (agglomerative nesting) for plotting dendrograms:

library(cluster)
plot(agnes(distance))

"agnes" can work with a dissimilarity matrix, like hclust, or can manipulate raw data directly if the the parameter diss = FALSE.

The "banner" plot is explained under ?plot.agnes:

The banner displays the hierarchy of clusters, and is equivalent to a tree .... The banner plots distances at which observations and clusters are merged. The  observations are listed in the order found by the 'agnes' algorithm, and the numbers in the 'height' vector are represented as bars between the observations.

The leaves of the clustering tree are the original observations. Two branches come together at the distance between the two clusters being merged.

The agglomerative coefficient is explained under ?agnes.object:

ac: the agglomerative coefficient, measuring the clustering structure of the dataset.

For each observation i, denote by m(i) its dissimilarity to the first cluster it is merged with, divided by the dissimilarity of the merger in the final step of the algorithm. The 'ac' is the average of all 1 - m(i). It can also be seen as the average width (or the percentage filled) of the banner plot. Because 'ac' grows with the number of observations, this measure should not be used to compare datasets of very different sizes.

The usefulness of the agglomerative coefficient is not clear from this brief exercise.

 

 

 

 

The dendrograms from hclust and agnes look similar but are slightly different.  There is no single "correct assignment" of the order of the clustered objects.  While agnes put the two "random" variables together (P00A and P00B), hclust randomly associated them as being "closer" to other clusters.  The "pvclust" package below tries to improve on this by assigning probabilities to the various clusters.

In addition to agnes, the R "cluster" package provides an alternative "pam" (partitioning around medoids) function for partitioning the data.  For example:

plot( pam(distance, 6))

 

 

See ?pam.object for details.

Yet another hierarchical clustering alternative is provided by the Hmisc package:

library(Hmisc)

plot( varclus(Raw, similarity="spearman") )
plot( varclus(Raw, similarity="pearson") )

 

 

 

 

The results of the Hmisc varclus function are rougly similar to the results from hclust with a dissimilarity measure 1-Abs(Correlation).

Finally, the dendrogram option from the pvclust package was considered (see paper or poster). The package helps assess the uncertainty in hierarchical cluster analysis by calculating p-values.

library(pvclust)

cluster.bootstrap <- pvclust(Raw, nboot=1000, method.dist="abscor")
plot(cluster.bootstrap)
pvrect(cluster.bootstrap)

 

 

 

The pvclust site gives this information about how to interpret the AU and BP p values above:

pvclust provides two types of p-values: AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value. AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value than BP value computed by normal bootstrap resampling.

The pvrect function can be used to automatically highlight significant clusters.

The dendrogram above was created using the "abscor" method, i.e., the dissimilarity measure 1-Abs(Correlation).  If the simpler "correlation" method is used, only two clusters are found to be significant:

# Default "correlation" methods does not work well here
cluster.bootstrap <- pvclust(Raw, nboot=1000,
                             method.dist="correlation")
plot(cluster.bootstrap)
pvrect(cluster.bootstrap)

 

 

 

Conclusions
The dissimilarity measure, 1-Abs(Correlation), seems to work best if correlation values can be either positive or negative.  Only in the case of positive-only correlations should a simpler dissimilarity measure, 1-Correlation, be used.

Dendrograms for the same data can be presented in a vareity of ways.  The computational-intensive bootstrap procedure is useful to assess the uncertainty in hierarchical clustering.


DownloadR code to reproduce the above figues


Updated
29 Dec 2005