efg's Research Notes:  R TechNotes and Graphics Gallery

 Correlation "Distances" and Hierarchical Clustering Earl F. Glynn Stowers Institute for Medical Research 29 Dec 2005

 Purpose There are many ways to perform hierarchical clustering in R, but a "controlled" experiment may be useful in understanding which methods may be more useful when analyzing experimental data.  This Technote explains an R experiment where data with "known" correlations are analyzed to create a hierarchical clustering. Some of the results presented here should be useful in analyzing experimental data when the "answer" is not known. Background Creating random values in R is easy with its various random number generators.  We could create completely random data to analyze, but the approach here is to create pairs of data variables with known correlations ranging from 0.0 (completely uncorrelated) to 1.0 (complete correlated).  Actually, the correlation range from -1.0 to +1.0 will be spanned by picking pairs of values with these correlations:  0.0, -0.20, +0.40, -0.60, +0.80, -1.00.  How can random values with known correlations be computed? Step-by-Step Instructions A recent posting by Robin Hankin to the R-News mailing list provided some details.  The CorrelatedXY function below creates the specified pairs, N, of normal random values with given mean and variance, but with a specified correlation, too: library(mvtnorm) # Based on R-News posting by Robin Hankin, 16 Dec 2005 # http://tolstoy.newcastle.edu.au/R/help/05/12/17693.html CorrelatedXY <- function(N, mean1,mean2,                          variance1,variance2,                          correlation) {   corvar <- correlation * sqrt(variance1*variance2)   rmvnorm(n=N,mean=c(mean1,mean2),   sigma=matrix(c(variance1, corvar,   corvar, variance2), 2,2)) } Let's now create six sets of correlated, normally-distributed values with these parameters: N <- 1000 mean1 <- 2 mean2 <- 5 variance1 <- 1 variance2 <- 2 Let's set this random number seed so this experiment is repeatable. set.seed(17) The following may seem a bit obscure but is not.  The Raw matrix is initialized to be N pairs of random values that have correlation 0.0 with each other.  The for loop then adds another N pairs of random values (using cbind) as columns to the matrix.  The sets have the approximate correlations:  -0.2, +0.4, -0.6, +0.8, -1.0: Raw <- CorrelatedXY(N, mean1, mean2, variance1, variance2, 0.0) for (i in 1:5) {   Raw <- cbind(Raw, CorrelatedXY(N, mean1, mean2,                     variance1, variance2,                     (-1)^i * 0.2*i) ) } Column names are assigned to the Raw data matrix.  The prefix "P" is for "Plus" and "M" is for "Minus" correlation.  The value 00 means a correlation of 0.0, the value 02 means a correlation of 0.2, ..., the value 10 means a correlation of 1.0. The "A" and "B" suffixes are used to distinguish the pair of values in each set: colnames(Raw) <- paste(rep(c("P", "P", "M", "M"), 3),   sprintf("%2.2d", c(0,0,2,2,4,4,6,6,8,8,10,10)),   rep( c("A", "B"), 6), sep="")

 ```# Look at correlation matrix> round( cor(Raw), 4) P00A P00B M02A M02B P04A P04B M06A M06B P08A P08B M10A M10B P00A 1.0000 0.0322 -0.0054 -0.0258 -0.0247 0.0102 -0.0159 0.0163 -0.0279 -0.0248 0.0150 -0.0150 P00B 0.0322 1.0000 -0.0349 -0.0052 0.0072 0.0231 0.0107 0.0088 -0.0303 -0.0299 0.0397 -0.0397 M02A -0.0054 -0.0349 1.0000 -0.2106 -0.0145 0.0125 -0.0175 -0.0124 0.0135 0.0283 -0.0246 0.0246 M02B -0.0258 -0.0052 -0.2106 1.0000 0.0740 0.0145 0.0479 -0.0402 0.0252 0.0327 0.0292 -0.0292 P04A -0.0247 0.0072 -0.0145 0.0740 1.0000 0.3925 0.0734 -0.0272 -0.0184 -0.0048 -0.0231 0.0231 P04B 0.0102 0.0231 0.0125 0.0145 0.3925 1.0000 0.0340 -0.0367 -0.0452 -0.0329 0.0229 -0.0229 M06A -0.0159 0.0107 -0.0175 0.0479 0.0734 0.0340 1.0000 -0.5693 0.0201 -0.0280 0.0084 -0.0084 M06B 0.0163 0.0088 -0.0124 -0.0402 -0.0272 -0.0367 -0.5693 1.0000 -0.0400 0.0093 -0.0168 0.0168 P08A -0.0279 -0.0303 0.0135 0.0252 -0.0184 -0.0452 0.0201 -0.0400 1.0000 0.7842 0.0278 -0.0278 P08B -0.0248 -0.0299 0.0283 0.0327 -0.0048 -0.0329 -0.0280 0.0093 0.7842 1.0000 0.0475 -0.0475 M10A 0.0150 0.0397 -0.0246 0.0292 -0.0231 0.0229 0.0084 -0.0168 0.0278 0.0475 1.0000 -1.0000 M10B -0.0150 -0.0397 0.0246 -0.0292 0.0231 -0.0229 -0.0084 0.0168 -0.0278 -0.0475 -1.0000 1.0000 ```

 The values in bold above show the assigned correlation values are approximately correct and the rest of the data is not correlated: A "map" of the correlation matrix can be created by assigning a color to each cell of the matrix: corRaw <- cor(Raw) library(spatstat) # "im" function plot(im(corRaw[nrow(corRaw):1,]), main="Correlation Matrix Map")

 Several methods of "dissimilarity" were explored to see if one method is "better" in some way over the other methods: Dissimilarity = 1 - Correlation Dissimilarity = (1 - Correlation)/2 Dissimilarity = 1 - Abs(Correlation) Dissimilarity = Sqrt(1 - Correlation2) The 1st, 3rd and 4th methods above were suggested in a Dec 28 posting to R-Help.  The 2nd method is suggested as part of R's ?dist help, which is likely used to rescale the 1st method to always have values from 0.0 to 1.0 instead of 0.0 to 2.0. Here's the R code that creates a distance matrix from 1-Correlation as the dissimilarity index: dissimilarity <- 1 - cor(Raw) distance <- as.dist(dissimilarity) Note the as.dist function is used here to assign the correlation values to be "distances".  (In some cases, you may want to use the dist function to compute distances using a variety of distance metrics instead.)
 ```>round(distance, 4) P00A P00B M02A M02B P04A P04B M06A M06B P08A P08B M10A P00B 0.9678 M02A 1.0054 1.0349 M02B 1.0258 1.0052 1.2106 P04A 1.0247 0.9928 1.0145 0.9260 P04B 0.9898 0.9769 0.9875 0.9855 0.6075 M06A 1.0159 0.9893 1.0175 0.9521 0.9266 0.9660 M06B 0.9837 0.9912 1.0124 1.0402 1.0272 1.0367 1.5693 P08A 1.0279 1.0303 0.9865 0.9748 1.0184 1.0452 0.9799 1.0400 P08B 1.0248 1.0299 0.9717 0.9673 1.0048 1.0329 1.0280 0.9907 0.2158 M10A 0.9850 0.9603 1.0246 0.9708 1.0231 0.9771 0.9916 1.0168 0.9722 0.9525 M10B 1.0150 1.0397 0.9754 1.0292 0.9769 1.0229 1.0084 0.9832 1.0278 1.0475 2.0000```

 The hierarchical clustering function, hclust, expects a dissimilarity matrix.  The plot function knows how to plot a dendrogram from hclust's result plot(hclust(distance),      main="Dissimilarity = 1 - Correlation", xlab="") The results for this simple dissimilarity index are not good.  In particular, the values with a correlation of -1.0 are grouped incorrectly (M10B with M02A, and M10A with P00B).

 As expected, the second method only changed the scaling and did not affect the clusters:

 The dissimilarity measure 1-Abs(Correlation) worked well to discriminate all correlated pairs. In addition, pairs with "stronger" correlation were ordered correctly from the bottom (abs(correlation) =1.0) to the top (correlation = 0.0), almost linearly with their correlation values.  All correlated values were paired correctly from abs(correction) 0.2 to 1.0.  The values that were not correlated, P00A and P00B were not grouped together, but that is expected.

 The final dissimilarity measure, Sqrt(1-Correlation2), worked well also, but narrowed the vertical spread shown in the diagram.  This method might be better when only a small number of highly correlated clusters are desired.  Many clusters that are resolved a bit better in the method shown above, are not as resolvable in this diagram:

 The R cluster package provides an alternative "agnes" function (agglomerative nesting) for plotting dendrograms: library(cluster) plot(agnes(distance)) "agnes" can work with a dissimilarity matrix, like hclust, or can manipulate raw data directly if the the parameter diss = FALSE. The "banner" plot is explained under ?plot.agnes: The banner displays the hierarchy of clusters, and is equivalent to a tree .... The banner plots distances at which observations and clusters are merged. The  observations are listed in the order found by the 'agnes' algorithm, and the numbers in the 'height' vector are represented as bars between the observations. The leaves of the clustering tree are the original observations. Two branches come together at the distance between the two clusters being merged. The agglomerative coefficient is explained under ?agnes.object: ac: the agglomerative coefficient, measuring the clustering structure of the dataset. For each observation i, denote by m(i) its dissimilarity to the first cluster it is merged with, divided by the dissimilarity of the merger in the final step of the algorithm. The 'ac' is the average of all 1 - m(i). It can also be seen as the average width (or the percentage filled) of the banner plot. Because 'ac' grows with the number of observations, this measure should not be used to compare datasets of very different sizes. The usefulness of the agglomerative coefficient is not clear from this brief exercise.

 The dendrograms from hclust and agnes look similar but are slightly different.  There is no single "correct assignment" of the order of the clustered objects.  While agnes put the two "random" variables together (P00A and P00B), hclust randomly associated them as being "closer" to other clusters.  The "pvclust" package below tries to improve on this by assigning probabilities to the various clusters. In addition to agnes, the R "cluster" package provides an alternative "pam" (partitioning around medoids) function for partitioning the data.  For example: plot( pam(distance, 6))

 See ?pam.object for details. Yet another hierarchical clustering alternative is provided by the Hmisc package: library(Hmisc) plot( varclus(Raw, similarity="spearman") ) plot( varclus(Raw, similarity="pearson") )

 The results of the Hmisc varclus function are rougly similar to the results from hclust with a dissimilarity measure 1-Abs(Correlation). Finally, the dendrogram option from the pvclust package was considered (see paper or poster). The package helps assess the uncertainty in hierarchical cluster analysis by calculating p-values. library(pvclust) cluster.bootstrap <- pvclust(Raw, nboot=1000, method.dist="abscor") plot(cluster.bootstrap) pvrect(cluster.bootstrap)

 The pvclust site gives this information about how to interpret the AU and BP p values above: pvclust provides two types of p-values: AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value. AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value than BP value computed by normal bootstrap resampling. The pvrect function can be used to automatically highlight significant clusters. The dendrogram above was created using the "abscor" method, i.e., the dissimilarity measure 1-Abs(Correlation).  If the simpler "correlation" method is used, only two clusters are found to be significant: # Default "correlation" methods does not work well here cluster.bootstrap <- pvclust(Raw, nboot=1000,                              method.dist="correlation") plot(cluster.bootstrap) pvrect(cluster.bootstrap)

 Conclusions The dissimilarity measure, 1-Abs(Correlation), seems to work best if correlation values can be either positive or negative.  Only in the case of positive-only correlations should a simpler dissimilarity measure, 1-Correlation, be used. Dendrograms for the same data can be presented in a vareity of ways.  The computational-intensive bootstrap procedure is useful to assess the uncertainty in hierarchical clustering. Download:  R code to reproduce the above figues

Updated
29 Dec 2005