Computing Generalized-Average Based Distances

Paper • R Code • Supplementary Figure • Feedback


Many popular distances used for comparing binary vectors belong to the generalized-average (GA) family of distances (see this paper : Glazko, Gordon and Mushegian. The choice of optimal distance measure in genome-wide data sets. 2005. Bioinformatics for more detail). GADIST.r produces distances with any given value of the crucial lambda parameter, and computes the first four moments of their distributions. This is useful when choosing the appropriate distance measure for genome analysis (see the paper for examples).
For a pair of vectors, Xm and Xn (Xmn = XmXn) GA distance is computed as

 


where



is the generalized average cardinality of two sets, of exponent lamda.



USAGE:
>R
>source('GADIST.r')

INPUT: You will be prompted to input

1. The desired exponent lamda for GA distance.

lamda = 0, complement to geometric average;
lamda = -100 (infinity), complement to Simpson indice;
lamda = +100 (+infinity);
lamda
= n, any number in the range [-inf,+inf]

2. Tab-delimited matrix of data (input file name, file1). For example:
name  Eco   EcZ   Ecs   Ype   Sty   Buc   Vch   Pae   Hin   Pmu   Xfa   Atu 
COG0001     1     1     1     1     1     0     1     1     0     1     1     1
COG0002     1     1     1     1     1     1     1     1     0     1     1     1

GADIST.r will compute
1. lamda distance;
2. the range of distances from -lamdato +lamda (integer);
3. correlation-based distance and distances lambda =-100, 100 (see above).

Do not forget: the distances are computed between column vectors.

Example of distance matrix output:
Eco   EcZ   Ecs   Ype   Sty   Buc   Vch   Pae   Hin   Pmu   Xfa   Atu   Sme
Eco   0.000 0.079 0.086 0.263 0.155 0.603 0.368 0.466 0.435 0.394 0.556 0.606
EcZ   0.079 0.000 0.017 0.244 0.145 0.614 0.379 0.463 0.444 0.406 0.562 0.612
Ecs   0.086 0.017 0.000 0.246 0.158 0.615 0.385 0.475 0.427 0.411 0.566 0.621
Ype   0.263 0.244 0.246 0.000 0.266 0.580 0.347 0.453 0.412 0.368 0.549 0.601
Sty   0.155 0.145 0.158 0.266 0.000 0.609 0.363 0.448 0.443 0.401 0.550 0.609


OUTPUT:
1. Table with four first moments of the distances distributions in file file1.stat
2. Distances with maximum and minimum absolute values of skewness and kurtosis in files
1) file1.maxSk; 2) file1.maxKu; 3) file1.minSk; 4) file1.minKu .
3. Distance matrix for given lamda: file1.lamda.