Paper • R Code • Supplementary Figure • Feedback
Many popular distances used for comparing binary vectors
belong to the generalized-average (GA) family of distances (see this paper :
Glazko, Gordon and Mushegian. The choice of optimal distance measure in
genome-wide data sets. 2005. Bioinformatics for more detail). GADIST.r produces
distances with any given value of the crucial lambda parameter, and computes
the first four moments of their distributions. This is useful when choosing the
appropriate distance measure for genome analysis (see the paper for examples).
For a pair of vectors, Xm and Xn (Xmn
= XmXn) GA distance is computed as
![]()
where
is the generalized average cardinality of two sets, of exponent lamda.
USAGE:
>R
>source('GADIST.r')
INPUT: You
will be prompted to input
1. The desired exponent lamda for GA distance.
lamda = 0, complement to geometric average;
lamda = -100 (infinity), complement to Simpson indice;
lamda = +100 (+infinity);
lamda = n, any number in the range [-inf,+inf]
2. Tab-delimited matrix of data (input file name, file1).
For example:
name Eco EcZ Ecs Ype
Sty Buc Vch Pae Hin
Pmu Xfa Atu
COG0001 1
1 1 1
1 0 1
1 0 1
1 1
COG0002 1
1 1 1
1 1 1
1 0 1
1 1
GADIST.r will compute
1. lamda distance;
2. the range of distances from -lamdato +lamda (integer);
3. correlation-based distance and distances lambda =-100, 100 (see
above).
Do not forget: the distances are computed between column vectors.
Example of distance matrix output:
Eco EcZ Ecs Ype Sty
Buc Vch Pae Hin Pmu
Xfa Atu Sme
Eco 0.000 0.079 0.086 0.263 0.155 0.603 0.368 0.466 0.435 0.394
0.556 0.606
EcZ 0.079 0.000 0.017 0.244 0.145 0.614 0.379 0.463 0.444 0.406
0.562 0.612
Ecs 0.086 0.017 0.000 0.246 0.158 0.615 0.385 0.475 0.427 0.411
0.566 0.621
Ype 0.263 0.244 0.246 0.000 0.266 0.580 0.347 0.453 0.412 0.368
0.549 0.601
Sty 0.155 0.145 0.158 0.266 0.000 0.609 0.363 0.448 0.443 0.401
0.550 0.609
…
OUTPUT:
1. Table with four first moments of the distances distributions in file
file1.stat
2. Distances with maximum and minimum absolute values of skewness and kurtosis
in files
1) file1.maxSk; 2) file1.maxKu; 3) file1.minSk; 4) file1.minKu .
3. Distance matrix for given lamda: file1.lamda.