efg's Research Notes:  R TechNotes and Graphics Gallery

Mixtures of Gaussians

Earl F. Glynn
Stowers Institute for Medical Research
9 February 2007


Data measurements of many properties are often normally distributed, but with heterogeneous populations, sometimes data measurements reflect a mixture of normal distributions. The following notes are intended to give some insight about such mixtures and how they can be "unmixed" into separate distributions. R code is provided to reproduce the calculations and figures below.

The normal or Gaussian probability density function (pdf)can be written as follows:

 

Below in Fig 1 the n(x,0,1), "Single Gaussian," has μ = 0 (mean) and σ = 1 (standard deviation).

Fig. 1.  Single Gaussian with its 1st and 2nd Derivatives
y = n(x; 0.0, 1.0)

 

According to Abramowitz & Stegun (section 26.2.9), inflections points in the pdf are at μ ± σ.  Above, the inflection points in the pdf are the peaks in the first derivative at x = ± 1.

The peaks in the second derivative occur approximately at x = ± 1.73

To create Fig. 1 see SingleGaussian.R.  Output is in SingleGaussian.pdf.

Mixture of Two Gaussians.

Consider a population that is a mixture of two Gaussian distributions.  Shown in Fig 2 is this mixture:  y = n(x; 3.75, 0.75) + n(x; 6.00, 0.50)

Fig. 2.  Sum of Two Gaussians along with 1st and 2nd Derivatives
y = n(x; 3.75, 0.75) + n(x; 6.00, 0.50)

 

Based on our observations of the Single Gaussian above, we can graphically find the mean and standard deviations of these two distributions.  To create Fig. 2, or see all the R code mentioned below, see TwoGaussians.R with output in TwoGaussians.pdf.

The x distance from the peak of the first derivative to the valley of the first derivative should be 2σ, which can be used to find σ:

> s.approx <- (derivative1$x[valleys.Deriv1] - derivative1$x[peaks.Deriv1])/2
> s.approx
[1] 0.70 0.48

Graphically we find the σ of the first Gaussian is 0.70 (expected 0.75), while the σ of the second Gaussian is 0.48 (expected 0.50). 

With some mixtures, all the peaks of the original distribution may not be obvious (see Heming Lake Pike problem in the next section).  Instead, the valleys in the second derivative may be a better way to find the mean values.

> # Approximate location of mean of normal distribution:
> derivative2$x[valleys.Deriv2]
[1] 3.75 6.01

Graphically we find the μ of the first Gaussian is 3.75 (expected 3.75), while the μ of the second Gaussian is 6.01 (expected 6.00).  Overall, the graphical approach worked well, but a non-linear curve fitting algorithm will be useful in situations where we cannot find the answers by simply studying the derivative plots. The R nls (nonlinear least squares) function can be used to find the best fit.

Oddly, the nls function fails with "toy" problems like this with exact solutions.  One must add a "small" amount of noise to the original function to get around this problem.  (I think this design of nls is flawed and should be fixed, but I've never studied exactly what the problem is.)

# add small random noise
set.seed(17)
y <- y + rnorm(length(y), 1E-5, 1E-4)

fit2 <- nls(y ~ (a/b)*exp(-(x-c)^2/(2*b^2)) +
               
(d/e)*exp(-(x-f)^2/(2*e^2)),
              start=list(a=(1/sqrt(2*pi)) / s.approx[1],
                         b=s.approx[1], 
                         c=mu.approx[1],
                         d=(1/sqrt(2*pi)) / s.approx[2],
                         e=s.approx[2],
                         f=mu.approx[2]),
        control=nls.control(tol=1E-5, minFactor=1/1024),
        trace=TRUE)

fit2
Nonlinear regression model
model: y ~ (a/b) * exp(-(x - c)^2/(2 * b^2)) +
           (d/e) * exp(-(x - f)^2/(2 * e^2))
data: parent.frame()
        a         b         c         d         e         f
0.3989549 0.7500404 3.7499843 0.3989563 0.5000215 6.0000002
residual sum-of-squares: 1.001001e-05

This iterative technique requires "good" initial values, and there's no guarantee of convergence. Above, the μ (mean) values are c=3.74998 and f=6.00000; the σ values are b=0.75004 and e=0.50002.  Overall, these are good results.  The a and d values in this case should be 1/sqrt(2*pi) = 0.39894.

What if our mixtures are weighted, e.g.,

y = A*n(x; 3.75, 0.75) + B*n(x; 6.00, 0.50)

The "Sum of Two Gaussians" section of MixingGaussians.doc shows several examples of this.

The Heming Lake Pike Mixture.

As a final example, let's look at the Heming Lake Pike mixture. The figures below were created from summary μ, σ statistics from data in this file. These statistics were used to create an "exact" curve for analysis purposes. 

> weight <- c(55, 243, 156, 47, 22) # Number of fish
> weight <- weight / sum(weight)> weight
[1] 0.10516252 0.46462715 0.29827916 0.08986616 0.04206501

> mean <- c(23.32727, 33.09053, 41.27244, 51.24468, 61.31818)
> sigma <- c(2.436638, 2.997594, 4.274585, 5.077521, 7.070303)

These weights, means, and standard deviations are represented in this expression:

y = w1*n(x; μ1, σ1) + w2*n(x; μ2, σ2) + ... + w5*n(x; μ5, σ5)

The top panel in Fig 3 below shows a graph of this function in bold black, with each of the component distributions in various colors.

Fig. 3.  Heming Lake Pike Mixture

The number of Gaussians in this mixture is not clear from the original plot.  There are two obvious peaks in the original data, but the "bump" on the right side of the major peak hints there is a third Gaussian.  The 1st derivative shows there are at least four peak/valley pairs, so there must be at least four Gaussians in the mixture. The peaks in the 2nd derivative show 4 clear peaks, with possibly 5 or 6 peaks.  Here the 6th peak is erroneous.  The five Gaussian distributions an be estimated from only graphical analysis.

The graphical analyis gives these results for the first four distributions:

> mu.approx
[1] 23.2 33.0 42.6 54.4
> s.approx
[1] 2.1 2.8 2.2 1.2

These graphical results compare well with the actual values (see below):

> mean <- c(23.32727, 33.09053, 41.27244, 51.24468, 61.31818)
> sigma <- c(2.436638, 2.997594, 4.274585, 5.077521, 7.070303)

However the graphical results do not give the weights of the origial probability density functions.

set.seed(17)
y <- y + rnorm(length(y), 1E-5, 1E-4)  # add some random noise

fit.pike <- nls(y ~
    (a/b)*exp(-(x-c)^2/(2*b^2)) +
    (d/e)*exp(-(x-f)^2/(2*e^2)) +
    (g/h)*exp(-(x-i)^2/(2*h^2)) +
    (j/k)*exp(-(x-l)^2/(2*k^2)),
  start=list(a=(1/sqrt(2*pi)) / s.approx[1], b=s.approx[1],
             c=mu.approx[1],
             d=(1/sqrt(2*pi)) / s.approx[2], e=s.approx[2],
             f=mu.approx[2],
             g=(1/sqrt(2*pi)) / s.approx[3], h=s.approx[3],
             i=mu.approx[3],
             j=(1/sqrt(2*pi)) / s.approx[4], k=s.approx[4],
             l=mu.approx[4] ),
  control=nls.control(tol=1E-5, minFactor=1/1024), trace=FALSE)

summary(fit.pike)

Formula: y ~ (a/b) * exp(-(x - c)^2/(2 * b^2)) + (d/e) * exp(-(x - f)^2/(2 * e^2)) + (g/h) * exp(-(x - i)^2/(2 * h^2)) + (j/k) * exp(-(x - l)^2/(2 * k^2))

Parameters:
   Estimate Std. Error t value Pr(>|t|)
a 3.788e-02  3.696e-04  102.50   <2e-16 ***
b 2.338e+00  1.211e-02  193.07   <2e-16 ***
c 2.326e+01  7.939e-03 2930.49   <2e-16 ***
d 1.727e-01  7.988e-04  216.18   <2e-16 ***
e 2.940e+00  5.599e-03  525.02   <2e-16 ***
f 3.312e+01  6.988e-03 4738.80   <2e-16 ***
g 8.028e-02  1.168e-03   68.76   <2e-16 ***
h 3.992e+00  2.698e-02  147.96   <2e-16 ***
i 4.114e+01  2.634e-02 1561.80   <2e-16 ***
j 1.080e-01  2.013e-03   53.62   <2e-16 ***
k 1.089e+01  8.879e-02  122.65   <2e-16 ***
l 4.587e+01  2.108e-01  217.63   <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0001335 on 389 degrees of freedom

Remember, this fit included only four of the five terms used to compute the original function.

Means ( μ) :

> coef(fit.pike)[3*1:4]
       c        f        i        l
23.26463 33.11712 41.14406 45.87142

Standard Deviations (σ):

> coef(fit.pike)[3*1:4-1]
       b        e        h         k
2.338391 2.939640 3.992482 10.890327

Expected values:

> mean <- c(23.32727, 33.09053, 41.27244, 51.24468, 61.31818)
> sigma <- c(2.436638, 2.997594, 4.274585, 5.077521, 7.070303)

Weights:  the computed values multiplied by a constant, sqrt(2*pi), allows comparison to the original values:

> coef(fit.pike)[3*1:4-2] * sqrt(2*pi)
         a          d          g          j
0.09495491 0.43287679 0.20124450 0.27060049

> weight[1:4]
[1] 0.10516252 0.46462715 0.29827916 0.08986616

The weights show general agreement but with less accuracy than the means or standard deviations.  In general, the coefficients of each term in a mixture of distributions need to be comparable.  Some "tweaking" of the convergence criterial in the nonlinear least squares call sometimes can improve results.

Files: 
Input:  HemingLakePike.dat  Processing: HemingLakePike.R  Output: HemingLakePike.pdf

 


Updated
9 Feb 2007