Detecting periodic patterns
in unevenly spaced gene expression time series
using Lomb-Scargle periodograms

Home    Paper    Supplementary Information    R Code    Feedback

A.  Numerical experiments
B.  Exploratory analysis
C.  Results
Note:  Section 8 was added on 17 Aug 2007.

1.  Single Dominant Frequency.

1aIdeal Gene.  A cosine curve, which can represent an "ideal gene," was used as the starting point.  The first 10 pages of  SinglePeriod.pdf show Lomb-Scargle periodograms for cosine curves using even time spacing with  various periods:

  • 48-hour period as the baseline (for studying Bozdech's data)
  • periods longer (shorter frequency) than baseline:  60 and 72 hours
  • periods shorter (higher frequency) than baseline:  36, 24, 16, 12, 8, 4 hours

With a 48-hour time window, periods longer than 48 hours were not detected as accurately (e.g., 53.3 instead of 60 hours, and 64 instead of 72 hours) as periods 48 hours or shorter.  A frequency range from 0.00 to 0.20/hour is sufficient to detect periods of 8 hours or longer. 

Expected Period [hrs] Observed Period [hrs] p-value Comments
48
48.0
3.3E-9
"Ideal" Plasmoidum Gene
60
53.3
5.8E-9
longer than 48 hours
72
64.0
6.2E-9
36
36.4
6.3E-9
shorter than 48 hours
24
24.0
3.3E-9
16
16.0
3.3E-9
12
12.0
3.3E-9
8
8.0
3.3E-9
4
4.0
3.3E-9

Note there are two pages for the 4 hour period.  The second page shows the Nyquist "problem" described more in section 2 below.

1b.  Uneven Time Spacing, Cosine Signal + Noise.  An "ideal gene" doesn't exist in biology, so a cosine signal with various amounts of Gaussian noise was used to simulate a non-ideal gene. The last eight pages of SinglePeriod.pdf show Lomb-Scargle periodograms for uneven time spacing for a gene with 48-hour period approximated by a cosine signal and various amounts of  noise, with the results summarized here:

Noise* Observed Period [hr] p value Comments
0.00
43.6
3.9E-9 p value quite significant
0.10
53.3
6.1E-9
0.25
48.0
4.1E-8
0.50
43.6
1.3E-5
0.75
40.0
6.2E-4
1.00
48.0
0.02 Figure 1 in paper
1.50
43.6
0.20 p value not significant
2.00
10.9
0.14 p value not significant

*Noise is Normal(mean=0.0, stdev=value),
which are the parameters used in R.

As noise increases the p-values are reduced, compared to the ideal gene with no noise. If noise is too great, the p-value will no longer indicate significance of any detected periodicity.

See file SinglePeriod.R for the parameters used to create these figures.

2 "Problems" in Lomb-Scargle periodogram above Nyquist limit

The Lomb-Scargle periodogram "behaves" normally up to the Nyquist limit.   Spurious spikes can be seen beyond the Nyquist frequency limit in the periodogram.  If the frequency search space is too coarse, peaks might be missed entirely.  With the Nyquist limit of 0.50/hr in this case (since time points were hourly), the frequency search space from 0.00 to 0.20/hour will stay away from any Nyquist problems and allow detection of periods of at least 8 hours up to 48 hours or more. 

Reproduce figures in R using file Nyquist.R.

Two peaks in a periodogram could indicate two periodicities, or a single period with an asymmetrical "duty cycle."

3.1  Two Periodicities  What if a gene has two simultaneous periodicities present because it responds to two periodic, exogenous signals? 

These graphics show ideal "cosine" genes with period of 24 and 48 hours, but with different amplitude weights, A1 (for 24-hr period) and A2 (for 48-hr period).  The baseline is 1:1, but other ratios are considered, including 1:0.5, 1:2, 1:4, and 1:8. With a ratio of about 2:1 or 4:1, one dominant frequency is statistically significant, and the less frequency can be seen in the periodogram, but with questionable statistical significance.  Above 4:1 or so, only the dominant frequency could be observed.

Addition of 24-hour (A1) and 48-hour (A2) Cosine Signals + Noise

A1 A2 Time spacing Noise* Dominant
Period [hr]
p-value Comments about
Periodogram
1 1 Even
0.00
21.4
7.1E-6 both periods approximately same magnitude in periodogram, and roughly the same significance based on the p values
Uneven
0.00
21.1
4.3E-6
0.10
21.4
1.6E-6
0.25
21.1
1.9E-4
0.50
22.2
1.5E-5
0.75
21.8
0.0043
1.00
21.4
0.041
1.50
21.4
0.014 can only see peak for 24 hour period
1.0 0.5 Even
0.0
22.2
4.7E-8 24-hour peak dominant
1.0 2.0
52.2
2.1E-7 48-hour peak dominant
1.0 4.0
50.0
1.1E-8 48-hour peak dominant
1.0 8.0
50.0
4.5E-9 48-hour peak dominant;
24-hour peak quite weak
0.5 1.0
52.2
2.1E-7 48-hour peak dominant
2.0 1.0
22.2
4.7E-8 24-hour peak dominant
4.0 1.0
23.1
6.7E-9 24-hour peak dominant;
48-hour peak not visible
8.0 1.0
23.5
3.9E-9 24-hour peak dominant;
48-hour peak not visible

*Noise is Normal(mean=0.0, stdev=value)

3.2  Asymmetrical "Duty Cycle".  If periodic gene expression is symmetrical the Lomb-Scargle periodogram has one dominant peak.  If the gene expression "duty cycle" is not symmetrical, additional peaks can be present in the periodogram.  However, these additional peaks may not show statistical significance. 

24 Period Cycle

Duty Cycle Observed
Period [hr]
p-value Comments about
Periodogram
1/2 on, 1/2 off
24
2.5E-7 single peak
2/3 on, 1/3 off
24
5.1E-6 one dominant peak; weak second peak
5/6 on, 1/6 off
24
0.0095 Two strong peaks;
weak third peak
11/12 on, 1/12 off
24
0.51 p-value is not significant; four weak peaks

See file TwoPeriods.R for the parameters used to create these figures.  (Also needs DutyCycle.R).

4. Three Periodicities. What if a gene expression profile is a complicated response to multiple periodic, exogenous signals?  These charts show the sum of ideal "cosine" genes with periods of 8, 24, and 48 hours, but with different weights (A1 for the 8-hour period, A2 for the 24-hour period, and A3 for the 48-hour period)..  The baseline is 1:1:1, but other ratios are considered, including 1:1:2, 1:1:3, 1:1:4.

When all cosines have the same weight, the periodogram can resolve all three peaks.  However, with a ratio of 1:2:1, only the dominant periodicity can be resolved with statistical significance.

Addition of 8-hour (A1), 24-hour (A2),
and 48-hour (A3) Cosine Signals + Noise

A1
A2
A3
Time spacing Noise* Dominant
Period [hr]
p-value Comments about
Periodogram
1
1
1
Even
0.00
21.8
0.0025 All three peaks present and roughly equal
Uneven
0.00
7.7
0.0014 24-hour peak smaller than 8- or 48-hour peaks
0.10
53.3
3.5E-4 6 peaks present, quite noisy;
peak near 48-hours strongest
0.25
20.9
1.5E-4 5 peaks present; quite noisy;
peak near 24-hours strongest
0.50
22.9
9.1E-4 4 peaks present, noisy;
peak near 24-hours strongest
0.75
53.3
0.0015 3 strong peaks, corresponding to known periodicities
1.00
20.9
0.069 2 strong peaks, one weak, corresponding to known periodicities.  Figure 3 in paper.
1.50
53.3
0.02 3 good peaks, corresponding to known periodicities, but p-values are a bit weak.
1
2
1
Even
0.0
22.9
2.5E-6 Dominant 24-hour peak with weak peaks for 8- and 48-hours.
1
3
1
22.9
1.1E-7 24-hour peak, with other two expected peaks quite weak.
1
4
1
22.9
2.7E-8 Only single 24-hour peak
1
8
1
24.0
6.72E-9 Only single 24-hour peak
2
1
1
7.9
4.8E-6 8-hour peak, with other two expected peaks quite weak.
3
1
1
7.9
1.9E-7 Only single 8-hour peak
4
1
1
7.9
4.4E-8 Only single 8-hour peak
8
1
1
8.0
6.7E-9 Only single 8-hour peak
1
1
2
53.3
4.9E-6 Strong peaks at 48- and 24-hours.  Weak peak at 8-hours.
1
1
3
53.3
1.8E-7 Dominant 4 8-hour peak ; weak 24-hour peak. 8-hour peak missing.
1
1
4
53.3
4.4E-8 Dominant 4 8-hour peak ; weak 24-hour peak. 8-hour peak missing.
1
1
8
48.0
6.7E-9 Dominant 4 8-hour peak ; quite weak 24-hour peak. 8-hour peak missing.

*Noise is Normal(mean=0.0, stdev=value)

Multiple periodicities may be difficult to recover in general. For further study:  A Procedure of Multiple Period Searching in Unequally Spaced Time-Series with the Lomb-Scargle Method, H. P. A. Van Dongen, et al, Biological Rhythm Research, 1999, 30(2), pp. 149-177.

Reproduce figures in R using file ThreePeriods.R.

5Sample Size.  How does sample size, N, of a cosine expression series affect the peak significance, p, of the Lomb-Scargle periodogram?  This experiment suggests a simple approximate regression line: 

  log10(p) = 1 – 0.2N. 

This means if we have N=30, we can expect a single perfect gene (cosine profile) to have approximately p = 1E-5, and for N=40, we can expect approx. p = 1E-7.  Reproduce figures in R using file Ndetermination.R.

After a large number of profiles are obtained each with a separate "p" value, a given false discovery rate level should be used with a multiple testing correction to identify profiles that are significantly periodic.

6.  "Noise" Experiments.  The purpose of this experiment was to understand how p-values change with increasing noise and fewer data points. A study of the distribution of results (periods and p-values) for various amounts of noise (including, no signal and all noise) was the basis of the "Mixture" experiments described in the section below. Only evenly-spaced time series were considered here, which would be the approximate best case behavior for unevenly-spaced series.

This summary of "Noise" Experiments shows results for 48-hour period cosine "signal" + various amounts of Gaussian noise for N = 12, 24, and 48, and a separate case of no signal and all Gaussian noise.  Noise levels were specified as Normal(mean=0, standard deviation), with the s.d. values of 0.0, 0.1, 0.2, 0.5, and 1.0.  For each noise level, 5000 gene expression profiles were simulated and analyzed.  In all there were 3 "N" values * 6 noise levels * 5000 = 90,000 simulated genes.

The "best" and "worst" case statistics in this table were formed by looking for the min and max peak spectral power density in the Lomb-Scargle periodogram, along with the corresponding p-value statistics. Obviously, for no noise, i.e., s.d.=0.0, the min and max values were the same since there was no variation in the results.  For small amounts of noise, s.d. = 0.10 to 0.20, the min and max values varied little, and Lomb-Scargle method showed little sensitivity to noise.  Only for larger amounts of noise, s.d.=0.50 or 1.0, did the best and worst cases diverge.

The results show excellent detection of the cosine signal, especially with N=48, even in the presence of considerable noise. Reasonable results were observed for N=24 and even N=12 when noise up to Normal(mean=0, s.d.=0.20) was added to the original cosine signal.

Cosine Signal + Noise Histograms.  Histograms were created showing the distribution of the periods and log10(p-values) for the cosine + noise experiments.  Here is a summary of those charts:

Histograms of log10(p-values) for Cosine + Noise Experiment

Noise Level

Comments for N=48 Comments for N=24
0.0
single value at -9 single value at -4
0.1
narrow spike distribution at
~-8.5
narrow spike distribution at
~-3.5
0.2
narrow distribution ranging from about -8.2 to -7.2, with mode near -7.8 narrow distribution ranging from about -3.8 to -2.9, with mode near -3.5
0.5
peaked distribution from -7 to -3, with mode approx. -5 peaked distribution from -3.2 to -0.2, with mode approx. -2.5
1.0
peaked distribution from -6 to -0, with mode approx. -2.  Right-side "tail" is truncated. peaked distribution from -2.7 to -0, with mode approx. -1.  Right-side "tail" is truncated, almost from peak.

The log10(p-values) distribution broadens for increased noise and shifts to the right for increased noise.  Decreasing N also results in broadening and shifting to the right.

Histograms of Periods for Cosine + Noise Experiment

Noise Level

Comments for N=48 Comments for N=24
0.0
single value at 48 hours single value at 48 hours
0.1
single value at 48 hours single value at 48 hours
0.2
single value at 48 hours single value at 48 hours
0.5
Bimodal distribution:  ~80% of values at 48 hours with ~20% of values at 51 hours Bimodal distribution:  ~80% of values at 48 hours with ~20% of values at 37 hours
1.0
~80% of values are approx. 51; ~ 15% of values at about 46 or 48 ; ~ 5% of values at 5 points ranging from 32 to 46. ~55% of values are approx 17; ~20 of values in peak at 48, with small adjoining peaks at 40 and 45; ~25% of values at peak at 14, or nearby.

The period histograms show the correct period is found most of the time, even with Normal(mean=0, s.d.=1.0) Gaussian noise for N=48, which was not the case with N=24. Oddly, with this maximum noise contribution, the period results for N=12 were better than for N=24.

Noise Histograms.  Histograms were created showing the distribution of the periods and log10(p-values) for the noise only experiments

Note the log(p) histograms peak t the right at log(p) = 0 (i.e., p = 1) and fall exponentially with a tail to the left.  The tail shows some simulated noise "genes" with log(p) as low as -4 to about -2.  This means this area can represent false hits if the genes with periodic "signal" do not have a log(p) lower than about -4. 

Period values from the pure-noise experiments varied widely, and seemed to prefer shorter values, which would be consistent with higher-frequency noise. Few periods were observed near 48 hours with this data.

As shown above, a histogram of a cosine "signal" with N(mean=0, s.d.=0.50) noise peaks at about -5, and at -7.8 with noise N(mean=0, s.d.=0.20).  This demonstrates strong signals should be easy to discriminate from pure noise, which is described next in the "Mixture" experiment.

Reproduce simulation data in R using file Noise-1.R and analyze with Noise-2.R and Noise-3.R.

7Mixture Experiments of "Periodic" and "Aperiodic" Genes .  The simulated genes above were either "signal" (periodic signal with noise), or pure "noise" (not periodic).  In a genome we wish to discriminate the periodic genes from the non-periodic genes.  The "Noise" experiments above suggest the distribution of log10(p-values) may be effective in this discrimination. 

For this experiment the noise was fixed at Normal(mean=0, s.d.=0.2) and the fraction of periodic genes was varied:  0%, 25%, 50%, 75%, 100%. Only evenly-spaced time series were considered here, which would be the approximate best case behavior for unevenly-spaced series.

Here are the charts from the mixture experiments of "signal" and "noise".  These histograms are bimodal, and a wide valley separates the periodic signal distribution and the noise gene distribution.  The periodic genes show a "spike" or almost a normal distribution with low p-values on the left side.  The aperiodic, "noise" genes are clustered at the right. A value of log10(p) = -4 is the middle of this valley, which suggests a good separation threshold value. 

The two distributions ("periodic" and "aperiodic") have their own characteristic shapes, which should be useful in interpreting results from a biological experiment, such as Bozdech's Plasmodium datatset.  This process allows one to make a number of assumptions about the presence (e.g., periodic gene rate) and signals (e.g., approximate noise levels) from periodic genes and simulate the results.  Such simulations might be useful in estimating false positive rates in the list of identified periodic genes -- where the two distributions overlap.

Results from using various multiple testing correction methods are also shown.  All synthetic genes were found using a significance cutoff of 1E-4, except the Bonferroni correction method rejected some of the genes at this significance level. While results for various multiple testing methods (provided by "R") are shown, our choice is the Benjamini-Hochberg "FDR" method.

Reproduce simulation data in R using file Mixture-1.R and analyze it with Mixture-2.R

8.  Phase shiftSee these notes for a brief discussion of how to extract phase shift information from the Lomb-Scargle results.  [Added 17 Aug 2007]

B.  Exploratory analysis  |  C.  Results

Updated
17 Aug 2007