![]() |
Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Note: Section 8 was added on 17 Aug 2007. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
1a. Ideal Gene. A cosine curve, which can represent an "ideal gene," was used as the starting point. The first 10 pages of SinglePeriod.pdf show Lomb-Scargle periodograms for cosine curves using even time spacing with various periods:
With a 48-hour time window, periods longer than 48 hours were not detected as accurately (e.g., 53.3 instead of 60 hours, and 64 instead of 72 hours) as periods 48 hours or shorter. A frequency range from 0.00 to 0.20/hour is sufficient to detect periods of 8 hours or longer.
Note there are two pages for the 4 hour period. The second page shows the Nyquist "problem" described more in section 2 below. 1b. Uneven Time Spacing, Cosine Signal + Noise. An "ideal gene" doesn't exist in biology, so a cosine signal with various amounts of Gaussian noise was used to simulate a non-ideal gene. The last eight pages of SinglePeriod.pdf show Lomb-Scargle periodograms for uneven time spacing for a gene with 48-hour period approximated by a cosine signal and various amounts of noise, with the results summarized here:
*Noise is Normal(mean=0.0, stdev=value), As noise increases the p-values are reduced, compared to the ideal gene with no noise. If noise is too great, the p-value will no longer indicate significance of any detected periodicity. See file SinglePeriod.R for the parameters used to create these figures. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. "Problems" in Lomb-Scargle periodogram above Nyquist limit The Lomb-Scargle periodogram "behaves" normally up to the Nyquist limit. Spurious spikes can be seen beyond the Nyquist frequency limit in the periodogram. If the frequency search space is too coarse, peaks might be missed entirely. With the Nyquist limit of 0.50/hr in this case (since time points were hourly), the frequency search space from 0.00 to 0.20/hour will stay away from any Nyquist problems and allow detection of periods of at least 8 hours up to 48 hours or more. Reproduce figures in R using file Nyquist.R. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Two peaks in a periodogram could indicate two periodicities, or a single period with an asymmetrical "duty cycle." 3.1 Two Periodicities What if a gene has two simultaneous periodicities present because it responds to two periodic, exogenous signals? These graphics show ideal "cosine" genes with period of 24 and 48 hours, but with different amplitude weights, A1 (for 24-hr period) and A2 (for 48-hr period). The baseline is 1:1, but other ratios are considered, including 1:0.5, 1:2, 1:4, and 1:8. With a ratio of about 2:1 or 4:1, one dominant frequency is statistically significant, and the less frequency can be seen in the periodogram, but with questionable statistical significance. Above 4:1 or so, only the dominant frequency could be observed. Addition of 24-hour (A1) and 48-hour (A2) Cosine Signals + Noise
*Noise is Normal(mean=0.0, stdev=value) 3.2 Asymmetrical "Duty Cycle". If periodic gene expression is symmetrical the Lomb-Scargle periodogram has one dominant peak. If the gene expression "duty cycle" is not symmetrical, additional peaks can be present in the periodogram. However, these additional peaks may not show statistical significance. 24 Period Cycle
See file TwoPeriods.R for the parameters used to create these figures. (Also needs DutyCycle.R). |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Three Periodicities. What if a gene expression profile is a complicated response to multiple periodic, exogenous signals? These charts show the sum of ideal "cosine" genes with periods of 8, 24, and 48 hours, but with different weights (A1 for the 8-hour period, A2 for the 24-hour period, and A3 for the 48-hour period).. The baseline is 1:1:1, but other ratios are considered, including 1:1:2, 1:1:3, 1:1:4. When all cosines have the same weight, the periodogram can resolve all three peaks. However, with a ratio of 1:2:1, only the dominant periodicity can be resolved with statistical significance. Addition of 8-hour (A1), 24-hour (A2),
*Noise is Normal(mean=0.0, stdev=value) Multiple periodicities may be difficult to recover in general. For further study: A Procedure of Multiple Period Searching in Unequally Spaced Time-Series with the Lomb-Scargle Method, H. P. A. Van Dongen, et al, Biological Rhythm Research, 1999, 30(2), pp. 149-177. Reproduce figures in R using file ThreePeriods.R. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
5. Sample Size. How does sample size, N, of a cosine expression series affect the peak significance, p, of the Lomb-Scargle periodogram? This experiment suggests a simple approximate regression line: log10(p) = 1 – 0.2N. This means if we have N=30, we can expect a single perfect gene (cosine profile) to have approximately p = 1E-5, and for N=40, we can expect approx. p = 1E-7. Reproduce figures in R using file Ndetermination.R. After a large number of profiles are obtained each with a separate "p" value, a given false discovery rate level should be used with a multiple testing correction to identify profiles that are significantly periodic. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
6. "Noise" Experiments. The purpose of this experiment was to understand how p-values change with increasing noise and fewer data points. A study of the distribution of results (periods and p-values) for various amounts of noise (including, no signal and all noise) was the basis of the "Mixture" experiments described in the section below. Only evenly-spaced time series were considered here, which would be the approximate best case behavior for unevenly-spaced series. This summary of "Noise" Experiments shows results for 48-hour period cosine "signal" + various amounts of Gaussian noise for N = 12, 24, and 48, and a separate case of no signal and all Gaussian noise. Noise levels were specified as Normal(mean=0, standard deviation), with the s.d. values of 0.0, 0.1, 0.2, 0.5, and 1.0. For each noise level, 5000 gene expression profiles were simulated and analyzed. In all there were 3 "N" values * 6 noise levels * 5000 = 90,000 simulated genes. The "best" and "worst" case statistics in this table were formed by looking for the min and max peak spectral power density in the Lomb-Scargle periodogram, along with the corresponding p-value statistics. Obviously, for no noise, i.e., s.d.=0.0, the min and max values were the same since there was no variation in the results. For small amounts of noise, s.d. = 0.10 to 0.20, the min and max values varied little, and Lomb-Scargle method showed little sensitivity to noise. Only for larger amounts of noise, s.d.=0.50 or 1.0, did the best and worst cases diverge. The results show excellent detection of the cosine signal, especially with N=48, even in the presence of considerable noise. Reasonable results were observed for N=24 and even N=12 when noise up to Normal(mean=0, s.d.=0.20) was added to the original cosine signal. Cosine Signal + Noise Histograms. Histograms were created showing the distribution of the periods and log10(p-values) for the cosine + noise experiments. Here is a summary of those charts: Histograms of log10(p-values) for Cosine + Noise Experiment
The log10(p-values) distribution broadens for increased noise and shifts to the right for increased noise. Decreasing N also results in broadening and shifting to the right. Histograms of Periods for Cosine + Noise Experiment
The period histograms show the correct period is found most of the time, even with Normal(mean=0, s.d.=1.0) Gaussian noise for N=48, which was not the case with N=24. Oddly, with this maximum noise contribution, the period results for N=12 were better than for N=24. Noise Histograms. Histograms were created showing the distribution of the periods and log10(p-values) for the noise only experiments. Note the log(p) histograms peak t the right at log(p) = 0 (i.e., p = 1) and fall exponentially with a tail to the left. The tail shows some simulated noise "genes" with log(p) as low as -4 to about -2. This means this area can represent false hits if the genes with periodic "signal" do not have a log(p) lower than about -4. Period values from the pure-noise experiments varied widely, and seemed to prefer shorter values, which would be consistent with higher-frequency noise. Few periods were observed near 48 hours with this data. As shown above, a histogram of a cosine "signal" with N(mean=0, s.d.=0.50) noise peaks at about -5, and at -7.8 with noise N(mean=0, s.d.=0.20). This demonstrates strong signals should be easy to discriminate from pure noise, which is described next in the "Mixture" experiment. Reproduce simulation data in R using file Noise-1.R and analyze with Noise-2.R and Noise-3.R. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
7. Mixture Experiments of "Periodic" and "Aperiodic" Genes . The simulated genes above were either "signal" (periodic signal with noise), or pure "noise" (not periodic). In a genome we wish to discriminate the periodic genes from the non-periodic genes. The "Noise" experiments above suggest the distribution of log10(p-values) may be effective in this discrimination. For this experiment the noise was fixed at Normal(mean=0, s.d.=0.2) and the fraction of periodic genes was varied: 0%, 25%, 50%, 75%, 100%. Only evenly-spaced time series were considered here, which would be the approximate best case behavior for unevenly-spaced series. Here are the charts from the mixture experiments of "signal" and "noise". These histograms are bimodal, and a wide valley separates the periodic signal distribution and the noise gene distribution. The periodic genes show a "spike" or almost a normal distribution with low p-values on the left side. The aperiodic, "noise" genes are clustered at the right. A value of log10(p) = -4 is the middle of this valley, which suggests a good separation threshold value. The two distributions ("periodic" and "aperiodic") have their own characteristic shapes, which should be useful in interpreting results from a biological experiment, such as Bozdech's Plasmodium datatset. This process allows one to make a number of assumptions about the presence (e.g., periodic gene rate) and signals (e.g., approximate noise levels) from periodic genes and simulate the results. Such simulations might be useful in estimating false positive rates in the list of identified periodic genes -- where the two distributions overlap. Results from using various multiple testing correction methods are also shown. All synthetic genes were found using a significance cutoff of 1E-4, except the Bonferroni correction method rejected some of the genes at this significance level. While results for various multiple testing methods (provided by "R") are shown, our choice is the Benjamini-Hochberg "FDR" method. Reproduce simulation data in R using file Mixture-1.R and analyze it with Mixture-2.R |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8. Phase shift. See these notes for a brief discussion of how to extract phase shift information from the Lomb-Scargle results. [Added 17 Aug 2007] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
B. Exploratory analysis | C. Results
Updated
17 Aug 2007