![]() |
Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms |
|
||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1. Plot of mean values by time period |
||||||||||||||||||||||||||||||||||||||||
2. Lomb-Scargle periodogram of mean expression profile |
||||||||||||||||||||||||||||||||||||||||
3. Histogram of "p" values assigned to periodogram peak for 6,875 probes in "complete" set. The lack of a sharp peak at log(p)=0 at the very right is somewhat unexpected, since a sharper peak was always present in numerical experiments with Gaussian noise. (See Numerical experiments for some background on this mixture of distributions.) Code to create chart is in Analyze1.R. |
||||||||||||||||||||||||||||||||||||||||
4. Histogram of period values computed for all 6,875 probes in "complete" set. The sparse spacing of the histogram above 48 hours is an artifact of how the Lomb-Scargle algorithm works for periods longer than the a single cycle. Code to create chart is in Analyze1.R. |
||||||||||||||||||||||||||||||||||||||||
5. The various adjustments to p-values supported by R were explored, including the new "Benjamini & Yekutieli" method introduced in R 2.1.0. Our a priori plan was to use the Benjamini & Hochberg (BH) method. The results show little difference between the BH method and making no p-value adjustment at all (in this case). With a change between R 2.0.1 and R 2.1.0 to correct how missing values were treated, the R p.adjust function gives slightly different results by version of R. These changes affect the p.adjust.methods bonferroni, holm, and Benjamini and Hochberg ("fdr" in old R version, "BH" in new version). [The Hommel method was not explored because it's quite slow] Code to create chart is in Analyze1.R and should execute in both old and new versions of R. NOTE: Results will vary slightly by version of R. |
||||||||||||||||||||||||||||||||||||||||
6a. The cardinality of the periodic gene sets identified by Bozdech and Lomb-Scargle methods was computed by the code in StatsAndSets.R and used as the basis of the Venn diagram in the paper (Fig. 5). The output includes the various numeric values shown in Figure 5, as well as a number of diagrams comparing the Bozdech data with the Lomb-Scargle approach. Several of the pages from the StatsAndSets.pdf file are introduced below. 744 probes were identified as periodic that Bozdech did not. These 744 can be divided into a "Small N" set of 243, and another set of 501. |
||||||||||||||||||||||||||||||||||||||||
6b. Heuristic criteria used by Bozdech. This icon represents p. 3 of StatsAndSets.pdf, which shows the two heuristic criteria Bozdech used to select periodic genes, namely: (powerMax >= 2.5) AND (Percentage Power at Peak Frequency >= 70) Histograms on pp. 1 and 2 of StatsAndSets.pdf show two criteria separately. The second term, (Percentage Power at Peak Frequency >= 70), was much more important than the first, (powerMax >= 2.5). |
||||||||||||||||||||||||||||||||||||||||
| 6c. Page 16 of StatsAndSets.pdf compares results between the Lomb-Scargle p-value statistic and the ad hoc Bozdech Power at Peak Frequency value, which was the most important factor as described above. The points in blue were the probes selected as periodic by both methods. The diagram shows the Power at Peak Frequency value correlated well with the Lomb-Scargle p-value. | ||||||||||||||||||||||||||||||||||||||||
6d. This icon represents p. 13 of StatsAndSets.pdf, where the phase shift values of the two methods were compared with the Overview dataset (also see 9 below). Bozdech's Fourier approach resulted in a phase shift angle ranging from -pi to pi. The Lomb-Scargle technique did not yield a phase shift value, but instead the peak of a loess-smoothed curve was used as a measure of the phase shift. Since the peak of a sine curve would occur at 12 hours in a 48-hour cycle, a 12-hour adjustment was introduced to make the comparison more meaningful. Overall, the phase shifts of the two methods were comparable, but the peak of the loess-smoothed curve showed some problems a the boundaries of the single cycle (see the time values "48 / 0" in the plot). |
||||||||||||||||||||||||||||||||||||||||
7a. Bozdech did not consider 1795 of the 6875 probes in the Complete dataset because of too many missing values. The Lomb-Scargle method could analyze these 1795 "small N" probes (in the range N=32 to N=42) without imputing any values for the missing data. List of 243 new probes to analyze with "small N" found to be periodic by Lomb-Scargle analysis, and passing the Benjamini and Hochberg multiple testing with a significance level of 1E-4. Periodograms of 98 of 243 periodic expression profiles for N < 40. Code to create charts is in Analyze2.R. Annotation information was available for 200 of the 243 probes representing 189 unique gene IDs. Since Bozdech already considered 37 of these 189 genes in his analysis, there are 152 "Small N" genes to consider. One of these 152 did not have sequence information. See this list of 151 genes with annotations representing 157 probes. Lomb-Scargle identified 501 probes as periodic that Bozdech did not. 440 of these 501 had annotation information, which reduced to 395 unique gene IDs. 55 genes already identified as periodic by Bozdech were excluded, leaving 340 genes (374 probes). |
||||||||||||||||||||||||||||||||||||||||
7b. Peridograms of two probes with unusually long periods. These peak of the periodogram for these two probes was the very first point, which gave them an unusually long period Both of these cases appear to have multiple periodicities, however, with a lower peak near a period of 24 hours, and a lower-yet peak near a 48 hour period. Code to create charts is in Analyze2.R. |
||||||||||||||||||||||||||||||||||||||||
8. Periodogram Map of Periodic Probes. Probes are ordered by the frequency corresponding to the "dominant" peak in each Lomb-Scargle periodogram. This diagram shows "frequency space" and should not be be confused with a phaseogram (see below). Clustering of a periodogram map may be a useful exploratory tool in some cases, but here a simple sort by peak frequency shows the dominance of frequency band corresponding to the 48-hour period in the Plasmodium data. Whether the weak 24-hour band is an indication of some biological process, or is only an artifact of the Lomb-Scargle technique, is not yet clear. Code to create chart is in Analyze3.R. |
||||||||||||||||||||||||||||||||||||||||
| 9. Phaseogram of Periodic Probes. Probes are ordered by the phase shift derived from the peaks of the expression profiles (also see 6d above). The phase shift was defined as the peak time in the smoothed expression pattern (using R's loess function for local regression smoothing).
The white spaces in the diagram reflect the locations of "missing" data. The white vertical bars correspond to all the missing time points for periods TP23 and TP29. This diagram, which reflects all 4355 periodic probes found using Lomb-Scargle, compares favorably with the phaseogram shown as Figure 2 in Bozdech's paper. Code to create chart is in Analyze3.R. |
||||||||||||||||||||||||||||||||||||||||
![]() |
10. Missing Value Tolerance. The Bozdech dataset was missing all time points for periods 23 and 29 and had a considerable number of other missing time points. What if all 7 points from time periods 23 through 29 had been missing? What if the gap "grew" and contained 11, 15, 19, or even 25 missing, consecutive points? Is any analysis for periodic genes possible with so many missing values? What is the "proper" way to impute 7 to 25 consecutive pseudodata time points (so that FFT analysis could be applied)? Loess smoothing could be used to impute such a large number of missing points (see example of loess imputation of data points for an "ideal gene" cosine curve). But should a researcher "invent" so many data points for analysis purposes? There is no rational way to impute data for such a large gap. The Lomb-Scargle method doesn't require any data imputation and can analyze what data are available. The Bozdech Complete Dataset was analyzed for various scenarios of missing points in a "gap", starting with the points for periods 23 and 29, which were already missing: Lomb-Scargle Periodic Genes
*A priori significance criteria Of the 3617 periodic probes identified by Lomb-Scargle with a data gap of 7 consecutive values, 3609 were identified in the original 4355 set. So this 7-value gap introduced 8 false positive periodic probes. Likewise, with a gap of 11 the 2502 of the 2506 probes were "correct" while 4 new false positives were present. 2504 of the 2506 were present in the set of 3617. The histograms of the log10(p-values) show a mixture of distributions of periodic probes and aperiodic probes, as discussed before. The histogram of this mixture maintains much of its shape as more missing values were considered, but shifts to the right along the x-axis to higher values. The "valley" in these histograms suggests that an adaptive threshold may have better discriminative value in finding periodic versus aperiodic probes. Plots of the adjusted p-values for several different adjustment methods, including BH, keep much their same shape, but with shifted p-values. See R script MissingTests.R. |
|||||||||||||||||||||||||||||||||||||||
Updated
24 Oct 2005