![]() |
Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms |
Step-by-Step Instructions: The file Bozdech03.R was used to process the Plasmodium dataset in the file Complete_Dataset.csv. That R script is not very general, and is likely not a good starting point for processing your dataset. Instead, download and modify the file AnalyzeMyTimeSeriesSet.R to process your data. The lines in this file are explained below. The following assumes you have already extracted the various .R files from the .ZIP (or .gz) on the R Code page. At a minimum you need these files in your working directory:
The first important line in AnalyzeMyTimeSeriesSet.R directly or indirectly references all the above R code:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Step 1. Create a comma-separated value (.csv) file containing a time series in each row for a particular probe from your microarray study. (Alternately, you can create an Excel spreadsheet with these values, and create a .csv from that, or modify R code to read the Excel file directly.) Ideally, data should be collected over a time period of more than one cycle. Your file should have this structure of N time points for each of the K probes:
Column labels are assumed to be present in row 1, and row labels in column 1 are assumed to be the various probe names from a microarray study. These values are assumed to be log2 ratios. We have worked with time series where N ranges from approximately 20 to 50, and where there are about 5,000 to 45,000 probes. In theory, any size dataset that fits in memory could be processed. Obviously, it takes more time to process more probes. For the following discussion, the LombScargleExampleSet.csv file was created using SetupLombScargleExample.R. (You may want to consider processing a test dataset of known periodic patterns to help understand how "perfect" periodic genes would perform.) Here is the code form AnalyazeMyTimeSeriesSet.R that loads the raw expression file into a data frame called "RawData":
Use your filename instead of LombScargleExampleSet.csv in the assignment to filename. If you have a number of expression datasets and would like to be prompted for which one to process, use this R statement:
If you have an Excel file, LombScargleExampleSet.xls, instead of a .csv file, the following usually works (problems with some Excel files can be avoided by always using .csv files instead):
Note that a forward slash, "/", can be used in R for either Windows or Linux paths. You can use the R dim command to verify if the correct number of rows and columns are present in the data frame:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Step 2. Define data needed for Lomb-Scargle Analysis. In some cases the RawData from Step 1 can be directly used in subsequent analysis. You may have raw expression ratios and may need to take logarithms before proceeding. You may want to perform data "quality" tests, or check for "outliers" before continuing. In this case the RawData contains log ratio values, so the ID and Expression objects can be extracted from the RawData data frame:
If you have ratio data and need to compute log ratios, you may want to replace 0 values with small values to avoid the -Inf values from log2(0):
This last assignment also converts RawData to an Expression matrix of double precision floating-point values, which is more computationally efficient to work with than a list:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Step 3. Define the time points for the series. Unlike Fourier analysis, time values with Lomb-Scargle analysis need not be uniform. In this case, the time values were originally defined from uniform random numbers over the interval 0 to 24 hours.
The consistency check above ensures the number of defined times matches the number of values in each of the time series. While a bit of kludge, the global assignment to "unit" changes text in certain plots from "unit" to "hour". |
||||||||||||||||||||||||||||||||||||||||||||||||
Step 4. Define the test frequencies. Unlike Fourier analysis where only certain "Fourier frequencies" are usually considered, the range and number of test frequencies need to be defined for the Lomb-Scargle analysis. A rule of thumb from [Press2002] suggests that 2N or 4N test frequencies are usually adequate. These test frequencies could range up to the Nyquist limit, but usually practical considerations limit this range. For example, if you are observing a biological phenomenon that has a period of 120 minutes, perhaps a range of 60 to 180 minutes would be adequate. In this case with a dominant frequency of 24 hours, let's look at the range of frequencies corresponding to periods of 6 to 48 hours.
The regression formula by [Horne86] is used to estimate the number of these test frequencies that are independent, which is used in the computation of p values. Missing numeric values in a time series will automatically be excluded from the Lomb-Scargle analysis and affect the computation of the number of independent frequencies, too, since it's based on sample size, N. |
||||||||||||||||||||||||||||||||||||||||||||||||
Step 5. Define directory to hold results. A directory is used to hold the output files.
As mentioned in the comments above, the "chart" directory is only used if SavePlots=TRUE is specified below. Normally, you would use SavePlots=FALSE if you are processing a large number of probes. If some other plot format is desired instead of PDFs, change this statement in ProcessExpressionDataset function in the file ProcessLombScargleDataset.R:
|
||||||||||||||||||||||||||||||||||||||||||||||||
Step 6. Process the time series in the Expression matrix:
This step and all the steps above are executed as a batch using this R statement:
By default the graphic for each probe is displayed as it is being processed. Three output files are created in the specified OUTPUT_PATH: (The links here show the results) Since SavePlots=TRUE in this example, a folder called charts is created in the OUTPUT_PATH with one PDF file per probe: To speed processing, especially with large Expression datasets, you may want to use SavePlots=FALSE to suppress these PDF files. In addition, you can suppress the display of graphics and save considerable processing time by defining these two global variables:
The loess peak is used to define the "phase shift" (see below) and this is not computed when SHOW.LOESS.PEAK is set to FALSE. An additional time savings can be achieved by suppressing all displayed graphics (and the loess computation) by setting SHOW.PLOT to FALSE. These statements can be directly edited in the LombScargle.R file, or set after this file is source, such as in the file ProcessLombScargleDataset.R. (Note: Always specify SavePlots=FALSE when specifying SHOW.PLOT=FALSE). |
||||||||||||||||||||||||||||||||||||||||||||||||
Step 7. Analyze Results -- Periodicity. The individual plots show the result for each probe, but the Dominant.CSV file can be used for overall summaries. The period and p-value columns from this file (shown below) indicate whether a gene is periodic. Because this is an artificial example defined by the SetupLombScargleExample.R file, the Expected Period column was added to show expected results were obtained:
The probe for Series1 shows a period of 48 hours, but this should be ignored because of the NaN (not a number) p-value. The remaining individual p-values are fairly good, but should be considered in the context of the multiple testing procedure outlined in our paper. For Series 10 both the peaks corresponding to periods of 8 hours and 40 hours were significant and about the same, but the procedure only picks the highest one, which corresponds to the 8-hour period. This is a limitation of the existing program. For Series 11 two peaks were not seen since the two frequencies were quite close. This is another limitation: frequencies corresponding to 8 and 40 hour periods can be resolved, but frequencies corresponding to periods of 12 and 18 hours cannot be resolved. |
||||||||||||||||||||||||||||||||||||||||||||||||
Step 8. Analyze Results -- Phase Shift. Probes 4 through 9 all have the same computed periods of approximately 24 hours. A set of probes with the same period can be further categorized by phase shift, which is a measure of whether they are aligned in time. This phase shift was computed by finding the peak in the loess smoothed curve, which is represented by a dot in the various plots. The PhaseShift and PhaseShiftHeight values below are from the Dominant.CSV file. PhaseShift can be expressed in terms of an offset within the 24 hour period, or in terms of radians. In this case the PhaseShift can range from 0 to 24 hours, which corresponds to 0 to 2π radians. The Adjusted Phase Shifts (see below) compare quite favorably with the Expected Phase Shifts. The adjustment was necessary because periodic functions can be expressed in alternative and equivalent ways.
The phase shift corresponding to the original value in radians in the SetupLombScargleExample.R file can be computed as follows: (24 - PhaseShift) * (2π/24). The PhaseShiftHeight values can be used to reject periodic patterns with low amplitude. These heights can be ignored in this case, since all probes have similar amplitudes. |
||||||||||||||||||||||||||||||||||||||||||||||||
| The Periodgram.CSV and pValue.CSV files contain all the numeric values displayed in the periodogram and p-value charts for additional analysis if desired. |
Updated
22 Sept 2005