efg's Research Notes:  R TechNotes and Graphics Gallery

loess Smoothing
and Data Imputation

Earl F. Glynn
Stowers Institute for Medical Research
18 March 2005

Purpose

This TechNote shows examples of loess (local polynomial regression fitting) smoothing for various "span" values. The online R documentation (?loess) says the default span value is 0.75, but doesn't give much guidance, nor visual examples, of how the span value affects smoothing.

In addition to simply smoothing a curve, the R loess function can be used to impute missing data points. An example of data imputation with loess is shown.

Background

Software Requirements

R 2.0.1 or later

Step-by-Step Procedure

Let's take a sine curve, add some "noise" to it, and then see how the loess "span" parameter affects the look of the smoothed curve.

1. Create a sine curve and add some noise:

> period <- 120
> x <- 1:120
> y <- sin(2*pi*x/period) + runif(length(x),-1,1)

2. Plot the points on this noisy sine curve:

> plot(x,y, main="Sine Curve + 'Uniform' Noise")
> mtext("showing loess smoothing (local regression smoothing)")

3. Apply loess smoothing using the default span value of 0.75:

> y.loess <- loess(y ~ x, span=0.75, data.frame(x=x, y=y))

4. Compute loess smoothed values for all points along the curve:

> y.predict <- predict(y.loess, data.frame(x=x))

5. Plot the loess smoothed curve along with the points that were already plotted:

> lines(x,y.predict)

6. Let's use the R "optimize" function to find the peak of the loess smoothed curve and plot that point:

> peak <- optimize(function(x, model)
            predict(model, data.frame(x=x)),
            c(min(x),max(x)),
            maximum=TRUE,
            model=y.loess) > points(peak$maximum,peak$objective,
            pch=FILLED.CIRCLE<-19)

 

7. Repeat steps 1-6 above for various span values. A script was created to automate this. Run this script by entering the following R statement:

> source("http://research.stowers-institute.org/efg/R/Statistics/loess-sin+runif.R")

 

8. Compare "noise" from a uniform distribution from -1 to 1 (above) to Gaussian noise, with mean 0 and standard deviation 1.0 (below):

> source("http://research.stowers-institute.org/efg/R/Statistics/loess-sin+rnorm.R")

 

9. Let's use loess to impute data points. Let's start by taking a sine curve with noise, like computed above, but leave out 15 of the 120 data points using R's "sample" function:

> period <- 120

 

> FullList <- 1:120
> x <- FullList

> # "randomly" make 15 of the points "missing"
> MissingList <- sample(x,15)
> x[MissingList] <- NA

> # Create sine curve with noise
> y <- sin(2*pi*x/period) + runif(length(x),-1,1) 

> # Plot points on noisy curve
> plot(x,y, main="Sine Curve + 'Uniform' Noise")
> mtext("Using loess smoothed fit to impute missing values")

10. As before, use the loess and predict functions to get smoothed values at the defined points:

> y.loess <- loess(y ~ x, span=0.75, data.frame(x=x, y=y))
> y.predict <- predict(y.loess, data.frame(x=FullList))

> # Plot the loess smoothed curve showing gaps for missing data
> lines(x,y.predict,col=i)

11. Use the loess and predict functions to also impute the values at the missing points:

> # Show imputed points to fill in gaps
> y.Missing <- predict(y.loess, data.frame(x=MissingList))
> points(MissingList, y.Missing, pch=FILLED.CIRCLE<-19, col=i)

 

12.Compare the loess smoothed fit and imputed points for various span values:

> source("http://research.stowers-institute.org/efg/R/Statistics/loess-sin+runif-impute.R")

 

Discussion/Conclusion

Span values as small as 0.10 do not provide much smoothing and can result in a "jerky" curve. Span values as large as 2.0 provide perhaps too much smoothing, at least in the cases shown above. Overall, the default value of 0.75 worked fairly well in "finding" the sine curve.

Updated
24 June 2005