statgenHTP tutorial: 4. Outlier detection for series of observations
Emilie Millet, Bart-Jan van Rossum, Isabelle Sanchez, Nadine Hilgert, Fred van Eeuwijk
2022-12-13
Source:vignettes/vignettesSite/OutlierSerieObs_HTP.Rmd
OutlierSerieObs_HTP.Rmd
Introduction
This document describes a protocol to detect outlying time courses, observed on a single plant or plot, with examples from various platforms. Traits may be measured directly or indirectly (through image analysis for example). The protocol considers time series on single traits. The protocol is applicable to raw data or data that were corrected spatially before (see statgenHTP tutorial: 3. Correction for spatial trends) or on raw data.
A nonparametric smoothing associated with a PCA
Each time course is modeled by a non-parametric smoothing spline with a fixed number of knots. This is a piecewise cubic polynomial ((Eubank 1999), (Eilers, Marx, and Durbán 2015)) fitted as a mixed model ((Currie and Durban 2002)).
![P-spline smoothing with 20 knots and a low penalty, from [@Hugelier2016]. The individual B-splines (with the correct coefficients) are shown (colored lines), as well as their sum representing the fit (thick black line).](figures/pspline.png)
P-spline smoothing with 20 knots and a low penalty, from (Hugelier, Devos, and Ruckebusch 2016). The individual B-splines (with the correct coefficients) are shown (colored lines), as well as their sum representing the fit (thick black line).
The estimates for the spline coefficients are then extracted per time course (typically per plant) and correlations between those coefficient vectors are calculated to identify outlying time courses, i.e., plants. An outlying time course will have low correlation to the majority of time courses. To support the analysis by correlations, a principal component analysis can be done on the plant (time course) by spline coefficient matrix. A PCA plot of the plant scores will show the outlying plants.
For P-splines in a mixed model, the smoothing coefficient is optimized by restricted maximum likelihood but the number of knots is chosen by the user.
Principal component analysis on smoothed time series to identify outlying series.
The function fitSpline()
fits a P-spline per plant for the selected trait
. The penalty (i.e. the amount of smoothing) will be chosen by REML and the number of knots can be determined by the user using knots
. In P-spline, the knots are equally spaced and their number can be large. The user should also chose an appropriate minimum number of time points that should be in the data set per plant minNoTP
. When a plant has less time points than the minimum, it will be skipped from the analysis.
The functions are illustrated with the three example data sets. For more information about the data, see statgenHTP tutorial: 1. Introduction.
Example 1
The data from the Phenovator platform have been corrected for spatial trends and time points outliers have been removed (see statgenHTP tutorial: 2. Outlier detection for single observations and statgenHTP tutorial: 3. Correction for spatial trends). At this stage, the cleaned and corrected data are used:
data(spatCorrectedVator)
# Fit P-splines using on a subset of genotypes.
subGenoVator <- c("G070", "G160", "G151", "G179", "G175", "G004", "G055")
fit.spline <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
genotypes = subGenoVator,
knots = 50)
# Extracting the tables of predicted values and P-spline coefficients
predDat <- fit.spline$predDat
coefDat <- fit.spline$coefDat
The object fit.spline
contains the P-spline model coefficients (coefDat
) and the predicted value (pred.value
in the table below), i.e the values predicted using the P-spline model coefficients. Predictions are made on a denser grid of time points: the time points for prediction are calculated as the smallest gap between two time points divided by 9, so dividing the smallest gap in 10 segments. The object fit.spline
also contains the first and second derivatives (deriv
and deriv2
in the table below, see also statgenHTP tutorial: 6. Estimation of parameters from time courses).
timeNumber | timePoint | pred.value | deriv | deriv2 | plotId | genotype |
---|---|---|---|---|---|---|
0 | 2018-05-31 16:37:00 | 0.6687993 | -8e-07 | 0 | c10r29 | G160 |
800 | 2018-05-31 16:50:20 | 0.6681258 | -8e-07 | 0 | c10r29 | G160 |
1600 | 2018-05-31 17:03:40 | 0.6674536 | -8e-07 | 0 | c10r29 | G160 |
2400 | 2018-05-31 17:17:00 | 0.6667829 | -8e-07 | 0 | c10r29 | G160 |
3200 | 2018-05-31 17:30:20 | 0.6661139 | -8e-07 | 0 | c10r29 | G160 |
4000 | 2018-05-31 17:43:40 | 0.6654466 | -8e-07 | 0 | c10r29 | G160 |
Conversion to numerical time is required to fit P-splines. To keep the same time scale as in the original timePoint
column, a numerical transformation of the time points is made using the first time point as origin (column timeNumber
in the table above).
The numerical time can also be provided by the user using the option useTimeNumber = TRUE
. It allows using thermal time or any other manual conversion. In this example, we provide a new column specified in timeNumber
with time in hours since first measurement.
fit.splineNum <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
genotypes = subGenoVator,
knots = 50,
useTimeNumber = TRUE,
timeNumber = "timeNumHour")
We can then visualize the P-spline predictions and first derivatives for a subset of genotypes or for a subset of plots.
plot(fit.spline,
genotypes = "G160")
plot(fit.spline,
plotIds = "c10r29",
plotType = "predictions")
plot(fit.spline,
plotIds = "c10r29",
plotType = "derivatives")
The object fit.spline
also contains the values of the P-splines coefficients:
obj.coefficients | plotId | type | genotype |
---|---|---|---|
0.6965965 | c10r29 | timeNumber1 | G160 |
0.6685027 | c10r29 | timeNumber2 | G160 |
0.6421887 | c10r29 | timeNumber3 | G160 |
0.6265465 | c10r29 | timeNumber4 | G160 |
0.6263577 | c10r29 | timeNumber5 | G160 |
0.6420706 | c10r29 | timeNumber6 | G160 |
The coefficients are then used to tag suspect time courses with the function detectSerieOut()
. This function performs a PCA on the coefficients (from data.frame coefDat
) per genotype
and calculates the pairwise angle between the plants in the PCA plot. Plants are tagged when the mean angle is above threshold (thrPca)
, see the lines reason = angle
in the table below. The function also calculates the pairwise-correlation of the coefficients per genotype. Plants are tagged when the correlation is below a given threshold (thrCor
), see the lines reason = mean corr
in the table below. Finally the pairwise-ratios of the slopes of a linear model fitted through the spline coefficients are computed. Plants are tagged when the average pairwise-ratio is lower the a given threshold (thrSlope
), see the lines reasion = slope
in the table below.
For obvious reasons, the detection will only work when there are at least three replicates per genotype. Genotypes with less than three replicates will be skipped.
outVator <- detectSerieOut(corrDat = spatCorrectedVator,
predDat = predDat,
coefDat = coefDat,
trait = "EffpsII_corr",
genotypes = subGenoVator,
thrCor = 0.9,
thrPca = 30,
thrSlope = 0.7)
plotId | genotype | reason | value |
---|---|---|---|
c9r4 | G160 | mean corr | 0.6775630 |
c9r4 | G160 | angle | 47.4954956 |
c9r4 | G160 | slope | 0.6058944 |
c3r43 | G055 | mean corr | 0.8921578 |
For this subset of genotypes, 2 plants were tagged as outliers:
- c3r43 and c9r4 had both low correlations. In addition c9r4 also had high angles and a low average ratio for the slope.
The outVator
can be visualized by selecting genotypes. Here genotype G160 which has plant c9r4 tagged as outlier:
plot(outVator, genotypes = "G160")
The figure above contains:
- (top) A scatter plot of the trait value on the y-axis and time on the x-axis. Points are the raw or corrected data and lines are the P-spline predictions, with one color per plant (in legend). Filled dots represent outlying plants, open dots non-outlying plants.
- (bottom left) A matrix with in the bottom right the correlations of the plant scores as a heatmap. The scale is centered on 0.9 (the correlation threshold) to see at a glance the outlying plants with low correlations (usually correlation is high between plants). The top left of the matrix shows the average slopes as a heatmap. The scale is centered on 0.7 (the slope threshold) to see at a glance the outlying plants with low average slope (usually average slopes are high between plants).
- (bottom right) A PCA plot of the plant scores. Usually, all plants are grouped and the first axis explains most of the variation. When a plant is outlying, it will be located apart from the other plants on the second axis.
It is possible to visualize only the plants tagged for one or two reasons instead of all three. This can be done by specifying e.g. reason = slope
for only visualizing plants tagged because of a low average slope. When doing so only the relevant plots will be shown. So in this case only the upper left part of the correlation plot and no PCA plot.
plot(outVator,
genotypes = "G160",
reason = "slope")
When numerical time was used to fit the splines, it can be used also to plot the outliers using the option useTimeNumber = TRUE
and providing the column name in timeNumber
.
plot(outVator,
genotypes = "G160",
useTimeNumber = TRUE,
timeNumber = "timeNumHour")
Finally, the outlying plants can be removed from the data set…
spatCorrectedVatorOut <- removeSerieOut(dat = spatCorrectedVator,
serieOut = outVator)
# Check for time series outliers in data from which individual outlying
# observations were already removed and to which a spatial adjustment has been applied
head(spatCorrectedVator[spatCorrectedVator$plotId == "c9r4",
c("EffpsII_corr", "EffpsII")])
#> EffpsII_corr EffpsII
#> 12377 0.5643125 0.559
#> 16648 0.6693775 0.669
#> 18073 0.7261539 0.727
#> 19502 0.7994927 0.799
#> 20933 NA NA
#> 22361 NA NA
# Check the same value in the new corrected data.frame
head(spatCorrectedVatorOut[spatCorrectedVatorOut$plotId == "c9r4",
c("EffpsII_corr", "EffpsII")])
#> EffpsII_corr EffpsII
#> 12377 NA 0.559
#> 16648 NA 0.669
#> 18073 NA 0.727
#> 19502 NA 0.799
#> 20933 NA NA
#> 22361 NA NA
… and from the predictions.
fit.splineOut <- removeSerieOut(fitSpline = fit.spline,
serieOut = outVator)
fit.splineNumOut <- removeSerieOut(fitSpline = fit.splineNum,
serieOut = outVator)
It is possible to remove only the plants tagged for one or two reasons instead of all three. This can be done by specifying e.g. reason = slope
for only removing plants tagged because of a low average slope.
spatCorrectedVatorOut <- removeSerieOut(dat = spatCorrectedVator,
serieOut = outVator,
reason = "slope")
Impact of the number of knots on the smoothing
For one plant of the same data set, we fit the P-spline with 10 or 50 knots and visualize the predictions to compare the smoothness. We advise the user to perform tests of the number of knots on a subset of plants before running the function on all plants.
With 10 knots:
sp10k <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
plotIds = "c10r29",
knots = 10)
plot(sp10k)
The predicted curve is very smooth and, in this case, it is not following precisely the real data curve shape. When comparing the plants of a same genotype with this curve shape we might not identify the outlying plant.
With 50 knots:
sp50k <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
plotIds = "c10r29",
knots = 50)
plot(sp50k)
The predicted curve is less smooth and follows the actual curve shape. This seems to be a good setting to detect strange curve shape among the replicates of a genotype.
Example 2
The data from the PhenoArch platform have been corrected for spatial trends and individually outlying observations have been removed (see statgenHTP tutorial: 2. Outlier detection for single observations and statgenHTP tutorial: 3. Correction for spatial trends).
data(spatCorrectedArch)
subGenoArch <- c("GenoA01", "GenoA02", "GenoA34", "GenoA04", "GenoB01", "GenoB02", "GenoB07")
fit.splineArch <- fitSpline(inDat = spatCorrectedArch,
trait = "LeafArea_corr",
genotypes = subGenoArch,
knots = 30,
minNoTP = 18)
predDatArch <- fit.splineArch$predDat
coefDatArch <- fit.splineArch$coefDat
plot(fit.splineArch,
plotIds = "c11r9",
plotType = "predictions")
plot(fit.splineArch,
plotIds = "c11r9",
plotType = "derivatives")
Here, the geno.decomp
option is also to split the plants of each genotype those under well watered and water deficit conditions. The outlier detection is run per treatment with a narrow angle threshold
outArch <- detectSerieOut(corrDat = spatCorrectedArch,
predDat = predDatArch,
coefDat = coefDatArch,
trait = "LeafArea_corr",
genotypes = subGenoArch,
thrCor = 0.9,
thrPca = 10,
thrSlope = 0.8,
geno.decomp = "geno.decomp")
plotId | genotype | geno.decomp | reason | value |
---|---|---|---|---|
c13r6 | GenoA01 | WD_Panel1 | mean corr | 0.893585 |
c13r6 | GenoA01 | WD_Panel1 | angle | 14.472281 |
c16r2 | GenoA01 | WD_Panel1 | angle | 13.054901 |
c20r24 | GenoA01 | WD_Panel1 | angle | 12.805672 |
c21r56 | GenoA01 | WD_Panel1 | angle | 10.462828 |
c24r26 | GenoA01 | WD_Panel1 | angle | 13.075548 |
plot(outArch, genotypes = "GenoA01", geno.decomp = "WD_Panel1")
spatCorrectedArchOut <- removeSerieOut(dat = spatCorrectedArch,
serieOut = outArch)
# Check for time series outliers in data from which individual outlying
# observations were already removed and to which a spatial adjustment has been applied
head(spatCorrectedArch[spatCorrectedArch$plotId == "c16r2",
c("LeafArea_corr", "LeafArea")])
#> LeafArea_corr LeafArea
#> 1 0.002564006 0.002871676
#> 2173 0.004164635 0.004066799
#> 3843 0.005768308 0.005690233
#> 5509 0.008211606 0.008043967
#> 7180 0.009454458 0.010399790
#> 10049 0.015047478 0.015390857
# Check the same value in the new corrected data.frame
head(spatCorrectedArchOut[spatCorrectedArchOut$plotId == "c16r2",
c("LeafArea_corr", "LeafArea")])
#> LeafArea_corr LeafArea
#> 1 NA 0.002871676
#> 2173 NA 0.004066799
#> 3843 NA 0.005690233
#> 5509 NA 0.008043967
#> 7180 NA 0.010399790
#> 10049 NA 0.015390857
Example 3
The data from the RootPhAir platform have not been corrected for spatial trends but individually outlying observations have been removed (see statgenHTP tutorial: 2. Outlier detection for single observations).
subGenoRoot <- c( "2", "6", "8", "9", "10", "520", "522")
fit.splineRoot <- fitSpline(inDat = noCorrectedRoot,
trait = "tipPos_y",
knots = 10,
genotypes = subGenoRoot,
minNoTP = 0,
useTimeNumber = TRUE,
timeNumber = "thermalTime")
predDatRoot <- fit.splineRoot$predDat
coefDatRoot <- fit.splineRoot$coefDat
row.names(coefDatRoot) <- 1:nrow(coefDatRoot)
plot(fit.splineRoot,
genotypes = "2")
outRoot <- detectSerieOut(corrDat = noCorrectedRoot,
predDat = predDatRoot,
coefDat = coefDatRoot,
trait = "tipPos_y",
genotypes = subGenoRoot,
thrCor = 0.9,
thrPca = 20,
thrSlope = 0.7)
plot(outRoot,
genotypes = "6")
noCorrectedRootOut <- removeSerieOut(dat = noCorrectedRoot,
serieOut = outRoot)
# Check one value annotated as outlier in the original corrected data.frame
head(noCorrectedRoot[noCorrectedRoot$plotId == "A_21_2", "tipPos_y"])
#> [1] 3.042857 3.057143 NA 2.317857 2.325000 2.371429
# Check the same value in the new corrected data.frame
head(noCorrectedRootOut[noCorrectedRootOut$plotId == "A_21_2", "tipPos_y"])
#> [1] NA NA NA NA NA NA
Detection of outlier plants on a multi-criteria basis, with expert rules
An outlier plant is defined as a biological replicate deviating from the overall distribution of plants on a multi-criteria basis, regardless of the quality of measurements. Detecting outlier plants can be done by monitoring a single criterion, such as plant height or biomass. In this case, a procedure for detecting outlying series of observations should be used, as discussed in the previous sections. But taking into account only one criterion can sometimes be restrictive in deciding whether a plant is outlier or not. A multi-criteria approach will then be more relevant. A multi-criteria method considers several traits jointly, with rules set by experts depending on the species. We describe below the approach followed on a maize experiment, see (Alvarez Prado et al. 2019).
Rules in the case of maize
We consider two categories of potentially outlier plants, namely apparently too small or too large plants. For the detection of unexpectedly small plants with likely physiological disorders, the progression of leaf stages was considered in addition to the time course of shoot biomass. Indeed, leaf appearance rate carries a non-redundant information compared with biomass (confirmed by standard correlation calculations). It usually presents a low plant-to-plant variability except in case of severe disorders, and is relatively insensitive to environmental cues other than temperature. For the detection of unexpectedly large plants, potentially associated with wrong genotype identification, combining plant height and biomass can result in an efficient identification.
Statistical modelling: a mixed model at a given time (defined by the expert)
Each of the selected traits was measured (or estimated) at a specific time (for example 24 d20°C for the PhenoArch data set), time just before the beginning of the differentiation of the two watering treatments. This allows to have more replicates per genotype. It also reduces the dimensionality of the time courses to only one point, which will simplify the statistical models to be implemented later.

Schematic representation of a multi-trait approach for detection of outlier plants (case of maize), from Alvarez Prado et al. (2019).
As shown above, traits are modeled with a mixed model that considers fixed experiment (Env) effect and random genotypic (G), replicate (R) and spatial (C) effects. The model can be fitted with the SpATS R-package, (Rodríguez-Álvarez et al. (2018)). Residuals (deviations) can be directly computed from the fitting, with a confidence interval. Plants, whose deviations for the criteria of leaf appearance rate and biomass are less than the lower bound of this interval, are considered as outlier small plants. Plants, whose deviations for the criteria of plant height and biomass are greater than the upper bound of the confidence interval, are considered as outlier large plants.
Implementation in statgenHTP
Mixed models are fitted with the R-package SpATS through the use of the detectSingleOutMaize
function, which was developed specifically for the maize data, as described above. This function fits a mixed model for each parameter, and tests whether the residual deviations are lower (for the criteria considered for small plants) or higher (for the criteria considered for big plants) than a specified threshold.
phenoTParch <- createTimePoints(dat = PhenoarchDat1,
experimentName = "Phenoarch",
genotype = "Genotype",
timePoint = "Date",
plotId = "pos",
rowNum = "Row",
colNum = "Col")
test2 <- detectSingleOutMaize(phenoTParch,
timeBeforeTrt = "2017-04-27",
trait1 = "Biomass",
trait2 = "PlantHeight",
trait3 = "phyllocron")
The detectSingleOutMaize
returns a list of 3 elements :
- modDat: a data.frame with the used data set, the fitted values and residuals calculated by the models, and the plants flagged as outlier
- smallPlants: a data.frame of the “small” plants detected as outliers
- bigPlants: a data.frame of the “big” plants detected as outliers
The table below shows the smallPlants data.frame.
timePoint | plotId | genotype | Scenario | population | |
---|---|---|---|---|---|
5811 | 2017-04-27 | c5r3 | GenoA06 | WW | Panel1 |
8476 | 2017-04-27 | c6r57 | GenoA57 | WW | Panel1 |
8594 | 2017-04-27 | c7r2 | GenoB07 | WD | Panel2 |
8986 | 2017-04-27 | c7r18 | GenoB11 | WD | Panel2 |
17106 | 2017-04-27 | c12r50 | GenoA30 | WD | Panel1 |
18085 | 2017-04-27 | c13r28 | GenoA27 | WD | Panel1 |
References
Alvarez Prado, Santiago, Isabelle Sanchez, Llorenç Cabrera-Bosquet, Antonin Grau, Claude Welcker, François Tardieu, and Nadine Hilgert. 2019. “To Clean or Not to Clean Phenotypic Datasets for Outlier Plants in Genetic Analyses?” Journal of Experimental Botany 70 (15): 3693–8. https://doi.org/10.1093/jxb/erz191.
Currie, I D, and M Durban. 2002. “Flexible Smoothing with P-Splines: A Unified Approach.” Statistical Modelling 2 (4): 333–49. https://doi.org/10.1191/1471082x02st039ob.
Eilers, Paul H. C., Brian D. Marx, and Maria Durbán. 2015. “Twenty Years of P-Splines.” Statistics and Operations Research Transactions 39 (2): 149–86.
Eubank, Randall L. 1999. Nonparametric Regression and Spline Smoothing. Statistics: A Series of Textbooks and Monographs. CRC Press.
Hugelier, S., O. Devos, and C. Ruckebusch. 2016. “Chapter 14 - a Smoothness Constraint in Multivariate Curve Resolution-Alternating Least Squares of Spectroscopy Data.” In Resolving Spectral Mixtures, edited by Cyril Ruckebusch, 30:453–76. Data Handling in Science and Technology. Elsevier. https://doi.org/10.1016/B978-0-444-63638-6.00014-0.
Rodríguez-Álvarez, María, Martin P. Boer, Fred van Eeuwijk, and Paul H. C. Eilers. 2018. “Correcting for Spatial Heterogeneity in Plant Breeding Experiments with P-Splines.” Spatial Statistics 23 (October): 52–71. https://doi.org/10.1016/j.spasta.2017.10.003.