# statgenHTP tutorial: 4. Outlier detection for series of observations

#### Emilie Millet, Bart-Jan van Rossum, Isabelle Sanchez, Nadine Hilgert, Fred van Eeuwijk

#### 2022-12-13

Source:`vignettes/vignettesSite/OutlierSerieObs_HTP.Rmd`

`OutlierSerieObs_HTP.Rmd`

## Introduction

This document describes a protocol to detect outlying time courses, observed on a single plant or plot, with examples from various platforms. Traits may be measured directly or indirectly (through image analysis for example). The protocol considers time series on single traits. The protocol is applicable to raw data or data that were corrected spatially before (see **statgenHTP tutorial: 3. Correction for spatial trends**) or on raw data.

### A nonparametric smoothing associated with a PCA

Each time course is modeled by a non-parametric smoothing spline with a fixed number of knots. This is a piecewise cubic polynomial ((Eubank 1999), (Eilers, Marx, and Durbán 2015)) fitted as a mixed model ((Currie and Durban 2002)).

The estimates for the spline coefficients are then extracted per time course (typically per plant) and correlations between those coefficient vectors are calculated to identify outlying time courses, *i.e.*, plants. An outlying time course will have low correlation to the majority of time courses. To support the analysis by correlations, a principal component analysis can be done on the plant (time course) by spline coefficient matrix. A PCA plot of the plant scores will show the outlying plants.

For P-splines in a mixed model, the smoothing coefficient is optimized by restricted maximum likelihood but the number of knots is chosen by the user.

## Principal component analysis on smoothed time series to identify outlying series.

The function `fitSpline()`

fits a P-spline per plant for the selected `trait`

. The penalty (i.e. the amount of smoothing) will be chosen by REML and the number of knots can be determined by the user using `knots`

. In P-spline, the knots are equally spaced and their number can be large. The user should also chose an appropriate minimum number of time points that should be in the data set per plant `minNoTP`

. When a plant has less time points than the minimum, it will be skipped from the analysis.

The functions are illustrated with the three example data sets. For more information about the data, see **statgenHTP tutorial: 1. Introduction**.

### Example 1

The data from the Phenovator platform have been corrected for spatial trends and time points outliers have been removed (see **statgenHTP tutorial: 2. Outlier detection for single observations** and **statgenHTP tutorial: 3. Correction for spatial trends**). At this stage, the cleaned and corrected data are used:

```
data(spatCorrectedVator)
# Fit P-splines using on a subset of genotypes.
subGenoVator <- c("G070", "G160", "G151", "G179", "G175", "G004", "G055")
fit.spline <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
genotypes = subGenoVator,
knots = 50)
# Extracting the tables of predicted values and P-spline coefficients
predDat <- fit.spline$predDat
coefDat <- fit.spline$coefDat
```

The object `fit.spline`

contains the P-spline model coefficients (`coefDat`

) and the predicted value (`pred.value`

in the table below), i.e the values predicted using the P-spline model coefficients. Predictions are made on a denser grid of time points: the time points for prediction are calculated as the smallest gap between two time points divided by 9, so dividing the smallest gap in 10 segments. The object `fit.spline`

also contains the first and second derivatives (`deriv`

and `deriv2`

in the table below, see also **statgenHTP tutorial: 6. Estimation of parameters from time courses**).

timeNumber | timePoint | pred.value | deriv | deriv2 | plotId | genotype |
---|---|---|---|---|---|---|

0 | 2018-05-31 16:37:00 | 0.6687993 | -8e-07 | 0 | c10r29 | G160 |

800 | 2018-05-31 16:50:20 | 0.6681258 | -8e-07 | 0 | c10r29 | G160 |

1600 | 2018-05-31 17:03:40 | 0.6674536 | -8e-07 | 0 | c10r29 | G160 |

2400 | 2018-05-31 17:17:00 | 0.6667829 | -8e-07 | 0 | c10r29 | G160 |

3200 | 2018-05-31 17:30:20 | 0.6661139 | -8e-07 | 0 | c10r29 | G160 |

4000 | 2018-05-31 17:43:40 | 0.6654466 | -8e-07 | 0 | c10r29 | G160 |

Conversion to numerical time is required to fit P-splines. To keep the same time scale as in the original `timePoint`

column, a numerical transformation of the time points is made using the first time point as origin (column `timeNumber`

in the table above).

The numerical time can also be provided by the user using the option `useTimeNumber = TRUE`

. It allows using thermal time or any other manual conversion. In this example, we provide a new column specified in `timeNumber`

with time in hours since first measurement.

```
fit.splineNum <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
genotypes = subGenoVator,
knots = 50,
useTimeNumber = TRUE,
timeNumber = "timeNumHour")
```

We can then visualize the P-spline predictions and first derivatives for a subset of genotypes or for a subset of plots.

```
plot(fit.spline,
genotypes = "G160")
```

```
plot(fit.spline,
plotIds = "c10r29",
plotType = "predictions")
```

```
plot(fit.spline,
plotIds = "c10r29",
plotType = "derivatives")
```

The object `fit.spline`

also contains the values of the P-splines coefficients:

obj.coefficients | plotId | type | genotype |
---|---|---|---|

0.6965965 | c10r29 | timeNumber1 | G160 |

0.6685027 | c10r29 | timeNumber2 | G160 |

0.6421887 | c10r29 | timeNumber3 | G160 |

0.6265465 | c10r29 | timeNumber4 | G160 |

0.6263577 | c10r29 | timeNumber5 | G160 |

0.6420706 | c10r29 | timeNumber6 | G160 |

The coefficients are then used to tag suspect time courses with the function `detectSerieOut()`

. This function performs a PCA on the coefficients (from data.frame `coefDat`

) per `genotype`

and calculates the pairwise angle between the plants in the PCA plot. Plants are tagged when the mean angle is above threshold (`thrPca)`

, see the lines `reason = angle`

in the table below. The function also calculates the pairwise-correlation of the coefficients per genotype. Plants are tagged when the correlation is below a given threshold (`thrCor`

), see the lines `reason = mean corr`

in the table below. Finally the pairwise-ratios of the slopes of a linear model fitted through the spline coefficients are computed. Plants are tagged when the average pairwise-ratio is lower the a given threshold (`thrSlope`

), see the lines `reasion = slope`

in the table below.

For obvious reasons, the detection will only work when there are at least three replicates per genotype. Genotypes with less than three replicates will be skipped.

```
outVator <- detectSerieOut(corrDat = spatCorrectedVator,
predDat = predDat,
coefDat = coefDat,
trait = "EffpsII_corr",
genotypes = subGenoVator,
thrCor = 0.9,
thrPca = 30,
thrSlope = 0.7)
```

plotId | genotype | reason | value |
---|---|---|---|

c9r4 | G160 | mean corr | 0.6775630 |

c9r4 | G160 | angle | 47.4954956 |

c9r4 | G160 | slope | 0.6058944 |

c3r43 | G055 | mean corr | 0.8921578 |

For this subset of genotypes, 2 plants were tagged as outliers:

- c3r43 and c9r4 had both low correlations. In addition c9r4 also had high angles and a low average ratio for the slope.

The `outVator`

can be visualized by selecting genotypes. Here genotype G160 which has plant c9r4 tagged as outlier:

`plot(outVator, genotypes = "G160")`

The figure above contains:

- (top) A scatter plot of the trait value on the y-axis and time on the x-axis. Points are the raw or corrected data and lines are the P-spline predictions, with one color per plant (in legend). Filled dots represent outlying plants, open dots non-outlying plants.
- (bottom left) A matrix with in the bottom right the correlations of the plant scores as a heatmap. The scale is centered on 0.9 (the correlation threshold) to see at a glance the outlying plants with low correlations (usually correlation is high between plants). The top left of the matrix shows the average slopes as a heatmap. The scale is centered on 0.7 (the slope threshold) to see at a glance the outlying plants with low average slope (usually average slopes are high between plants).
- (bottom right) A PCA plot of the plant scores. Usually, all plants are grouped and the first axis explains most of the variation. When a plant is outlying, it will be located apart from the other plants on the second axis.

It is possible to visualize only the plants tagged for one or two reasons instead of all three. This can be done by specifying e.g. `reason = slope`

for only visualizing plants tagged because of a low average slope. When doing so only the relevant plots will be shown. So in this case only the upper left part of the correlation plot and no PCA plot.

```
plot(outVator,
genotypes = "G160",
reason = "slope")
```

When numerical time was used to fit the splines, it can be used also to plot the outliers using the option `useTimeNumber = TRUE`

and providing the column name in `timeNumber`

.

```
plot(outVator,
genotypes = "G160",
useTimeNumber = TRUE,
timeNumber = "timeNumHour")
```

Finally, the outlying plants can be removed from the data set…

```
spatCorrectedVatorOut <- removeSerieOut(dat = spatCorrectedVator,
serieOut = outVator)
# Check for time series outliers in data from which individual outlying
# observations were already removed and to which a spatial adjustment has been applied
head(spatCorrectedVator[spatCorrectedVator$plotId == "c9r4",
c("EffpsII_corr", "EffpsII")])
#> EffpsII_corr EffpsII
#> 12377 0.5643125 0.559
#> 16648 0.6693775 0.669
#> 18073 0.7261539 0.727
#> 19502 0.7994927 0.799
#> 20933 NA NA
#> 22361 NA NA
# Check the same value in the new corrected data.frame
head(spatCorrectedVatorOut[spatCorrectedVatorOut$plotId == "c9r4",
c("EffpsII_corr", "EffpsII")])
#> EffpsII_corr EffpsII
#> 12377 NA 0.559
#> 16648 NA 0.669
#> 18073 NA 0.727
#> 19502 NA 0.799
#> 20933 NA NA
#> 22361 NA NA
```

… and from the predictions.

```
fit.splineOut <- removeSerieOut(fitSpline = fit.spline,
serieOut = outVator)
fit.splineNumOut <- removeSerieOut(fitSpline = fit.splineNum,
serieOut = outVator)
```

It is possible to remove only the plants tagged for one or two reasons instead of all three. This can be done by specifying e.g. `reason = slope`

for only removing plants tagged because of a low average slope.

```
spatCorrectedVatorOut <- removeSerieOut(dat = spatCorrectedVator,
serieOut = outVator,
reason = "slope")
```

#### Impact of the number of knots on the smoothing

For one plant of the same data set, we fit the P-spline with 10 or 50 knots and visualize the predictions to compare the smoothness. We advise the user to perform tests of the number of knots on a subset of plants before running the function on all plants.

*With 10 knots:*

```
sp10k <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
plotIds = "c10r29",
knots = 10)
plot(sp10k)
```

The predicted curve is very smooth and, in this case, it is not following precisely the real data curve shape. When comparing the plants of a same genotype with this curve shape we might not identify the outlying plant.

*With 50 knots:*

```
sp50k <- fitSpline(inDat = spatCorrectedVator,
trait = "EffpsII_corr",
plotIds = "c10r29",
knots = 50)
plot(sp50k)
```

The predicted curve is less smooth and follows the actual curve shape. This seems to be a good setting to detect strange curve shape among the replicates of a genotype.

### Example 2

The data from the PhenoArch platform have been corrected for spatial trends and individually outlying observations have been removed (see **statgenHTP tutorial: 2. Outlier detection for single observations** and **statgenHTP tutorial: 3. Correction for spatial trends**).

```
data(spatCorrectedArch)
subGenoArch <- c("GenoA01", "GenoA02", "GenoA34", "GenoA04", "GenoB01", "GenoB02", "GenoB07")
fit.splineArch <- fitSpline(inDat = spatCorrectedArch,
trait = "LeafArea_corr",
genotypes = subGenoArch,
knots = 30,
minNoTP = 18)
predDatArch <- fit.splineArch$predDat
coefDatArch <- fit.splineArch$coefDat
```

```
plot(fit.splineArch,
plotIds = "c11r9",
plotType = "predictions")
```

```
plot(fit.splineArch,
plotIds = "c11r9",
plotType = "derivatives")
```

Here, the `geno.decomp`

option is also to split the plants of each genotype those under well watered and water deficit conditions. The outlier detection is run per treatment with a narrow angle threshold

```
outArch <- detectSerieOut(corrDat = spatCorrectedArch,
predDat = predDatArch,
coefDat = coefDatArch,
trait = "LeafArea_corr",
genotypes = subGenoArch,
thrCor = 0.9,
thrPca = 10,
thrSlope = 0.8,
geno.decomp = "geno.decomp")
```

plotId | genotype | geno.decomp | reason | value |
---|---|---|---|---|

c13r6 | GenoA01 | WD_Panel1 | mean corr | 0.893585 |

c13r6 | GenoA01 | WD_Panel1 | angle | 14.472281 |

c16r2 | GenoA01 | WD_Panel1 | angle | 13.054901 |

c20r24 | GenoA01 | WD_Panel1 | angle | 12.805672 |

c21r56 | GenoA01 | WD_Panel1 | angle | 10.462828 |

c24r26 | GenoA01 | WD_Panel1 | angle | 13.075548 |

`plot(outArch, genotypes = "GenoA01", geno.decomp = "WD_Panel1")`

```
spatCorrectedArchOut <- removeSerieOut(dat = spatCorrectedArch,
serieOut = outArch)
# Check for time series outliers in data from which individual outlying
# observations were already removed and to which a spatial adjustment has been applied
head(spatCorrectedArch[spatCorrectedArch$plotId == "c16r2",
c("LeafArea_corr", "LeafArea")])
#> LeafArea_corr LeafArea
#> 1 0.002564006 0.002871676
#> 2173 0.004164635 0.004066799
#> 3843 0.005768308 0.005690233
#> 5509 0.008211606 0.008043967
#> 7180 0.009454458 0.010399790
#> 10049 0.015047478 0.015390857
# Check the same value in the new corrected data.frame
head(spatCorrectedArchOut[spatCorrectedArchOut$plotId == "c16r2",
c("LeafArea_corr", "LeafArea")])
#> LeafArea_corr LeafArea
#> 1 NA 0.002871676
#> 2173 NA 0.004066799
#> 3843 NA 0.005690233
#> 5509 NA 0.008043967
#> 7180 NA 0.010399790
#> 10049 NA 0.015390857
```

### Example 3

The data from the RootPhAir platform have not been corrected for spatial trends but individually outlying observations have been removed (see **statgenHTP tutorial: 2. Outlier detection for single observations**).

```
subGenoRoot <- c( "2", "6", "8", "9", "10", "520", "522")
fit.splineRoot <- fitSpline(inDat = noCorrectedRoot,
trait = "tipPos_y",
knots = 10,
genotypes = subGenoRoot,
minNoTP = 0,
useTimeNumber = TRUE,
timeNumber = "thermalTime")
predDatRoot <- fit.splineRoot$predDat
coefDatRoot <- fit.splineRoot$coefDat
row.names(coefDatRoot) <- 1:nrow(coefDatRoot)
```

```
plot(fit.splineRoot,
genotypes = "2")
```

```
outRoot <- detectSerieOut(corrDat = noCorrectedRoot,
predDat = predDatRoot,
coefDat = coefDatRoot,
trait = "tipPos_y",
genotypes = subGenoRoot,
thrCor = 0.9,
thrPca = 20,
thrSlope = 0.7)
```

```
plot(outRoot,
genotypes = "6")
```

```
noCorrectedRootOut <- removeSerieOut(dat = noCorrectedRoot,
serieOut = outRoot)
# Check one value annotated as outlier in the original corrected data.frame
head(noCorrectedRoot[noCorrectedRoot$plotId == "A_21_2", "tipPos_y"])
#> [1] 3.042857 3.057143 NA 2.317857 2.325000 2.371429
# Check the same value in the new corrected data.frame
head(noCorrectedRootOut[noCorrectedRootOut$plotId == "A_21_2", "tipPos_y"])
#> [1] NA NA NA NA NA NA
```

## Detection of outlier plants on a multi-criteria basis, with expert rules

An outlier plant is defined as a biological replicate deviating from the overall distribution of plants on a multi-criteria basis, regardless of the quality of measurements. Detecting outlier plants can be done by monitoring a single criterion, such as plant height or biomass. In this case, a procedure for detecting outlying series of observations should be used, as discussed in the previous sections. But taking into account only one criterion can sometimes be restrictive in deciding whether a plant is outlier or not. A multi-criteria approach will then be more relevant. A multi-criteria method considers several traits jointly, with rules set by experts depending on the species. We describe below the approach followed on a maize experiment, see (Alvarez Prado et al. 2019).

### Rules in the case of maize

We consider two categories of potentially outlier plants, namely apparently **too small** or **too large** plants. For the detection of unexpectedly small plants with likely physiological disorders, the progression of leaf stages was considered in addition to the time course of shoot biomass. Indeed, leaf appearance rate carries a non-redundant information compared with biomass (confirmed by standard correlation calculations). It usually presents a low plant-to-plant variability except in case of severe disorders, and is relatively insensitive to environmental cues other than temperature. For the detection of unexpectedly large plants, potentially associated with wrong genotype identification, combining plant height and biomass can result in an efficient identification.

### Statistical modelling: a mixed model at a given time (defined by the expert)

Each of the selected traits was measured (or estimated) at a specific time (for example 24 d20°C for the PhenoArch data set), time just before the beginning of the differentiation of the two watering treatments. This allows to have more replicates per genotype. It also reduces the dimensionality of the time courses to only one point, which will simplify the statistical models to be implemented later.

As shown above, traits are modeled with a mixed model that considers fixed experiment (Env) effect and random genotypic (G), replicate (R) and spatial (C) effects. The model can be fitted with the SpATS R-package, (Rodríguez-Álvarez et al. (2018)). Residuals (deviations) can be directly computed from the fitting, with a confidence interval. Plants, whose deviations for the criteria of leaf appearance rate and biomass are less than the lower bound of this interval, are considered as **outlier small plants**. Plants, whose deviations for the criteria of plant height and biomass are greater than the upper bound of the confidence interval, are considered as **outlier large plants**.

### Implementation in statgenHTP

Mixed models are fitted with the R-package SpATS through the use of the `detectSingleOutMaize`

function, which was developed specifically for the maize data, as described above. This function fits a mixed model for each parameter, and tests whether the residual deviations are lower (for the criteria considered for small plants) or higher (for the criteria considered for big plants) than a specified threshold.

```
phenoTParch <- createTimePoints(dat = PhenoarchDat1,
experimentName = "Phenoarch",
genotype = "Genotype",
timePoint = "Date",
plotId = "pos",
rowNum = "Row",
colNum = "Col")
test2 <- detectSingleOutMaize(phenoTParch,
timeBeforeTrt = "2017-04-27",
trait1 = "Biomass",
trait2 = "PlantHeight",
trait3 = "phyllocron")
```

The `detectSingleOutMaize`

returns a list of 3 elements :

- modDat: a data.frame with the used data set, the fitted values and residuals calculated by the models, and the plants flagged as outlier
- smallPlants: a data.frame of the “small” plants detected as outliers
- bigPlants: a data.frame of the “big” plants detected as outliers

The table below shows the smallPlants data.frame.

timePoint | plotId | genotype | Scenario | population | |
---|---|---|---|---|---|

5811 | 2017-04-27 | c5r3 | GenoA06 | WW | Panel1 |

8476 | 2017-04-27 | c6r57 | GenoA57 | WW | Panel1 |

8594 | 2017-04-27 | c7r2 | GenoB07 | WD | Panel2 |

8986 | 2017-04-27 | c7r18 | GenoB11 | WD | Panel2 |

17106 | 2017-04-27 | c12r50 | GenoA30 | WD | Panel1 |

18085 | 2017-04-27 | c13r28 | GenoA27 | WD | Panel1 |

### References

Alvarez Prado, Santiago, Isabelle Sanchez, Llorenç Cabrera-Bosquet, Antonin Grau, Claude Welcker, François Tardieu, and Nadine Hilgert. 2019. “To Clean or Not to Clean Phenotypic Datasets for Outlier Plants in Genetic Analyses?” *Journal of Experimental Botany* 70 (15): 3693–8. https://doi.org/10.1093/jxb/erz191.

Currie, I D, and M Durban. 2002. “Flexible Smoothing with P-Splines: A Unified Approach.” *Statistical Modelling* 2 (4): 333–49. https://doi.org/10.1191/1471082x02st039ob.

Eilers, Paul H. C., Brian D. Marx, and Maria Durbán. 2015. “Twenty Years of P-Splines.” *Statistics and Operations Research Transactions* 39 (2): 149–86.

Eubank, Randall L. 1999. *Nonparametric Regression and Spline Smoothing*. Statistics: A Series of Textbooks and Monographs. CRC Press.

Hugelier, S., O. Devos, and C. Ruckebusch. 2016. “Chapter 14 - a Smoothness Constraint in Multivariate Curve Resolution-Alternating Least Squares of Spectroscopy Data.” In *Resolving Spectral Mixtures*, edited by Cyril Ruckebusch, 30:453–76. Data Handling in Science and Technology. Elsevier. https://doi.org/10.1016/B978-0-444-63638-6.00014-0.

Rodríguez-Álvarez, María, Martin P. Boer, Fred van Eeuwijk, and Paul H. C. Eilers. 2018. “Correcting for Spatial Heterogeneity in Plant Breeding Experiments with P-Splines.” *Spatial Statistics* 23 (October): 52–71. https://doi.org/10.1016/j.spasta.2017.10.003.