# statgenHTP tutorial: 2. Outlier detection for single observations

#### Emilie Millet, Bart-Jan van Rossum, Isabelle Sanchez, Nadine Hilgert

#### 2022-12-13

Source:`vignettes/vignettesSite/OutlierSingleObs_HTP.Rmd`

`OutlierSingleObs_HTP.Rmd`

## Introduction

An outlier is usually defined as an observation that appears to be inconsistent with the remainder of the data set (Barnett and Lewis 1994). Observations may be single time points (Grubbs 1950) or whole time courses of one or more variables (Hubert, Rousseeuw, and Segaert 2015). An illustration is given below, where we see the difficulty of deciding whether a biomass time course (red curve at the bottom left) is atypical in a set of time courses from plants of the same genotype, or whether this slow growth is due to the plant’s position in the platform.

The concept of outlier can be extended to “outlier plants”, defined here as biological replicates deviating from the overall distribution of plants on a multi-criteria basis, regardless of the quality of measurements (Alvarez Prado et al. 2019).

This documents describes the procedures to detect outlying single observations in a time course with examples on experimental data measured directly on the plants or indirectly via image analysis for example.

## Definition of outlying single observation

Time courses of phenotypic data are viewed as continuous time-related functions. The first cleaning step consists of roughly checking the consistency of each point with its neighbors within a time course. Outlying single observations are measurements that do not follow the expected behavior at a given time. The detection of outlying observations is done “one time course after the other”. Identifying and removing these points will facilitate the outlier detection of series of observations or plants, as proposed in **statgenHTP tutorial: 4. Outlier detection for series of observations**.

The detection requires fitting a model from the data, as a function of time. Two types of models can be used, one based on nonlinear parametric regression (Gompertz Model or sigmoidal function for example) and the other on non-parametric regression. Data annotation is based on the comparison of the experimental data with its estimated value from the model. If they differ significantly, the data will be annotated as suspect.

NOTE: We only consider non-parametric regression in this package because of its flexibility to fit any type of curve.

## Illustration of the non-parametric function

Local regression is a well known smoothing technique which locally approximates an unknown function by parametric functions. It is a two-step procedure included in the `detectSingleOut`

function: the `locfit()`

function fits a local regression at a set of points, then, the `predict()`

function is used to interpolate this fit to other points. A confidence interval can then be calculated. Points outside this interval will be annotated as outliers.

The user can set:

the smoothing parameter

`nnLocfit`

. This is the parameter`nn`

from the locfit package (Nearest neighbor component of the smoothing parameter). This parameter values range from 0 to 1, with the higher the value, the smoother the curve.the level at which the confidence interval

`confIntSize`

is calculated. A large value calculates a wide interval and fewer outliers are annotated.the parameter

`checkEdges`

. By setting this parameter to`TRUE`

(the default), before the local regression a check is done to determine if the first and last time point in the time series for a plot are outliers. The`locfit`

function sometimes has problems determining these outliers. The use of this parameter is demonstrated in example 2.

The functions are illustrated with the three example data sets. For more information about the data, see **statgenHTP tutorial: 1. Introduction, data description and preparation**.

### Example 1

First a `TP`

object is created containing all the time points.

```
PhenovatorDat1 <- PhenovatorDat1[!PhenovatorDat1$pos %in% c("c24r41", "c7r18", "c7r49"), ]
## Create a TP object containing the data from the Phenovator.
phenoTP <- createTimePoints(dat = PhenovatorDat1,
experimentName = "Phenovator",
genotype = "Genotype",
timePoint = "timepoints",
repId = "Replicate",
plotId = "pos",
rowNum = "y", colNum = "x",
addCheck = TRUE,
checkGenotypes = c("check1", "check2", "check3", "check4"))
```

We then run the function on a subset of plants to set the parameters, using the `plotIds`

option. The smoothing parameter `nnLocfit`

is set at 0.1. The curves are atypical and to detect only the very strange points, fitted curves cannot be too smooth. The confidence interval size `confIntSize`

is set at 3 which is a relatively narrow interval.

```
# First select a subset of plants, for example here 8 plants.
plantSel <- c("c1r17","c13r17","c6r51","c21r24","c5r54","c21r32","c14r48","c5r26")
# Then run on the subset
resuVatorHTP <- detectSingleOut(TP = phenoTP,
trait = "EffpsII",
plotIds = plantSel,
confIntSize = 3,
nnLocfit = 0.1)
```

plotId | timePoint | EffpsII | yPred | sd_yPred | lwr | upr | outlier |
---|---|---|---|---|---|---|---|

c13r17 | 2018-06-03 09:07:00 | 0.751 | 0.6852574 | 0.0389194 | 0.5684992 | 0.8020156 | 0 |

c13r17 | 2018-06-03 11:37:00 | 0.538 | 0.6833963 | 0.0255016 | 0.6068916 | 0.7599011 | 1 |

c13r17 | 2018-06-03 14:37:00 | 0.737 | 0.7045487 | 0.0342136 | 0.6019078 | 0.8071896 | 0 |

c13r17 | 2018-06-04 09:07:00 | 0.619 | 0.6330957 | 0.0436610 | 0.5021126 | 0.7640788 | 0 |

c13r17 | 2018-06-04 11:37:00 | 0.711 | 0.6655915 | 0.0322765 | 0.5687620 | 0.7624211 | 0 |

c13r17 | 2018-06-04 14:37:00 | 0.633 | 0.6861254 | 0.0330415 | 0.5870010 | 0.7852499 | 0 |

The function output is a `data.frame`

containing for each plant at each time point: the predicted value, `yPred`

, the standard deviation of the prediction, `sd_yPred`

, the limits of the confidence interval, `lwr`

and `upr`

, and the `outlier`

status. When a point is annotated as outlying, the value of the outlier column is 1, it is 0 otherwise.

The predicted values can be visualized to adjust the smoothing parameter and the confidence interval using the `plot`

function. The option `outOnly`

, default value `TRUE`

, enables visualizing only plants with annotated outliers. Here, we are visualizing one plant and then all the selected plants to check the prediction.

```
plot(resuVatorHTP,
plotIds = "c13r17",
outOnly = FALSE)
```

```
plot(resuVatorHTP,
plotIds = plantSel,
outOnly = FALSE)
```

In the plots, the black dots are the raw data, the red line is the predicted curve, the green lines are the lower and upper limits of the confidence interval. Outlying points are highlighted in blue.

The annotated points can be replaced by NA for the studied trait using the function `removeSingleOut`

. It creates a new `TP`

object.

```
phenoTPOut <- removeSingleOut(phenoTP,
resuVatorHTP)
# Check one value annotated as outlier in the original TP object
phenoTP[[16]][phenoTP[[16]]$plotId=="c14r32", c("plotId", "EffpsII")]
#> plotId EffpsII
#> 16 c14r32 0.63
# Check the same value in the new TP object
phenoTPOut[[16]][phenoTPOut[[16]]$plotId=="c14r32", c("plotId", "EffpsII")]
#> plotId EffpsII
#> 16 c14r32 0.63
```

### Example 2

First a `TP`

object is created containing all the time points.

```
phenoTParch <- createTimePoints(dat = PhenoarchDat1,
experimentName = "Phenoarch",
genotype = "Genotype",
timePoint = "Date",
plotId = "pos",
rowNum = "Row",
colNum = "Col")
```

We then run the function on a subset of plants to set the parameters, using the `plotIds`

option. The smoothing parameter `nnLocfit`

is set at 0.5. The curves are linear and medium smooth is enough. The confidence interval size `confIntSize`

is set at 5 which is a medium interval.

```
# First select a subset of plants, for example here 7 plants
plantSelArch <- c("c22r7", "c8r22", "c1r10", "c21r43", "c14r15", "c25r12", "c6r54")
# Then run on the subset
resuArchHTP <- detectSingleOut(TP = phenoTParch,
trait = "LeafArea",
plotIds = plantSelArch,
confIntSize = 5,
nnLocfit = 0.5)
```

plotId | timePoint | LeafArea | yPred | sd_yPred | lwr | upr | outlier | |
---|---|---|---|---|---|---|---|---|

22 | c14r15 | 2017-05-09 | 0.3095688 | 0.2998288 | 0.0023225 | 0.2882164 | 0.3114412 | 0 |

23 | c14r15 | 2017-05-11 | 0.3268933 | 0.3305959 | 0.0023188 | 0.3190019 | 0.3421899 | 0 |

24 | c14r15 | 2017-05-12 | 0.3420478 | 0.3450079 | 0.0028001 | 0.3310072 | 0.3590086 | 0 |

25 | c14r15 | 2017-05-13 | 0.3621102 | 0.3587838 | 0.0037381 | 0.3400936 | 0.3774741 | 0 |

26 | c14r15 | 2017-05-15 | 0.0910594 | 0.3871152 | 0.0062159 | 0.3560358 | 0.4181947 | 1 |

The prediction values can be visualized to adjust the smoothing parameter and the confidence interval using the `plot`

function.

```
plot(resuArchHTP,
plotIds = "c14r15",
outOnly = FALSE)
```

```
plot(resuArchHTP,
plotIds = plantSelArch,
outOnly = FALSE)
```

In the plots, the black dots are the raw data, the red line is the predicted curve, the green lines are the lower and upper limits of the confidence interval. Outlying points are highlighted in blue.

It is possible to run the outlier detection without first checking if the first and last time point are outlying. Doing this could be useful is some cases, but might give undesired results as well as the example below shows.

```
# Detect outliers without first checking edges.
resuArchHTPNoEdges <- detectSingleOut(TP = phenoTParch,
trait = "LeafArea",
plotIds = "c14r15",
checkEdges = FALSE,
confIntSize = 5,
nnLocfit = 0.5)
```

`plot(resuArchHTPNoEdges)`

The annotated points can be replaced by NA for the studied trait using the function `removeSingleOut`

. It creates a new `TP`

object.

```
phenoTParchOut <- removeSingleOut(phenoTParch,
resuArchHTP)
# Check one value annotated as outlier in the original TP object
phenoTParch[[30]][phenoTParch[[30]]$plotId == "c12r31", c("plotId", "LeafArea")]
#> plotId LeafArea
#> 16655 c12r31 0.2447164
# Check the same value in the new TP object
phenoTParchOut[[30]][phenoTParchOut[[30]]$plotId == "c12r31", c("plotId", "LeafArea")]
#> plotId LeafArea
#> 16655 c12r31 0.2447164
```

### Example 3

First a `TP`

object is created containing all the time points.

```
rootTP <- createTimePoints(dat = RootDat1,
experimentName = "UCL1",
genotype = "Genotype",
timePoint = "Time",
plotId = "plantId",
rowNum = "Strip",
colNum = "Pos")
```

We then run the function on a subset of plants to set the parameters, using the `plotIds`

option. The smoothing parameter `nnLocfit`

is set at 1. The curves are almost linear and a strong smooth is enough to get an accurate curve shape (line). The confidence interval size `confIntSize`

is set at 5 which is a medium interval.

```
# First select a subset of plants, for example here 7 plants
plantSelRoot <- unique(RootDat1$plantId)[1:7]
# Then run on the subset
resuRootHTP <- detectSingleOut(TP = rootTP,
trait = "tipPos_y",
plotIds = plantSelRoot,
confIntSize = 5,
nnLocfit = 1)
```

plotId | timePoint | tipPos_y | yPred | sd_yPred | lwr | upr | outlier | |
---|---|---|---|---|---|---|---|---|

38 | A_01_2 | 2016-11-12 03:21:58 | 6.775000 | 6.355150 | 0.1571693 | 5.569304 | 7.140996 | 0 |

39 | A_01_2 | 2016-11-12 05:41:17 | 6.760714 | 6.451046 | 0.1570347 | 5.665872 | 7.236219 | 0 |

40 | A_01_2 | 2016-11-12 14:57:43 | 6.775000 | 6.856314 | 0.1555333 | 6.078648 | 7.633981 | 0 |

41 | A_01_2 | 2016-11-12 17:16:51 | 6.921429 | 6.965262 | 0.1550652 | 6.189936 | 7.740588 | 0 |

42 | A_01_2 | 2016-11-13 04:52:57 | 6.014286 | 7.576345 | 0.1541887 | 6.805402 | 8.347288 | 1 |

43 | A_01_2 | 2016-11-13 07:12:14 | 5.728571 | 7.711560 | 0.1549867 | 6.936627 | 8.486493 | 1 |

The prediction values can be visualized to adjust the smoothing parameter and the confidence interval using the `plot`

function.

```
plot(resuRootHTP,
plotIds = "A_03_2",
outOnly = FALSE)
```

```
plot(resuRootHTP,
plotIds = plantSelRoot,
outOnly = FALSE)
```

In the plots, the black dots are the raw data, the red line is the predicted curve, the green lines are the lower and upper limits of the confidence interval. Outlying points are highlighted in blue.

The annotated points can be replaced by NA for the studied trait using the function `removeSingleOut`

. It creates an new `TP`

object.

```
rootTPOut <- removeSingleOut(rootTP,
resuRootHTP)
# Check one value annotated as outlier in the original TP object
rootTP[[11127]][rootTP[[11127]]$plotId == "A_01_2", c("plotId", "tipPos_y")]
#> plotId tipPos_y
#> 42 A_01_2 6.014286
# Check the same value in the new TP object
rootTPOut[[11127]][rootTPOut[[11127]]$plotId == "A_01_2", c("plotId", "tipPos_y")]
#> plotId tipPos_y
#> 42 A_01_2 NA
```

### References

Alvarez Prado, Santiago, Isabelle Sanchez, Llorenç Cabrera-Bosquet, Antonin Grau, Claude Welcker, François Tardieu, and Nadine Hilgert. 2019. “To Clean or Not to Clean Phenotypic Datasets for Outlier Plants in Genetic Analyses?” *Journal of Experimental Botany* 70 (15): 3693–8. https://doi.org/10.1093/jxb/erz191.

Barnett, Vic, and Tobby Lewis. 1994. *Outliers in Statistical Data*. Wiley Series in Probability and Statistics. Wiley. https://books.google.nl/books?id=B44QAQAAIAAJ.

Grubbs, Frank E. 1950. “Sample Criteria for Testing Outlying Observations.” *Ann. Math. Statist.* 21 (1): 27–58. https://doi.org/10.1214/aoms/1177729885.

Hubert, Mia, Peter J. Rousseeuw, and Pieter Segaert. 2015. “Multivariate Functional Outlier Detection.” *Statistical Methods & Applications* 24 (2): 177–202. https://doi.org/10.1007/s10260-015-0297-8.