Skip to contents

The statgenQTLxT package performs multi-trait and multi-environment Genome Wide Association Studies (GWAS), following the approach of . It builds on the statgenGWAS package (for single trait GWAS) which is available from CRAN. The package uses data structures and plots defined in the statgenGWAS package. It is recommended to read the vignette of this package, accessible in R via vignette(package = "statgenGWAS") or online at https://biometris.github.io/statgenGWAS/articles/GWAS.html to get a general idea of those.

## Multi-Trait GWAS

### Theoretical background

Multi-trait GWAS in the statgenQTLxT package estimates and tests the effect of a SNP in different trials or on different traits, one SNP at a time. Genetic and residual covariances are fitted only once, for a model without SNPs. Given balanced data on $$n$$ genotypes and $$p$$ traits (or trials) we fit a mixed model of the form

$$Y = \left(\begin{array}{c} Y_1 \\ \vdots \\ Y_p\end{array}\right) = \left(\begin{array}{c} X\gamma_1 \\ \vdots \\ X\gamma_p\end{array}\right) + \left(\begin{array}{c} x\beta_1 \\ \vdots \\ x\beta_p\end{array}\right) + \left(\begin{array}{c} G_1 \\ \vdots \\ G_p\end{array}\right) + \left(\begin{array}{c} E_1 \\ \vdots \\ E_p\end{array}\right)$$

where $$Y_1$$ to $$Y_p$$ are $$n × 1$$ vectors with the phenotypic values for traits or trials $$1, \ldots, p$$. $$x$$ is the $$n × 1$$ vector of scores for the marker under consideration, and $$X$$ the $$n × q$$ design matrix for the other covariates. By default only a trait (environment) specific intercept is included. The vector containing the genetic background effects $$G_1, \ldots, G_p$$ is Gaussian with zero mean and covariance $$V_g \otimes K$$, where $$V_g$$ is a $$p × p$$ matrix of genetic (co)variances, and $$K$$ an $$n × n$$ kinship matrix. Similarly, the residual errors ($$E_1, \ldots, E_p$$) have covariance $$V_e \otimes I_n$$, for a $$p × p$$ matrix $$V_e$$ of residual (co)variances.

#### Hypotheses for the SNP-effects

For each SNP, the null-hypothesis $$\beta_1 = \dots = \beta_p = 0$$ is tested, using the likelihood ratio test (LRT) described in . If estCom = TRUE, additional tests for a common effect and for QTL-by-trait (QTL × T) or QTL-by-environment (QTL × E) are performed, using the parameterization $$\beta_j = \alpha + \alpha_j (1 \leq j \leq p)$$. As in , we use likelihood ratio tests, but not restricted to the bivariate case. For the common effect, we fit the reduced model $$\beta_j = \alpha$$, and test if $$\alpha = 0$$. For the interactions, we test if $$\alpha_1 = \dots = \alpha_p = 0$$.

#### Models for the genetic and residual covariance

$$V_g$$ and $$V_e$$ can be provided by the user (fitVarComp = FALSE); otherwise one of the following models is used, depending on covModel. If covModel = "unst", an unstructured model is assumed, as in : $$V_g$$ and $$V_e$$ can be any positive-definite matrix, requiring a total of $$p(p+1)/2$$ parameters per matrix. If covModel = "fa", a factor-analytic model is fitted using an EM-algorithm, as in . $$V_g$$ and $$V_e$$ are assumed to be of the form $$W W^t + D$$, where $$W$$ is a $$p × m$$ matrix of factor loadings and $$D$$ a diagonal matrix with trait or environment specific values. $$m$$ is the order of the model, and the arguments mG and mE specify the order used for respectively $$V_g$$ and $$V_e$$. maxIter sets the maximum number of iterations used in the EM-algorithm. Finally, if covModel = "pw", $$V_g$$ and $$V_e$$ are estimated ‘pairwise’, as in . Looping over pairs of traits or trials $$1 \leq j < k \leq p$$, $$V_g[j,k] = V_g[k,j]$$ and $$V_e[j,k] = V_e[k,j]$$ are estimated assuming a bivariate mixed model. The diagonals of $$V_g$$ and $$V_e$$ are fitted assuming univariate mixed models. If the resulting $$V_g$$ or $$V_e$$ is not positive-definite, they are replaced by the nearest positive-definite matrix. In case covModel = "unst" or "pw" it is possible to assume that $$V_e$$ is diagonal (VeDiag = TRUE).

## The class gData

Data for analysis on genomic data comes from different sources and is stored in one data object of class gData (genomic Data) for convenience. A gData object will contain all data needed for performing analyses, so the first thing to do when using the statgenQTLxT package is creating a gData object; see the statgenGWAS package for more details. A gData object contains the following components:

• Marker map, a data.frame describing the physical positions of the markers on the chromosomes,
• Marker matrix, a numerical matrix containing the genotyping,
• Phenotypic data, either a single data.frame or a list of data.frames containing the phenotypic data,
• Kinship matrix, a matrix describing the genetic relatedness between the different genotypes,
• Further covariates that can be used in the analyses.

In our examples below we will show how a gData object is created.

## Worked example 1: multiple traits in one trial

As examples of the functionality of the package two worked examples are provided using maize data from the European Union project DROPS. The data is available from https://doi.org/10.15454/IASSTN and the relevant data sets are included as data.frames in the statgenGWAS package. We will first show how to load the data and create a gData object. Users already familiar with the statgenGwas packages might want to skip this part and go straight to Running Multi-trait GWAS section.

dropsMarkers contains the coded marker information for 41,722 SNPs and 246 genotypes. dropsMap contains information about the positions of those SNPs on the B73 reference genome V2. dropsPheno contains data for the genotypic means (Best Linear Unbiased Estimators, BLUEs) for a subset of ten experiments, with one value per experiment per genotype, for eight traits. For a more detailed description of the contents of the data see help(dropsData, package = statgenGWAS).

### Create gData object

The first step is to create a gData object from the raw data that can be used for the GWAS analysis. For this the raw data has to be converted to a suitable format for a gData object, see help(createGData, package = statgenGWAS) and the statgenGWAS vignette for more details.

When running a multi-trait or multi-environment GWAS, all traits used in the analysis should be in the same data.frame, with genotype as first column and the phenotypic data in subsequent columns. In case of a multi-trait analysis the phenotypic columns contain different traits, measured in one environment, while for a multi-environment the columns correspond to the different environments (same trait). In both cases the data.frame may only contain phenotypic data. Additional covariates need to be stored in covar.

Below are some examples of what these data.frames should look like.

genotype Trait1 Trait2 Trait3
G1 0.3 17 277
G2 0.4 19 408
G3 0.5 17 206
G4 0.7 13 359
genotype Trait1-Trial1 Trait1-Trial2 Trait1-Trial3
G1 0.3 0.7 0.5
G2 0.4 0.9 0.1
G3 0.5 0.8 0.2
G4 0.7 0.5 0.4

In our first example, we want to perform multi-trait GWAS for one of the DROPS environments. dropsPheno contains genotypic means for six traits in ten trials. To run a multi-trait GWAS analysis for each of the ten trials, the data has to be added as a list of ten data.frames. Recall that these data.frames should have “genotype” as their first column and may only contain traits after that. Other columns need to be dropped.

The code below creates this list of data.frames from dropsPheno.

data(dropsPheno, package = "statgenGWAS")

## Convert phenotypic data to a list.
colnames(dropsPheno)[1] <- "genotype"
dropsPheno <- dropsPheno[c("Experiment", "genotype", "grain.yield", "grain.number",
"anthesis", "silking", "plant.height", "ear.height")]
## Split data by experiment.
dropsPhenoList <- split(x = dropsPheno, f = dropsPheno[["Experiment"]])
## Remove Experiment column.
## phenotypic data should consist only of genotype and traits.
dropsPhenoList <- lapply(X = dropsPhenoList, FUN = [, -1)

If the phenotypic data consists of only one trial/experiment, it can be added as a single data.frame without first converting it to a list. In that case createGData will convert the input to a list with one item.

Now a gData object containing map, marker information, and phenotypes can be created. Kinship matrix and covariates may be added later on.

## Load marker data.
data("dropsMarkers", package = "statgenGWAS")
## Add genotypes as row names of dropsMarkers and drop Ind column.
rownames(dropsMarkers) <- dropsMarkers[["Ind"]]
dropsMarkers <- dropsMarkers[, -1]

## Load genetic map.
data("dropsMap", package = "statgenGWAS")
## Add genotypes as row names of dropsMap.
rownames(dropsMap) <- dropsMap[["SNP.names"]]
## Rename Chromosome and Position columns.
colnames(dropsMap)[2:3] <- c("chr", "pos")

## Create a gData object containing map, marker and phenotypic information.
gDataDrops <- statgenGWAS::createGData(geno = dropsMarkers,
map = dropsMap,
pheno = dropsPhenoList)

To get an idea of the contents of the data a summary of the gData object can be made. This will give an overview of the content of the map and markers and also print a summary per trait per trial. Since there are ten trials and six traits in gDataDrops we restrict the output to one trial, using the trials argument of the summary function.

## Summarize gDataDrops.
summary(gDataDrops, trials = "Mur13W")
#> map
#>  Number of markers: 41722
#>  Number of chromosomes: 10
#>
#> markers
#>  Number of markers: 41722
#>  Number of genotypes: 246
#>  Content:
#>   0 1 2 NA
#>   0.28 0.01 0.71 0
#>
#> pheno
#>  Number of trials: 1
#>
#>  Mur13W:
#>      Number of traits: 6
#>      Number of genotypes: 246
#>
#>         grain.yield grain.number anthesis silking plant.height ear.height
#> Min.            3.3         1348       56      59          222        102
#> 1st Qu.         6.3         2641       61      64          251        125
#> Median          7.5         2965       63      66          258        133
#> Mean            7.4         2986       63      66          259        133
#> 3rd Qu.         8.4         3359       66      68          266        141
#> Max.           11.4         4510       71      74          294        172
#> NA's            0.0            0        0       0            0          0

### Recoding and cleaning of markers

Marker data has to be numerical and without missing values in order to do GWAS analysis. This can be achieved using the codeMarkers() function, which can also perform imputation of missing markers. The marker data available for the DROPS project has already been converted from A/T/C/G to 0/1/2. We still use the codeMarkers() function to further clean the markers, in this case by removing the duplicate SNPs.

## Set seed.
set.seed(1234)

## Remove duplicate SNPs from gDataDrops.
gDataDropsDedup <- statgenGWAS::codeMarkers(gDataDrops,
impute = FALSE,
verbose = TRUE)
#> Input contains 41722 SNPs for 246 genotypes.
#> 0 genotypes removed because proportion of missing values larger than or equal to 1.
#> 0 SNPs removed because proportion of missing values larger than or equal to 1.
#> 5098 duplicate SNPs removed.
#> Output contains 36624 SNPs for 246 genotypes.

Note that duplicate SNPs are removed at random. To get reproducible results make sure to set a seed.

To demonstrate the options of the codeMarkers() function, see help(codeMarkers, package = statgenGWAS) and the statgenGWAS vignette for more details.

### Running Multi-trait GWAS

The cleaned gData object can be used for performing multi-trait GWAS analysis. In this example the trial Mur13W is used to demonstrate the options of the runMultiTraitGwas() function, for a subset of five traits. As in we choose a factor analytic model for the genetic and residual covariance.

## Run multi-trait GWAS for 5 traits in trial Mur13W.
GWASDrops <- runMultiTraitGwas(gData = gDataDropsDedup,
traits = c("grain.yield","grain.number",
"anthesis", "silking" ,"plant.height"),
trials = "Mur13W",
covModel = "fa")

The output of the runMultiTraitGwas() function is an object of class GWAS. This is a list consisting of five elements described below.

1. GWAResult: a list of data.tables, one for each trial for which the analysis was run. Each data.table has the following columns:
 snp SNP name trait trait name chr chromosome on which the SNP is located pos position of the SNP on the chromosome pValue P-value for the SNP LOD LOD score for the SNP, defined as $$-\log_{10}(pValue)$$ effect effect of the SNP on the trait value effectSe standard error of the effect of the SNP on the trait value allFreq allele frequency of the SNP
head(GWASDrops$GWAResult$Mur13W)
#>              snp    trait chr    pos pValue   LOD effect effectSe allFreq
#> 1:         SYN83 anthesis   1   3498  0.928 0.033  -0.14     0.19    0.60
#> 2: PZE-101000060 anthesis   1 157104  0.185 0.732   0.50     0.21    0.72
#> 3: PZE-101000088 anthesis   1 238347  0.078 1.110  -0.65     0.25    0.84
#> 4: PZE-101000083 anthesis   1 239225  0.503 0.298  -0.16     0.18    0.58
#> 5: PZE-101000108 anthesis   1 255850  0.654 0.185   0.17     0.34    0.90
#> 6: PZE-101000111 anthesis   1 263938  0.467 0.331  -0.41     0.24    0.83

Ve = GWASDropsFA$GWASInfo$varComp$Ve) Parallel computing To improve performance when using a pairwise variance component model, it is possible to use parallel computing. To do this, a parallel back-end has to be specified by the user, e.g. by using registerDoParallel from the doParallel package (see the example below). In addition, in the runMultiTraitGwas() function the argument parallel has to be set to TRUE. With these settings the computations are done in parallel per pair of traits. ## Register parallel back-end with 2 cores. doParallel::registerDoParallel(cores = 2) ## Run multi-trait GWAS for one trait in all trials. GWASDropsxEPar <- runMultiTraitGwas(gData = gDataDropsDedupxE, covModel = "pw", parallel = TRUE) ### Covariates Covariates can be included as extra fixed effects in the GWAS model. The runMultiTraitGwas() function distinguishes between ‘usual’ covariates and SNP-covariates. The former could be design factors such as block, or other traits one wants to condition on. In the latter case, the covariate(s) are one or more of the markers contained in the genotypic data. SNP-covariates can be set with the argument snpCov, which should be a vector of marker names. Similarly, other covariates should be specified using the argument covar, containing a vector of covariate names. The gData object should contain these covariates in gData$covar.

In case SNP-covariates are used, GWAS for all the other SNPs is performed with the the SNP-covariates as extra fixed effect; also the null model used to estimate the variance components includes these effects. For each SNP in SNP-covariates, a P-value is obtained using the same F-test and null model to estimate the variance components, but with only all other SNPs (if any) in SNP-covariates as fixed effects.

## Run multi-trait GWAS for Mur13W.
## Use PZE-106021410, the most significant SNP, as SNP covariate.
GWASDropsSnpCov <- runMultiTraitGwas(gData = gDataDropsDedup,
trials = "Mur13W",
snpCov = "PZE-106021410",
covModel = "fa")

### Minor Allele Frequency

It is recommended to remove SNPs with a low minor allele frequency (MAF) from the data before starting a GWAS analysis. However it is also possible to do so in the analysis itself. The difference between these approaches is that codeMarkers() removes the SNPs, whereas runMultiTraitGwas() excludes them from the analysis but leaves them in the output (with results set to NA). In the latter case it will still be possible to see the allele frequency of the SNP.
By default all SNPs with a MAF lower than 0.01 are excluded from the analysis. This can be controlled by the argument MAF. Setting MAF to 0 will still exclude duplicate SNPs since duplicates cause problems when fitting the underlying models.

## Run multi-trait GWAS for Mur13W.
## Only include SNPs that have a MAF of 0.05 or higher.
GWASDropsMAF <- runMultiTraitGwas(gData = gDataDropsDedup,
trials = "Mur13W",
covModel = "fa",
MAF = 0.05)

### Estimation of common SNP effects and QTL×E effects.

Besides a normal SNP-effect model, it is possible to fit a common SNP-effect model as well (see Hypotheses for the SNP-effects). When doing so, in addition to the SNP-effect, also the common SNP-effect and the QTL×E effect and corresponding standard errors and P-values are returned. These are included as extra columns in the GWAResult data.table in the output of the function.

## Run multi-trait GWAS for Mur13W.
## Fit an additional common sNP effect model.
GWASDropsCommon <- runMultiTraitGwas(gData = gDataDropsDedup,
trials = "Mur13W",
covModel = "fa",
estCom = TRUE)
head(GWASDropsCommon$GWAResult$Mur13W)
#>              snp    trait chr    pos pValue    LOD effect effectSe allFreq pValCom effsCom effsComSe pValQtlE
#> 1:         SYN83 anthesis   1   3498   0.98 0.0087  -0.12     0.19    0.60   0.536  -0.028     0.045     0.98
#> 2: PZE-101000060 anthesis   1 157104   0.20 0.6991   0.47     0.20    0.72   0.029   0.103     0.048     0.58
#> 3: PZE-101000088 anthesis   1 238347   0.19 0.7142  -0.61     0.25    0.84   0.049  -0.115     0.059     0.44
#> 4: PZE-101000083 anthesis   1 239225   0.76 0.1207  -0.16     0.18    0.58   0.203  -0.054     0.043     0.88
#> 5: PZE-101000108 anthesis   1 255850   0.60 0.2253   0.15     0.34    0.90   0.774  -0.023     0.079     0.48
#> 6: PZE-101000111 anthesis   1 263938   0.60 0.2217  -0.37     0.24    0.83   0.248  -0.064     0.055     0.66

The SNPs with a significant QTLxE effect come from the same regions as the SNPs in supplementary table 6 in Millet et al. (2016), who however used a different SNP set. Another difference is that here we explicitly test for QTLxE within the GWAS, whereas Millet et al. (2016) regressed SNP effects on environmental covariates.

### References

Astle, William, and David J. Balding. 2009. “Population Structure and Cryptic Relatedness in Genetic Association Studies.” Statistical Science 24 (4): 451–71. https://doi.org/10.1214/09-sts307.
Brzyski, Damian, Christine B. Peterson, Piotr Sobczyk, Emmanuel J. Candès, Malgorzata Bogdan, and Chiara Sabatti. 2016. “Controlling the Rate of GWAS False Discoveries.” Genetics 205 (1): 61–75. https://doi.org/10.1534/genetics.116.193987.
Furlotte, N. A., and E. Eskin. 2015. “Efficient Multiple-Trait Association and Estimation of Genetic Correlation Using the Matrix-Variate Linear Mixed Model.” Genetics 200 (1): 59–68. https://doi.org/10.1534/genetics.114.171447.
Korte, Arthur, Bjarni J Vilhjálmsson, Vincent Segura, Alexander Platt, Quan Long, and Magnus Nordborg. 2012. “A Mixed-Model Approach for Genome-Wide Association Studies of Correlated Traits in Structured Populations.” Nature Genetics 44 (9): 1066–71. https://doi.org/10.1038/ng.2376.
Millet, Emilie J., Cyril Pommier, Mélanie Buy, Axel Nagel, Willem Kruijer, Therese Welz-Bolduan, Jeremy Lopez, et al. 2019. “A Multi-Site Experiment in a Network of European Fields for Assessing the Maize Yield Response to Environmental Scenarios.” Portail Data Inra. https://doi.org/10.15454/IASSTN.
Millet, Emilie J., Claude Welcker, Willem Kruijer, Sandra Negro, Stephane Nicolas, Sebastien Praud, Nicolas Ranc, et al. 2016. “Genome-Wide Analysis of Yield in Europe: Allelic Effects as Functions of Drought and Heat Scenarios.” Plant Physiology, July, pp.00621.2016. https://doi.org/10.1104/pp.16.00621.
Rincent, Renaud, Laurence Moreau, Hervé Monod, Estelle Kuhn, Albrecht E. Melchinger, Rosa A. Malvar, Jesus Moreno-Gonzalez, et al. 2014. “Recovering Power in Association Mapping Panels with Variable Levels of Linkage Disequilibrium.” Genetics 197 (1): 375–87. https://doi.org/10.1534/genetics.113.159731.
VanRaden, P. M. 2008. “Efficient Methods to Compute Genomic Predictions.” Journal of Dairy Science 91 (11): 4414–23. https://doi.org/10.3168/jds.2007-0980.
Zhou, Xiang, and Matthew Stephens. 2014. “Efficient Multivariate Linear Mixed Model Algorithms for Genome-Wide Association Studies.” Nature Methods 11 (4): 407–9. https://doi.org/10.1038/nmeth.2848.