Statistical framework

See Zheng et al. 2014, Zheng 2015 , Zheng et al. 2015, and Zheng et al. 2018 for the detailed description on the Hidden Markov Model (HMM) framework of RABBIT. See Zheng et al. 2018(2) and Zheng et al. 2024 for the description on the algorithm of genotype imputation. See Zheng et al. 2019 and Zheng et al. 2024(2) for the description on the algorithm of genetic map construction.

The RABBIT HMM framework consists of two basic components: hidden Markov process and genotype data model.

Ancestral origin process

The hidden Markov process refers to the prior ancestral origin process, describing how ancestral origins change along two homologous chromosomes in a diploid offspring.

RABBIT has a keyarg model for specifying the dependence of the prior ancestral origin processes between two homologous chromosomes. It must be "depmodel", "indepmodel", or "jointmodel", denoting complete dependence, complete independence, or intermediate dependence, respectively. RABBIT uses the general "jointmodel" by default. It is recommended for magicimpute to use "depmodel" for almost homozygous populations, which would be much faster than the default "jointmodel".

Genotype data model

The Genotype data model describes the (emission) probability of observed genotypic data given hidden ancestral origin state, and it varies with genotype format. See [Zheng et al. 2024] for the detailed description on the data model.

Discrete genotype (GT)

The data model for "GT" has a parameter describing the allelic error rate, that is, the probability of an error occurring on one allele. If an error occurs on an allele, it will result in the other allele.

RABBIT introduces two likelihood parameters for "GT": foundererror and offspringerror, denoting the allelic error rates for founders and offspring, respectively.

Allelic depeth (AD)

Sequence reads are assumed to be generated by two steps: (1) true genotypes are mis-aligned using the random allele model with the allelic error rates foundererror and offspringerror, and (2) conditional on mis-aligned genotypes, sequence reads are sampled with parameters seqerror, allelebalancemean, allelebalancedisperse, and alleledropout.

We extend the binomial model such that a heterozygous genotype is modeled by a 0- and 1- inflated beta-binomial distribution. Conditional on mis-aligned genotypes, we model sequence reads as follows

\[\begin{aligned} P(r0,r1|0/0) &= B(r0|n, 1-seqerror) \\ P(r0,r1|0/1) &= \text{Inf-BetaBin}(r0|n, seqerror, allelebalancemean, allelebalancedisperse, alleledropout) \\ P(r0,r1|1/1) &= B(r0|n, seqerror) \end{aligned}\]

where r0 is the read count for allele 0, r1 is the read count for allele 1, $n=r0+r1$, and $B(r0|n,p)$ is the binomial distribution. Conditional on heterozygous genotype, the allele 0 is dropped out and P(r0,r1|0/1) is given by P(r0,r1|1/1) with probability alleledropout/2, the allele 1 is dropped out and P(r0,r1|0/1) is given by P(r0,r1|0/0) with probability alleledropout/2, and otherwise P(r0,r1|0/1) follows a beta-binomial distribution $BetaBin(r0|n, \alpha, \beta)$. The beta-binomial distribution can be regarded as the compound distribution that results from $r0 \sim B(r0|n,p)$ with $p \sim Beta(\alpha, \beta)$, where p is the allelic balance for an offspring. We reparameterize $\alpha$ and $\beta$ as follows

\[\begin{aligned} \alpha &= allelebalancemean/allelebalancedisperse \\ \beta &= (1-allelebalancemean)/allelebalancedisperse \end{aligned}\]

Sequence read model
  • $BetaBin(n, \alpha, \beta) \rightarrow B(n, allelebalancemean)$, as $allelebalancedisperse \rightarrow 0$.
  • The sequence read model reduces to the binomial model as as $allelebalancedisperse \rightarrow 0$ and $allelebalancemean \rightarrow 0.5$
  • $allelebalancemean - 0.5$ measures allelic balance bias, and $allelebalancedisperse$ measures allelic balance overdispersion.

Referneces

Zheng, Chaozhi, Martin P Boer, and Fred A Van Eeuwijk. 2014. “A General Modeling Framework for Genome Ancestral Origins in Multiparental Populations.” Genetics 198 (1): 87–101. https://doi.org/10.1534/genetics.114.163006.

———. 2015. “Reconstruction of Genome Ancestry Blocks in Multiparental Populations.” Genetics 200 (4): 1073–87. https://doi.org/10.1534/genetics.115.177873.

———. 2018. “Recursive Algorithms for Modeling Genomic Ancestral Origins in a Fixed Pedigree.” G3 Genes|Genomes|Genetics 8 (10): 3231–45. https://doi.org/10.1534/G3.118.200340.

———. 2018(2). "Accurate Genotype Imputation in Multiparental Populations from Low-Coverage Sequence". Genetics 210 (1): 71-82. https://doi.org/10.1534/genetics.118.300885

———. 2019. "Construction of Genetic Linkage Maps in Multiparental Populations". Genetics 212 (4): 1031-1044. https://doi.org/10.1534/genetics.119.302229

Zheng et al. 2024. “Genotype imputation in connected multiparental populations.” In preparation.

Zheng et al. 2024(2). "Efficient consensus map construction in connected multiparental populations.” In preparation.