S5 MagicImpute

MagicImpute

magicmask_impute
  • magicmask_impute combines magicmask, magicimpute, and imputeaccuracy.
  • By default, magicmask masks 10% genotypes (foundermask=0.1,offspringmask=0.1). For sequence data, mask only genotypes with read depth >= minread (minread=10).
  • Use magicimpute without masking genotypes.

The magicimpute first imputes founders and then imputes offspring. During the iterative founder imputation, it simultaneously performs the following:

  • Impute founder genotypes.
  • Delete markers that do not fit in. By default, isdelmarker = true.
  • Correct founder genotypes. By default, iscorrectfounder = true if model="depmodel" or offspring do not have genotypes in AD format`.
  • Infer marker-specific error rates. By default, isinfererror = true if model is not "depmodel" or isspacemarker = true.
  • Refine local marker ordering. By default, isordermarker = true if mapfile is not nothing.
  • Refine inter-marker distances. By default, isspacemarker = true if mapfile is not nothing or isordermarker = true or isphysmap=true.

After founder imputation, magicimpute imputes offspring.

  • If phasealg=forwardbackward, for each offspring at each marker, the posterior diplotype probabilities (in format GP) corresponding to the phased genotypes 0|0, 0|1, 1|0 and 1|1 are caculated according to the forward backward algorithm, and the called phased genotypes (in format GT) are given by those with the largest posterior diplotype probabilities if they are greater than threshimpute.
  • If phaselag=viterbi(default), the posterior diplotype probabilities (GP) are set as those of phasealg=forwardbackward, and the called phased genotypes (GT) are calculated according to the Viterbi algorithm.
  • If phaselag=unphase, the posterior genotype probabilities (GP) corresponding to the unphased genotypes 0/0, 0/1, and 1/1 are caculated by tranforming the posterior diplotype probabilities of phasealg=forwardbackward, and the called unphased genotypes are given by those with the largest posterior genotype probabilities if they are greater than threshimpute.
# code for Julia
using MagicImpute
cd(@__DIR__)
genofile = outstem*"_magicfilter_geno.vcf.gz"
pedfile = outstem*"_magicfilter_ped.csv"
magicmask_impute(genofile,pedfile;
    mapfile = outstem*"_magicmap_construct_map.csv.gz",          
    outstem     
)
# code for Linux shell. 
# For Window CMD, replace multiline key \ by  ^, and replace comment-key # by ::
julia rabbit_magicmask_impute.jl -g example_magicfilter_geno.vcf.gz \
    -p example_magicfilter_ped.csv \
    --mapfile example_magicmap_construct_map.csv.gz \
    --nworker 5 \
    -o example

Output files

outfileDescription
outstem*"_magicimpute.log"log file
outstem*"_magicmask_geno.vcf.gz"genofile with some genotypes being masked
outstem*"_magicmask_reversed.vcf.gz"ground-truth genofile with calculated accuracy
outstem*"_magicimpute_founder.vcf.gz"intermediate genofile after founder imputation
outstem*"_magicimpute_geno.vcf.gz"result genofile for downstream analysis
outstem*"_magicimpute_map.csv"result mapfile
outstem*"_magicimpute_compare_inputmap.png"compare refined mapfile with keyword mapfile (if not nothing)
outstem*"_magicimpute_delete.csv"collection of deleted markers
outstem*"_magicimpute_peroffspringerror.csv"geneotyping error per offspring
outstem*"_magicimpute_peroffspringerror.png"plot peroffspringerror
outstem*"_magicimpute_permarkererror.png"plot per-marker error rates that are saved in genofile
outstem*"_magicimpute_founderacc.csv"imputation accuracy for each founder
outstem*"_magicimpute_offspringacc.csv"imputation accuracy for each subpopulation

Output: imputation accuracy

outstem*"_magicimpute_founderacc.csv" gives imputation accuracies for each founder, and outstem*"_magicimpute_offspringacc.csv" gives imputation accuracies for subpopulation, assuming that masked genotypes in outstem*"_magicmask_reversed.vcf.gz are ground truth. These output files ( + outstem*"_magicmask_geno.vcf.gz") do not exist for magicimpute.

using CSV, DataFrames
outstem = "example"
CSV.read(outstem*"_magicimpute_founderacc.csv",DataFrame; comment="##")
4×7 DataFrame
Rowfoundernoffspringnmarkermiss_afterimputentruenonimputecorrectimpute
String3Int64Int64Float64Int64Float64Float64
1P12002360.0254237310.06451610.724138
2P23002360.0250.00.88
3P33002360.0350.00.714286
4P42002360.0169492120.08333330.727273
CSV.read(outstem*"_magicimpute_offspringacc.csv",DataFrame; comment="##")
3×7 DataFrame
Rowsubpopsubpopsizenmarkermiss_afterimputentruenonimputecorrectimpute
String7Int64Int64Float64Int64Float64Float64
1pop11002360.02745764410.07482990.958333
2pop21002360.06771195650.1451330.888199
3pop32002360.062076316080.09514930.931959

Output: map refinement

outstem*"_magicimpute_map.csv" is the refined mapfile after refining local marker ordering and/or inter marker distance. outstem*"_magicimpute_compare_inputmap.png" compare the refined mapfile with the map of input genofile and the map of keyword mapfile (if no nothing). These output files do not exist if input map is not changed.