It is nowadays increasingly feasible to conduct genetic analyses in
polyploid populations thanks to developments in genotyping technologies
as well as tools designed to deal with these genotypes.
polyqtlR
is one such tool that provides a number of
functions to perform quantitative trait locus (QTL) mapping in F1
populations of outcrossing, heterozygous polyploid species. For more
details on the methodology, please see the 2021 Bioinformatics
publication of Bourke et
al..
polyqtlR
assumes that a number of prior steps have
already been performed - in particular, that an integrated linkage map
for the mapping population has been generated. The package has been
developed to utilise the output of the mapping software
polymapR
, although any alternative mapping software that
produces phased and integrated linkage maps could also be used. However,
the input files may have to be altered in such cases. Currently,
bi-allelic SNP markers are the expected raw genotypic data, which is
also the data format used by polymapR
in map construction.
However, some functions can also accept probabilistic genotypic data,
for example in the estimation of IBD probabilities. Full functionality
of probabilistic genotypes in the package has yet to be implemented but
is planned for future releases. The assignment of marker dosages in
polyploid species can be achieved using a number of R packages such as
fitPoly
(Voorrips, Gort, and Vosman
2011). More background on the steps of dosage-calling and
polyploid linkage mapping can be found in a review of the topic by (Peter M. Bourke, Voorrips, et al. 2018).
The QTL analysis functions primarily rely on identity-by-descent
(IBD) probabilities, which are the probabilities of inheritance of
particular combinations of parental haplotypes. Single-marker QTL
detection methods are also available. The package was originally
developed with the IBD output of TetraOrigin (Zheng et al. 2016) in mind. However, the
TetraOrigin package was implemented in the proprietary software
Mathematica, which many users may not have access to without
buying a licence. Since 2021 an updated method called PolyOrigin
(extended for multi-parental populations) has been released in the
OpenSource julia software (Zheng et al.
2021). Alternative options to estimate IBD probabilities for
multiple ploidy levels (2x, 3x, 4x and 6x) are also provided directly
polyqtlR
.
Genotyping errors are a regular feature of modern marker datasets.
Although linkage mapping software such as polymapR
has been
shown to be relatively robust against genotyping errors (Peter M. Bourke, Geest, et al. 2018), if
present in sufficiently large proportions (~10 % errors or more) issues
may arise with map accuracy and QTL detection. Therefore, a number of
functions have been included in polyqtlR
to check map
accuracy using the estimated IBD probabilities, and to re-impute marker
genotypes if required. These imputed genotypes may then be used in an
iterative process to re-estimate the linkage map and re-run a QTL
analysis.
This tutorial goes through each of the steps in a typical QTL
analysis using the example datasets provided with the package, outlining
the different options that are available. However, it is certainly
recommended to consult the documentation that accompanies each of the
functions by running the ?
command before the function name
(e.g. ?QTLscan
).
polyqtlR
polyqtlR
can be installed using the following call from
within R:
install.packages("polyqtlR")
There are a number of package dependencies which should be installed,
e.g. Rcpp
, foreach
, or
doParallel
. Usually these will be automatically installed
as well but if not, you can always install them yourself one by one,
e.g.
install.packages("Rcpp")
install.packages("foreach")
install.packages("doParallel")
before re-trying the installation of polyqtlR
.
Eventually when all dependencies are on your computer, you should be
able to successfully run the following command (i.e. without
any error message):
library(polyqtlR)
It is also a good time to load the datasets that you will need to perform a typical analysis, namely a phased maplist, a set of SNP marker genotypes (discrete dosage scores) and a set of trait data (phenotypes):
data("phased_maplist.4x", "SNP_dosages.4x", "Phenotypes_4x")
In the example that follows we are using a simulated tetraploid dataset with 2 chromosomes for simplicity.
Currently two options to estimate IBD probabilities in an F1
population exist within polyqtlR
. The first uses a
heuristic approach originally developed for tetraploids but later
applied to hexaploid populations (Peter M. Bourke
2014; Geest et al. 2017) and implemented here in the
polyqtlR::estimate_IBD
function. This can be very
computationally efficient at higher ploidy levels (and has been
generalised to all ploidy levels), but the algorithm ignores
partially-informative marker information. In datasets with large numbers
of haplotype-specific markers this is not such an issue, but in datasets
with fewer such markers the accuracy of the resulting IBD probabilities
is compromised. The method behind TetraOrigin (Zheng et al. 2016) for offspring IBD
probability estimation given a known parental marker phasing has also
been implemented in the polyqtlR::estimate_IBD
function
(this is the default method used). Note that unlike the original
TetraOrigin software, parental marker phasing is not re-estimated.
However, this is usually a direct output of the linkage mapping
procedure (e.g. using polymapR
).
As the IBD data are the most central objects in this package it is
worth spending a moment describing them. In polyqtlR
, IBD
probabilities are organised in nested list form. The first level of
structure are the linkage groups. In our example dataset, the list has
two elements corresponding to the two linkage groups, each of which is
itself a list containing the following elements:
IBDtype The type of IBD probability, either “genotypeIBD” or “haplotypeIBD”
IBDarray A 3d array of IBD probabilities, with dimensions “locus”,“genotype class”, “individual”
map The integrated linkage map (marker names and cM positions, in order)
parental_phase The phasing of the markers in the parental map
marginal.likelihoods A list of marginal
likelihoods of different valencies if method “hmm” was used, otherwise
NULL
valency The predicted valency that maximised the
marginal likelihood, per offspring. For method “heur”,
NULL
offspring The offspring names
biv_dec Whether bivalents only
(TRUE
) or also multivalents were allowed
(FALSE
) in the procedure
gap Here NULL
, but later this can
hold a numeric value (e.g. 1 cM) if IBDs are ‘splined’ or
interpolated.
genocodes Ordered list of genotype codes used to represent the different genotype classes
pairing Log likelihoods of each of the different pairing scenarios considered
ploidy The ploidy of parent 1
ploidy2 The ploidy of parent 2
method The method used, either “hmm” (Hidden Markov Model), “heur” (heuristic) or “hmm_TO” (Hidden Markov Model TetraOrigin)
error The offspring error prior (eps) used in the calculation.
All functions within polyqtlR
for estimating or
importing IBD probabilities will automatically return them in this
form.
The estimate_IBD
function of polyqtlR
with
the default setting method = "hmm"
estimates offspring IBD
probabilities using a Hidden Markov Model (HMM) developed by Zheng et al
(Zheng et al. 2016) in the TetraOrigin
package but generalised in polyqtlR
to multiple ploidy
levels. Currently, diploid (2x), triploid (3x = 4x \(\times\) 2x), tetraploid (4x) and hexaploid
(6x) populations are handled. genotypes
can either be
discrete (i.e. integer marker dosage scores), or probabilistic
(i.e. the probabilities of each of the ploidy + 1 possible
scores). For triploids, tetraploids and hexaploids, the possibility of
incorporating double reduction (Peter M. Bourke
et al. 2015) (i.e. allowing for the phenomenon of
multivalent pairing) is available, although this can have serious
performance implications for hexaploids (and in hexaploids, the option
of allowing multivalent pairing in both parents for the same
chromosome is by default disabled as it requires very high RAM
requirements (> 32 Gb). Use argument
full_multivalent_hexa = TRUE
to allow multivalents
simultaneously at the same position from both hexaploid parents).
Some of the code used to estimate these IBD probabilities has been
programmed in C++ to improve performance, and relies on both the
Rcpp
and RcppArmadillo
packages to provide the
interface between R, C++ and the Armadillo C++ library.
The expected format of input files is that used by the mapping
software polymapR
(Peter M. Bourke,
Geest, et al. 2018). If you have used other software for map
construction and/or genotype calling, you will need to convert your
input files to the correct format. There are already some tools
available to do this (although it should be quite straightforward if you
look at the expected format using the example data files provided here).
For example, if you generated the phased linkage map using
mappoly
(Mollinari and Garcia
2019), you can convert your map to the required format using the
convert_mappoly_to_phased.maplist
function (check out the
help using ?convert_mappoly_to_phased.maplist
for
details).
The genotypes can be either discrete or probabilistic. If the
genotypes are discrete, they must be provided as a matrix of marker
dosage scores (for example as provided in this tutorial in
SNP_dosages.4x
) with marker names as row-names, and
individual names (including the two parents) as column-names. Checks are
performed and non-numeric data is converted to numeric data if needed.
Alternatively, probabilistic genotypes can be provided which can either
be the direct output of the fitpoly
(Voorrips, Gort, and Vosman 2011) function
saveMarkerModels
, or from some other polyploid-appropriate
genotype-calling software. For example, if polyRAD
(Clark, Lipka, and Sacks 2019) or
updog
(Gerard et al. 2018)
were used for genotype calling, a conversion step is needed. Functions
for doing this are provided by polymapR
. Check out
?polymapR::convert_polyRAD
or
?polymapR::convert_updog
for details.
We run the function estimate_IBD
using the phased
linkage map phased_maplist.4x
as generated by
polymapR
, in this case allowing for multivalents
(bivalent_decoding = FALSE
), as follows:
IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
genotypes = SNP_dosages.4x,
ploidy = 4,
bivalent_decoding = FALSE,
ncores = 4)
Note that the default setting of bivalent_decoding
is
TRUE
, as the function evaluates faster if only bivalents
are considered (the difference in run-time becomes more pronounced for
hexaploids). However, allowing multivalents in the model gives a more
complete and correct picture.
Here we chose to enable parallel processing to speed up the
calculations. Note that jobs are split across linkage groups, so with 5
linkage groups it would have made more sense to use 5 cores if they were
available. Use parallel::detectCores()
to display how many
CPU cores are available, and use 1 or 2 less than this at the very most.
In this dataset there are only 2 linkage groups, so no more than 2 cores
are needed.
nc <- parallel::detectCores() - 1
If so desired, the TetraOrigin package (Mathematica software) or the
PolyOrigin package (julia software) can be used to estimate IBD
probabilities rather than using the estimate_IBD
function.
The results should be quite similar for diploids or tetraploids. If
mapping in triploid or hexaploid populations,
polyqtlR::estimate_IBD
should be used.
TetraOrigin produces by default a large .txt output file after running the function “inferTetraOrigin”. However, it is convenient to produce a summary of this output using the “saveAsSummaryITO” function, which produces a .csv file summarising the results. Information on how to use TetraOrigin is provided with the package itself, downloadable from GitHub.
The function import_IBD
imports the .csv output of
TetraOrigin::saveAsSummaryITO. It takes a number of arguments, such as
folder
, which is the folder containing the output of
TetraOrigin. For example, if we wanted to import IBDs from the subfolder
“TetraOrigin” for a species with 5 linkage groups, we would run the
following:
IBD_4x <- import_IBD(method = "TO", #TO for TetraOrigin
folder = "TetraOrigin",
filename = paste0("TetraOrigin_Output_bivs_LinkageGroup",1:5,"_Summary"),
bivalent_decoding = TRUE)
Here it is assumed that the TetraOrigin summary filenames are “TetraOrigin_Output_bivs_LinkageGroup1_Summary.csv” etc. If all goes well, the following messages will be printed per linkage group:
Importing map data under description inferTetraOrigin-Summary,Genetic map of biallelic markers
Importing parental phasing under description inferTetraOrigin-Summary,MAP of parental haplotypes
Importing IBD data under description inferTetraOrigin-Summary,Conditonal genotype probability
A more recent version of the TetraOrigin software has been released in julia, extended for multi-parental populations of either diploid or tetraploid species. Details of the method can be found in Zheng et al. (2021) (Zheng et al. 2021). The software, as well as vignettes detailing how to use it can also be found on GitHub.
Assuming the output files are in a folder called “PolyOrigin” with filestem “PolyOrigin_Output”, a similar call to import the IBDs would be:
IBD_4x <- import_IBD(method = "PO", #PO for PolyOrigin
folder = "PolyOrigin",
filename = "PolyOrigin_Output")
Note that bivalent_decoding
or error
do not
need to be specified when method = "PO"
, as these
parameters are looked up from the PolyOrigin log file. Note also that
previously the summary output of TetraOrigin was split up per linkage
group. In PolyOrigin, the output from all linkage groups is kept
together.
In cases where the results are needed quickly, or where there are
very large numbers of markers, or for ploidy levels above 6, it is
convenient to use the heuristic approach to IBD estimation. We do this
using the estimate_IBD
function as follows:
IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
genotypes = SNP_dosages.4x,
method = "heur",
ploidy = 4)
Note that the attribute IBDtype
is now
haplotypeIBD
, referring to the fact that these IBDs are the
probabilities of inheriting each parental haplotype at a locus, as
opposed to the probabilities of inheriting particular combinations of
parental alleles at a locus (genotypeIBD
). Although
similar, certain downstream functions such as exploreQTL
can only work with the former (although you will still be able to
visualise QTL allele effects with the visualiseQTL
function). The accuracy of these IBDs is generally lower than
if method = "hmm"
had been used (and therefore the power of
subsequent QTL analyses will be somewhat reduced).
Particularly at higher ploidy levels (6x), it becomes computationally
expensive to estimate IBDs using the hmm
method. Our
experience has shown that for hexaploids with a mapping population size
of 400 or more individuals running on 4 cores, about 200 - 300 markers
per chromosome can be accommodated on an “above-average” desktop
computer (16 GB RAM). Running on a single core will reduce the committed
memory, but the evaluation time will therefore be much longer. 250
markers per chromosome should be already enough to estimate IBDs with
high accuracy - including extra markers over this amount will add
incrementally less information (following the law of diminishing
returns).
The function thinmap
can be used to make a selection of
markers that tries to maximise their distribution across the genome and
across parental homologues. The argument bin_size
is used
to specify the size of the bins in which markers are selected -
increasing this results in fewer markers being used. The ideal
bin_size
is 1 cM, although at higher ploidies and with
higher marker densities, wider bins may be needed. The function is
called as follows:
thinned_maplist.4x <- thinmap(maplist = phased_maplist.4x,
dosage_matrix = SNP_dosages.4x)
## 87 markers from a possible 93 on LG1 were included.
##
## 89 markers from a possible 93 on LG2 were included.
The object thinned_maplist.4x
can then be used in place
of phased_maplist.4x
in a call to
estimate_IBD
.
Regardless of how they were generated, IBD probabilities are estimated at all marker positions provided. When we perform an initial scan for QTL, it is often more efficient to look for QTL at a grid of evenly-spaced positions (such as at every 1 cM). This is because the genotype data has been replaced with multi-point IBD estimates at the marker positions. Information at e.g. ten different markers within a 0.5 cM window is virtually identical and therefore just one representative for this locus should approximate the full information, while reducing the number of statistical tests. This becomes particularly relevant when we generate significance thresholds using permutation tests later, which are computationally quite demanding.
It is therefore recommended to interpolate the probabilities using
the spline_IBD
function as follows:
IBD_4x.spl <- spline_IBD(IBD_list = IBD_4x,
gap = 1) #grid at 1 cM spacing
Before performing a QTL analysis, it is a good idea to visualise the
IBD probabilities as these can give indications about the quality of the
genotypes, the linkage maps and the parental marker phasing. The
function visualiseHaplo
was developed originally to examine
the inherited haplotypes of offspring with extreme (usually favourable)
trait scores to see whether their inherited alleles are consistent with
a predicted QTL model (we return to this later). But as a first look at
the data quality, the function is also useful.
Haplotypes of linkage group 1 of the first nine F1 offspring can be visualised as follows:
visualiseHaplo(IBD_list = IBD_4x,
display_by = "name",
linkage_group = 1,
select_offspring = colnames(SNP_dosages.4x)[3:11], #cols 1+2 are parents
multiplot = c(3,3)) #plot layout in 3x3 grid
Here the quality of the predicted haplotypes appears to be high -
dark colours signifying high confidence (probabilities close to 1) and
small numbers of recombinations predicted. Regions of dark red signify
double reduction, which were permitted during the estimation of IBDs
using estimate_IBD
. In poorer-quality datasets,
considerably more “noise” may be present, with less certainty about the
correct configuration per offspring.
The function returns a list of 2 elements (recombinants
and allele_fish
) which do not concern us here but which we
will return to later. Note that we selected the
offspring to display using the select_offspring
argument -
using the option select_offspring = all
will return the
whole population. We have also opted to
display_by = "name"
, as opposed to
display_by = "phenotype"
, in which case further arguments
are required. For convenience the plots were combined into a 3 x 3 grid
using the multiplot
argument. For more options, including
how to overlay recombination events, see
?visualiseHaplo
.
An example of IBD probabilities that show much higher levels of uncertainty might look something like the following: