Introduction

It is nowadays increasingly feasible to conduct genetic analyses in polyploid populations thanks to developments in genotyping technologies as well as tools designed to deal with these genotypes. polyqtlR is one such tool that provides a number of functions to perform quantitative trait locus (QTL) mapping in F1 populations of outcrossing, heterozygous polyploid species. For more details on the methodology, please see the 2021 Bioinformatics publication of Bourke et al..

polyqtlR assumes that a number of prior steps have already been performed - in particular, that an integrated linkage map for the mapping population has been generated. The package has been developed to utilise the output of the mapping software polymapR, although any alternative mapping software that produces phased and integrated linkage maps could also be used. However, the input files may have to be altered in such cases. Currently, bi-allelic SNP markers are the expected raw genotypic data, which is also the data format used by polymapR in map construction. However, some functions can also accept probabilistic genotypic data, for example in the estimation of IBD probabilities. Full functionality of probabilistic genotypes in the package has yet to be implemented but is planned for future releases. The assignment of marker dosages in polyploid species can be achieved using a number of R packages such as fitPoly (Voorrips, Gort, and Vosman 2011). More background on the steps of dosage-calling and polyploid linkage mapping can be found in a review of the topic by (Peter M. Bourke, Voorrips, et al. 2018).

The QTL analysis functions primarily rely on identity-by-descent (IBD) probabilities, which are the probabilities of inheritance of particular combinations of parental haplotypes. Single-marker QTL detection methods are also available. The package was originally developed with the IBD output of TetraOrigin (Zheng et al. 2016) in mind. However, the TetraOrigin package was implemented in the proprietary software Mathematica, which many users may not have access to without buying a licence. Since 2021 an updated method called PolyOrigin (extended for multi-parental populations) has been released in the OpenSource julia software (Zheng et al. 2021). Alternative options to estimate IBD probabilities for multiple ploidy levels (2x, 3x, 4x and 6x) are also provided directly polyqtlR.

Genotyping errors are a regular feature of modern marker datasets. Although linkage mapping software such as polymapR has been shown to be relatively robust against genotyping errors (Peter M. Bourke, Geest, et al. 2018), if present in sufficiently large proportions (~10 % errors or more) issues may arise with map accuracy and QTL detection. Therefore, a number of functions have been included in polyqtlR to check map accuracy using the estimated IBD probabilities, and to re-impute marker genotypes if required. These imputed genotypes may then be used in an iterative process to re-estimate the linkage map and re-run a QTL analysis.

This tutorial goes through each of the steps in a typical QTL analysis using the example datasets provided with the package, outlining the different options that are available. However, it is certainly recommended to consult the documentation that accompanies each of the functions by running the ? command before the function name (e.g. ?QTLscan).

Installing polyqtlR

polyqtlR can be installed using the following call from within R:

install.packages("polyqtlR")

There are a number of package dependencies which should be installed, e.g. Rcpp, foreach, or doParallel. Usually these will be automatically installed as well but if not, you can always install them yourself one by one, e.g.

install.packages("Rcpp")
install.packages("foreach")
install.packages("doParallel")

before re-trying the installation of polyqtlR. Eventually when all dependencies are on your computer, you should be able to successfully run the following command (i.e. without any error message):

library(polyqtlR)

It is also a good time to load the datasets that you will need to perform a typical analysis, namely a phased maplist, a set of SNP marker genotypes (discrete dosage scores) and a set of trait data (phenotypes):

data("phased_maplist.4x", "SNP_dosages.4x", "Phenotypes_4x")

In the example that follows we are using a simulated tetraploid dataset with 2 chromosomes for simplicity.

IBD probabilities

Currently two options to estimate IBD probabilities in an F1 population exist within polyqtlR. The first uses a heuristic approach originally developed for tetraploids but later applied to hexaploid populations (Peter M. Bourke 2014; Geest et al. 2017) and implemented here in the polyqtlR::estimate_IBD function. This can be very computationally efficient at higher ploidy levels (and has been generalised to all ploidy levels), but the algorithm ignores partially-informative marker information. In datasets with large numbers of haplotype-specific markers this is not such an issue, but in datasets with fewer such markers the accuracy of the resulting IBD probabilities is compromised. The method behind TetraOrigin (Zheng et al. 2016) for offspring IBD probability estimation given a known parental marker phasing has also been implemented in the polyqtlR::estimate_IBD function (this is the default method used). Note that unlike the original TetraOrigin software, parental marker phasing is not re-estimated. However, this is usually a direct output of the linkage mapping procedure (e.g. using polymapR).

Data structures

As the IBD data are the most central objects in this package it is worth spending a moment describing them. In polyqtlR, IBD probabilities are organised in nested list form. The first level of structure are the linkage groups. In our example dataset, the list has two elements corresponding to the two linkage groups, each of which is itself a list containing the following elements:

  • IBDtype The type of IBD probability, either “genotypeIBD” or “haplotypeIBD”

  • IBDarray A 3d array of IBD probabilities, with dimensions “locus”,“genotype class”, “individual”

  • map The integrated linkage map (marker names and cM positions, in order)

  • parental_phase The phasing of the markers in the parental map

  • marginal.likelihoods A list of marginal likelihoods of different valencies if method “hmm” was used, otherwise NULL

  • valency The predicted valency that maximised the marginal likelihood, per offspring. For method “heur”, NULL

  • offspring The offspring names

  • biv_dec Whether bivalents only (TRUE) or also multivalents were allowed (FALSE) in the procedure

  • gap Here NULL, but later this can hold a numeric value (e.g. 1 cM) if IBDs are ‘splined’ or interpolated.

  • genocodes Ordered list of genotype codes used to represent the different genotype classes

  • pairing Log likelihoods of each of the different pairing scenarios considered

  • ploidy The ploidy of parent 1

  • ploidy2 The ploidy of parent 2

  • method The method used, either “hmm” (Hidden Markov Model), “heur” (heuristic) or “hmm_TO” (Hidden Markov Model TetraOrigin)

  • error The offspring error prior (eps) used in the calculation.

All functions within polyqtlR for estimating or importing IBD probabilities will automatically return them in this form.

HMM for IBD estimation

The estimate_IBD function of polyqtlR with the default setting method = "hmm" estimates offspring IBD probabilities using a Hidden Markov Model (HMM) developed by Zheng et al (Zheng et al. 2016) in the TetraOrigin package but generalised in polyqtlR to multiple ploidy levels. Currently, diploid (2x), triploid (3x = 4x \(\times\) 2x), tetraploid (4x) and hexaploid (6x) populations are handled. genotypes can either be discrete (i.e. integer marker dosage scores), or probabilistic (i.e. the probabilities of each of the ploidy + 1 possible scores). For triploids, tetraploids and hexaploids, the possibility of incorporating double reduction (Peter M. Bourke et al. 2015) (i.e. allowing for the phenomenon of multivalent pairing) is available, although this can have serious performance implications for hexaploids (and in hexaploids, the option of allowing multivalent pairing in both parents for the same chromosome is by default disabled as it requires very high RAM requirements (> 32 Gb). Use argument full_multivalent_hexa = TRUE to allow multivalents simultaneously at the same position from both hexaploid parents).

Some of the code used to estimate these IBD probabilities has been programmed in C++ to improve performance, and relies on both the Rcpp and RcppArmadillo packages to provide the interface between R, C++ and the Armadillo C++ library.

The expected format of input files is that used by the mapping software polymapR (Peter M. Bourke, Geest, et al. 2018). If you have used other software for map construction and/or genotype calling, you will need to convert your input files to the correct format. There are already some tools available to do this (although it should be quite straightforward if you look at the expected format using the example data files provided here). For example, if you generated the phased linkage map using mappoly (Mollinari and Garcia 2019), you can convert your map to the required format using the convert_mappoly_to_phased.maplist function (check out the help using ?convert_mappoly_to_phased.maplist for details).

The genotypes can be either discrete or probabilistic. If the genotypes are discrete, they must be provided as a matrix of marker dosage scores (for example as provided in this tutorial in SNP_dosages.4x) with marker names as row-names, and individual names (including the two parents) as column-names. Checks are performed and non-numeric data is converted to numeric data if needed. Alternatively, probabilistic genotypes can be provided which can either be the direct output of the fitpoly (Voorrips, Gort, and Vosman 2011) function saveMarkerModels, or from some other polyploid-appropriate genotype-calling software. For example, if polyRAD (Clark, Lipka, and Sacks 2019) or updog (Gerard et al. 2018) were used for genotype calling, a conversion step is needed. Functions for doing this are provided by polymapR. Check out ?polymapR::convert_polyRAD or ?polymapR::convert_updog for details.

We run the function estimate_IBD using the phased linkage map phased_maplist.4x as generated by polymapR, in this case allowing for multivalents (bivalent_decoding = FALSE), as follows:

IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
                       genotypes = SNP_dosages.4x,
                       ploidy = 4,
                       bivalent_decoding = FALSE,
                       ncores = 4)

Note that the default setting of bivalent_decoding is TRUE, as the function evaluates faster if only bivalents are considered (the difference in run-time becomes more pronounced for hexaploids). However, allowing multivalents in the model gives a more complete and correct picture.

Here we chose to enable parallel processing to speed up the calculations. Note that jobs are split across linkage groups, so with 5 linkage groups it would have made more sense to use 5 cores if they were available. Use parallel::detectCores() to display how many CPU cores are available, and use 1 or 2 less than this at the very most. In this dataset there are only 2 linkage groups, so no more than 2 cores are needed.

nc <- parallel::detectCores() - 1

Optional: Importing IBDs from TetraOrigin or PolyOrigin

If so desired, the TetraOrigin package (Mathematica software) or the PolyOrigin package (julia software) can be used to estimate IBD probabilities rather than using the estimate_IBD function. The results should be quite similar for diploids or tetraploids. If mapping in triploid or hexaploid populations, polyqtlR::estimate_IBD should be used.

TetraOrigin

TetraOrigin produces by default a large .txt output file after running the function “inferTetraOrigin”. However, it is convenient to produce a summary of this output using the “saveAsSummaryITO” function, which produces a .csv file summarising the results. Information on how to use TetraOrigin is provided with the package itself, downloadable from GitHub.

The function import_IBD imports the .csv output of TetraOrigin::saveAsSummaryITO. It takes a number of arguments, such as folder, which is the folder containing the output of TetraOrigin. For example, if we wanted to import IBDs from the subfolder “TetraOrigin” for a species with 5 linkage groups, we would run the following:

IBD_4x <- import_IBD(method = "TO", #TO for TetraOrigin
                     folder = "TetraOrigin",
                     filename = paste0("TetraOrigin_Output_bivs_LinkageGroup",1:5,"_Summary"),
                     bivalent_decoding = TRUE)

Here it is assumed that the TetraOrigin summary filenames are “TetraOrigin_Output_bivs_LinkageGroup1_Summary.csv” etc. If all goes well, the following messages will be printed per linkage group:

Importing map data under description inferTetraOrigin-Summary,Genetic map of biallelic markers
Importing parental phasing under description inferTetraOrigin-Summary,MAP of parental haplotypes
Importing IBD data under description inferTetraOrigin-Summary,Conditonal genotype probability
PolyOrigin

A more recent version of the TetraOrigin software has been released in julia, extended for multi-parental populations of either diploid or tetraploid species. Details of the method can be found in Zheng et al. (2021) (Zheng et al. 2021). The software, as well as vignettes detailing how to use it can also be found on GitHub.

Assuming the output files are in a folder called “PolyOrigin” with filestem “PolyOrigin_Output”, a similar call to import the IBDs would be:

IBD_4x <- import_IBD(method = "PO", #PO for PolyOrigin
                     folder = "PolyOrigin",
                     filename = "PolyOrigin_Output")

Note that bivalent_decoding or error do not need to be specified when method = "PO", as these parameters are looked up from the PolyOrigin log file. Note also that previously the summary output of TetraOrigin was split up per linkage group. In PolyOrigin, the output from all linkage groups is kept together.

“Heuristic” method for IBD estimation

In cases where the results are needed quickly, or where there are very large numbers of markers, or for ploidy levels above 6, it is convenient to use the heuristic approach to IBD estimation. We do this using the estimate_IBD function as follows:

IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
                       genotypes = SNP_dosages.4x,
                       method = "heur",
                       ploidy = 4)

Note that the attribute IBDtype is now haplotypeIBD, referring to the fact that these IBDs are the probabilities of inheriting each parental haplotype at a locus, as opposed to the probabilities of inheriting particular combinations of parental alleles at a locus (genotypeIBD). Although similar, certain downstream functions such as exploreQTL can only work with the former (although you will still be able to visualise QTL allele effects with the visualiseQTL function). The accuracy of these IBDs is generally lower than if method = "hmm" had been used (and therefore the power of subsequent QTL analyses will be somewhat reduced).

High marker densities

Particularly at higher ploidy levels (6x), it becomes computationally expensive to estimate IBDs using the hmm method. Our experience has shown that for hexaploids with a mapping population size of 400 or more individuals running on 4 cores, about 200 - 300 markers per chromosome can be accommodated on an “above-average” desktop computer (16 GB RAM). Running on a single core will reduce the committed memory, but the evaluation time will therefore be much longer. 250 markers per chromosome should be already enough to estimate IBDs with high accuracy - including extra markers over this amount will add incrementally less information (following the law of diminishing returns).

The function thinmap can be used to make a selection of markers that tries to maximise their distribution across the genome and across parental homologues. The argument bin_size is used to specify the size of the bins in which markers are selected - increasing this results in fewer markers being used. The ideal bin_size is 1 cM, although at higher ploidies and with higher marker densities, wider bins may be needed. The function is called as follows:

thinned_maplist.4x <- thinmap(maplist = phased_maplist.4x,
                              dosage_matrix = SNP_dosages.4x)
## 87 markers from a possible 93 on LG1 were included.
## 
## 89 markers from a possible 93 on LG2 were included.

The object thinned_maplist.4x can then be used in place of phased_maplist.4x in a call to estimate_IBD.

Interpolating IBDs

Regardless of how they were generated, IBD probabilities are estimated at all marker positions provided. When we perform an initial scan for QTL, it is often more efficient to look for QTL at a grid of evenly-spaced positions (such as at every 1 cM). This is because the genotype data has been replaced with multi-point IBD estimates at the marker positions. Information at e.g. ten different markers within a 0.5 cM window is virtually identical and therefore just one representative for this locus should approximate the full information, while reducing the number of statistical tests. This becomes particularly relevant when we generate significance thresholds using permutation tests later, which are computationally quite demanding.

It is therefore recommended to interpolate the probabilities using the spline_IBD function as follows:

IBD_4x.spl <- spline_IBD(IBD_list = IBD_4x, 
                         gap = 1) #grid at 1 cM spacing

Visualising IBD haplotypes

Before performing a QTL analysis, it is a good idea to visualise the IBD probabilities as these can give indications about the quality of the genotypes, the linkage maps and the parental marker phasing. The function visualiseHaplo was developed originally to examine the inherited haplotypes of offspring with extreme (usually favourable) trait scores to see whether their inherited alleles are consistent with a predicted QTL model (we return to this later). But as a first look at the data quality, the function is also useful.

Haplotypes of linkage group 1 of the first nine F1 offspring can be visualised as follows:

visualiseHaplo(IBD_list = IBD_4x,
               display_by = "name",
               linkage_group = 1,
               select_offspring = colnames(SNP_dosages.4x)[3:11], #cols 1+2 are parents
               multiplot = c(3,3)) #plot layout in 3x3 grid

Here the quality of the predicted haplotypes appears to be high - dark colours signifying high confidence (probabilities close to 1) and small numbers of recombinations predicted. Regions of dark red signify double reduction, which were permitted during the estimation of IBDs using estimate_IBD. In poorer-quality datasets, considerably more “noise” may be present, with less certainty about the correct configuration per offspring.

The function returns a list of 2 elements (recombinants and allele_fish) which do not concern us here but which we will return to later. Note that we selected the offspring to display using the select_offspring argument - using the option select_offspring = all will return the whole population. We have also opted to display_by = "name", as opposed to display_by = "phenotype", in which case further arguments are required. For convenience the plots were combined into a 3 x 3 grid using the multiplot argument. For more options, including how to overlay recombination events, see ?visualiseHaplo.

An example of IBD probabilities that show much higher levels of uncertainty might look something like the following: