Introduction

It is nowadays increasingly feasible to conduct genetic analyses in polyploid populations thanks to developments in genotyping technologies as well as tools designed to deal with these genotypes. polyqtlR is one such tool that provides a number of functions to perform quantitative trait locus (QTL) mapping in F1 populations of outcrossing, heterozygous polyploid species. For more details on the methodology, please see the 2021 Bioinformatics publication of Bourke et al..

polyqtlR assumes that a number of prior steps have already been performed - in particular, that an integrated linkage map for the mapping population has been generated. The package has been developed to utilise the output of the mapping software polymapR, although any alternative mapping software that produces phased and integrated linkage maps could also be used. However, the input files may have to be altered in such cases. Currently, bi-allelic SNP markers are the expected raw genotypic data, which is also the data format used by polymapR in map construction. However, some functions can also accept probabilistic genotypic data, for example in the estimation of IBD probabilities. Full functionality of probabilistic genotypes in the package has yet to be implemented but is planned for future releases. The assignment of marker dosages in polyploid species can be achieved using a number of R packages such as fitPoly [1]. More background on the steps of dosage-calling and polyploid linkage mapping can be found in a review of the topic by Bourke et al. (2018) [2].

The QTL analysis functions primarily rely on identity-by-descent (IBD) probabilities, which are the probabilities of inheritance of particular combinations of parental haplotypes. Single-marker QTL detection methods are also available. The package was originally developed with the IBD output of TetraOrigin [3] in mind. However, the TetraOrigin package is only applicable to tetraploid datasets, and has been implemented in the proprietary software Mathematica. Therefore, alternative options to estimate IBD probabilities within R for multiple ploidy levels are offered in polyqtlR.

Genotyping errors are a regular feature of modern marker datasets. Although linkage mapping software such as polymapR has been shown to be relatively robust against genotyping errors [4], if present in sufficiently large proportions (~10 % errors or more) issues may arise with map accuracy and QTL detection. Therefore, a number of functions have been included in polyqtlR to check map accuracy using the estimated IBD probabilities, and to re-impute marker genotypes if required. These imputed genotypes may then be used in an iterative process to re-estimate the linkage map and re-run a QTL analysis.

This tutorial goes through each of the steps in a typical QTL analysis using the example datasets provided with the package, outlining the different options that are available. However, it is certainly recommended to consult the documentation that accompanies each of the functions by running the ? command before the function name (e.g. ?QTLscan).

Installing polyqtlR

polyqtlR can be installed using the following call from within R:

install.packages("polyqtlR")

There are a number of package dependencies which should be installed, e.g. Rcpp, foreach, or doParallel. Usually these will be automatically installed as well but if not, you can always install them yourself one by one, e.g.

install.packages("Rcpp")
install.packages("foreach")
install.packages("doParallel")

before re-trying the installation of polyqtlR. Eventually when all dependencies are on your computer, you should be able to successfully run the following command (i.e. without any error message):

library(polyqtlR)

It is also a good time to load the datasets that you will need to perform a typical analysis, namely a phased maplist, a set of SNP marker genotypes (discrete dosage scores) and a set of trait data (phenotypes):

data("phased_maplist.4x", "SNP_dosages.4x", "Phenotypes_4x")

In the example that follows we are using a simulated tetraploid dataset with 2 chromosomes for simplicity.

IBD probabilities

Currently two options to estimate IBD probabilities in an F1 population exist within polyqtlR. The first uses a heuristic approach originally developed for tetraploids but later applied to hexaploid populations [5, 6] and implemented here in the polyqtlR::estimate_IBD function. This can be very computationally efficient at higher ploidy levels (and has been generalised to all ploidy levels), but the algorithm ignores partially-informative marker information. In datasets with large numbers of haplotype-specific markers this is not such an issue, but in datasets with fewer such markers the accuracy of the resulting IBD probabilities is compromised. The method behind TetraOrigin [3] for offspring IBD probability estimation given a known parental marker phasing has also been replicated in the polyqtlR::estimate_IBD function (this is the default method used). Note that unlike the original TetraOrigin software, parental marker phasing is not re-estimated. However, this is usually a direct output of the linkage mapping procedure (e.g. using polymapR).

Data structures

As the IBD data are the most central objects in this package it is worth spending a moment describing them. In polyqtlR, IBD probabilities are organised in nested list form. The first level of structure are the linkage groups. In our example dataset, the list should have two elements corresponding to the two linkage groups, each of which is itself a list containing the following elements:

• IBDtype The type of IBD probability, either “genotypeIBD” or “haplotypeIBD”

• IBDarray A 3d array of IBD probabilities, with dimensions “locus”,“genotype class”, “individual”

• map The integrated linkage map (marker names and cM positions, in order)

• parental_phase The phasing of the markers in the parental map

• marginal.likelihoods A list of marginal likelihoods of different valencies if method “hmm” was used, otherwise NULL

• valency The predicted valency that maximised the marginal likelihood, per offspring. For method “heur”, NULL

• offspring The offspring names

• biv_dec Whether bivalents only (TRUE) or also multivalents were allowed (FALSE) in the procedure

• gap Here NULL, but later this can hold a numeric value (e.g. 1 cM) if IBDs are ‘splined’ or interpolated.

• genocodes Ordered list of genotype codes used to represent the different genotype classes

• pairing Log likelihoods of each of the different pairing scenarios considered

• ploidy The ploidy of parent 1

• ploidy2 The ploidy of parent 2

• method The method used, either “hmm” (Hidden Markov Model), “heur” (heuristic) or “hmm_TO” (Hidden Markov Model TetraOrigin)

• error The offspring error prior (eps) used in the calculation.

All functions within polyqtlR for estimating or importing IBD probabilities will automatically return them in this form.

HMM for IBD estimation

The estimate_IBD function of polyqtlR with the default setting method = "hmm" estimates offspring IBD probabilities using a Hidden Markov Model (HMM) developed by Zheng et al [3] in the TetraOrigin package but generalised in polyqtlR to multiple ploidy levels. Currently, diploid (2x), triploid (3x = 4x $$\times$$ 2x), tetraploid (4x) and hexaploid (6x) populations are handled. genotypes can either be discrete (i.e. integer marker dosage scores), or probabilistic (i.e. the probabilities of each of the ploidy + 1 possible scores). For triploids, tetraploids and hexaploids, the possibility of incorporating double reduction [7] (i.e. allowing for the phenomenon of multivalent pairing) is available, although this can have serious performance implications for hexaploids (and in hexaploids, the option of allowing multivalent pairing in both parents for the same chromosome is by default disabled as it requires very high RAM requirements (> 32 Gb). Use argument full_multivalent_hexa = TRUE to allow multivalents simultaneously at the same position from both hexaploid parents).

Some of the code used to estimate these IBD probabilities has been programmed in C++ to improve performance, and relies on both the Rcpp and RcppArmadillo packages to provide the interface between R, C++ and the Armadillo C++ library.

The expected format of input files is that used by the mapping software polymapR [4]. If you have used other software for map construction and/or genotype calling, you will need to convert your input files to the correct format. There are already some tools available to do this (although it should be quite straightforward if you look at the expected format using the example data files provided here). For example, if you generated the phased linkage map using mappoly [8], you can convert your map to the required format using the convert_mappoly_to_phased.maplist function (check out the help using ?convert_mappoly_to_phased.maplist for details).

The genotypes can be either discrete or probabilistic. If the genotypes are discrete, they must be provided as a matrix of marker dosage scores (for example as provided in this tutorial in SNP_dosages.4x) with marker names as row-names, and individual names (including the two parents) as column-names. Checks are performed and non-numeric data is converted to numeric data if needed. Alternatively, probabilistic genotypes can be provided which can either be the direct output of the fitpoly [1] function saveMarkerModels, or from some other polyploid-appropriate genotype-calling software. For example, if polyRAD [9] or updog [10] were used for genotype calling, a conversion step is needed. Functions for doing this are provided by polymapR. Check out ?polymapR::convert_polyRAD or ?polymapR::convert_updog for details.

We run the function estimate_IBD using the phased linkage map phased_maplist.4x as generated by polymapR, in this case allowing for multivalents (bivalent_decoding = FALSE), as follows:

IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
genotypes = SNP_dosages.4x,
ploidy = 4,
bivalent_decoding = FALSE,
ncores = 4)

Note that the default setting of bivalent_decoding is TRUE, as the function evaluates faster if only bivalents are considered (the difference in run-time becomes more pronounced for hexaploids). However, allowing multivalents in the model gives a more complete and correct picture.

Here we chose to enable parallel processing to speed up the calculations. Note that jobs are split across linkage groups, so with 5 linkage groups it would have made more sense to use 5 cores if they were available. Use parallel::detectCores() to display how many CPU cores are available, and use 1 or 2 less than this at the very most. In this dataset there are only 2 linkage groups, so no more than 2 cores are needed.

nc <- parallel::detectCores() - 1

Optional: Importing IBDs from TetraOrigin

If so desired, the original TetraOrigin package can be used to estimate IBD probabilities (again using a Hidden Markov Model) rather than using the estimate_IBD function. The results should be identical for tetraploids, while polyqtlR also offers the flexibility of handling multiple ploidy levels (2x, 3x, 4x and 6x).

TetraOrigin produces by default a large .txt output file after running the function “inferTetraOrigin”. However, it is convenient to produce a summary of this output using the “saveAsSummaryITO” function, which produces a .csv file summarising the results. Information on how to use TetraOrigin is provided with the package itself, downloadable from GitHub.

The function import_IBD imports the .csv output of TetraOrigin::saveAsSummaryITO. It takes a number of arguments, such as folder, which is the folder containing the output of TetraOrigin. For example, if we wanted to import IBDs from the folder “TetraOrigin” for a species with 5 linkage groups, we would run the following:

IBD_4x <- import_IBD(folder = "TetraOrigin",
bivalent_decoding = TRUE)

Here it is assumed that the TetraOrigin summary filenames are “TetraOrigin_Output_bivs_LinkageGroup1_Summary.csv” etc. If all goes well, the following messages will be printed per linkage group:

Importing map data under description inferTetraOrigin-Summary,Genetic map of biallelic markers
Importing parental phasing under description inferTetraOrigin-Summary,MAP of parental haplotypes
Importing IBD data under description inferTetraOrigin-Summary,Conditonal genotype probability

“Heuristic” method for IBD estimation

In cases where the results are needed quickly, or where there are very large numbers of markers, or for ploidy levels above 6, it is convenient to use the heuristic approach to IBD estimation. We do this using the estimate_IBD function as follows:

IBD_4x <- estimate_IBD(phased_maplist = phased_maplist.4x,
genotypes = SNP_dosages.4x,
method = "heur",
ploidy = 4)

Note that the attribute IBDtype is now haplotypeIBD, referring to the fact that these IBDs are the probabilities of inheriting each parental haplotype at a locus, as opposed to the probabilities of inheriting particular combinations of parental alleles at a locus (genotypeIBD). Although similar, certain downstream functions such as exploreQTL can only work with the former (although you will still be able to visualise QTL allele effects with the visualiseQTL function). The accuracy of these IBDs is generally lower than if method = "hmm" had been used (and therefore the power of subsequent QTL analyses will be somewhat reduced).

High marker densities

Particularly at higher ploidy levels (6x), it becomes computationally expensive to estimate IBDs using the hmm method. Our experience has shown that for hexaploids with a mapping population size of 400 or more individuals running on 4 cores, about 200 - 300 markers per chromosome can be accommodated on an “above-average” desktop computer (16 GB RAM). Running on a single core will reduce the committed memory, but the evaluation time will therefore be much longer. 250 markers per chromosome should be already enough to estimate IBDs with high accuracy - including extra markers over this amount will add incrementally less information (following the law of diminishing returns).

The function thinmap can be used to make a selection of markers that tries to maximise their distribution across the genome and across parental homologues. The argument bin_size is used to specify the size of the bins in which markers are selected - increasing this results in fewer markers being used. The ideal bin_size is 1 cM, although at higher ploidies and with higher marker densities, wider bins may be needed. The function is called as follows:

thinned_maplist.4x <- thinmap(maplist = phased_maplist.4x,
dosage_matrix = SNP_dosages.4x)
## 87 markers from a possible 93 on LG1 were included.
##
## 89 markers from a possible 93 on LG2 were included.

The object thinned_maplist.4x can then be used in place of phased_maplist.4x in a call to estimate_IBD.

Interpolating IBDs

Regardless of how they were generated, IBD probabilities are estimated at all marker positions provided. When we perform an initial scan for QTL, it is often more efficient to look for QTL at a grid of evenly-spaced positions (such as at every 1 cM). This is because the genotype data has been replaced with multi-point IBD estimates at the marker positions. Information at e.g. ten different markers within a 0.5 cM window is virtually identical and therefore just one representative for this locus should approximate the full information, while reducing the number of statistical tests. This becomes particularly relevant when we generate significance thresholds using permutation tests later, which are computationally quite demanding.

It is therefore recommended to interpolate the probabilities using the spline_IBD function as follows:

IBD_4x.spl <- spline_IBD(IBD_list = IBD_4x,
gap = 1) #grid at 1 cM spacing

Visualising IBD haplotypes

Before performing a QTL analysis, it is a good idea to visualise the IBD probabilities as these can give indications about the quality of the genotypes, the linkage maps and the parental marker phasing. The function visualiseHaplo was developed originally to examine the inherited haplotypes of offspring with extreme (usually favourable) trait scores to see whether their inherited alleles are consistent with a predicted QTL model (we return to this later). But as a first look at the data quality, the function is also useful.

Haplotypes of linkage group 1 of the first nine F1 offspring can be visualised as follows:

visualiseHaplo(IBD_list = IBD_4x,
display_by = "name",
select_offspring = colnames(SNP_dosages.4x)[3:11], #cols 1+2 are parents
multiplot = c(3,3)) #plot layout in 3x3 grid

Here the quality of the predicted haplotypes appears to be high - dark colours signifying high confidence (probabilities close to 1) and small numbers of recombinations predicted. Regions of dark red signify double reduction, which were permitted during the estimation of IBDs using estimate_IBD. In poorer-quality datasets, considerably more “noise” may be present, with less certainty about the correct configuration per offspring.

The function returns a list of 2 elements (recombinants and allele_fish) which do not concern us here but which we will return to later. Note that we selected the offspring to display using the select_offspring argument - using the option select_offspring = all will return the whole population. We have also opted to display_by = "name", as opposed to display_by = "phenotype", in which case further arguments are required. For convenience the plots were combined into a 3 x 3 grid using the multiplot argument. For more options, including how to overlay recombination events, see ?visualiseHaplo.

An example of IBD probabilities that show much higher levels of uncertainty might look something like the following:

Here, there are unrealistically-high numbers of recombinations predicted, suggesting that the algorithm had difficulty correctly assigning genotype classes. In such cases it may be worthwhile to re-examine the steps prior to IBD estimation, in particular genotype calling. We will return to this issue again later.

Genotypic Information

Another approach to assessing the quality of the IBD probabilities, as well as understanding the “QTL detection landscape” to some degree is to estimate the Genotypic Information Coefficient (GIC). The GIC is calculated per homologue across the whole population, and ranges from 0 (no information) to 1 (full information). To maximise QTL detection power and precision, we would like the GIC to be uniform and as high as possible [11]. Note that the GIC is only defined for bivalent pairing, and therefore estimates of GIC from a multivalent-aware HMM are based on offspring predicted to have come from a bivalent-based meiosis for that linkage group (this tends to be most of the population anyway).

We calculate the GIC as follows:

GIC_4x <- estimate_GIC(IBD_list = IBD_4x)

We can also visualise the output of this function as follows:

visualiseGIC(GIC_list = GIC_4x)