(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Rhometa: Population recombination rate estimation from metagenomic read datasets [1]

['Sidaswar Krishnan', 'Climate Change Cluster', 'Faculty Of Science', 'University Of Technology Sydney', 'Sydney', 'Nsw', 'Matthew Z. Demaere', 'Australian Institute For Microbiology', 'Infection', 'Dominik Beck']

Date: 2023-04

Prokaryotic evolution is influenced by the exchange of genetic information between species through a process referred to as recombination. The rate of recombination is a useful measure for the adaptive capacity of a prokaryotic population. We introduce Rhometa ( https://github.com/sid-krish/Rhometa ), a new software package to determine recombination rates from shotgun sequencing reads of metagenomes. It extends the composite likelihood approach for population recombination rate estimation and enables the analysis of modern short-read datasets. We evaluated Rhometa over a broad range of sequencing depths and complexities, using simulated and real experimental short-read data aligned to external reference genomes. Rhometa offers a comprehensive solution for determining population recombination rates from contemporary metagenomic read datasets. Rhometa extends the capabilities of conventional sequence-based composite likelihood population recombination rate estimators to include modern aligned metagenomic read datasets with diverse sequencing depths, thereby enabling the effective application of these techniques and their high accuracy rates to the field of metagenomics. Using simulated datasets, we show that our method performs well, with its accuracy improving with increasing numbers of genomes. Rhometa was validated on a real S. pneumoniae transformation experiment, where we show that it obtains plausible estimates of the rate of recombination. Finally, the program was also run on ocean surface water metagenomic datasets, through which we demonstrate that the program works on uncultured metagenomic datasets.

Microbes, specifically prokaryotes, are able to exchange DNA between them through a process called recombination that takes the form of gene-conversion. Recombination plays a fundamentally important role in microbial speciation and evolution. Metagenomics allows us to study microbes in their natural environment as they are via direct sequencing and analysis of environmental DNA. Indeed, most microbes cannot be cultured and can only be studied in this manner. Metagenomic datasets represent an excellent resource for measuring prokaryotic recombination. Rhometa is a new software program that we have designed to be used to interrogate modern metagenomic shotgun sequencing read datasets to estimate the population recombination rates. It extends the composite likelihood approach for population recombination rate estimation, and makes it applicable to modern aligned metagenomic datasets of various depths. The input for the program requires little pre-processing and only an aligned BAM and reference FASTA in the form of a complete sequence or MAG are needed. The program performs well on large BAM files and the included subsampling functionality ensures that files of arbitrarily large size are subsampled and analysed. The program has been validated on simulated datasets and further on experimental and environmental datasets where the amount of recombination is quantified. Through the validation we show that the program is able to reliably estimate the recombination rates in simulated and experimental datasets.

Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: A. E. Darling holds equity in Illumina Inc and is employed by its subsidiary Illumina Australia Pty Ltd, a company that develops and sells DNA sequencing technology. All other authors declare no competing financial interests.

Funding: This work was supported by an Australian Government Research Training Program Scholarship. This research was supported by the Australian Government through the Australian Research Council Discovery Projects funding scheme under the project DP180101506, http://purl.org/au-research/grants/arc/DP180101506 (to AED). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: All supporting information can be accessed here: https://doi.org/10.5281/zenodo.7634208 . The Rhometa software package is available at https://github.com/sid-krish/Rhometa . The metagenomic dataset simulation pipeline is available at https://github.com/sid-krish/rhometa_sim , the LDhat Nextflow Pipeline is available at: https://github.com/sid-krish/Nextflow_LDhat , the full genome version of Rhometa, developed for testing purposes, is available at https://github.com/sid-krish/Rhometa_Full_Genome and the Nextflow_LDhat_sim simulation pipeline (used for simulating full sequences for both Rhometa Full Genome and LDhat Nextflow Pipeline) is available at: https://github.com/sid-krish/Nextflow_LDhat_Sim .

Copyright: © 2023 Krishnan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Here, we present Rhometa, a software implementation of the composite likelihood based population recombination rate (ρ = 2N e r) estimation method, which builds upon the methods introduced in the LDhat pairwise program [ 12 ] and can be applied directly to modern aligned shotgun metagenomic read datasets for prokaryotes. Details of its implementation are presented in the methods, while an evaluation of its accuracy on simulated and real data and comparison to existing tools are presented in the results.

mcorr [ 27 ] is a program that can work with metagenomic reads and estimate the relative rate of recombination to mutation as well as the recombinational divergence. mcorr uses an alternative mathematical formulation to parameterise the recombination process, with ϕ pool ≡2Tγ, where T “is the mean pairwise coalescence time across all loci in the bulk pool” and γ the per base pair (bp) per generation recombination rate, equivalent to r in ρ = 2N e r. Furthermore, the program is limited to coding regions and requires a gene annotation file. It is our aim to build on methods established in previous composite likelihood estimators for population recombination rate estimation to create a tailored solution that is applicable to modern aligned read-based metagenomic datasets.

pyrho [ 23 ] is a recent composite likelihood estimator supporting read based data, however it is limited to the analysis of diploid genomes, making it unsuitable for the analysis of haploid genomes (prokaryotes) which commonly dominate metagenomic datasets. Still other programs exist that calculate the population recombination rate through different approaches such as LDjump [ 18 ] and CodABC [ 24 ] which utilise summary statistics [ 18 ], and programs such as ClonalFrameML [ 25 , 26 ] which can provide an estimate of recombination rate relative to the mutation rate, but is designed around bacterial isolate genomes.

More specifically, LDhat [ 12 , 19 ], LDhelmet [ 20 ] and LDhot [ 21 ] were designed for genome sequence analysis, not metagenomes. PIIM [ 22 ] was a pioneering attempt at a metagenomic read-based population recombination rate estimator which deals with the complexities of such datasets such as varying depths across loci. While innovative at the time its application is impractical today. PIIM’s approach included computationally expensive techniques to integrate out uncertainty in low quality base-calls so as to retain as much information as possible from the scarce data available at the time. Today, deep sequencing is affordable and highly accurate, such that it’s often more practical to simply discard low quality sequence data rather than account for it computationally using complex algorithms. As such PIIM’s approach is impractical for the ever-larger datasets that are generated via modern sequencing techniques. Furthermore, it lacks support for modern sequence data formats (e.g., BAM), being limited to the obsolete ACE assembly format that is rarely used today.

There are several programs available that implement the composite likelihood approach for estimating the recombination rate, including LDhat [ 12 , 19 ], LDhelmet [ 20 ], LDhot [ 21 ], PIIM [ 22 ] and pyrho [ 23 ]. Each are excellent for their respective use cases, but have limitations that make them unsuitable for modern read-based metagenomic datasets.

Several approaches have been used to estimate the recombination rate ρ. These include moment estimators, full-likelihood estimators and composite likelihood estimators. Moment estimators use summary statistics to estimate ρ, but their accuracy is limited by the fact that they cannot use all the genetic information available [ 14 – 16 ]. Full likelihood estimators are able to utilise all the genetic information available to them, but are so computationally intensive that their usage is impractical. To mitigate these issues and to make the approach more computationally tractable, composite likelihood estimators were developed [ 12 , 16 , 17 ]. With composite likelihood estimators, the scope of data that is analysed is reduced e.g. to only consider pairs of alleles, this approach is less computationally intensive with only a slight loss in accuracy compared to the full-likelihood approach [ 12 , 16 – 18 ].

Coalescent theory provides the population scaled recombination rate for the gene-conversion model of recombination. It is formulated as ρ = 2N e r, or 2 × “effective population size” × “per individual” “per generation” “recombination rate”, respectively [ 12 ] as well as the haploid population scaled mutation rate equation θ = 2N e u, or 2 × “effective population size” × “per individual” “per generation” “mutation rate”, respectively [ 12 ]. It is difficult to estimate r or u directly without additional prior information, so recombination and mutation rates are typically computed as the population scaled statistics ρ and θ or simultaneously as the ratio r/u also denoted as r/m (per site recombination to per site mutation rate) [ 12 , 13 ]. An important point of note is that r and u are per site rates. ρ applies to the entire genome, θ on the other hand is also per site.

The rate of recombination within a population can be inferred using population genetic models for evolution. The Wright-Fisher model provides an analytical framework that quantifies various forces that can impact the evolution of a population such as random genetic drift and mutation [ 10 ]. Coalescent theory, building on the Wright-Fisher population model, provides an analytical framework for DNA polymorphism data and can be used to obtain quantitative estimates for recombination and mutation rates [ 11 , 12 ].

Currently the most viable way to study microbial populations is via metagenomics, which allows us to study microbes in their natural environment via direct sequencing and analysis of environmental DNA [ 7 , 8 ]. Shotgun metagenomic sequencing yields fragments of DNA sequences, referred to as reads, which taken together represent a random sampling of genome fragments from all the microbes in the environmental sample [ 9 ]. These reads can then be used to estimate the rates of recombination.

A primary question in the field of microbial ecology is to understand the rate at which prokaryotes evolve and form species in nature. A major driving factor of prokaryotic evolution is recombination [ 1 ]. Within prokaryotes, recombination often takes the form of gene-conversion where homologous sequences of DNA are non-reciprocally transferred and replaced with another [ 2 , 3 ]. This process can occur between repeated sequences within the same chromosome and between homologous chromosomal sequences [ 3 ]. Recombination often plays a greater role than de novo mutation for evolution in prokaryotes [ 4 ]. Furthermore, it is thought that recombination plays an important role in counteracting the effects of Muller’s ratchet, the theorised process where deleterious mutations inevitably accumulate over time leading to the irrevocable loss of most mutation free genotypes in a population [ 5 , 6 ]. Therefore, understanding the rate at which recombination occurs within prokaryotic populations can provide us insight into a crucial biological process that is necessary for their adaptation and survival.

Description of the method

Our approach focuses on advancing the composite likelihood recombination rate estimator for use with aligned metagenomic read datasets. We have built our metagenomic population recombination rate estimator program upon the approach introduced in the LDhat program, specifically the LDhat pairwise module [12]. LDhat is a well-known and used program with support for microbial datasets, specifically for the gene-conversion type recombination which occurs in microbes, however, it is limited to aligned genome sequences. We have designed the program to work with aligned read based metagenomic datasets, where the complication of varying depths and short reads needs to be addressed. For our implementation, we have also subsumed features from pyrho [23]. pyrho, while lacking support for microbial (haploid) datasets, is a modern composite likelihood estimator implemented in python. Like pyrho, our program is also implemented in python and aims to make use of modern libraries and their features. As a result of this shared implementation approach, we were able to call applicable functionalities from pyrho, helping avoid unnecessary code rewrites.

User Input We have endeavoured to make the process of preparing a metagenomic dataset for analysis with Rhometa straight forward with few pre-processing steps. Short reads can be aligned to existing reference genomes representing the metagenomic dataset, to reference MAGs (Metagenome-Assembled Genome) or pangenomes representative of the microbial community. Multi sequence references are also supported. The reference genomes provide a scaffold to align the reads, from which the rates of recombination and mutation are determined. For input files, the Rhometa pipeline requires a FASTA format reference sequence and a BAM file of metagenomic reads of interest aligned to the reference. For our pipeline evaluation, we have used BWA MEM (default parameters) to produce the input BAM file [28] where necessary.

Variant site pairs The first step of the pipeline involves identifying variant sites (also known as segregating sites). Our program first filters the user supplied BAM for mapping quality and relative alignment score and subsequently performs variant calling against the user supplied reference FASTA using the program freebayes (default parameters with -p (ploidy) = 1) [29]. The resulting VCF file, containing information on all predicted variant sites, is reduced to only single nucleotide polymorphisms (SNPs) using bcftools [30]. Rather than individual variant sites, the composite likelihood estimator as implemented in LDhat considers variant site pairs, tracking count and position within the reference genome’s coordinate space to estimate the recombination rate. For instance, if variant sites are found at reference positions 1, 3, and 5 the set of variant site pairs would then be (1, 3), (1, 5), and (3, 5).

Pairwise table The LDhat pairwise module was designed for genome sequences and considers all possible variant site pair combinations across the sequences being analysed. Rhometa restricts its consideration to the set of variant site pairs linked by individual reads or read-pairs. For single-end reads, both sites within a variant pair must fall within the extent of an individual read, while for paired-end reads variants can fall within the insert length. A separation limit of 1000 bp is imposed on paired end variant site pairs reflecting a practical upper limit on insert size for current Illumina short-read sequencing technology [31]. Rhometa performs well with both single and pair-end reads, with very little difference in the results between the two (S2 Fig). For all accepted variant site pairs, we construct a pairwise table of observational frequency (Table 1). The pairwise table allows for the possibility of all 16 combinations for any variant site pair. The table also captures the fact that multiple reads can align at a position. Instances where variant site pairs contain an ambiguous base (eg. N) are ignored. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Pairwise table example. https://doi.org/10.1371/journal.pgen.1010683.t001

Splitting the pairwise table by depth In the pairwise table, variant site pair total alignment depth is calculated by row summation (e.g., For the pair (130, 136) from (Table 1), total depth is 13 + 21 + 10 = 44. For the whole genome approach of LDhat, this marginal value is a constant, while for metagenomic data depth of coverage can vary greatly across sites. As such it is necessary to split the pairwise table into constant depth subtables so that the depth can be taken into account and handled in downstream processing.

Bi-allelic pairwise table For each constant depth subtable, sites that do not contain two alleles are excluded (only biallelic sites should be in the pairwise table).

Lookup tables Lookup tables improve the computational efficiency of the composite likelihood approach by precomputing the likelihoods for different configurations of sets of allele pairs. Lookup tables are generated under a fixed population mutation rate and a range of population recombination rates, typically between 0–100 [12,19]. We use the program LDpop [32] for generating lookup tables as it is the most feature rich and most efficient program of its kind currently. Details on how the lookup tables are used can be found in S1 Appendix. It is a standard process for which we have made use of some functions from pyrho to avoid reimplementation. Generation of lookup tables with LDpop are parameterised by the number of genomes, range of population recombination rates, and θ per site. Further, the “approx” option is used which is significantly faster but still quite accurate when compared to the ldpop’s exact algorithm. This also makes generation of large lookup tables, as is necessary with Rhometa, more tractable. Generally, the generation of large lookup tables is a computationally intensive process, more information on which can be found in the paper associated with LDpop [32].

Watterson’s theta estimate A subprogram is provided to estimate the population mutation rate (θ = 2N e u), per site. The formulation is directly based on Watterson’s theta estimate as implemented in LDhat [12], with some changes on the information used for the input parameter. The program requires the aligned BAM file and the reference FASTA file and makes use of freebayes to identify variant sites which is required for the Watterson estimate. θ is a required parameter for lookup table generation adjusted for read based datasets, the θ estimate is calculated based on dataset depth–specifically mean and median depth–in place of the number of genomes, as in the original formation (S4 Fig). For metagenomic samples where the exact number of genomes is unknown, the true number of different genomes in a metagenomic sample is likely to be a large value, as such using the depth value is reasonable and in many cases is likely to be a conservative estimate for the number of genomes.

Lookup table and depth The size of each constant depth pairwise sub-table determines the lookup table to match against for precomputed negative log-likelihoods. In the context of Rhometa and aligned reads, the number of genomes parameter used to generate lookup tables is instead taken as depth. Where the bam file is of a high depth, this can result in pairwise sub-tables of high depth. In such cases, available lookup tables may not be of high enough depth to cover them. A subsampling feature is included that is able to automatically downsample the BAM to a given depth. This ensures that positions with a depth exceeding that of the highest generated lookup table are still evaluated and are not omitted from consideration. BAM subsampling uses a random sampling process and permits a list of seed values for testing and identifying any variance that can stem from the subsampling. In general, if the depth of the largest available lookup table is small, an increased need for downsampling could result in a decrease in estimation accuracy.

Calculating r ij The next step is to calculate recombination rate values for each variant site pairs, these values are denoted by r ij , with i and j referring to the variant sites. The method used for calculation differs for crossing-over and gene-conversion modes of recombination. prokaryotes undergo recombination via gene-conversion and the equation used to calculate r ij is as follows [12]: (1) For Eq 1, c represents the per base recombination rate, t the mean gene conversion tract length and d ij the distance between a variant site pair. ct is taken together and represents the range of population recombination rates being evaluated, this is typically between 0–100 and is the same as the range of ρ values used when generating the lookup tables. The process essentially involves computing r ij for each variant site pair for the range of population recombination rate values.

Final pairwise likelihoods Next, we bring together the information we have generated thus far: the output of the lookup table and depth step and calculating r ij step. For each variant site pair, we use the corresponding negative log-likelihood values from the lookup table and depth step, and on these apply linear interpolation to determine the negative log-likelihood values for r ij value for that variant pair configuration. The interpolation is performed against the range of population recombination rates used to generate the lookup tables. This process is done for all the variant site pairs and the results are the final negative log-likelihoods for a given range of population recombination rates being evaluated, which again is typically between 0–100.

Population recombination rate Prior to computing the final log-likelihood sums we weight adjust the negative log-likelihoods. As the number of observations provided at a given depth represents the degree of evidential support towards the final ρ estimate, we introduce a novel weighting algorithm that accounts for the additional information in high depth, high observation count site pairs relative to low depth / low count site pairs. The weighting algorithm is a simple solution to reliably add more weight to higher depth regions and the number of observations at that depth (there can be many regions of same depth). The weighting and final negative log-likelihood summation algorithms are as follows: (2) (3) Here, d represents depths observed in the dataset. Weighting is performed on each per-depth table denoted by w d (ρ) (Eq 2). In the right side of the Eq 2, T d is the unweighted per-depth table and n d the number of unique variant site pairs. The reweighted negative log-likelihoods w d (ρ) are collected across depths and summed with respect to the range of population recombination rates being evaluated. The maximum negative log-likelihood value (closest to 0) then corresponds to the final population recombination rate ρ max (Eq 3) estimate.

Program structure The program is organised into 4 pipelines, each dedicated to a specific task. These pipelines are written using nextflow, a framework for pipeline management [33]. All the scripts used in the individual pipeline steps were written using the python programming language and various python libraries. Some python scripts were adapted or used as is from the programs LDpop [32] and pyrho [23]. Additional programs used in the pipelines include msprime [34], ART [35], BWA MEM [28], seaborn [36] and samtools [30]. The four pipelines, sim_gen, theta_est, lookup_table_gen and rho_est (Fig 1), correspond to the nextflow pipeline names, e.g. sim_gen.nf within Rhometa, and perform the functions defined in the following paragraphs. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. The pipelines that together make up Rhometa. (A) Pipeline for generating simulated metagenomic read datasets. (B) Pipeline for estimating the population mutation rate (C) Pipeline for generating the lookup tables required for the recombination rate estimator (D) Pipeline for estimating the population recombination rate. https://doi.org/10.1371/journal.pgen.1010683.g001 Sim_gen (Fig 1A) is used to generate BAM files and FASTA reference files with simulated reads from bacterial genomes with recombination. The bacterial genomes are simulated using msprime. This pipeline is primarily included so that the simulated datasets used for this paper can be reproduced, but is not required to analyse real datasets. It is in a separate repository and can be accessed at: https://github.com/sid-krish/Rhometa_sim Theta_est (Fig 1B) is used to determine the population mutation rate per site (θ) based on the Watterson estimate as implemented in LDhat, details in methods. This pipeline estimates θ on the dataset of interest, furthermore, θ per site is one of the required parameters for generating lookup tables. The user has the option to use the estimated θ or a different value when generating lookup tables. The Lookup_table_gen (Fig 1C) component of the pipeline makes use of LDpop and pyrho to generate the lookup tables required for the recombination rate estimator and can be launched in one of 2 ways. It can either use a pre-generated lookup table for high depth, which then will be downsampled for each depth from 3 to the depth of the lookup table or the pipeline can generate a high depth lookup table from scratch and then perform the downsampling step. The downsampling algorithm is a part of pyrho, it is significantly faster to generate the required smaller lookup tables from a larger table via downsampling and the results are essentially identical. The rho_est pipeline (Fig 1D) is used to estimate the population recombination rate of metagenomic read based datasets provided in the form of BAM and reference FASTA files. It makes use of the lookup tables generated by the lookup_table_gen pipeline. Rhometa is available at: https://github.com/sid-krish/Rhometa Our pipelines for evaluating LDHat, the Rhometa_full_genome pipeline and the simulated dataset generator for these pipelines can be accessed here: LDhat Nextflow Pipeline: https://github.com/sid-krish/Nextflow_LDhat

Rhometa Full Genome Pipeline: https://github.com/sid-krish/Rhometa_Full_Genome

Nextflow_LDhat_sim (used for both Rhometa Full Genome and LDhat Nextflow Pipeline): https://github.com/sid-krish/Nextflow_LDhat_Sim

Simulated datasets The development of our program was performed in two major phases, for the first phase we endeavoured to create a full genome recombination rate estimation pipeline for bacterial sequences based on the LDhat methodology (Rhometa_full_genome), once we were certain that we were able to replicate LDhat’s results exactly we then carefully adapted the program to work with read based datasets (Rhometa). To evaluate LDhat and Rhometa_full_genome, we utilised msprime [34] to simulate bacterial sequences with recombination. Our simulations included multiple genomes (5–100 genomes) of size 25KB, under population recombination rates [5, 12.5, 25, 37.5, 50], mean recombination tract length 500bp, with 10 replicates (seed values 1–10) and population mutation rate 0.01. Lookup tables for population mutation rate 0.01 and population recombination rates 0–100 (101 steps) were used. The LDhat pipeline configured for gene-conversion is available at: https://github.com/sid-krish/Nextflow_LDhat. Rhometa_full_genome pipeline is available at: https://github.com/sid-krish/Rhometa_Full_Genome. The full genome simulation pipeline is available at https://github.com/sid-krish/Nextflow_LDhat_Sim. A point of note is that the θ estimator is implemented separately by us as per [12, Eq 1] in both our LDhat pipeline and Rhometa_full_genome pipeline. Additionally, all variant sites are used for θ estimation, not just bi-allelic ones. When simulating the population recombination rate with msprime, the number of samples (genomes), sequence length, recombination rate (r), mean gene conversion tract length (t), seed value and mutation rate (u) are provided, the effective population size (N e ) was 1 (default) and the ploidy (i) was fixed to 1. Default options are used in all other cases. Within msprime, the population recombination rate ρ is calculated as such: 2 * i * N e * r * t. The per site population mutation rate θ was calculated as such: 2 * i * N e * u. Initially the number of genomes was fixed and we varied the length of the genomes, but this analysis revealed that varying the genome length does not have a significant impact on the final population recombination rate estimations (S1 Fig). We therefore fixed the genome length and varied the number of genomes and in doing so we found that as the number of genomes increased the accuracy and variance of the final estimations also improved (Fig 2). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Comparison of LDhat and Rhometa_full_genome when running on simulated full genomes. (A) LDhat. Simulated vs Estimated population recombination rate (ρ) for varying number of simulated full bacterial genome sequences. (B) Rhometa_full_genome. Simulated vs Estimated population recombination rate (ρ) for varying number of simulated full bacterial genome sequences. https://doi.org/10.1371/journal.pgen.1010683.g002 We took a similar approach to evaluating the read based pipeline Rhometa to that used with LDhat and Rhometa_full_genome. For the read-based pipeline, the simulated full bacterial sequences, simulated via msprime, are further processed to be in the form of reads using the read simulator ART [35], which simulates sequencing reads by mimicking real sequencing processes with empirical error models. The reads were simulated based on the Illumina HiSeqX TruSeq system. These reads are then aligned to one of the bacterial sequences which represents the reference FASTA file, the first of the simulated sequences is used for this (Fig 1A). The aligned BAM and reference FASTA are then used for recombination rate estimation. The simulation parameters were as follows: the effective population size was 1 (default) and the ploidy was set to 1, number of genomes 20–200, genome length 100KB, population recombination rates [10.0, 20.0, 30.0, 40.0, 50.0], mean recombination tract length 1000bp, with 20 replicates (seed values 1–20) and population mutation rate of 0.01. Each seed value used applies to all aspects of the pipeline where a seed is required. The reads were paired-end of length 150bp, insert length 300bp, standard deviation of 25bp, with window size set to 1000 during analysis and the fold coverage values were [1, 4, 8, 16]. The lookup tables were generated and used for 3–250 (genomes), generated under population mutation rate 0.01 for population recombination rates 0–100 (0–1 in 101 steps plus 1–100 in 100 steps). Bam subsampling was also automatically applied by Rhometa during analysis if needed. Additionally for the read-based pipeline, we evaluated the deviation of the estimated results from the simulated values. The formula used to calculate the deviation is (Estimated ρ (mean)—Simulated ρ) / Simulated ρ. This makes it easier to gauge the magnitude of deviation from the expected.

Real Datasets—Transformation experiment To further evaluate Rhometa we applied our pipeline on the data derived from a previously published laboratory transformation experiment, where the extent and distribution of recombination events were quantified. In the experiment [37], in vitro recombination through transformation was performed on a S. pneumoniae strain. Transformed isolates were then sequenced and recombination events were identified. This dataset was also used to evaluate the mcorr method by its authors and as such it provides us with the opportunity to compare the results of our pipeline against those published in the mcorr paper. The transformation experiments were performed with different concentrations of donor DNA, 5 ng mL-1 and 500 ng mL-1, 5 ng mL-1 and 500 ng mL-1 experiments (S2 Table) had a similar number of recombination events, with the 5 ng mL-1 having a slightly larger number of events, the authors state that this indicates a single piece of DNA can act as the origin for multiple recombination events. The dataset is available in the form of reads, which Rhometa was designed to analyse. Each 5 ng mL-1 sample from experiment 1 was aligned to S. pneumoniae reference sequence ATCC 700669, NCBI accession NC_011900.1 the resulting BAM files were then merged and analysed with Rhometa. To analyse the datasets, we first estimated the θ for median depth using the θ estimation pipeline, from which we obtained θ, that is per site by default. We then generated lookup tables, based on the θ, for population recombination rates 0–20 in 201 steps for 3–200 genomes and used the lookup tables for the recombination rate estimation pipeline. Subsampling was enabled, with a window size of 1000 for paired end reads. As given in the Croucher et al paper [37], we used the value of t = 2300 bp as the mean tract length for analysis. Additionally, we used 5 different seed values [0, 1, 2, 3, 4] for the subsampling step to account for any variance and then took the average of the values for recombination rate estimations. Using the ρ and θ estimates along with information from the experiment we also calculated the ρ per site and r/m values. The default ρ estimate, by Rhometa, is a whole-genome estimate. To obtain ρ per site, the estimated ρ value was divided by the mean tract length of 2300 bp. To get the r/m value, we used the conversion formula [26]: ρ (per site)/θ (per site) * tract length * substitution probability. We estimated the substitution probability between the donor and recipient and found it to be , where 2221315 is the recipient genome length and based on the information provided by Croucher et. al (2012) “… the donor DNA identified 17,534 SNPs when aligned to the recipient sequence, of which 385 appeared to be false positives … These positions were excluded from subsequent analyses.” [37]. We repeated the process above for each 500 ng mL-1 sample from experiment 1 and the final merged BAM was analysed with Rhometa. Furthermore, the 5 ng mL-1 samples and 500 ng mL-1 samples in experiment 1 were analysed together, corresponding to 84 sequences. We prepared this dataset by merging the final 5 ng and 500 ng BAMs. We performed this analysis, to enable comparison with mcorr’s published results. The analysis was performed using the same process as with the 5 ng mL-1 samples and 500 ng mL-1 samples.

[END]
---
[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010683

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/