(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

nf-core/airrflow: An adaptive immune receptor repertoire analysis workflow employing the Immcantation framework [1]

['Gisela Gabernet', 'Department Of Pathology', 'Yale School Of Medicine', 'New Haven', 'Connecticut', 'United States Of America', 'Quantitative Biology Center', 'Eberhard-Karls University Of Tübingen', 'Tübingen', 'Susanna Marquez']

Date: 2024-08

nf-core/airrflow benchmarking with simulated data

To evaluate the ability of nf-core/airrflow to recover immune repertoire sequences and infer clonal relationships from ground truth sequencing data, we simulated three BCR receptor repertoires with known V(D)J sequences, varying clonal abundances and increasing frequency of sequencing errors. The germline sequences were generated by simulating V(D)J recombination with ImmuneSIM [68]. Clonal lineage trees and somatic hypermutation were simulated with SHazaM to obtain a power law and a uniform clonal size distribution (repA and repB, respectively) or extracted from a real BCR repertoire sample previously published [69] (repC). 5000 singleton sequences—representing naive B cells that are not clonally expanded—were added to the synthetic repertoires repA and repB to achieve a similar frequency as observed in the real BCR repertoire repC (see S1 Text and Fig A in S1 Text for further details on the repertoire simulation). We then simulated paired-end raw sequencing data for each repertoire with Grinder [70]. To assess the impact of sequencing errors on the final sequence recovery, the sequencing data simulations were performed with increasing percentages of sequencing errors modeled in a linear fashion along the read length, increasing from zero to a predetermined percentage, in accordance with previous studies on Illumina sequencing error values and their distribution along the read positions [71,72]. Five simulated libraries were prepared, with 0%, 0.1%, 0.25%, 0.5%, 1.0% sequencing errors in the center of the reads. Additionally, two library preparation protocols were compared: with (UMI) and without (sans-UMI) UMIs. Adding UMIs to the library preparation procedure offers the potential for sequencing error correction by constructing a consensus sequence of the recovered sequences with identical UMIs, which is a widely used strategy in BCR and TCR sequencing protocols.

The simulated BCR sequencing libraries were processed with nf-core/airrflow, and the ability to recover the original sequences in the simulated repertoires was evaluated (Fig 2 and B in S1 Text). We evaluated the proportion of correctly identified sequences for each of the repertoires with exact sequence matches (sensitivity exact matches) and additionally considered the matches of sequences that contain “N” nucleotides (Fig 2A and 2B) for a protocol including UMIs. The N-nucleotides are introduced when UMIs are used for error correction by building a sequence consensus, and the consensus base is under a certain frequency threshold (default minimum frequency 0.6) or quality threshold (default minimum quality 0) so there is insufficient consensus to call a particular base. nf-core/airrflow correctly recovered over 99% of the sequences in each of the three repertoires when no simulated sequencing errors were present. The data with sequencing errors decreased the proportion of correctly identified sequences, but sensitivity was maintained above 97% for exact sequence matches and 98% for matches containing “N” nucleotides due to not reaching sufficient consensus. Two incorrect sequences were reported due to a rare occurrence of duplicate UMIs being assigned to two highly similar sequences that are part of the same clone, and two sequences with the same UMI with simulated errors at the same position (Fig 2C). The number of missing sequences ranged from 100 to 300 from a total of 21,321 (repA), 20,959 (repB), and 15,329 (repC). The sensitivity and number of missing sequences was comparable to MiXCR(28), an alternative pipeline for bulk and single-cell AIRR-seq data analysis (Fig 2A–2D and Table A in S1 Text). Differences in the number of correctly recovered sequences can be explained by the differential sequencing error correction and UMI error correction methods used by nf-core/airrflow and MiXCR. nf-core/airrflow performs UMI error correction with pRESTO BuildConsensus and ClusterSets, which allows the identification of UMI groups with a dissimilar sequence of origin [21]. On the other hand, MiXCR performs UMI error correction by clustering UMI sequences and assigning small UMI groups to the closest larger UMI group [28]. When simulating a protocol without UMI, the sensitivity started at 99% with the absence of sequencing errors, but dropped below 50% and up to 125 incorrect sequences with increasing sequencing errors (Fig C in S1 Text), reinforcing the importance of utilizing protocols that include UMIs for sequencing error correction. The high number of missing sequences is due to a quality control step that eliminates sequences that do not contain at least two representative copies, which are attributed to sequencing errors (Fig C in S1 Text).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Performance assessment of the nf-core/airrflow pipeline on three simulated BCR repertoires compared to MiXCR. Sensitivity of the nf-core/airrflow and MiXCR pipelines on the data simulated with UMIs. Sensitivity was calculated for exact VDJ sequence matches to the truth repertoires (A), and matches that contain N nucleotides due to not reaching sufficient consensus (B). C. Number of incorrect sequences. D. Number of sequences present in the truth repertoires that were not identified by the pipelines. E. Number of clones identified by each pipeline. The discontinuous line indicates the true number of clones in the simulated repertoires. F. Mean clonal abundance (solid line) with max and min intervals (shaded area) from n = 200 bootstrap samples of N = 15,250 sequences of the three simulated repertoires. The x axis shows the clone rank number when ordering the clones by size from bigger to smaller. https://doi.org/10.1371/journal.pcbi.1012265.g002

In addition to recovering the original sequences in the sample, AIRR-seq data analysis often involves determining the clonal relationships of the individual sequences. This is important to assess whether a sequence comes from an expanded clone of the same progenitor cell. This step is particularly relevant in BCR data analysis, as mutations are introduced during clonal expansion by targeted somatic hypermutation. Thus, we assessed the ability of the workflow to recover the original number and size distribution of B-cell clones (Fig 2E). Close to 5,500 clones and 5,700 clones were identified for repA and repB (ground truth 5,100 for both), and 6,500 clones for repC (ground truth 6,305). The number of clones was overestimated by 7%, 11%, and 3%, respectively, but was robust with respect to simulated sequencing errors. The clonal abundance distribution (Fig 2F) reflected the true clonal distribution in all three repertoires for the simulated protocol with UMIs. Clonal inference by the Immcantation tools incorporated into nf-core/airrflow was superior to the inference method implemented in MiXCR (Fig 2E and 2F). While we used the default parameters for performing clonal inference with the MiXCR pipeline, adjusting these parameters could potentially lead to improved clonal inference results. The inferred number of clones and clonal abundance by both tools were affected by the increasing sequencing errors in the sans-UMI protocol, highlighting once more the importance of UMI error correction (Fig B in S1 Text). Regarding runtime, nf-core/airrflow took 2h 2min to process the UMI benchmarking data and 1h 47min to process the sans-umi benchmarking data, with a sample average time of 8min and 7 min, respectively (Table C in S1 Text). When including the lineage tree reconstruction step, which is skipped by default, users should be aware that this is often the most time consuming step and that the method chosen (maximum likelihood vs maximum parsimony), and the presence of large clones in the repertoire, as it is the case in repA, can greatly influence the runtime (Table C in S1 Text).

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012265

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/