(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
['Grace A. Blackwell', 'Embl-Ebi', 'Wellcome Genome Campus', 'Hinxton', 'United Kingdom', 'Wellcome Sanger Institute', 'Martin Hunt', 'Nuffield Department Of Medicine', 'University Of Oxford', 'Oxford']
Date: 2021-11
Abstract The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
Citation: Blackwell GA, Hunt M, Malone KM, Lima L, Horesh G, Alako BTF, et al. (2021) Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLoS Biol 19(11): e3001421.
https://doi.org/10.1371/journal.pbio.3001421 Academic Editor: William P. Hanage, Harvard School of Public Health, UNITED STATES Received: March 3, 2021; Accepted: September 21, 2021; Published: November 9, 2021 Copyright: © 2021 Blackwell et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The 661,405 assemblies as well as the COBS, minHash and pp_sketch indices are available from: ftp.ebi.ac.uk/pub/databases/ENA2018-bacteria-661k. In addition, the 661,405 assemblies have been deposited as third-party assemblies to the European Nucleotide Archive (ENA) under Umbrella project PRJEB46036 (individual project accessions for the major species can be found in S1 Table). The pipeline used for download and assembly of reads from the ENA
https://github.com/iqbal-lab-org/assemble-all-ena. Additional metadata, characterisation files and Rnotebooks detailing figure generation have been deposited in figshare (
https://dx.doi.org/10.6084/m9.figshare.16437939): -Full metadata downloaded from the ENA for each assembly in json form (Json1_ENA_metadata) -Full QC and general characterisation including AMR gene and plasmid replicon detection, for each assembly in json form (Json2_QC_characterisation_amr_plasmid) -Kraken/Bracken output including the top 50 species for each assembly (File1_full_krakenbracken) -The taxid lineage of the major species determined using NCBITaxa (File2_taxid_lineage_661K) -Summarised metadata from the ENA for each assembly (File3_metadata_661K) -Summarised QC and general characterisation for each assembly (File4_QC_characterisation_661K) -Summarised AMR genes, MCR status, plus genes, plasmid replicons for each assembly (File5_AMR_plasmids_661K) -Presence/absence matrix of AMR genes in each assembly (File6_AMR_presenceabsence_661K) -Class of each AMR gene extracted by AMRFinder (File7_gene_class_AMRFinder) -Presence/absence matrix of plasmid replicons in each assembly (File8_plasmidreplicons_presenceabsence_661K) -All available metadata from the ENA for each sample in the 661K (File9_all_metadata_ena_661K.txt) -Bacterial assemblies in GenBank as of the 6th March 2021 (File10_Genbank_assemblies_06.03.21.txt) -Bacterial assemblies in PATRIC as of the 6th March 2021 (File10_PATRIC_assemblies_06.03.21.txt) -Code used to generate figures in the QC and filtering section (Rnotebook1_QC_filtering_section) -Code used to generate figures in the Species breakdown section (Rnotebook2_species_breakdown_section) -Code used to generate figures in the AMR section (Rnotebook3_AMR_section_figures) -Code used to generate the figures comparing the composition of the 661K, GenBank assemblies and PATRIC database (Rnotebook4_661K_vs_genbank_patric). Funding: This work was supported by Wellcome Trust (206194). M.H. was funded by a Wellcome Trust/Newton Fund-MRC Collaborative Award (200205) and an award from the Bill & Melinda Gates Foundation Trust (OPP1133541). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Abbreviations: AMR, antimicrobial resistance; COBS, COmpact Bit-sliced Signature; DDBJ, DNA Data Bank of Japan; DRA, DDBJ, Sequence Read Archive; EMBL-EBI, European Bioinformatics Institute; ENA, European Nucleotide Archive; GTB, Genome Taxonomy Database; INSDC, International Nucleotide Sequence Database Collaboration; MCR, multi-class resistant; NCBI, National Center for Biotechnology Information; SNP, single nucleotide polymorphism; SRA, Sequence Read Archive; WGS, whole genome sequencing
Introduction The widespread availability of high-throughput sequencing has resulted in a huge wealth of bacterial genomic data collected from countries all over the world that are shared openly through the public archives, representing a unique and essential resource. Studying the extreme diversity of bacterial species is of broad interest to communities working in basic science, agriculture, and medicine. Beyond their primary function of genomic data storage, sequence repositories show trends in funding, biases in the collection strategies of bacteria, and even reveal the drive and focus of individuals pursuing particular lines of research. Sequence read data are held by members of the International Nucleotide Sequence Database Collaboration (INSDC) [1], who include the DNA Data Bank of Japan (DDBJ), the European Bioinformatics Institute (EMBL-EBI), and the National Centre for Biotechnology Information (NCBI). Submission of genomic data to the European Nucleotide Archive (ENA) (EMBL-EBI) or its INSDC partners (DDBJ Sequence Read Archive, DRA, for DDBJ and Sequence Read Archive, SRA, for NCBI) has become a central and mandatory step in the dissemination of research to the scientific community and a way to ensure open and free access to data [1]. Each of these repositories host the raw read data as well as genome assemblies, at different levels of completeness, which have been submitted by a user. These archives are continuing to grow at a remarkable rate, with current estimation of doubling time of datasets in the ENA to be just over 2 years (
https://www.ebi.ac.uk/ena/browser/about/statistics). The ever increasing data size presents difficulties for storage capacity. A general user’s ability to access and effectively use the data is restricted, whether due to their computational skills, the biological question, the volume of data, the IT infrastructure, or other resources required. Furthermore, once a user has their genomes of interest, significant processing for quality control, including remapping of the underlying raw data, is ideally required prior to applying specific analyses. Even after that is done, any potential discovery needs to be carefully reviewed in the light of the fact that different genome assemblers were applied to each genome [2], and, realistically, it is very hard to rule this out with confidence. Careful users therefore might assemble the raw data from scratch, and, over time, many of these processing steps will be performed repeatedly by different researchers worldwide. Other databases exist that provide a high level of curation, including NCBI’s RefSeq [3]. RefSeq (195,316 assemblies in September 2020) is composed of a selection of assemblies that have been submitted to INSDC databases that meet their quality control requirements, and most have been reannotated using NCBI’s prokaryotic genome annotation pipeline [4] to provide consistency across the data. The assemblies are widely used for taxonomic identification [5,6], but are also commonly used to examine the distribution of genes or elements of interest or as test sets for new algorithms or programmes [7,8]. However, the RefSeq assemblies have been collated progressively over time using a range of assembly algorithms, making the assemblies less consistent and so potentially more problematic for drawing wide-ranging conclusions [9,10]. Attempts to standardise the assembled dataset tend to have a community focus such as Enterobase, which holds sequencing data from the Enterobacteriaceae, and includes curated genome data for 466,670 Salmonella, Escherichia/Shigella, Clostridioides, Vibrio, Helicobacter, Yersinia, and Moraxella genomes [11]. Enterobase gathers sequence data with associated metadata by actively searching for new sequence submissions for supported genera or through direct submissions. The raw data are then processed in a uniform way (assembly and annotation), and basic organism-specific typing is performed [11]. However, while standardised, the scope of this type of database is by definition limited. Depending on an individual’s focus, this can act to further fragment genome data and lead to even more incompatibility issues if the complete genome dataset, agnostic of organism, is to be analysed. The above databases lack comprehensive facilities for obtaining subsets of interest matching different criteria, and the support of a few key use cases would significantly transform their utility. First, it should be possible to filter for genomes containing a given DNA sequence (e.g., a specific single nucleotide polymorphism (SNP), gene, or plasmid). We know that there is currently no known algorithm that allows a full BLAST alignment on this scale, but it has been shown that k-mer indexes allow presence/absence search for SNPs, genes, and plasmids [12]. Secondly, it would be valuable to be able to select a subset that is representative of diversity of a species, genus, or other taxon—this would be enabled by precalculating genetic distances or a phylogenetic tree. Finally, availability of lightweight methods for whole genome comparison [7,13] would enable users to easily query the database for any genomes similar to their own. With these, a host of possibilities can be realised including the ability to easily perform generic queries of a user such as “Has this plasmid isolated from Salmonella enterica ever been seen in other species?,” “Does this SNP occur in many Escherichia coli genomes?,” and “How do I get a set of Listeria genomes representative of the genomic diversity?.” Here, we present a uniformly processed archive of 661K bacterial genomes that were available in the ENA at the end of November in 2018. These include 311,006 isolates that did not previously have publicly available assemblies. Through the quality control steps, characterisation of the assemblies, and the provision of multiple search tools, we remove technical barriers for the interrogation of the public sequences. We use these data to examine the composition of the sequencing archives and in doing so highlight the influence of sampling and sequencing trends.
Discussion Bacteria are a vast, diverse, and ancient family of single-celled organisms that dominate this planet. In our efforts to understand and categorise this most abundant life form, hundreds upon thousands of bacterial sequences are submitted yearly into sequence archives such as the ENA. In the last 2 decades and with the advent of cheap high-throughput short read sequencing, the trend has moved away from the submission of finished or draft genome assemblies to one where simply the raw reads are submitted to public archives. These data usually require substantial preprocessing before they are analysis ready. This takes significant time, expertise, and computational power to do. By uniformly processing the data present in the ENA in November of 2018, we have collated a set of 661,405 standardised assemblies, including 311,006 that were previously only available in the read archives. The additional standard characterisation and quality control we have performed enables the data to be easily subsetted for the purposes of identifying all the assemblies of a particular species or sequence type or to those containing a specific AMR gene. Furthermore, this dataset can be interrogated for a specific gene or mutation using the COBS search index, for a specific genome by use of the provided MinHash index and glean estimations of genetic distances of genomes of interest using the pp-sketch index. These facilities hint at the power of this unified resource, allowing phylogenetic relationships between genomes to be quickly elucidated and hypotheses rapidly tested. This resource will empower more scientists to harness the multitude of data in the ENA both for surveillance and public health projects, as well as to address questions of basic science. The overrepresentation of species belonging to Proteobacteria, Actinobacteria, Firmicutes, or Bacteriodetes phyla is a trend consistent with reports of public database composition from 2002 [40] and 2015 [41]. Concerningly, there has been a reduction in the species diversity outside of these 4 phyla, decreasing from 29% in 2002, 8% in 2015, and 0.68% of all submissions analysed in the 661K snapshot. The count of 2,336 species in this snapshot is well below the number of bacterial species in the taxonomic databases such as NCBI taxonomy (>20,000 species,
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2) and Genome Taxonomy Database (GTB; >30,000 species,
https://gtdb.ecogenomic.org/). Some of the sequence diversity within the snapshot may have been missed due to limitations of the Kraken 2 database used for taxonomic assignment and abundance estimation, a research project in its own right. For a small proportion of the assemblies (6.1%), a major species could not be assigned with high confidence, despite being shown by CheckM to contain little or no contamination, indicating that there was not a good match for it in the database (see Methods and S2D Fig). The inclusion of genomes originating from metagenomic sequences from different sources (e.g., gut, skin, soil, and ocean) would likely improve the overall species diversity, but the methods of assembly and analysis are very different to those used here. Many of the sequenced genomes could be defined as MCR based on the carriage of AMR genes. While we observe many occurrences of AMR mechanisms in the 661K assemblies, both in the organisms that are already known to be problematic (species outlined on WHO priority pathogens list) and in newly emerged threats (such as E. bugandensis, C. striatum, and R. planticola), it is difficult to estimate how well these reflect the true prevalence of resistance in a given species. This is due to many biases, including, for example, how isolates were sampled, and some projects implementing preselection steps with only the antimicrobial resistant strains being then sequenced [42–44]. This intrinsically biases the archive, preventing prevalence estimations. It also limits the power to track the origins of accessory genes, and, consequently, the species interactions that can be inferred from this. Ideally, strategies to sequence a wider variety of species, including susceptible isolates, from diverse environments and global locations must be implemented before the dynamics of gene flow can be accurately studied. The uniform resource of 661K bacterial assemblies that we present here removes several technical barriers to harnessing the wealth of public data stored in the ENA, enabling a broader community to access and leverage these data for their research. We envisage this to be a valuable resource that can provide the substrate for a wide range of future studies. Nevertheless, it is intrinsically limited through the nature of our scientific practice, by the diversity of sequences it holds. Rather, the current composition highlights the influences of the past quarter of century of funding and scientific focus. The enormous contribution of just a few projects shows that even the drive and focus of individual groups has influenced our view of recent bacterial diversity. Sampling and sequencing strategies must change if we want to reveal the bacterial “tree” of life.
Methods Download of reads, assembly, and characterisation of genomes The bacterial whole genome sequencing (WGS) datasets in the ENA as of November 26, 2018 were downloaded and assembled as a part of an assembly pipeline (
https://github.com/iqbal-lab-org/assemble-all-ena) [45]. Only paired-end reads were included, and those where the instrument platform was “PACBIO_SMRT” or “OXFORD_NANOPORE” were excluded. In addition, those with a library source of “METAGENOMIC” and “TRANSCRIPTOMIC” were also ignored. Available metadata and appropriate reads were downloaded, and if multiple read sets were available, they were appended together. Reads were assembled using Shovill v1.0.4 (Seeman, T.,
https://github.com/tseemann/shovill) with default options. Shovill uses SPAdes (v3.12.0) [46] for assembly and includes some additional pre- and post-processing steps that utilise Lighter [47], FLASH [48], Trimmomatic [49], SAMtools [50], BWA-MEM [50,51], seqtk (
https://github.com/lh3/seqtk), Pilon [52], and samclip (Seeman, T.,
https://github.com/tseemann/samclip) to speed up the assembly and to correct minor assembly errors. A total of 664,877 assemblies were produced by this pipeline. Separate from the assembly pipeline, Kraken 2 v2.0.8-beta [6] was run on the read FASTQ files using the Kraken 2 microbial database (2018, 30 GB), and the resulting taxonomy labels assigned by Kraken 2 were analysed by Bracken v2.5 [14] to estimate the species abundance within each set of reads. From the assemblies, contigs of less than 200 bp were removed using the script available at
https://github.com/sanger-pathogens/Fastaq, and contigs of k-mer depth less than 10 were noted, but not removed. QUAST v5.0.2 [53] was used to summarise assembly statistics, and CheckM v1.1.2 [54] using the “--reduced_tree” flag was used for estimations of completeness and contamination of an assembly. Assemblies with a genome length of less than 100 kb or longer than 15 Mb were removed (3,472 assemblies), leaving 661,405 assemblies. A MinHash sketch of each assembly (“-n 5000”) was produced using sourmash v3.5.0 [13]. A searchable k-mer database of the 661K assemblies was constructed by COBS (
https://github.com/bingmann/cobs, checkout 7c030bb) using “compact-construct” with default options [17]. Core and accessory distances were calculated between the assemblies using poppunk_sketch v1.5.1 with default options except “--k-step 3” [18]. MLST was determined where possible using mlst v2.19.0 (Seeman, T.,
https://github.com/tseemann/mlst), E. coli phylotype determined using clermonTyping v1.4.1 [55], and Salmonella were serotyped using SeqSero2_package.py v1.1.1 [56]. Plasmid replicons were detected using Abricate v1.0.1 (Seeman, T.,
https://github.com/tseemann/abricate) with the plasmidfinder 2020-May-7 database [57], and AMR, heavy metal, and virulence genes were detected using AMRFinderPlus v3.6.15 [58], with standard thresholds of minimum identity (curated cutoff if it exists and 0.9 otherwise) and default coverage of 0.5. All figures were generated in R using ggplot2 v3.3.3 [59] and where required were edited manually using Inkscape 2 v0.92. Taxid lineage, species comparison, and adjustment species abundance The taxid lineage of the major bracken species was acquired by NCBITaxa [60]. Where the major species from the Bracken analysis belonged to either the M. tuberculosis complex or B. cereus s.l. complex or was identified as a Shigella sp. or an E. coli, the remainder of the read assignments were examined to see if they belonged to other members of that complex. If they were members, their assigned percentage was added to that of the major species. High-quality assemblies Filtering was applied using the reports generated by QUAST and CheckM analysis for each genome. The high-quality assemblies met the requirements of less than 2,000 contigs, a genome length that is within the acceptable range for that species, as calculated for species with 4 or more assemblies in GenBank (
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz, August 27, 2020) or is unknown, a N50 of greater than 5,000, a completeness score of at least 90%, and a contamination score of less than or equal to 5%. In total, 639,981 assemblies met these requirements. Composition against other public databases The list of genomes in NCBI assemblies was downloaded (
https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt, March 6, 2021) [19], and the taxid lineage of each genome was acquired by NCBITaxa [60]. In total, 867,940 belonged to the bacterial superkingdom. The list of 422,590 genomes that were in PATRIC [20] (March 6, 2021) were also downloaded. The sample accessions in the 3 databases (661K snapshot, NCBI assemblies, and PATRIC) were compared using UpSetR v1.4.0. Multi-class resistance MCR was defined as containing genes conferring resistance to at least 3 classes of antimicrobial (antimicrobial classes were extracted from the AMRFinderPlus output). Only species with at least 10 samples were included, and a species was classed as MCR if at least 50% of individual assemblies were MCR.
Supporting information S1 Table. Projects under Umbrella project PRJEB46036.
https://doi.org/10.1371/journal.pbio.3001421.s001 (DOCX) S1 Fig. Quality control measures used to filter genome assemblies. (A) Distribution of number of contigs per assembly in the collection. A total of 1,766 assemblies had greater than 3,000 contigs. Red line: Assemblies with more than 2,000 contigs were filtered from the high-quality assemblies. (B) Distribution of the N50 of each of the assemblies in the collection and 26,142 had an N50 of greater than 500,000. Red line: assemblies with an N50 less then 5,000 were filtered from the high-quality assemblies. (C) Comparison of genome size of each assembly to that expected of its species. Where available, the genome size range accepted for each species was extracted from
ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz, downloaded August 27, 2020. Those genomes that are greater than or less than the expected length was filtered from the high-quality assemblies. (D) Correlation between the genome completeness and contamination percentages produced by CheckM for each assembly. A total of 1,785 assemblies had a contamination score greater then 100%. Red lines indicate cutoffs applied; bottom right corner are the high-quality genomes. The data underlying this figure may be found in
https://doi.org/10.6084/m9.figshare.16437939.
https://doi.org/10.1371/journal.pbio.3001421.s002 (TIFF) S2 Fig. QC of the 661,405 assemblies. Number of (A) assemblies and (B) species remaining following each stage of filtering. Status 1; removal of genomes with >2,000 contigs, status 2; removal of genomes with an N50 <5,000, status 3; removal of genomes with length outside the range expected for that species (note: if expected range is not known, the assemblies are kept), status 4; assemblies with a completeness score = >90% and with a contamination score = <5%. A total of 639,981 assemblies passed the 4 levels of filtering and are the high-quality genomes. (C) Distribution of high-quality assemblies (filtering status 4) with >50% abundance of major species. A total of 9,595 assemblies were below this threshold. (D) Within-sample abundance of major species vs completeness of the high-quality assemblies. For (C) and (D), the abundance of major species is the adjusted abundance values (see Methods). The data underlying this figure may be found in
https://doi.org/10.6084/m9.figshare.16437939.
https://doi.org/10.1371/journal.pbio.3001421.s003 (TIFF) S3 Fig. High-quality assemblies that have a major species abundance of less than 90%. The data underlying this figure may be found in
https://doi.org/10.6084/m9.figshare.16437939.
https://doi.org/10.1371/journal.pbio.3001421.s004 (TIFF) S4 Fig. Composition of the assemblies in the 661K snapshot compared to those in the GenBank bacterial and PATRIC databases. (A) Upset plot shows the number of shared sample accessions (also called biosample) in the 661K snapshot, the GenBank bacterial, and PATRIC databases. Each column corresponds to an exclusive intersection that includes the elements denoted by the dark circles, but not of the others. (B) The top 50 species in the 311,600 sample accessions unique to the 661K snapshot. The data underlying this figure may be found in
https://doi.org/10.6084/m9.figshare.16437939.
https://doi.org/10.1371/journal.pbio.3001421.s005 (TIFF) S5 Fig. Composition of the 639,981 high-quality assemblies. (A) Breakdown of assemblies by year first public in the ENA. (B) Fraction of assemblies covered by accumulating projects. (C) Tracking proportions of the top 10 bacterial species for a year. The data underlying this figure may be found in
https://doi.org/10.6084/m9.figshare.16437939. ENA, European Nucleotide Archive.
https://doi.org/10.1371/journal.pbio.3001421.s006 (TIFF) S6 Fig. Distribution of AMR gene alleles by antibiotic class. Gene variants are coloured by their level of spread, from being detected in genomes from different phyla to only found in a single species. The top graph includes genomes with the major species being > = 90% abundance, and the lower graph is when this threshold was increased to > = 98% abundance. The data underlying this figure may be found in
https://doi.org/10.6084/m9.figshare.16437939. AMR, antimicrobial resistance.
https://doi.org/10.1371/journal.pbio.3001421.s007 (TIFF)
Acknowledgments We thank Alexandre Almeida, Kate Mellor, Alyce Taylor-Brown, and all other members of the Iqbal and Thomson research teams for their useful discussions and suggestions. We would also like to thank John Lees for his helpful guidance and support when creating the pp-sketch index of the 661K assemblies. We also thank Colman O’Cathail and Nadim Rahman for their help with accession of the assemblies in the ENA.
[END]
[1] Url:
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001421
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/