(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------



Filling gaps in bacterial catabolic pathways with computation and high-throughput genetics

['Morgan N. Price', 'Environmental Genomics', 'Systems Biology', 'Lawrence Berkeley National Laboratory', 'Berkeley', 'California', 'United States Of America', 'Adam M. Deutschbauer', 'Adam P. Arkin']

Date: 2022-06

To discover novel catabolic enzymes and transporters, we combined high-throughput genetic data from 29 bacteria with an automated tool to find gaps in their catabolic pathways. GapMind for carbon sources automatically annotates the uptake and catabolism of 62 compounds in bacterial and archaeal genomes. For the compounds that are utilized by the 29 bacteria, we systematically examined the gaps in GapMind’s predicted pathways, and we used the mutant fitness data to find additional genes that were involved in their utilization. We identified novel pathways or enzymes for the utilization of glucosamine, citrulline, myo-inositol, lactose, and phenylacetate, and we annotated 299 diverged enzymes and transporters. We also curated 125 proteins from published reports. For the 29 bacteria with genetic data, GapMind finds high-confidence paths for 85% of utilized carbon sources. In diverse bacteria and archaea, 38% of utilized carbon sources have high-confidence paths, which was improved from 27% by incorporating the fitness-based annotations and our curation. GapMind for carbon sources is available as a web server ( http://papers.genomics.lbl.gov/carbon ) and takes just 30 seconds for the typical genome.

For many microbes, we know little about them beyond their genome sequences. In principle, we could use genome sequences to predict microbes’ traits, such as which carbon sources they can eat, but first we need to identify more of the genes involved. We built an automated tool, GapMind, to annotate the transporters and enzymes for utilizing 62 common carbon sources, and used GapMind to identify gaps: transporters or enzymes that should be present, to explain how a bacterium uses a carbon source, but could not be found in the genome. By comparing these gaps to large-scale genetic data for 29 bacteria, we identified hundreds of novel transporters and enzymes, and a new metabolic pathway for consuming glucosamine. When we added these novel genes to GapMind, its results for diverse bacteria and archaea improved significantly.

Funding: This material by ENIGMA- Ecosystems and Networks Integrated with Genes and Molecular Assemblies ( http://enigma.lbl.gov ), a Science Focus Area Program at Lawrence Berkeley National Laboratory is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Biological & Environmental Research under contract number DE-AC02-05CH11231. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: The code for GapMind, including the rules that describe carbon catabolism, is available in the PaperBLAST code base ( https://github.com/morgannprice/PaperBLAST ). The code and the analysis results are also archived ( https://doi.org/10.6084/m9.figshare.16906993.v1 ). All of the fitness data we analyzed is available in the fitness browser ( http://fit.genomics.lbl.gov/ ) or for download ( https://doi.org/10.6084/m9.figshare.16913530.v1 ).

Copyright: © 2022 Price et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

We incorporated all of these additional enzymes and transporters into GapMind, and we asked how much the coverage of catabolism in diverse bacteria and archaea had improved. We relied on the IJSEM database, which reports carbon sources utilized by diverse bacteria and archaea [ 14 ]. (The International Journal of Systematic and Environmental Microbiology publishes species descriptions, which often report carbon sources that are utilized by the type strain.) Across diverse bacteria and archaea with sequenced genomes, coverage by high-confidence paths was improved by 11% (from 27% to 38%) after the incorporation of annotated and curated proteins into GapMind. We also used the fitness data from the 29 heterotrophic bacteria to confirm that GapMind usually selects the correct pathway and genes for utilizing each carbon source. Overall, we filled many gaps in carbon catabolism, and we improved our understanding of catabolism in diverse prokaryotes significantly, but much remains to be discovered.

Next, we used large-scale mutant fitness data from 29 heterotrophic bacteria [ 4 , 8 ] to try and fill these gaps. For each of these bacteria, a pool of tens of thousands of barcoded transposon mutants was grown in various defined media and the change in each mutant’s abundance was quantified by DNA sequencing. If the initial version of GapMind (developed without using the fitness data) had any gaps, we tried to fill the gaps by using genes that were important for fitness during growth on that carbon source, but were not important in most other conditions. Using this approach, we identified functions for hundreds of diverged proteins. Highlights include a new pathway for the utilization of glucosamine; a new family of citrullinases; a new family of aldolases that are involved in myo-inositol catabolism; the first identification of genes for 3’-ketolactose hydrolases, which are involved in lactose catabolism; and a novel oxepin-CoA hydrolase for phenylacetate catabolism. By using PaperBLAST to find papers about homologs of the candidate genes [ 9 ], we also identified over 100 relevant proteins that were experimentally characterized but whose function was not described in curated databases such as Swiss-Prot [ 10 ], BRENDA [ 11 ], MetaCyc [ 7 ], CAZy [ 12 ], or TCDB [ 13 ].

To discover novel catabolic enzymes and transporters on a large scale, we used a combination of large-scale mutant fitness data and computation. First, we built an automated tool to annotate catabolic pathways. GapMind for carbon sources uses a similar approach as GapMind for amino acids [ 6 ]. GapMind relies on known pathways (mostly from MetaCyc [ 7 ]) and a database of experimentally-characterized proteins. Given a genome and a carbon source, GapMind identifies the most plausible pathway for consuming the compound, and it highlights any gaps.

Genome sequences are now available for tens of thousands of bacterial species [ 1 ], and for most of these bacteria, little else is known about them. In principle, the genome sequence could allow us to predict the capabilities of the organism, such as what nutrients it can use, but in practice this is challenging. For instance, metabolic models can be generated automatically from a genome sequence, and these metabolic models can be used to predict which carbon sources the organism can grow on, but these predictions are only 50–70% accurate [ 2 , 3 ]. More accurate predictions are not currently feasible because annotations of the functions of transporters and enzymes are often erroneous [ 4 , 5 ] and because new families of transporters and enzymes and new catabolic pathways continue to be discovered. Also, even if the genome contains genes for the necessary proteins, the proteins might not be expressed.

Results and discussion

Glucosamine utilization via putative transmembrane transacetylase NagX Fitness data from five diverse bacteria showed that the protein NagX is involved in the utilization of glucosamine as the sole source of carbon or nitrogen (Fig 3A–3E). The NagX family of transmembrane proteins is often found in operons for chitin utilization [18], but its function is not known. In four of the five bacteria, we found that N-acetylglucosamine 6-phosphate deacetylase NagA was also involved in glucosamine utilization (Fig 3A–3D). And in four of the five bacteria, the transporter NagP or another putative sugar transporter were also involved in glucosamine utilization (Fig 3A, 3B, 3D and 3E). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Role of NagX in glucosamine utilization. (A-E) Fitness data from five different bacteria with glucosamine or NAcGln as the sole source of carbon or nitrogen. As a control, we also show fitness with D,L-lactate or glucose as the carbon source. Each colored cell shows the fitness value for a gene in an individual experiment. The fitness of a gene is the log2 change in the relative abundance of mutants in that gene during 4–8 generations of growth (from inoculation at OD 600 = 0.02 until saturation). Cells with strongly negative fitness are dark blue. (F) The proposed role of NagX. https://doi.org/10.1371/journal.pgen.1010156.g003 NagX proteins are distantly related (25–31% amino acid identity) to human heparan-α-glucosaminide N-acetyltransferase (HGSNAT), which transfers acetyl groups from cytoplasmic acetyl-CoA to terminal glucosamine residues in lysosomal heparan sulfate [19]. Similarly, we propose that NagX is a transmembrane transacetylase that uses cytoplasmic acetyl-CoA to convert periplasmic glucosamine to N-acetylglucosamine (NAcGln). Although NagX is much shorter than HGSNAT, with 309–395 amino acids instead of 663, NagX contains the entire catalytic domain (PFam PF07786; [20]). Furthermore, the catalytic histidine which carries the acetyl group across the membrane is conserved: for instance, His72 of Shewana3_3111 aligns to His297 of HGSNAT (SwissProt Q68CP4). Once NAcGln is formed, it can be transported across the membrane and phosphorylated (such as by NagP and NagK, or by a phosphotransferase system), followed by deacetylation by NagA. Our proposal explains why NagA, NagP, and NagK are involved in glucosamine utilization as well as NAcGln utilization. Our proposal also explains why NagX is important for the utilization of glucosamine but not NAcGln (Fig 3A–3E, although NagX might be involved in NAcGln utilization in Caulobacter crescentus). We also noticed that in Echinicola vietnamensis KMM 6221, a putative acetyl-CoA synthase (acs) is important during glucosamine utilization (Fig 3E), but not in most other conditions (not shown); we speculate that it produces acetyl-CoA for NagX. NagX is also distantly related to a putative N-acetylmuramate transporter (TfMurT) from Tannerella forsythia [21]. So we also considered that NagX might be a glucosamine transporter. However, this seems inconsistent with the involvement of the deacetylase NagA and of other sugar transporters in glucosamine utilization.

Citrulline utilization via putative citrullinase CtlX Using fitness data from Phaeobacter inhibens DSM 17395 (BS107), Pseudomonas simiae WCS417, and Pseudomonas fluorescens FW300-N2E3, we previously identified [4] a family of putative hydrolases that are involved in citrulline utilization (Fig 4A–4C). These hydrolases, which we will call CtlX, are distantly related to arginine deiminases, which hydrolyze arginine to citrulline and ammonia. We previously proposed that the arginine deiminase reaction might run in reverse [4]. But eQuilibrator estimates that the reverse reaction is thermodynamically unfavorable, with an equilibrium constant of under 10−6 M-1 [22]. If arginine deiminase is operating in reverse, then the genes for converting citrulline to arginine (argGH) should be dispensable. We lack fitness data for argGH from P. simiae WCS417 or P. inhibens BS107, but in P. fluorescens FW300-N2E3, argG and argH were very important for fitness with citrulline as the sole source of either carbon or nitrogen (Fig 4A). Furthermore, the arginine deiminases and related enzymes that act on substrates with guanidino groups (-NH-C (= NH 2 +)-NH2) have two conserved substrate-binding aspartate residues [23], while CtlX has asparagines at these positions instead (FTRD → FPNN and HLD → HTN). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Putative citrullinase CtlX. (A-C) In diverse bacteria, ctlX and either ornithine cyclodeaminase (ocd) or ornithine/arginine succinyltransferase (aruFG) are important for the utilization of citrulline as a carbon source. The color-coded cells show fitness values, which are log2 changes in the relative abundance of mutants in each gene. (D) Gene neighborhoods of ctlX. The drawing is modified from Gene Graphics [25]. (E) Pathways of citrulline utilization. https://doi.org/10.1371/journal.pgen.1010156.g004 We noticed that CtlX is often encoded adjacent to ornithine cyclodeaminase ocd or ornithine/arginine N-succinyltransferase aruG (Fig 4D). These enzymes are also involved in citrulline utilization (Fig 4A–4C), which suggests that ornithine is an intermediate. This led us to consider that CtlX might hydrolyze citrulline to ornithine and carbamate (Fig 4E). The replacement of substrate-binding aspartates with asparagines seems consistent with an amide substrate. Unfortunately, citrulline is not included in the IJSEM database [14], so we do not have a large data set of citrulline-utilizing bacteria. But ctlX is present in four of the five bacteria we have studied that grow with citrulline as the sole source of carbon. (Besides the three bacteria shown in Fig 4, ctlX is present in P. fluorescens FW300-N1B4, but we lack fitness data for the gene.) From a study of bacteria that can use citrulline as the sole source of carbon [24], we found two with genome sequences, and both encode ctlX (C8E02_RS07400 from Vogesella indigofera ATCC 19706 = DSM 3303, and DM41_RS32400 from Burkholderia cepacia NCTC 10743 = ATCC 25416 = DSM 7288). Furthemore, ctlX from B. cepacia is encoded adjacent to ocd (Fig 4D). So CtlX is widespread in citrulline-utilizing bacteria. To further investigate the role of CtlX, we collected additional fitness data for P. fluorescens FW300-N2E3, P. simiae WCS417, and P. inhibens DSM 17395 during growth with varying concentrations of citrulline or ornithine as the sole source of carbon. We had expected that CtlX would be important for the utilization of citrulline, but not ornithine. Instead, we observed that CtlX was important for the utilization of both citrulline and ornithine in all three bacteria. We suspect that ornithine is being converted to citrulline and then arginine by enzymes of the arginine biosynthesis pathway, and that CtlX is important for fitness because it counteracts this. First, P. fluorescens FW300-N2E3 has three ways to consume arginine: the arginine succinyltransferase pathway, the arginine decarboxylase pathway, and arginine deiminase ArcA, which converts arginine to citrulline. Genes from all three pathways are strongly detrimental to fitness during growth on ornithine; in other words, mutants in these pathways are enriched after growth on ornithine (S1 Fig). This suggests that an excess of arginine is being formed (although we do not understand why disrupting just one of three catabolic pathways is beneficial). Second, in P. simiae WCS417, several genes from the arginine succinyltransferase pathway are important for fitness during growth on ornithine (S2 Fig). This is consistent with flux to arginine in excess of requirements for protein synthesis, although these genes could be involved in ornithine catabolism instead, as AruFG can succinylate both arginine and ornithine [26]. We also noticed that all four transposon insertions within the ctlX of P. simiae WCS147 have the antibiotic resistance marker in the antisense orientation, which might prevent expression of the downstream ornithine cyclodeaminase (ocd) in these strains. Ocd is important for utilization of ornithine (S2 Fig), so the phenotype of insertions in ctlX could be a polar effect. Third, in P. inhibens DSM 17395, arginase (which hydrolyzes arginine to ornithine and urea) was very important for fitness during growth on either ornithine or citrulline, which again implies excess flux to arginine (S3 Fig). Because of the complexity of citrulline and arginine metabolism, biochemical studies will be needed to prove the function of CtlX. In the current release of GapMind, we assume that CtlX converts citrulline to ornithine. The only citrullinase from bacteria that has been reported before, Ctu from Francisella tularensis [27], is not homologous to CltX (PFam PF00795, not PF02274). Also, many Pseudomonas can use ornithine carbamoyltransferase and carbamate kinase (both in reverse) to consume citrulline and form ATP (Fig 4E). (Both of the Pseudomonas with the putative citrullinase also encode carbamate kinase, but Phaeobacter inhibens DSM 17395 does not.) In Pseudomonas aeruginosa, these enzymes are repressed under aerobic conditions [28], and all of our experiments with citrulline were conducted aerobically, so the carbamate kinase pathway may not have been expressed. Although the carbamate kinase pathway generates one more ATP per molecule of citrulline than the citrullinase pathway, the first step of the carbamate kinase pathway (ornithine carbamoyltransferase in reverse) is thermodynamically quite unfavorable, with an estimated equilibrium constant of 5 · 10−6 [22]. So we speculate that the citrullinase pathway is faster, which would explain why it is preferred when oxygen is available.

An alternative 2-deoxy-5-keto-D-gluconate 6-phosphate aldolase for myo-inositol utilization 2-deoxy-5-keto-D-gluconate 6-phosphate aldolase (EC 4.1.2.29) is involved in myo-inositol catabolism via inosose dehydratase and 5-deoxy-D-glucuronate. As far as we know, the only previously-characterized enzymes are IolJ from Bacillus subtilis [29] and a similar protein from Phaeobacter inhibens, PGA1_c07220, which was identified using fitness data [4]. Of the 11 bacteria for which we have fitness data with myo-inositol as the sole carbon source, just two encode IolJ-like proteins, so we searched for alternative aldolases using the fitness data. We noticed that in the other nine bacteria, a putative 2-deoxy-5-keto-D-gluconate kinase (IolC) is fused to an uncharacterized domain, DUF2090 (PFam PF09863). All of these fusion proteins were important for fitness during myo-inositol utilization but not in most other conditions (Fig 5). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. IolC-DUF2090 fusion proteins are important for myo-inositol utilization. Each point shows a gene fitness value (x axis) from a separate experiment. Values under -4 are shown at -4. The y axis is arbitrary. Experiments with myo-inositol as the sole source of carbon are highlighted. https://doi.org/10.1371/journal.pgen.1010156.g005 DUF2090 is related to aldolases: for instance, D-tagatose-bisphosphate aldolase LacD from Streptococcus pyogenes (PDB:5ff7) has a statistically significant alignment to PF09863.9 (uncorrected E = 6.5·10−8, hmmsearch 3.3.1). The catalytic residues of LacD are Lys126 and Glu164 [30]. When we aligned LacD and the DUF2090 fusion proteins (via the PFam model and hmmsearch), we found that these catalytic residues were fully conserved. For instance, BPHYT_RS13910 from Burkholderia phytofirmans PsJN has Lys493 and Glu531. We propose that DUF2090 is the missing 2-deoxy-5-keto-D-gluconate 6-phosphate aldolase. When we examined the genomes of diverse myo-inositol-utilizing microbes from the IJSEM database [14], we found that none contained IolJ, but 7 of 15 (47%) contained DUF2090, and in each case, DUF2090 was fused to IolC. Just 22 of 232 genomes (9%) from organisms not known to utilize myo-inositol contained DUF2090, which was significantly less (odds ratio 0.12, P = 0.0005, Fisher exact test). (To identify members of DUF2090, we used hmmsearch with PF09863.9 and the trusted cutoff, and proteins that had higher bit scores for alignments to the DeoC/LacD family (PF01791.9) than to PF09863.9 were ignored.) If we combine the 11 myo-inositol-utilizing bacteria with fitness data with the 15 microbes from IJSEM, then of the 26 genomes, 16 encode IolC-DUF2090 and just 2 encode IolJ. Thus, DUF2090 is associated with myo-inositol utilization, which supports our prediction that DUF2090 domains are 2-deoxy-5-keto-D-gluconate 6-phosphate aldolases.

An alternative oxepin-CoA hydrolase for phenylacetate utilization Phenylacetate is an end product of phenylalanine fermentation, and phenylacetate or phenylacetyl-CoA are common intermediates in the degradation of phenylalanine and other aromatic compounds. The aerobic pathway for phenylacetate utilization [41,42] begins by activation to phenylacetyl-CoA, oxygenation to 1,2-epoxyphenylacetyl-CoA, isomerization to oxepin-CoA, hydrolytic ring-opening to 3-oxo-5,6-didehydrosuberyl-CoA semialdehyde, and oxidation to 3-oxo-5,6-didehydrosuberyl-CoA. Additional thiolase, isomerase, dehydrogenase, and enoyl-CoA hydratase enzymes convert this to acetyl-CoA and succinyl-CoA (Fig 7A). In E. coli, the ring opening reaction and the next step in the pathway, the oxidation of 3-oxo-5,6-didehydrosuberyl-CoA semialdehyde, are catalyzed by PaaZ, which combines an enoyl-CoA hydratase (ECH) domain that performs ring opening with an aldehyde dehydrogenase domain [43]. But in many other bacteria that encode this pathway, the 3-oxo-5,6-didehydrosuberyl-CoA semialdehyde dehydrogenase is a separate protein (for instance, PacL, [43]). To our knowledge, the oxepin-CoA hydrolase from these bacteria has not been identified. Teufel and colleagues did identify a protein (ECH-Aa) that had some activity as an oxepin-CoA hydrolase, but ECH-Aa was ~1,000 times more active as a crotonyl-CoA hydratase than as oxepin-CoA hydrolase, so it is not clear if ECH-Aa’s oxepin-CoA hydrolase activity is physiologically relevant [43]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 7. Phenylacetate utilization via an alternative oxepin-CoA hydrolase. (A) The aerobic pathway for phenylacetate utilization. (B) Fitness data from P. bryophila 376MFSha3.1 growing in minimal media with phenylacetate or glucose as the carbon source. Except for the experiments with 20 mM glucose, the media also contained 1% dimethylsulfoxide (by volume). https://doi.org/10.1371/journal.pgen.1010156.g007 To study this question, we analyzed fitness data from Paraburkholderia bryophila 376MFSha3.1 with phenylacetate as the carbon source (Robin Herbert and Trenton Owens, personal communication). Most of the genes of the aerobic pathway were identified in the genome and were important for phenylacetate utilization, including the phenylacetate-CoA ligase paaK, the oxygenase paaABCDE, the isomerase paaG, a pacL-like 3-oxo-5,6-didehydrosuberyl-CoA semialdehyde dehydrogenase, the thiolase paaJ, and the enoyl-CoA hydratase paaF (Fig 7B). The only missing steps were the oxepin-CoA hydrolase and the 3-hydroxyadipoyl-CoA dehydrogenase (PaaH). Using the fitness data, we identified candidates for both steps. First, a putative enoyl-CoA hydratase, H281DRAFT_04594 was important for phenylacetate utilization (Fig 7B). A closely related protein from Burkholderia sp. OAS925 (97% identity) is also important for phenylalanine utilization (Ga0395975_5191, fitness = -4.1 and -3.9, Marta Torres, personal communication), which confirms our genetic data. We predict that these proteins provide the missing oxepin-CoA hydrolase activity. H281DRAFT_04594 is related to enoyl-CoA hydratases that form (3S)-hydroxyacyl-CoA from 2-trans-enoyl-CoA, while the ECH domain of PaaZ is related to enoyl-CoA hydratases that form (3R)-hydroxyacyl-CoA. Both families of hydratases use acid-base chemistry to act on CoA thioesters, and neither oxepin-CoA nor the hydrolysis product have chiral centers (except within the coenzyme A group), so either type of ECH domain could catalyze the hydrolysis of oxepin-CoA. H281DRAFT_04594 is 32% identical to enoyl-CoA hydratase from rat liver, whose catalytic mechanism has been studied [44]. The side chains that participate in catalysis (E144 and Q162) are not conserved in H281DRAFT_04594: the corresponding residues are S118 and M135, respectively. This suggests that H281DRAFT_04594 has another function, which is consistent with our proposal. Second, the gene for the 3-hydroxyadipoyl-CoA dehydrogenase PaaH was not clearly identified, but there are at least three 3-hydroxyacyl-CoA dehydrogenases that might have this activity. One of them, H281DRAFT_00361, was important for phenylacetate utilization (Fig 7B). A close homolog from B. phytofirmans PsjN was also important for phenylacetate utilization (BPHYT_RS13545, fitness = -1.7 or -2.0; data from [45]). H281DRAFT_00361 is 49% identical to PimB from Rhodopseudomonas palustris; the pim operon is involved in dicarboxylic fatty acid degradation [46], which suggests that PimB may be active on 3-hydroxyadipoyl-CoA (the 3-hydroxyacyl-CoA intermediate in adipate degradation). H281DRAFT_00361 has an ECH domain as well as an aldehyde dehydrogenase domain; we do not have a proposal for the role of its ECH domain.

Annotation of 299 diverged enzymes and transporters While developing GapMind, we used the fitness data to identify transporters and enzymes that were important for utilization of various carbon sources, and hence to predict these proteins’ functions. Overall, we annotated 716 proteins, comprising 555 enzymes and 161 transporters or transporter components. (Proteins whose functions we had previously identified from the fitness data are not included in these counts.) Many of these proteins are distantly related to previously-characterized proteins from the seven curated databases that GapMind relies on (Fig 8A). For proteins that were over 40% identical to one or more characterized proteins, 22% (117 of 534) had a different function than their best hit. For example, PS417_22145 from Pseudomonas simiae WCS417 is 88% identical to GtsA from P. putida KT2440, which is reported in the transporter classification database (TCDB) to be the substrate-binding component of a glucose transporter. PS417_22145 was important for the utilization of D-glucose 6-phosphate (fitness = -3.2 and -2.0) and D-xylose (fitness = -1.8 and -1.6) but not in most other conditions (data of [4,45]; also, the other components of this ABC transporter had similar phenotypes). Glucose 6-phosphate may be hydrolyzed to glucose before uptake, which would explain why a glucose transporter is important for fitness; but the phenotype during growth on D-xylose suggests that PS417_22145 binds xylose as well as glucose. Indeed, in strains of P. putida that were engineered to utilize xylose, GtsA is required for xylose utilization [47]. This information is not in TCDB: since the xylose-utilizing strains of P. putida had mutations in GtsA, it is not clear if the wild-type protein from P. putida binds to xylose. But GtsA from Pseudomonas simiae WCS417 does seem to be involved in xylose transport. Overall, we used the fitness data to identify functions for 299 diverged proteins that have a different function than their closest characterized homolog or are less than 40% identical to any characterized protein in the databases. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 8. Similarity of the proteins that we annotated to previously-characterized proteins from seven curated databases. Panel A shows the 716 proteins that we annotated using fitness data, and panel B shows the 125 proteins that we annotated using the scientific literature. Homologs were identified using protein BLAST against a database of 125,685 experimentally-characterized proteins. We required E < 0.001 and 70% coverage of both the query and the subject. Proteins whose functions we had previously identified using the fitness data were not included in the database. https://doi.org/10.1371/journal.pgen.1010156.g008

Curation of enzymes and transporters from the literature While developing GapMind, we identified 125 proteins that have published experimental data about their function, are relevant to the utilization of the 62 carbon sources, but are not included in any of the curated databases. For example, in Pseudomonas putida KT2440, the putative lactonase PP_1170 is important during growth on D-glucuronate and D-galacturonate (fitness < -2, Mitchell Thompson and Matthias Schmidt, personal communication), but not in over 100 other experiments (all fitness ≥ -0.5). A uronate dehydrogenase (PP_1171) is also important for glucuronate utilization, which indicates that P. putida uses an oxidative pathway and suggests that PP_1170 is a glucurono-1,5-lactonase. This reaction is not linked to protein sequences by any of the curated databases we used, so at first we thought we had identified a novel enzyme. But by using PaperBLAST [9], we found that PP_1170 is 72% identical to PSPTO_1052, which hydrolyzes D-glucurono-1,5-lactone in vitro [48]. GapMind now associates the glucurono-1,5-lactonase reaction with PP_1170, PSPTO_1052, and five other lactonases studied by [48]. Of the 125 proteins we curated from the literature, 61 are enzymes and 64 are transporters. The majority of these proteins are quite diverged from characterized proteins in the databases, or have different functions (Fig 8B). The median similarity to the most-similar characterized protein is 38%.

Quality of GapMind’s results To assess the quality of GapMind’s results, we examined its predictions for organisms that are reported to grow, or not, with these compounds as the sole source of carbon. First, we compared GapMind’s results to growth data for 29 heterotrophic bacteria across 57 of the 62 carbon sources in GapMind [4,8]. (Deoxyinosine, deoxyribonate, mannitol, phenylacetate and sucrose were not included because we do not have comprehensive growth data.) As shown in Fig 9A, GapMind identified a high-confidence path for 85% of carbon sources that support growth, and for just 24% of other carbon sources. For carbon sources that are utilized, transport steps on the best path are more likely to be low- or medium-confidence than enzymatic steps are (5.9% vs. 2.6%, P = 1.5 · 10−7, Fisher exact test). We suspect that this reflects the greater difficulty of annotating transporters by similarity, and also the greater difficulty of identifying transporters from fitness data because they are often genetically redundant (see below). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 9. Quality of GapMind’s results. (A) Confidence of the best path for utilized and non-utilized carbon sources, across 57 carbon sources and 29 heterotrophic bacteria with fitness data. A path is low confidence if it has any low-confidence steps (and similarly for medium confidence). Proportions are from 700 utilized cases and 953 non-utilized cases. (B) Whether high-confidence and non-redundant genes on the best path were important for fitness. Proportions are from 962 genes that encode transporters and 1,254 genes that encode enzymes. Genes lack fitness data if they have insufficient coverage by transposon insertions (usually these are essential or short genes). A phenotype is “specific” if the gene has little phenotype in most other conditions [4]. (C) Confidence of the best path for utilized carbon sources across diverse bacteria and archaea. The phylum assignments are from the Genome Taxonomy Database [49], and “other phyla” includes 14 phyla with less than 100 organism x compound pairs each. There were 54 pairs for archaea. https://doi.org/10.1371/journal.pgen.1010156.g009 Cases where the organism doesn’t grow, despite having high-confidence candidates for all of the necessary steps, could indicate inadequate expression of those genes. For example, EcoCyc reports that E. coli K-12 does not grow aerobically at 37°C on 11 of the carbon sources in GapMind, despite containing all of the proteins necessary for their uptake and catabolism. These compounds are arginine, asparagine, aspartate, cellobiose, citrate, ethanol, glutamate, lysine, proline, putrescine, and L-serine. Of the eight nitrogen-containing compounds, seven (all except glutamate) support the growth of E. coli as the sole source of nitrogen, which confirms that they are taken up and metabolized. We also used the fitness data to check if GapMind selected the correct genes for consuming each carbon source. We considered steps that were on the best path, and which had just one high-confidence candidate, because otherwise the genes for the step might be genetically redundant. We analyzed genes that encode enzymes and transporters separately. When fitness data is available for that gene and condition, 82% of genes that encode enzymes were important for fitness in the condition (Fig 9B). To understand why some of these genes were not important for fitness, we examined a random sample of 20 cases. In 12 of the 20 cases, GapMind identified another high-confidence path as well. For example, in Shewanella loihica PV-4, acetate might be converted to the central metabolite acetyl-CoA by acetyl-CoA synthase (acs) or else by acetate kinase (in reverse) and phosphate acetyltransferase (ackA and pta). E. coli K-12 uses both pathways to consume acetate [50], so the lack of a phenotype for ackA in S. loihica could indicate genetic redundancy with acs. More broadly, if GapMind identifies two high-confidence pathways, it arbitrarily chooses the one with more steps. (Our intuition is that one step might be annotated erroneously, but the presence of several steps is unlikely unless the pathway is present.) GapMind might guess wrong, or the two pathways might be genetically redundant. For another 6 of the 20 cases we examined, genes for other steps on the selected path were important for fitness during growth on the carbon source, which suggests that GapMind selected the correct path. For genes that encode transporters on the best path, and for which fitness data is available, 56% were important for fitness in the condition (Fig 9B). We examined a random sample of 20 cases where the transporter gene was not important for fitness. In most of those cases (18/20), GapMind identified another high-confidence transporter as well, so the genes for the two types of transporters might be genetically redundant. Overall, enzymes and transporters that are part of GapMind’s best path for consuming a compound are usually important for fitness during growth with that compound as the sole source of carbon and energy, and most of the exceptions could be due to genetic redundancy.

[END]

[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010156

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/


via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/