(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Robust genetic codes enhance protein evolvability [1]
['Hana Rozhoňová', 'Institute Of Integrative Biology', 'Eth Zürich', 'Zürich', 'Swiss Institute Of Bioinformatics', 'Lausanne', 'Carlos Martí-Gómez', 'Simons Center For Quantitative Biology', 'Cold Spring Harbor Laboratory', 'Cold Spring Harbor']
Date: 2024-05
The standard genetic code defines the rules of translation for nearly every life form on Earth. It also determines the amino acid changes accessible via single-nucleotide mutations, thus influencing protein evolvability—the ability of mutation to bring forth adaptive variation in protein function. One of the most striking features of the standard genetic code is its robustness to mutation, yet it remains an open question whether such robustness facilitates or frustrates protein evolvability. To answer this question, we use data from massively parallel sequence-to-function assays to construct and analyze 6 empirical adaptive landscapes under hundreds of thousands of rewired genetic codes, including those of codon compression schemes relevant to protein engineering and synthetic biology. We find that robust genetic codes tend to enhance protein evolvability by rendering smooth adaptive landscapes with few peaks, which are readily accessible from throughout sequence space. However, the standard genetic code is rarely exceptional in this regard, because many alternative codes render smoother landscapes than the standard code. By constructing low-dimensional visualizations of these landscapes, which each comprise more than 16 million mRNA sequences, we show that such alternative codes radically alter the topological features of the network of high-fitness genotypes. Whereas the genetic codes that optimize evolvability depend to some extent on the detailed relationship between amino acid sequence and protein function, we also uncover general design principles for engineering nonstandard genetic codes for enhanced and diminished evolvability, which may facilitate directed protein evolution experiments and the bio-containment of synthetic organisms, respectively.
Funding: This work was funded by Swiss National Science Foundation (
https://www.snf.ch ; grants PP00P3_202672 and 310030_192541 to J.L.P.), NIH (
https://www.nih.gov/ ; grant R35GM133613 to D.M.M.), an Alfred P. Sloan Research Fellowship (
https://sloan.org/ ; to D.M.M.), and additional funding from the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (
https://www.cshl.edu/research/quantitative-biology/ ; to D.M.M.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Here, we overcome the limitations of prior studies using experimental data from massively parallel sequence-to-function assays [ 35 ]. In particular, we use combinatorially complete data, which provide a quantitative characterization of protein phenotype for all possible combinations of 20 L amino acid sequence variants at a small number L of protein sites [ 36 – 40 ]. These data facilitate the construction of complete adaptive landscapes without assumptions regarding the combined effects of individual mutations (e.g., additivity). Importantly, the combinatorially complete nature of these data allows us to construct such landscapes under arbitrary genetic codes. The reason is that, no matter which code we use, we are guaranteed that each of the 4 3L possible mRNA sequences can be computationally translated into an amino acid sequence with an experimentally assayed phenotype. We characterize the topographies of 6 such empirical adaptive landscapes under the standard genetic code, as well as under hundreds of thousands of rewired codes, and perform population-genetic simulations on these landscapes. We show that robust genetic codes tend to produce smooth adaptive landscapes with few peaks and, consequently, allow evolving populations to reach on average higher fitness. Thus, the robustness of a genetic code not only helps to mitigate the potentially deleterious effects of replication and translation errors, but it also transforms the problem of molecular evolution from one that depends on the vicissitudes of individual mutations into one where evolving populations can readily find mutational paths toward adaptation.
Whether code robustness hinders or facilitates protein evolvability therefore remains an open question. Whereas steps have been taken towards answering this question [ 20 , 21 , 27 – 29 ], these studies suffer from at least 1 of 2 key limitations. The first is a focus on how missense mutations change the physicochemical properties of amino acids [ 15 , 16 , 28 ], rather than how missense mutations change the phenotype of a protein (e.g., its stability or catalytic activity) or the corresponding fitness of an organism. The second limitation is a lack of suitable data, with studies relying on a purely theoretical model of landscape topography [ 28 ], a categorical, rather than quantitative, phenotype [ 27 ], an incomplete fitness landscape [ 20 ], or assumptions of additivity regarding the combined effects of mutations [ 29 ]. We therefore do not know how the structure of a genetic code, standard or otherwise, influences the evolvability of proteins beyond one-step adaptation. This is an important knowledge gap, because protein evolution often proceeds via a sequence of adaptive mutations that improve protein function, as evidenced by comparisons of orthologous sequences [ 30 , 31 ] and directed protein evolution experiments [ 32 , 33 ]. Moreover, given the increasing interest in engineering nonstandard genetic codes [ 24 ], it is desirable to deduce design principles for engineering genetic codes with reduced or enhanced evolvability, as these might be used to form a “genetic firewall” [ 34 ] or accelerate directed evolution [ 22 ], respectively.
What are the implications of code robustness for protein evolvability? By definition, a robust genetic code limits the amount of phenotypic variation that point mutations can cause. However, opinions differ on whether this hinders or facilitates evolvability. Inspired by Fisher’s geometric model [ 18 ], early theoretical work argues that code robustness may facilitate protein evolvability exactly because it minimizes the effects of mutations, thus increasing the probability that mutations will be adaptive [ 19 ]. Indeed, by analyzing the fitness effects of point mutations to the antibiotic resistance gene TEM-1 β-lactamase and 2 influenza hemagglutinin inhibitor genes, it has been shown that missense mutations are enriched for adaptive amino acid changes, relative to amino acid changes that require multiple point mutations [ 20 , 21 ]. In contrast, more recent theoretical work [ 22 ], motivated by advances in synthetic biology [ 23 – 26 ], argues that protein evolvability can be enhanced by reducing code robustness, because by doing so one can increase the number and diversity of amino acids accessible via point mutation to any codon.
The structure, history, and evolutionary implications of the standard genetic code have fascinated scientists for decades [ 10 – 14 ]. Given the nearly infinite space of alternatives, why did life converge on the standard genetic code? What makes it so special? Answers to this question are typically based on comparisons of the properties of the standard genetic code to those of hypothetical, alternative codes [ 15 , 16 ], of which there are many [ 17 ]. Even if one maintains the degeneracy of the standard code, but simply randomizes which amino acids are assigned to which codon blocks, there are 20!≈10 18 possible rewirings. By sampling a large number of such rewired codes, one can ask whether a given quantitative property of the standard genetic code has a value higher or lower than expected by chance. For example, using a measure of so-called “error tolerance” based on how well point mutations preserve polar requirement (a measure of hydrophilicity), and taking into consideration mutation bias toward transitions relative to transversions, Freeland and Hurst [ 16 ] showed that only one in a million rewired codes preserves the hydrophilicity of amino acids to a greater extent than the standard genetic code. The standard genetic code is thus highly robust to this form of error, in that point mutations and mistranslations tend to cause minor changes to this physicochemical property of amino acids.
What determines whether a protein’s adaptive landscape is smooth or rugged? One primary factor is the genetic code an organism uses for translation. Its importance arises because it determines which amino acid changes are accessible via alteration of a single nucleotide. For example, under the standard code, point mutations to the CUG codon can change the amino acid leucine to methionine (AUG), valine (GUG), proline (CCG), glutamine (CAG), and arginine (CGG), but not to any other of the remaining 14 amino acids. A genetic code thus defines which protein sequences are “near” one another in sequence space [ 9 ], and which mutational paths to adaptation are closed or open ( Fig 1C ).
(A) In one-step adaptation, evolvability depends on the amount of adaptive phenotypic variation accessible via point mutation. Therefore, the genotype shown with a filled circle in the right panel is more evolvable than the one shown in the left panel. (B) Zooming out and considering multi-step adaptation, landscape topography becomes important. Smoother landscapes promote evolvability (left panel), whereas rugged landscapes hinder evolvability (right panel), because an evolving population is more likely to be trapped on a local optimum. (C) Landscape topography is influenced by the genetic code. As a toy model, a sequence consisting of a single codon is shown. Under the standard genetic code, there is a single peak, which is also a global optimum (left panel). If the meaning of the CUG codon is changed from leucine to serine (as is the case in some yeast species [ 8 ]), an adaptive valley is formed (right panel). The population now cannot leave the local optimum consisting of the AUG codon without crossing a maladaptive valley.
Central to this process is evolvability—the ability of mutation to bring forth adaptive phenotypic variation [ 4 , 5 ]. For short-term, one-step adaptation, evolvability depends on the immediate mutational neighborhood of a protein sequence ( Fig 1A ). That is, it depends on the amount of adaptive phenotypic variation accessible via point mutation. For longer-term, multi-step adaptation, evolvability depends on the topography of the adaptive landscape. A smooth single-peaked landscape facilitates evolvability, because mutation can easily bring forth adaptive phenotypic variation from anywhere in the landscape, except atop the global peak; in contrast, a rugged landscape diminishes evolvability, because its adaptive valleys often preclude the generation of adaptive phenotypic variation [ 4 , 6 , 7 ] ( Fig 1B ).
Proteins are the workhorses of the cell. They are the building blocks of cellular infrastructure, they transport molecules, regulate gene expression, and catalyze essential biochemical reactions. How do such protein functions evolve? The classic metaphor of the adaptive landscape is helpful to conceptualize this process [ 1 ]. An adaptive landscape is a mapping from genotype space onto fitness or some related quantitative phenotype, which defines the “elevation” of each coordinate in this space. For proteins, genotype space comprises the set of all possible amino acid sequences of a given length [ 2 ] and the quantitative phenotypes of these sequences include catalytic activity, folding stability, and binding affinity. The evolution of protein function can then be viewed as a hill-climbing process in such a landscape, in which mutation and natural selection tend to drive evolving populations toward adaptive peaks of improved functionality [ 3 ].
Results
Data We construct empirical adaptive landscapes using 6 combinatorially complete data sets for 4 proteins. The first protein is GB1, a Streptococcal protein that binds immunoglobulin [41,42]. Wu and colleagues [36] experimentally assayed the binding affinity of GB1 to immunoglobulin for all 204 = 160,000 amino acid sequences at 4 protein sites (V39, D40, G41, and V54; Fig A in S1 Text), which are known to interact epistatically and influence binding affinity [43]. In particular, they measured the relative frequencies of sequence variants before and after selection for binding immunoglobulin. Binding affinities are then defined as log enrichment ratios (Methods). The second protein is ParD3, a bacterial antitoxin that is part of the ParD-ParE family of toxin-antitoxin systems, which are commonly found on bacterial plasmids and chromosomes [44]. Such systems comprise a toxin that inhibits cell growth unless bound and inhibited by the cognate antitoxin. Lite and colleagues [37] experimentally assayed bacterial cell growth for all 203 = 8,000 amino acid sequence variants at 3 sites in ParD3 (D61, K64, E80; Fig A in S1 Text), in the presence of its cognate toxin ParE3, as well as a related, but non-cognate toxin ParE2. This resulted in 2 data sets, 1 per toxin, in which cell growth was used as a quantitative readout of the degree to which individual ParD3 variants antagonize a given toxin. The third protein is ParB, a DNA-binding protein crucial for bacterial chromosome segregation [45]. The binding site of ParB, parS, is a palindrome of GTTTCAC. Jalal and colleagues [39] experimentally measured the binding affinity of ParB to the cognate DNA sequence, parS, as well as a related DNA-binding site, NBS (palindrome of ATTTCCC), for all 204 = 160,000 variants at 4 positions (R173, T179, A184, and G201; Fig A in S1 Text). This again resulted in 2 data sets, 1 per DNA-binding site. The fourth protein is dihydrofolate reductase (DHFR), an essential metabolic enzyme in Escherichia coli. Papkou and colleagues [40] generated all possible 643 = 262,144 combinations of codons at 3 positions (A26, D27, L28) of the corresponding folA gene. Missense mutations at these positions are known to confer resistance to the antibiotic trimethoprim [46,47]. Using a mass-selection experiment, Papkou and colleagues [40] measured the fitness of each variant in the presence of a sublethal dose of trimethoprim. The majority (89.7%) of the variants are nonfunctional, in that they are sensitive to trimetophrim. Following the protein evolution literature [3,36,48], we assume that fitness is directly proportional to binding affinity (GB1, ParB) or growth rate (ParD3, DHFR), and will use the term “fitness” generically for all landscapes from now on. Using the raw measurements described above (binding affinities and cell growth), we inferred the fitness values, as well as imputed the missing sequence variants (6.6% of the GB1 data set) using empirical variance component regression [49] (Methods and Fig B in S1 Text). For each of the 6 data sets, we constructed adaptive landscapes using the standard genetic code, as well as hundreds of thousands of rewired codes. Specifically, we represented each mRNA sequence of length 12 (GB1, ParB-parS, ParB-NBS) or 9 (ParD-ParE2, ParD-ParE3, and DHFR), respectively, as a vertex in a mutational network and connected vertices with an edge if their corresponding sequences differed by a single point mutation [50] (Methods). We labeled each vertex with the fitness of its corresponding translation under a given genetic code, thus defining the “elevation” of each coordinate in genotype space.
Evolutionary simulations reveal complex relationship between code robustness and evolvability Our analyses suggest that code robustness promotes evolvability by producing smooth adaptive landscapes with few peaks and little sign epistasis. As a consequence, we anticipate evolving populations to obtain higher fitness, on average, when translating proteins using more robust codes than when using less robust codes. To determine if this is the case, we turn to evolutionary simulations, specifically of greedy adaptive walks [7]. These model adaptive evolution of a large population with pervasive clonal interference, such that all possible point mutations to a sequence are simultaneously present in the population, and the fittest of these variants goes to fixation. For each of the 100,000 amino acid permutation landscapes and each of the 6 data sets, we initialized the walks in each nucleotide sequence encoding a functional product. We terminated a walk when it reached a local or global adaptive peak, and recorded the fitness of that peak sequence (Methods). Fig 3A shows the average fitness reached by the greedy adaptive walks in relation to code robustness. As expected from our landscape-based analyses, evolving populations reached higher fitness, on average, when translating proteins using more robust genetic codes for the GB1, ParD-ParE2, and ParB-NBS landscapes (Table D in S1 Text). However, the results for the ParD-ParE3 landscape were not statistically significant (R = −0.004, p = 0.182) and we even observed a negative correlation in the ParB-parS (R = −0.0118, p = 1.89∙10−4) and DHFR data sets (R = −0.096, p<2.2∙10−16). Similar to the analysis of accessible paths, we reasoned that these results might be caused by variation in the size of the global peak, such that larger global peaks are easier to “find” than smaller global peaks, simply because they contain more mRNA sequences. Indeed, we observe a positive correlation between the size of the global peak and mean fitness reached by the greedy adaptive walks in all 6 data sets (Table E in S1 Text). We thus again restricted our analysis to those genetic codes for which the size of the global peak is the same as under the standard genetic code and occupies a single connected region in genotype space. However, even in this subset of codes, we observe a positive correlation between code robustness and mean fitness reached by the greedy adaptive walks in only for 4 out of 6 data sets (Table D in S1 Text). Whether code robustness promotes or diminishes evolvability thus depends on the particular landscape, which is surprising given that robust genetic codes are associated with smoother fitness landscapes in all 6 data sets. In Section 7 in S1 Text, we show that the correlations between code robustness and mean fitness result from a complex interplay between the heights and sizes of the basins of attraction of the peaks, as well as from idiosyncracies specific to particular data sets. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Relationship between code robustness and results of greedy adaptive walks. The labeled point denotes the results obtained using the standard genetic code. Data pertain to GB1. The data and code required to generate this figure can be found at
https://zenodo.org/records/10677993.
https://doi.org/10.1371/journal.pbio.3002594.g003 We also observe that the average length of the walks tended to be longer under robust codes (Fig 3C; 4.89 versus 5.30 steps, on average, for the 1% least and most robust codes, respectively), revealing that the benefit of increased fitness afforded by code robustness comes at the cost of longer evolutionary trajectories to adaptation. This is in line with our observations concerning landscape ruggedness. In landscapes with many local peaks, a greedy walk is more likely to be initialized near one of these peaks, which it will likely ascend in only a small number of mutational steps. In contrast, in landscapes with few local peaks, a greedy walk is more likely to be initialized farther away from one of these peaks, thus increasing the length of the mutational path to adaptation, be it to a local or global peak. We also highlight that the greedy walks reached higher mean fitness under the standard genetic code than under 85% of the amino acid permutation codes in 5 out of 6 landscapes and ranked among the top 5% in 3 of them. Similarly, the standard code resulted in exceptionally low Shannon entropy of the distribution of reached peaks in 5 out of 6 data sets (Table C in S1 Text), meaning that under the standard genetic code, greedy walks preferentially converged on a small number of fitness peaks. We observe qualitatively the same results in simulations of the weak-mutation regime (Section 8 in S1 Text) and using codes constructed by restricted amino acid permutation (Section 4 in S1 Text) and random codon assignment (Section 5 in S1 Text).
The genetic code governs the genetic architecture of long-term molecular evolution In the previous section, we studied a short-term adaptive process, in which high-fitness protein variants evolve from low-fitness variants via mutation and selection. However, once an evolving population reaches high fitness, it behaves like a random walk among the mutationally interconnected set of high-fitness variants. To assess how different code rewirings influence this random walk, we apply a visualization technique that captures the dynamics of a finite population evolving on a fitness landscape at mutation-selection-drift balance [59] where the distances between genotypes reflect the expected amount of time to evolve form one genotype to another (squared distances have units of time, and time is scaled such that each nucleotide mutation occurs at rate 1, see Methods). In an earlier study, we used this technique to explore the structure of the GB1 landscape at the amino acid level [60] and found that it consists of 3 main regions of high-fitness protein variants that differ primarily in the placements of a small non-polar and bulkier amino acids at positions 41 and 54 (Fig F in S1 Text). The first and largest of these regions is characterized by having G at position 41, which is compatible with most amino acids at position 54 and contains the wild-type sequence (VDGV); we will refer to this as Region 1. The second region typically has G at position 54, while tolerating T at 54 in some contexts, together with L or F at position 41, and we will refer to this as Region 2. The final region, Region 3, is characterized by A at position 54, which can be paired at position 41 with C, S, or A, and to a lesser extent L and F. Moreover, each of these 3 regions is connected via functional intermediates with the other 2 regions (see Fig F in S1 Text and ref. [60]). Here, we consider how the genetic code, standard or otherwise, reshapes the structure of these regions and restricts their mutational interconnections, focusing on the standard genetic code as well as the 2 most and 2 least robust in our set of 100,000 amino acid permutation codes (see Fig G in S1 Text for the corresponding codon tables). Fig 4 shows the resulting visualizations, where for each code we plot the visualization using a sufficient number of dimensions to show the major features of the corresponding fitness landscape. These dimensions are called Diffusion Axes because they reflect the dynamics of diffusion in sequence space, and they are ordered such that the first k Diffusion Axes provide an optimal approximation of the expected times to evolve form one sequence to another (see Methods). In order to better see the structure of the high-fitness set, we also show a second visualization for each code where we only plot the high-fitness sequences (which in what follows we will take to be the fittest 1%) and color these sequences by their corresponding region in amino acid sequence space. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 4. The genetic code governs genotype network topology and the genetic architecture of long-term molecular evolution. Fitness landscape for GB1 at positions 39, 40, 41, and 54 under the (A) standard genetic code, (B, C) the 2 most and (D, E) the 2 least robust codes in the amino acid permutation set. Vertices represent 12-nucleotide sequences and edges connect vertices if their corresponding sequences differ by a single point mutation. Vertex color represents protein fitness (color bar in (A) applies to all panels). Vertices are placed at the coordinates along the diffusion axes, which at a technical level are defined by the subdominant eigenvectors of the rate matrix describing the weak mutation dynamics and have units of square root of time [59], and where time is scaled such that each possible nucleotide mutation occurs at rate 1 (see Methods for details). For each pair of diffusion axes shown, there are 2 subpanels: one that shows all ≈16 million genotypes, with the location of the sequences encoding the wild-type protein sequence V39 D40 G41 V54 marked, and another that shows only the genotype network of high-fitness variants (top 1% of fitness distribution), which better shows the connectivity between high-fitness regions and which is annotated with the protein sequence features that characterize each cluster or subset of nucleotide sequences. Colors in these panels represent the main regions of functional protein sequences as highlighted in Fig F in S1 Text and show the extent to which the connectivity between these regions of amino acid sequence space is rewired under the different genetic codes. See Section 9 in S1 Text for a more detailed descriptions of the visualizations. The data and code required to generate this figure can be found at
https://github.com/parizkh/rewired_codes_landscapes/tree/main/GB1/05_landscape_ visualizations.
https://doi.org/10.1371/journal.pbio.3002594.g004 We find that different rewirings of the genetic code produce fitness landscapes with dramatically different structures from each other or from the structure of the fitness landscape in amino acid sequences space. For example, under the standard genetic code, Region 1 is no longer directly connected to Region 3 because neither 41F nor 41L is accessible from 41G under the standard genetic code (Fig 4A). Indeed, under many codes the set of high-fitness sequences becomes split into several distinct components separated by lower fitness sequences, as observed in Robust Code A (Fig 4B), Robust Code B (Fig H in S1 Text), and Non-Robust Code B (Fig 4E), so that moving from one component to another requires the fixation of less fit sequences. The waiting time for such deleterious fixations is long, increasing the amount of time required for a population to explore the landscape. We can quantify this in terms of the relaxation time of the rate matrix for the evolutionary random walk, given by the inverse of the absolute value of its largest non-zero eigenvalue. For Robust Code A and Non-robust Code B, the relaxation time is, respectively, 3.17 and 3.22-fold longer than the expected waiting time for individual nucleotide mutations; this is roughly 50% longer than for the standard genetic code or Non-robust Code A with relaxation times of 2.31- and 2.05-fold longer than the expected waiting time for individual nucleotide mutations (see Methods for details). We also find that high-fitness sequences connected by high-fitness paths can be connected via very different structures in sequence space. For example, the standard genetic code, as well as the 2 least robust codes (Fig 4A, D, and E) all show long branch-like structures where the high-fitness paths connecting genotypes require that the mutations be accumulated in a specific order. This can result in very long paths, for example, Fig I in S1 Text shows an example of an 11-mutation path connecting Regions 1 and 3 under the standard genetic code that does not include any substitutions at positions 39 or 40, synonymous changes, or reversions. In contrast, we can also see regions of sequence space where the high-fitness set shows a grid-like structure in which mutations at a pair of sites can accumulate independently from each other, as seen in the right-hand panel of Fig 4B for Robust code A, or under the standard genetic code, where a population can switch between F or L at position 41 more or less independently of whether T, A, or G is found at position 54 (Fig 4A, negative values of Diffusion Axis 1). Besides these large-scale differences in the pattern of connectivity between high-fintess sequences, the density of high-fitness paths can also vary greatly. One particularly interesting case is where a pair of sequences are connected by many long high-fitness paths but are also accessible via a smaller number of short high-fitness paths that can only be accessed on very specific genetic backgrounds; we call these rare shortcuts “wormholes” because they are short paths that connect otherwise distant regions of the high-fitness set. For example, under the standard genetic code, we can see that distant parts of the network of high-fitness sequences are in fact accessible from one another via 41S 4 (where the 2 disconnected sets of S codons are broken into S 2 and S 4 , named for the number of codons in each set [61]; Fig 4A, bottom right). In this case, the average probability for 41G-54L sequences of arriving at a high-fitness genotype with L or F at 41 and T, A, or G at 54 through a high-fitness S 4 intermediate is 1.13% (see Section 9 in S1 Text for additional details). Thus, although these short paths are possible, they occur only a small minority of the time. We see another example of such a wormhole under Non-robust code A, where WWLP sequences bridge the otherwise distant regions characterized by 41L-54A and 41L-54T (Fig 4D). This wormhole is used even more rarely, only 0.002% of the time. In summary, these visualizations illustrate the richness and variety of landscape topographies that can be induced by different genetic codes, and the extent to which even exceptionally robust codes can interact with the idiosyncrasies of a particular protein fitness landscape to break crucial links between high-fitness variants.
[END]
---
[1] Url:
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002594
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/