(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Large-scale design and refinement of stable proteins using sequence-only models
['Jedediah M. Singer', 'Two Six Technologies', 'Arlington', 'Virginia', 'United States Of America', 'Scott Novotney', 'Devin Strickland', 'Department Of Electrical', 'Computer Engineering', 'University Of Washington']
Date: 2022-04
Engineered proteins generally must possess a stable structure in order to achieve their designed function. Stable designs, however, are astronomically rare within the space of all possible amino acid sequences. As a consequence, many designs must be tested computationally and experimentally in order to find stable ones, which is expensive in terms of time and resources. Here we use a high-throughput, low-fidelity assay to experimentally evaluate the stability of approximately 200,000 novel proteins. These include a wide range of sequence perturbations, providing a baseline for future work in the field. We build a neural network model that predicts protein stability given only sequences of amino acids, and compare its performance to the assayed values. We also report another network model that is able to generate the amino acid sequences of novel stable proteins given requested secondary sequences. Finally, we show that the predictive model—despite weaknesses including a noisy data set—can be used to substantially increase the stability of both expert-designed and model-generated proteins.
Competing interests: JMS and AZ are employed by Two Six Technologies, which has filed a patent on a portion of the technology described in this manuscript. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
Funding: This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory under Contract No. FA8750-17-C-0231 (and related contracts by SD2 Publication Consortium Members). The specified contract number applies to JMS, SN, NL, AZ, and HE, while related contracts under the same program pertain to other authors. We thank the staff at Northeastern Collaborative Access Team (NECAT) at Advanced Photon Source for the beamtime. Representatives of DARPA—the funders—asked interested scientific questions that may have provided ideas for study design and data analysis, but played no other role in study design or data analysis. They had no role in data collection or preparation of the manuscript. They encouraged publication after the decision to publish was made by the authors.
Proteins are designed, either by an expert using Rosetta or dTERMen software or by a neural network model that transforms secondary sequences into primary sequences. These designs are refined to maximize stability via an iterative procedure. At each step, the stability of all possible single-site substitutions is predicted by another neural network model. The mutants with the highest predicted stability are saved and used as seeds for the next round of optimization.
Here we describe a data-driven process for the high-speed, automated creation of stable protein designs ( Fig 1 ). We use a previously described combination [ 26 ] of parallel oligo library synthesis, yeast surface display, and next-generation sequencing, coupled with Aquarium lab automation software (aquarium.bio) to generate very large datasets suitable for training and testing a neural network model of protein stability. We describe a computationally efficient neural network, the Evaluator Model (EM), that is able to predict protein stability with high accuracy, based solely on amino acid sequence. We validate the EM using data on almost 200,000 new protein designs, assayed using this robust experimental pipeline. We also demonstrate the use of the EM to refine the stability of protein designs by making multiple changes, increasing stability tenfold as evaluated by the assay described in [ 26 ]. We show that these refinements can easily be made to respect additional constraints on how they change the proteins, which in conjunction with other tools could lead to preservation of structure or function. We describe another neural network, the Generator Model (GM), that is able to create novel stable protein sequences at high speed; these sequences can also be successfully refined by the EM. Finally, we demonstrate via low-resolution methods that selected examples fold into stable structures, and report a high resolution crystal structure of one design that matches the expected topology.
One drawback of machine learning models is that they require large amounts of training data. This presents a particular problem for modeling protein stability because, historically, experimental measurements of stability have been laborious [ 20 – 22 ]. Many previous applications of machine learning to problems in protein science have taken advantage of existing large datasets such as the Protein Data Bank [ 23 ], UniProt [ 24 ], or a handful of existing datasets of empirical measurements (e.g. [ 25 , 26 ]). Such efforts can only address questions for which there are pertinent data, and typically cannot test model predictions comprehensively on new data or fully support iterative or reinforcement learning strategies. Machine learning in combination with the generation of new data has yielded new insights, for example in the prediction of functional motifs such as alternative polyadenylation [ 27 ], alternative splicing [ 28 ], small molecule inhibitor discovery [ 29 ], and protein-protein interaction surfaces [ 30 ].
Data-driven models have also recently been applied to the direct design of proteins. [ 17 ] used recurrent neural networks to generate novel peptide designs based on a training set of helical antimicrobial peptides, with results that were predicted to be more active than random draws from the training set distribution. These designs were not tested in the laboratory, however. In [ 18 ], constraints derived from evolutionary statistics of single and paired amino acids yielded novel designs that demonstrated the intended enzymatic activity. [ 19 ] demonstrated that a model trained to predict 3D structure could be used to “hallucinate” novel sequences that formed experimentally validated proteins.
There has been a recent surge in the application of machine learning to understanding properties of proteins. In particular, there is substantial interest in predicting the forces guiding protein folding and dynamics, leading to physically plausible models of protein structure [ 6 ]. [ 7 ] presents an approach for learning protein structure from primary sequence leveraging geometric constraints, which yields competitive accuracy with high computational performance. [ 8 , 9 ] applied similar geometric constraints to a model that augmented its primary sequence input with data about mutation correlations in homologous proteins to yield state-of-the-art performance in structure prediction. There have been a few attempts to use primarily or exclusively primary sequence data to make predictions about proteins, leaving the underlying biology, chemistry, and physics to be learned implicitly. [ 10 ] showed that a greedy search guided by simple sequence-based machine-learning models could substantially improve the efficiency of guided evolution for protein binding. [ 11 ] developed a sequence-based recurrent model which generated fixed-length vector embeddings of proteins, based solely on primary sequence. They demonstrated that these embeddings were useful for solving complex problems in protein engineering, including stability prediction, using very simple models. [ 12 – 14 ] explored attention-based models that had been trained on large sets of protein sequences and found that the models’ learned representations aligned with such biophysical properties as 3D structure and binding domains. In [ 15 ] an ensemble of simple machine learning models predicted stability based on features of protein secondary and tertiary structure, after training on a limited set of natural proteins. [ 16 ] used a regression model over Fourier transforms of encoded biophysical properties of primary sequences to identify mutations of a particular enzyme that increased stability while decreasing detrimental aggregation.
Most proteins, natural or designed, require a stable tertiary structure for functions such as binding [ 1 ], catalysis [ 2 ], or self-assembly [ 3 ]. Because structural stability derives from thousands of interactions between atoms, both attractive and repulsive, whose net sum is close to zero, precise prediction of stability is extremely challenging. Current approaches use combinations of statistical and physics-based potentials to approximate and refine calculations of these interactions. While much progress has been made, these programs comprise many terms that scale unfavorably in compute time, are imperfectly parameterized, and attempt to model poorly understood processes such as the entropy of the unfolded state or the polarizability of some atoms [ 4 , 5 ]. Thus, calculating the stability of even a small protein is both computationally expensive and usually insufficiently accurate for practical use. This means that creating a small number of successful proteins typically requires the design and evaluation of a large number of candidates, at significant cost. Data-driven approaches, especially neural networks, implicitly solve only the calculations necessary to arrive at a prediction, and have been used in computational prediction tasks in other domains lacking accurate statistical or physics-based models. A successful application to protein stability would lead to both higher design-build-test throughput and higher accuracy in design for a wide range of applications.
Results
High-throughput measurement of protein stability In order to generate enough data to train a sequence-based neural network model of stability, we adapted and automated a recently developed technique for assaying the stability of small proteins through their relative resistance to digestion by proteases (S1 Fig). This approach uses DNA oligonucleotide gene library synthesis to encode the designed proteins, which are expressed in a yeast surface display system so that each cell displays many copies of a single designed protein. The yeast library is treated with trypsin and chymotrypsin in a range of concentrations, and sorted in a fluorescence activated cell sorter (FACS) to collect yeast cells with a high proportion of undigested protein. The resulting pools are then deep-sequenced, and the relative resistance to protease digestion is computed from the frequencies of sequence reads. We used this combination of the existing assay and software-based automation to evaluate the stability of up to 100,000 proteins in a single experiment. While resistance to digestion by protease is partly determined by a protein’s unfolding free energy (i.e. its stability), this relationship is complicated by the intrinsic, unfolded-state resistance of the sequence to cleavage. [26] devised a method to subtract the intrinsic resistance predicted by an unfolded-state model (USM) from the overall observed resistance. This yields a stability score that, in principle, represents only the component of resistance derived from folding. Because this correction is imperfect—for example, it only modestly improves agreement between trypsin- and chymotrypsin-derived stability scores—we reasoned that potential latent factors, such as the propensity of the sequence to form residual unfolded-state structures incompatible with binding to the active site of the protease, could also affect the cleavage resistance. We developed a more flexible USM to account for these additional factors. This new USM yielded predictions that improved upon the original USM in several metrics (S2 Fig). We also confirmed that the stability score calculated using the new USM was an improvement over raw EC 50 , by comparing relationships between EC 50 values for the two proteases (S3 Fig) and relationships between stability scores for the two proteases (S4 Fig). Given the apparent improvements in stability score over raw EC 50 , and in the new USM over the original, we chose to use a stability score based on the new USM for all analyses. Analyses performed with the original USM yield similar results and unchanged conclusions.
Predicting stability with a sequence-only model We built a convolutional neural network (CNN) model, which we call the Evaluator Model (EM), to predict the stability score of a sequence of amino acids (Fig 2). This model was trained on a corpus of 107,948 proteins (“Corpus A”) designed by experts using Rosetta [4] or dTERMen [31] software (S5 and S6 Figs), and achieved high performance on held-out test data. Corpus A comprised the designs reported in [26], in [32], as well as previously unpublished designs. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Evaluator model and performance. (A) Architecture of Evaluator Model. (1) Input: one-hot encoding of protein’s primary sequence. (2) Three convolutional layers; the first flattens the one-hot encoding to a single dimension, successive filters span longer windows of sequence. Three dense layers (3) yield trypsin and chymotrypsin stability scores (4). The final stability score (5) is the minimum of the two. (6) A separate dense layer from the final convolution layer yields one-hot encoding of the protein’s secondary structure. (B) Success of EM predictions on a library of new designs. We used the EM to predict the stability of 45,840 new protein sequences that the model had not seen before (described later as “Corpus B”); the distribution of predictions is shown in pink. The blue curve shows the fraction of these designs that were empirically stable (stability score >1.0) as a function of the model’s a priori stability predictions (dotted black line: stability threshold for predicted stability). 281 outliers (predicted stability score <-1.0 or >3.0) excluded for clarity. (C) Predicted versus observed stability scores for the same data, with outliers included.
https://doi.org/10.1371/journal.pone.0265020.g002 The EM demonstrated predictive performance near the limit of the experimental data using a random testing subset. We reserved a set of 5000 randomly selected designs for testing model performance, with the remaining designs used for training. We calculated both the squared Pearson correlation coefficient (i.e. r2) and the R2 goodness-of-fit scores for the test set, for five versions of the EM (built by training from scratch five times). R2 is more stringent than r2 because it does not assume a linear regression of model predictions onto data, and can be negative when the model performs worse than simply predicting the sample mean. The mean R2 score for the EM was 0.48 (r2 = 0.49), slightly better than the relationships between trypsin and chymotrypsin stability scores (trypsin onto chymotrypsin R2 = 0.34, chymotrypsin onto trypsin R2 = 0.38; r2 = 0.47). Ideally, trypsin and chymotrypsin stability scores would be equal; that EM predictions are as similar to stability scores as the per-protease scores are to each other suggests that the EM’s performance is near the limit imposed by the assay. We also evaluated EM performance when controlling for the possibility that it was memorizing artifacts of the design process rather than modeling underlying aspects of stability. It is reasonable to think that a neural network might overfit the data by learning the patterns associated with a given topology or methodology. Topologies each have a loose but distinctive patterning of hydrophobic and hydrophilic residues [33]. Similarly, a particular design methodology may have a sequence-level “tell”—even one that is not easy for humans to discern [34]. In order to guard against overfitting to such features, we took advantage of the fact that our training corpus comprised numerous topologies, design approaches, and sizes. We partitioned the corpus into 27 classes based on these factors. For each class, we tested performance on the designs within that class after training each of the three models on data from the other 26 classes. We again used R2 goodness-of-fit scores, taking the mean of three retrained EMs for each held-out class. The EM achieved a mean R2 of 0.17 over all classes, indicating that the previous results reflect some overfitting to topology or design methodology, but also demonstrating a substantial level of generalization. Performance on each held-out class is shown in S7 Fig. To verify that differences in topology and design yielded distinct proteins, we used BLAST with default parameters using BLOSUM62 [35] to identify each protein’s nearest homolog from among the other 26 classes. Maximum bit scores and maximum percent identities were both low (mean ± SD of 34.5±5.1 and 47.0±8.3 respectively), indicating substantial differences between the 27 classes. Moreover, EM performance was not generally better for proteins that were more similar to those in other classes; we found negligible correlations between model error and bit score (Spearman ρ = −0.0011) and percent identity (Spearman ρ = 0.014). We next asked whether the EM can predict thermal stability for short natural proteins. We procured from the literature the melting temperatures of single-site mutations of several proteins: Cold shock protein CspB [36], Regulatory protein rop [37], Pancreatic trypsin inhibitor [38], Transcriptional repressor arc [39], Toxin CcdB [40], α-lactalbumin [41], and Subtilisin-chymotrypsin inhibitor-2A [42]. We ran these sequences through the EM to predict stability scores, and calculate the Pearson correlation coefficient between experimentally measured melting temperatures and antilogs of stability scores (Pearson r because the units are not directly comparable, and antilogs because stability score is logarithmic) for each data set. Table 1 presents information about the data sets, the Pearson r, and associated p-values. Even without any training on natural protein thermal stability data, there are weak but consistently positive correlations between EM stability predictions and thermal stability. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 1. Evaluator model performance on natural proteins.
https://doi.org/10.1371/journal.pone.0265020.t001 We next sought to evaluate the EM’s ability to predict the effects of sequence perturbations and to refine the stability of synthetic protein designs. To support this investigation, we constructed a new dataset (“Corpus B”) of 96,799 designs, which fell into three main categories: expert designs from the held-out test subset of Corpus A subjected to a variety of perturbations, expert designs from the held-out test subset of Corpus A subjected to refinement, and novel neural network designs subjected to refinement (all described below). We used the EM to predict the stability of these new designs, and then tested them empirically in the stability assay. We analyzed the 45,840 of these designs for which the credible intervals for both measured protease EC 50 values were no greater than 2. Across this set of designs, the EM’s predictions were highly consistent with the observed stability scores (R2 = 0.52, r2 = 0.59, Fig 2B and 2C). We note the marked prevalence of predicted stability scores in two peaks slightly above and below 0; we attribute no meaning or significance to these peaks, however—a drawback of using neural networks, where interpretability is often elusive. Many designs in Corpus B were expert designs mutated to test whether the EM could be misled by local or global changes to a sequence. Such mistakes could reveal weaknesses caused by the model’s training data or approach. The EM was trained only on stability scores of expert designs created with stability as a goal (though it was exposed to natural protein sequences during training; see Methods), which made overfitting a particular concern. The EM might, for instance, learn to make predictions based mostly on a design’s overall amino acid composition, taking advantage of statistical differences between the training set and the larger set of possible sequences. In that case, the EM may be relatively insensitive to single-site mutations or changes that alter the ordering of amino acids but maintain their overall frequencies. Conversely, if the EM has learned to make predictions based on specific memorized motifs present in the training data, then single-site mutations away from these motifs may cause the model to react unpredictably. To test these possibilities, we evaluated the effects of fourteen different types of perturbations to each of the 5000 designs in the original test set. These fell into three classes: single-site mutations (substitution, insertion, deletion), global rearrangements that did not change amino-acid frequency (sequence reversal, cyclic shifts), and global compositional changes (deleting half the amino acids). The model was able to predict the stability of perturbed proteins with reasonable fidelity (S8 Fig). As expected, all fourteen classes of perturbations tended to decrease a protein’s stability. Also as expected, the two classes of global disruptions tended to decrease stability more than the local mutations. To evaluate the EM’s ability to predict decreases in stability, we limited analysis to cases where the base protein had, or was predicted to have, a stability score of at least 1. As seen in S9 Fig, the model systematically underestimated the impacts of these sequence perturbations. However, it ordered the fourteen perturbations reasonably well by mean destabilization (Spearman ρ = 0.908, p = 7.30 × 10−6) and showed some success at ranking the impacts of disruptions to individual proteins (Spearman ρ = 0.432, p = 1.68 × 10−202).
Refining stability with a sequence-only model Successfully predicting that mutations reduce stability could, in principle, be achieved by overfitting to signatures of expert design—mutated versions of such designs, lacking those signatures, could be judged as inferior. Therefore, we asked whether the EM can also predict stabilizing mutations, which presumably would not be seen by the model as improving upon an adventitious signature. The EM was able to find multi-site mutations that increased stability score from among astronomically large sets of possible mutations to a collection of proteins. This demonstrates extrapolation of the EM’s predictions to those regions of design space where a design’s stability increases. More importantly, it raises the possibility of rapid and automatic improvement to existing protein design pipelines. To evaluate this possibility, we randomly selected a subset of 1215 of the test set designs, and subjected these sequences to incremental stabilization guided by the predictions of the EM. We performed a five-round beam search (Fig 3A) to generate successive single-site substitutions with higher predicted stability. Although this approach required the prediction of stability for hundreds of millions of primary sequences, the computationally efficient EM yielded these predictions in only a few hours using a single GPU. On average, this iterative refinement increased assayed stability more than ten-fold (i.e., one stability score unit) after five single-site substitutions (Fig 3B). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Refinement and its effects. (A) Beam search refinement. Refinement begins with a protein’s amino acid sequence (left, green). All possible single-site substitutions are generated (bold red characters in middle sequences), and they are sorted according to the EM’s prediction of their stability (middle). The design with the highest predicted stability (middle, green) is reserved as the product of refinement at this stage. The k single-site substitutions with the highest predicted stability (middle, green and yellow; k = 2 in this illustration, though we used k = 50 to stabilize proteins) are then used as new bases. For each of the k new bases, the process was repeated, combining all single-site substitutions of all k new bases in the new sorted list (right). In this fashion, we predicted the best mutations of 1–5 amino acid substitutions for each of the base designs. (B) Effect of guided and random substitutions on expert-designed proteins. Guided substitutions (orange) raised the mean stability score from 0.23 in the base population (green) to 1.27 after five amino acid changes, as compared to random substitutions (blue) which dropped it to -0.06. Because stability score is logarithmic, the increase in stability is more than ten-fold after five guided substitutions. Annotated black bars indicate means, notches indicate bootstrapped 95% confidence intervals around the medians, boxes indicate upper and lower quartiles, and whiskers indicate 1.5 times the inter-quartile range.
https://doi.org/10.1371/journal.pone.0265020.g003 Notably, this refinement is successful even though the EM’s ability to accurately predict the effects of single-site mutations is limited. One of the proteins we subjected to guided refinement, “HEEH_rd3_0726”, had single-site mutations exhaustively characterized previously [26]. The Pearson correlation coefficient between antilogs of predicted and experimental stability scores for this protein was only 0.419, not much higher than some of the correlations seen with single-site mutations of natural proteins (Table 1). Yet, for this protein, three and four guided mutations increased stability 2.25-fold and 1.89-fold, respectively. Substantial improvements to stability are seen even when we restrict analysis only to proteins that are already at least marginally stable and have EC 50 values well below the assay’s ceiling (S10 Fig). This demonstrates that the EM can stabilize designs that are already somewhat stable, rather than only transforming entirely unstable sequences into partially folded structures. By limiting EC 50 values we ensure that the EM is not being misled by, or taking advantage of, unfolded-state predictions outside the assay’s range. Iterative stabilization of proteins also succeeds with respect to the individual components of the final stability score. In S11 and S12 Figs, we show stability scores for each protease separately. We see that the effect of guided substitutions is smaller, and the effect of random substitutions is larger, for trypsin than for chymotrypsin. As also seen in S4 Fig, trypsin stability scores tend to be lower than chymotrypsin stability scores, which means that they more often determine the final stability score. In every case, however, we see increasing divergence between guided and random substitutions as the number of substitutions increases. To separate how much of the change in stability score was due to changes in protease resistance and how much was due to changes in predicted unfolded-state resistance, we examined the experimentally observed EC 50 values for each protease. S13 and S14 Figs show the results, broken down by protease. In both cases, there is increasing divergence between EC 50 values for guided versus random substitutions as the number of substitutions increases. Trypsin EC 50 values increase modestly as a result of EM guidance, while chymotrypsin EC 50 values hold steady. For both proteases, much of the increase in stability score is associated with predicted decreases in unfolded-state resistance, in the absence of a corresponding decrease in observed protease resistance. As with stability scores, we see the same pattern of results when we restrict analysis only to proteins that are already at least marginally stable and which demonstrate EC 50 values well below the assay’s ceiling (S15 and S16 Figs).
Generating and refining novel proteins with sequence-only models Given that a data-driven neural network model can rapidly and successfully predict stability from sequence, we next asked if a similar model can also rapidly generate new stable designs. Specifically, we approached de novo protein design as a language generation task where the amino acids are the words of our language. We adapt deep learning sequence models from neural machine translation [43, 44] for our task. This architecture translates secondary structure sequences to primary sequences of the same length using an attention-based encoder-decoder recurrent neural network [45]. This yields primary sequences that are likely to have close to the requested secondary sequence and likely to be similar to sequences in the training set, with variation that likely reflects distributions in the training set. We refer to this model as the Generator Model (GM, Fig 4A). Beyond demonstrating proof of concept for this translation approach, we use the GM to create new proteins with which to evaluate EM-guided stabilization. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 4. Generator model and its performance. (A) Architecture of the GM. Adapted for use with protein secondary and primary sequences from [45]. (B) Density plot of experimental stability scores for training designs, designs from the GM, and scrambles of the GM designs. (C) Density plot of trypsin EC 50 values. (D) Density plot of chymotrypsin EC 50 values.
https://doi.org/10.1371/journal.pone.0265020.g004 As a preliminary diagnostic to confirm that this class of model is fundamentally able to map between the domains of primary and secondary sequence, we consider the reverse task: predicting secondary sequence from primary sequence. We trained a model with architecture equivalent to the GM’s to predict ground-truth secondary structure sequences derived from applying the DSSP algorithm [46] to Rosetta tertiary structure models for a set of primary sequences. This “Reverse GM”, which recapitulates the functionality of tools like PSIPRED [47], achieves a secondary-sequence character error rate of 0.67% (CER: the average fraction of characters that are wrong; more precisely, the Levenshtein edit distance between the predicted and reference secondary sequences divided by sequence length). Even when evaluated on topologies that the model was not trained on, it achieves a CER of 9% for secondary sequence. We also used the Reverse GM to evaluate primary sequences produced by the GM, as described below. We used the GM to create a set of primary sequences based on secondary sequences of expert-designed proteins that had been held out from the training set. To evaluate how well the model was able to learn principles of expert-designed proteins, we measured the perplexity [48, 49] of the predicted primary sequences relative to the actual primary sequences of the test proteins. Perplexity, the standard metric in language modeling [50], is the antilog of the average of the per-residue cross entropy of the model’s predictions. It can be intuitively understood as the average number of reasonable choices available to the model for each predicted amino acid, and can be as low as 1 when predictions are perfect. A perplexity of 19 would indicate chance (we exclude cysteine from the GM’s vocabulary because disulfide bonds can invalidate the stability assay by preventing the fluorescent end of the expression vector from being washed away even when the protein has been cleaved [26]). As a control, we trained a basic recurrent neural network on the same data set. When tasked with predicting an amino acid in a protein’s sequence based on the amino acids preceding it, this control model achieved a perplexity of 1.8. The GM expands upon such a model by considering, for each amino acid, not only the primary sequence generated so far but also the desired secondary sequence. In doing so, it achieves a perplexity of 1.1—by considering secondary structure information, it is able to produce primary sequences that are more reliably similar to those in the test set. We selected 1000 GM designs with the highest agreement between the input secondary structure and the Reverse GM’s predicted secondary structure, speculating that these designs might have underlying features best encoded by the GM. Note that this selection process further encourages sequences to be similar to those in the training set. These 1000 designs were assayed for stability as part of Corpus B. Their measured stability scores and EC 50 values were similar to those of the expert-designed proteins on which they were trained, and much higher than random scrambles of the same designs (Fig 4B–4D). We used BLAST [51] to compute the maximum identity of the designs with respect to designs in the training data set. Of the 993 designs that expressed at a detectable level, 275 had a maximum identity less than 95% and 47 of these 275 (17%) had an empirical stability score greater than 1.0 (classified as “stable”). Of the 120 designs with a maximum identity below 75%, 17 (14%) were stable. The GM produced stable sequences for six of the nine topologies that were represented by at least 500 designs in the training set. Note that the GM is simply learning to translate from secondary sequences to primary sequences; it knows nothing explicit about tertiary structure or stability. Because we wanted to generate stable sequences, the training set for the GM was only a subset of Corpus A, with many unstable designs excluded and greater weight given during training to stable designs (see Materials and methods). The GM was able to nearly match the stability distribution of this stability-enriched training set (Fig 4B). We questioned whether EM-based refinement would improve these 1000 GM designs as it did expert-designed proteins, or, alternatively, whether the EM would see designs by another neural network model trained on the same data as unimprovable. Guided substitutions overall yielded ten-fold increases in the stability of GM designs (Fig 5A, stability score increase of 1.0). By contrast, random substitutions overall yield proteins that are less stable than those created by the GM. Importantly, the increase in stability with EM refinement is apparent even for proteins that are the most different from those in the training corpus (Fig 5B, S17 Fig). This demonstrates that we are able to design and iteratively improve highly stable proteins de novo, using an automated, rapid pipeline. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 5. Refinement of GM designs, overall and as a function of novelty. (A) Effect of guided and random substitutions on designs created by the GM. The base stability score was much higher for this population of designs than for the expert-designed proteins tested, with a mean of 0.67; EM-guided refinement further increased it to 1.67. As with the expert-designed proteins, this demonstrates a ten-fold increase in stability. Random substitutions again had a deleterious effect, dropping mean stability to 0.29. (B) Stability of GM designs, and guided and random substitutions within those designs, as novelty increases. We consider designs to be more novel when BLAST percent identity with the most-similar design in the training corpus is lower.
https://doi.org/10.1371/journal.pone.0265020.g005 S18 and S19 Figs show increasing divergence between guided and random substitutions for each protease’s individual stability score as the number of substitutions increases. S20 and S21 Figs show increased divergence between guided and random substitutions for each protease’s EC 50 as the number of substitutions increases.
Guided refinement demonstrates biases but yields stable monomers We observed that individual single-site substitutions were much more successful at improving assayed stability when guided by the EM than when carried out at random. Considering mutations to both expert-designed and GM sequences, there were 5850 guided single-site substitutions evaluated in the Corpus B, and 4798 random substitutions. For each type of substitution (i.e. for each change from amino acid x to amino acid y) we calculated the mean stability change when that substitution was applied due to the EM’s guidance and when it was applied at random. There were 191 of these types of substitution for which we have data in both conditions. Of these, guided substitutions were more successful than random for 148 of them, with an average stability increase over all types of 0.155 (Fig 6). As a somewhat arbitrary threshold for considering differences between guided and random substitutions to be reliable, we ran t-tests between the two approaches for each type of substitution (two-sample unpaired two-tailed t-test, p < 0.05 uncorrected for multiple comparisons). 24 of the substitution types showed significant differences between guided and random substitutions, indicating sufficient examples of both treatments as well as sufficient differences between them; in 23 of those, guided substitutions were more successful than random. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 6. Differential effects on stability between guided and random single-site substitutions. For each original amino acid (indexed on the y-axis) and each replacement amino acid (indexed on the x-axis), the mean effect on stability when that substitution was guided by the EM is computed, as is the mean effect on stability when that substitution was applied randomly. The difference between these two effects is plotted for each from-to pair that was represented in the data; redder circles indicate that guided substitutions were more beneficial for stability, bluer circles indicate that random substitutions were more beneficial. Circles with heavy black outlines showed a significant difference (two-sample unpaired two-tailed t-test, p < 0.05 uncorrected) between guided and random effects. Bar graphs indicate mean differences in stability score (guided substitutions minus random substitutions) averaged across all replacement amino acids for each original amino acid (left) and and averaged across all original amino acids for each replacement amino acid (bottom).
https://doi.org/10.1371/journal.pone.0265020.g006 Further breaking down these analyses and looking at guided and random substitutions separately, we see that guided substitutions increased stability score 4138 out of 5850 times (70.7%); the mean stability change over all guided substitutions was 0.208 (S22 Fig). Out of 196 types of guided substitution, 43 showed a significant positive effect on stability (two-tailed t-test, p < 0.05 uncorrected for multiple comparisons), and 2 showed a significant negative effect. If guided substitutions had no real effect on stability, we would expect to see approximately 10 types crossing this significance threshold in each direction. Random substitutions, meanwhile, decreased stability score 2731 out of 4798 times (56.9%); the mean stability change over all random substitutions was -0.074 (S23 Fig). Out of 318 types of random substitution, 6 were significantly helpful (two-tailed t-test, p < 0.05 uncorrected), and 55 were significantly deleterious; we would expect approximately 16 in each direction, if random substitutions had no real effect on stability. The general lack of improvement with random substitutions conforms with the intuition that stabilizing proteins is difficult. That EM-guided substitutions perform so well is encouraging but invites further scrutiny. The EM’s refinement of proteins frequently substituted amino acids with tryptophan (S22 Fig)—it was the amino acid of choice in 3276 of the 5850 guided single-site substitutions. This raises the possibility that the EM has learned how to exploit gaps between what the stability assay is actually computing (a difference between observed protease resistance and predicted unfolded-state resistance) and the assay’s objective (protein stability). For example, indiscriminately adding tryptophan to a sequence might cause proteins to aggregate or undergo nonspecific hydrophobic collapse [52], or even to form regions of stable structure [53]. Either of these could increase protease resistance without necessarily increasing the stability of the intended tertiary structure. However, random mutations to tryptophan show virtually no change in assayed stability score (mean increase of 0.016), compared to a substantial increase in stability score when those substitutions are guided by the EM (mean increase of 0.215). Lysine and glutamic acid are the two most common residues to be mutated to tryptophan, and these residues usually occur on the surface of the protein. By favoring these mutations, the EM could be implicitly learning to bias tryptophan substitutions at surface sites, versus random mutations which would be more uniformly distributed and thus less likely to lead to aggregation. However, if those biases were leading to aggregation and thus artificially inflated stability score, random mutations from lysine or glutamic acid to tryptophan should show a similar increase in stability score as guided mutations. Guided mutations, however, increase stability scores substantially more. Taken together, these observations show that increasing the tryptophan count of a protein in and of itself does not increase stability—the changes to tryptophan must be at the right location for the protein in question. To directly test the quality of the proteins generated, we selected by hand—on the basis of low sequence identity to the training set, diversity of primary and secondary sequences, and the progenitor sequences for four apparently stable refinements—twelve designs for further analysis. We expressed these designs in Escherichia coli with N-terminal His6 affinity purification tags and purified them using immobilized metal affinity chromatography (IMAC, Fig 7A). Eight of the designs were expressed as reasonably soluble proteins and seven of those were monomeric as judged by size exclusion chromatography (SEC). Mass-spectroscopy confirmed that the purified fractions were predominantly molecules of the correct molecular weight. Circular dichroism (CD) spectroscopy with wavelength scans and thermal melts were used to assess the secondary structure and overall stability, respectively, of each of the soluble proteins. The nmt_0457_guided_03 and nmt_0994_guided_02 proteins, expected to be all alpha helical, show distinct minima at 222nm and 208nm, consistent with the intended helical structure. For all the other designs, the spectra are consistent with a mixed alpha-beta secondary structure (S24 Fig). Thermal stability of the designed proteins was determined by monitoring the CD signal at 222nm as a function of temperature. Five of the eight monomeric/oligomeric designs are highly thermostable (S24 Fig); four do not unfold even at a temperature of 99°C. These four are precisely the non-aggregating proteins with stability score > 1, providing additional experimental validation to the previously published finding [26] that stability scores are reasonably correlated with thermal stability. For one of these highly stable designs (Fig 7B) we obtained a crystal structure (Fig 7C); this showed an all-helical structure, consistent with predictions of the Reverse GM, CD, and AlphaFold2 [54]. Two of the three helical segments observed in the crystal structure match AlphaFold2’s prediction, with the third helix taking a slightly different trajectory from the prediction. We also predicted the tertiary structure of all twelve designs using Rosetta (S25 Fig). These results confirm that our data-driven pipeline can design stable monomeric proteins, amenable to analysis with current tertiary structure prediction models, without itself making any explicit calculation of tertiary structure. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 7. Laboratory analyses of GM proteins. (A) Results of targeted analyses of twelve GM proteins. All twelve proteins had less than 60% identity with respect to the entire set of training proteins, as calculated by BLAST. Reported topology was predicted by PSIPRED [47] and Rosetta (in that order, when predictions differ). (B) “Life cycle” of one refined protein, nmt_0994_guided_02. The design began with a requested secondary structure fed into the GM. The GM produced a primary sequence (nmt_0994) stochastically translated from that secondary structure; however, the Reverse GM correctly predicted that two of the requested helices were actually merged into one in the generated protein’s structure. EM-guided refinement then changed two residues to tryptophan, which raised the empirical stability score from -0.18 to 1.88. Green characters highlight differences from original sequences. (C) Crystal structure for nmt_0994_guided_02 (dark grey), showing that it also has the three helices predicted by the Reverse GM for its pre-refinement progenitor. It is shown aligned to the structure predicted by AlphaFold2 (cyan). The prediction and the crystal structure have a C α RMSD of 3.4 Å.
https://doi.org/10.1371/journal.pone.0265020.g007
[END]
[1] Url:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0265020
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/