(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins [1]

['Moritz Ertelt', 'Institute For Drug Discovery', 'Leipzig University Medical Faculty', 'Leipzig', 'Center For Scalable Data Analytics', 'Artificial Intelligence Scads.Ai', 'Dresden Leipzig', 'Vikram Khipple Mulligan', 'Center For Computational Biology', 'Flatiron Institute']

Date: 2024-04

Abstract Post-translational modifications (PTMs) of proteins play a vital role in their function and stability. These modifications influence protein folding, signaling, protein-protein interactions, enzyme activity, binding affinity, aggregation, degradation, and much more. To date, over 400 types of PTMs have been described, representing chemical diversity well beyond the genetically encoded amino acids. Such modifications pose a challenge to the successful design of proteins, but also represent a major opportunity to diversify the protein engineering toolbox. To this end, we first trained artificial neural networks (ANNs) to predict eighteen of the most abundant PTMs, including protein glycosylation, phosphorylation, methylation, and deamidation. In a second step, these models were implemented inside the computational protein modeling suite Rosetta, which allows flexible combination with existing protocols to model the modified sites and understand their impact on protein stability as well as function. Lastly, we developed a new design protocol that either maximizes or minimizes the predicted probability of a particular site being modified. We find that this combination of ANN prediction and structure-based design can enable the modification of existing, as well as the introduction of novel, PTMs. The potential applications of our work include, but are not limited to, glycan masking of epitopes, strengthening protein-protein interactions through phosphorylation, as well as protecting proteins from deamidation liabilities. These applications are especially important for the design of new protein therapeutics where PTMs can drastically change the therapeutic properties of a protein. Our work adds novel tools to Rosetta’s protein engineering toolbox that allow for the rational design of PTMs.

Author summary Machine learning is changing the world of protein design, from structure prediction methods like AlphaFold to fixed-backbone design methods like ProteinMPNN. machine learning methods have made much progress in various aspects of protein computational biology, both complementing and, in some cases, surpassing traditional macromolecular modeling methods such as those combined in libraries like the Rosetta software suite. However, a lack of compatibility and flexibility can hinder interoperability with existing methods, preventing the full potential of these new solutions from being realized. Here, we first present a new machine learning tool for predicting post-translational modifications (PTMs), which play an important role in the stability and function of proteins, and then highlight how the implementation of this tool in the existing Rosetta toolbox can facilitate new applications. To this end, we combine PTM prediction with protein design, maximizing or minimizing the predicted probability of a post-translational modification occurring at a specific site. As one example, we predict the N-linked glycosylation of influenza hemagglutinin, which has applications in both understanding the evolution of viral strains over time, and engineering additional glycosylation sites to mask unwanted epitopes of vaccine candidates.

Citation: Ertelt M, Mulligan VK, Maguire JB, Lyskov S, Moretti R, Schiffner T, et al. (2024) Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins. PLoS Comput Biol 20(3): e1011939. https://doi.org/10.1371/journal.pcbi.1011939 Editor: Joanna Slusky, University of Kansas, UNITED STATES Received: June 18, 2023; Accepted: February 20, 2024; Published: March 14, 2024 Copyright: © 2024 Ertelt et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data and code used for running experiments, model training, and plotting is available on a GitHub repository at https://github.com/meilerlab/PTMPrediction. Additional documentation for the Rosetta SimpleMetric can be found at https://www.rosettacommons.org/docs/latest/scripting_documentation/RosettaScripts/SimpleMetrics/simple_metric_pages/PTMPredictionMetric. Funding: This work is supported through a Rosetta mini-grant under award number RC22021 from RosettaCommons (www.rosettacommons.org) held by CTS. ME, JM and CTS acknowledge the financial support by the Federal Ministry of Education and Research of Germany and by the Sächsische Staatsministerium für Wissenschaft Kultur und Tourismus in the program Center of Excellence for AI-research "Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig", project identification number: ScaDS.AI (https://scads.ai/). ME's position is funded through an award by ScaDS.AI. VKM is supported by the Simons Foundation (https://www.simonsfoundation.org/). TS is supported by a Sofja Kovalevskaja prize from the Alexander-von-Humboldt foundation (https://www.humboldt-foundation.de/), while JM is supported by an Alexander-von-Humboldt professorship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

This is a PLOS Computational Biology Methods paper.

Introduction PTMs play an important role in modulating both protein stability and many aspects of protein function. PTMs can be divided into reversible and irreversible modifications, with some modifications, like N-linked glycosylation, even occurring before protein folding. The diversity of possible PTMs highlights the complex chemical composition of proteins, which is not limited to the standard 20 letter amino acid code. Understanding the impact of modifications is especially vital in the field of protein therapeutics, where PTMs can range from being essential for desired therapeutic function, to completely blocking therapeutic function through unforeseen changes in stability and function over time [1]. Glycosylation describes the enzymatic attachment of an oligosaccharide to a protein residue. This generally occurs in the endoplasmic reticulum (ER) and Golgi apparatus for proteins bound for secretion or for cell surface expression, though rare cytoplasmic and nuclear glycoproteins are known [2,3]. Glycosylation is further classified into N-and O-glycosylation, where the carbohydrate linkage occurs either at the side-chain amide nitrogen of an asparagine residue (N-glycosylation), or the hydroxyl oxygen atom of a serine, threonine, or (very rarely) tyrosine residue (O-glycosylation). Additionally, N-glycosylation occurs in the unfolded state while O-glycosylation occurs after the protein is already folded. Both N-and O-glycosylation tend to increase thermostability and solubility [4], and both can modulate interactions with other proteins [2]. For N-glycosylation, there exists a well-known sequence motif: NxT/S, where x is any amino acid except proline [5]. While this sequence motif is helpful in identifying potential sites, the existence of a sequon is not sufficient to guarantee glycosylation. Additionally, multiple improved sequons have been discovered through trial and error, highlighting the complexity beyond the NxT/S motif [6,7]. For O-glycosylation a clear sequence motif is not known; however, O-glycosylation sites tend to cluster in proline/serine rich flexible regions of proteins [8]. Glycosylation of protein therapeutics can impact their folding, solubility, thermal stability, chemical stability, and aggregation propensity [9,10,11]. The list of protein drugs affected by glycosylation is long and includes chymotrypsin [12,13], insulin [14], lenograstim [15,16,17], antithrombin [18], agalsidase alfa/beta [19,20,21], and various antibodies [11] (for a detailed review refer to [10]). For this reason, when engineering protein therapeutics, it is essential to be able to predict glycosylation, and extremely useful to be able to rationally design for or against it. This “glycoengineering” can be particularly useful in vaccine development: off-target epitopes, for instance in engineered epitope-presentation scaffolds, can be “masked” by suitable introduction of glycosylation sites [22,23]. Glycans are commonly used by viruses to hide antigenic protein surfaces, however, this mechanism can also be used to prevent unwanted immune reactions in vaccines and direct an immune response to a desired site. For influenza hemagglutinin, for example, the creation of a “hyper-glycosylated” variant through seven additional glycosylation sites lead to better protection against morbidity and mortality in mice upon virus challenge by directing the immune response to a neutralizing epitope left unglycosylated [24]. Deamidation, the spontaneous reaction of asparagine to isoaspartate, is one of the most commonly occurring PTMs known. The resulting modification leads to structural changes through the insertion of a negative charge and through significant alteration of the protein backbone (effectively, replacing an α-L-amino acid with a β3-amino acid with the chiral center reversed), affecting both protein stability and function. In vivo, deamidation is thought to play the role of a molecular “clock”, marking proteins for degradation through increased susceptibility to proteolysis [25,26,27]. The rate of deamidation is not only influenced by pH and temperature but also by its local environment, including the neighboring residues, secondary structure, and solvent accessibility [28,29]. Therefore, the deamidation half-life of a protein can be as long as several months or as short as hours, at which point it can begin to affect the pharmacokinetics of therapeutic proteins. A commonly described deamidation motif is an asparagine in a flexible loop, followed by a glycine residue [28]. The occurrence of a deamidation site, however, cannot be simply derived from sequence alone and thus remains unpredictable without experimental characterization. For therapeutic proteins the rate of deamidation can strongly influence both shelf life and persistence time in the body, through either loss of function or stability, and therefore render them ineffective [30,31]. Therapeutic proteins affected include, but are not limited to, antibodies [32], vaccine antigens [33,34,35], peptides [36], adeno-associated virus (AAV) serotypes used for human gene therapy [37], human hormones [25,38] and enzymes [25]. In case of AAV vectors, multiple deamidation sites were discovered and engineered for enhanced stability against deamidation, leading to higher transduction efficiencies in mice, as well as different T cell activation profiles [39]. For antibodies, deamidation potentially leads not only to aggregation but also to drastic decreases in antigen binding affinity [32]. Deamidation sites are commonly discovered late in the development process and then corrected by trial-and-error mutation studies, leading to unnecessary costs and liabilities. Although nowadays many companies use computational liability screening methods, most of them are purely sequence-based. More recently, several studies have used rational design to create proteins responsive to changes in either phosphorylation or glutathionylation with potential applications in building biomaterials or controlling cellular behavior. Scheuermann et al. [40] and Gao et al. [41] designed minimal domains derived from EF-Hand calcium-binding domains that only bind terbium upon glutathionylation or phosphorylation of a key residue, therefore regulating the function of a protein through its modification status. Similarly, Winter et al. [42] and Thompson et al. [43] designed proteins with their multimerization status being defined by whether a particular residue located at the interface is phosphorylated or not. Woodall et al. [44] combine the two approaches, creating tyrosine and serine kinase-driven protein switches where protein association is controlled by kinase activity, leading to the reconstitution of green fluorescent protein fluorescence or the inhibition of the protease calpain. These seminal studies highlight the potential benefit of PTM-aware protein engineering. As the occurrence and rate of PTMs is dependent on multiple factors, prediction needs to take many features into account. Previously, multiple studies used machine learning methods to predict PTMs, generally focusing on a single prominent modification. Often, sequence is the only readily available information, and therefore used as the main feature in combination with in silico predicted structural features like solvent accessibility or secondary structure. In the case of protein deamidation, for example, a recent study used both sequence and selected structural features, including neighboring residues, solvent accessible surface area (SASA), dihedral angles and half-life times derived from a mass spectrometry poly-peptide study [27]. In the case of N-linked glycosylation, multiple studies [45,46,47,48] trained neural networks on the sequence context of glycosylation sites, often using the full-length protein sequence, or leveraging homology-based features. In these cases, it is not entirely clear whether the model learned general sequon preferences or simply protein homology, especially in the case where proteins with cellular localization in the nucleus or no glycosylation sequon were used as negative examples. However, the usefulness of a predictive model is not measured alone by its accuracy, but whether the choice of data reflects the downstream task the model is intended to be used for. The approaches do not only differ by the features or neural network architectures used, but crucially by their choice and filtering of data. These filtering steps are especially important to avoid overestimating the performance of a model, because of, for example, missed homology or false negatives. While these models are potentially useful for predicting glycosylation in natural proteins, they are of limited use in the case of (re-)engineering proteins. With the recent revolution in protein structure prediction [49,50], however, structural features are more readily available to complement sequence information. The engineering of modification sites would offer both the reduction of liabilities from unwanted PTMs, as well as the introduction of desirable PTMs in order to improve stability or alter functionality of therapeutics. The protein modeling suite Rosetta [51] has proven successful in tasks such as designing proteins for thermodynamic stability [52] and functionality [53]. By implementing accurate prediction of PTMs using machine learning in Rosetta, we can combine this new tool with Rosetta’s existing structure-based protein design toolbox to either screen pools of natural, reengineered, or de novo designed proteins for the presence or absence of a PTM, or to impose the presence or absence of a PTM as a requirement during the design process. Moreover, by bringing this into the context of existing protein design protocols, we can combine PTM restrictions or requirements with other design objectives for which well-validated optimization protocols already exist, permitting multi-objective optimization. The integration into the existing Rosetta ecosystem also permits the use of these tools for analytical purposes, to model different modifications in the contexts in which they are likely to occur in order to aid understanding of their impact on protein function and stability. For example, the already present glycosylation modeling tools [54,55,56] allow us to further test the plausibility of a predicted glycosylation site, as well as make predictions about its impact on, for instance, a modelled protein-protein interaction. To our knowledge, no protocol for engineering PTMs which combines machine learning with structure-based design has been implemented yet. We argue that this combination of predictive machine learning methods with structure-based design has great potential for a variety of protein engineering applications [57]. In this study, we implemented both, a metric that scans a given protein structure for predicted PTM sites, as well as a protocol using protein design to either increase or decrease the predicted probability of a modification to occur. Compared to earlier work, we leverage recent improvements in the field of natural language processing, as well as similarities between modifications, to improve the prediction accuracy. Additionally, the models are implemented as a SimpleMetric [54] (in Rosetta, a module for measuring a property of a structure), allowing seamless integration with other RosettaScripts objects. Internally, the implemented SimpleMetric, called the PTMPredictionMetric, accesses the Tensorflow model through Tensorflow’s C API [58]. To ensure robustness, avoid repeated load and initialization of the Tensorflow model, and minimize developer error, we also built a framework, called the RosettaTensorflowManager, for structured C++-style interaction with machine learning models. By implementing these methods in Rosetta, we benefit from existing infrastructure for unit, integration, and scientific testing [59], ensuring that the methods remain functional and that results produced with them remain reproducible. In comparison, Python library compatibilities can be notoriously hard to organize and maintain, hindering reproducibility. A recent study on computational biology webservers found only 31% of them to be consistently working [60]. As a demonstration of our methods, to modify the predicted probability of a modification to occur, we design proteins using a Monte Carlo protocol optimizing the Rosetta score as well as the predicted modification probability. This combination allows us to find a tradeoff between thermodynamic stability and predicted PTM rate. Additionally, given a functionally relevant structure, like an antibody-antigen complex, we can further ensure that the mutation is not disrupting the functionality of a given protein.

Discussion In this work, we combined machine learning with structure-based protein design to predict and (re-)engineer PTMs in proteins. Our main result is that this combination of accurate prediction and design allows the modification of the predicted rate of PTMs occurring in proteins. We were able to predict PTM probabilities not only on native structures, but also on structures altered with Rosetta design. Interestingly, combining the prediction of certain PTMs with the prediction of other modifications increased performance for multiple cases. To do so, we pooled data for PTMs with unique modified amino acids (for example only one kind of lysine modification) and switched to a multi-class classification setting. Additionally, as this increased the number of examples for training, we added a small attention-based layer to our sequence track which is also responsible for the better performance. The improvement was especially prominent for cases with few or unbalanced data. Our initial reasoning for combining different modifications was that the surroundings of a PTM site should share a similar feature space as, e.g., a potential site must be exposed to enable enzyme binding. In the case of Protein A deamidation, we correctly predicted the susceptibility of four out of five asparagine residues. Additionally, we could show that using Rosetta structural modelling in combination with modification prediction was able to recapitulate changes in deamidation probability. For asparagine at position 28, a mutation of its neighbor from glycine to alanine resulted in a drastically reduced deamidation probability, which is confirmed by previous experimental data [69]. In the case of influenza hemagglutinin, we were able to correctly predict four out of five glycosylation sites of the early H3N2 HK68 strain and three out of four later acquired glycosylation sites by modifying the structure of the original strain. One reason for the misclassified positions could be the inadequate modeling of the mutated backbone, which prevents accurate prediction. For the de novo serine-kinase driven phosphorylation switch from Woodall et al. [44] we accurately predicted the four introduced phosphorylation sites and used a Monte Carlo based optimization protocol to find mutations that increased the predicted phosphorylation probability of site S93 from 0.63 to 0.88. The best design had a Q97R mutation at the n+4 site which is in line with previous characterization of protein kinase A preferences [74]. Effective phosphorylation should increase the extent of activation in the presence of kinase and is therefore likely to improve the dynamic range of the protein switch. Taken together, these results show promise for accurate prediction not only of native, but also of designed and/or modeled proteins. As the field of PTM engineering grows more cases should become available to build a test set that goes beyond the case studies presented here. To facilitate thorough testing, a shift to also publish negative data for failed PTM engineering examples will be necessary. Additionally, we did not experimentally validate the resulting mutations of our case studies, as such a verification should test a broad set of proteins for one PTM, something that is out of scope of the current work that focuses on the prediction and engineering of many modifications. Overall, it must be pointed out that our method presented here is very challenging to benchmark, as appropriate data are not necessarily available, especially for protein design tasks. We foresee that as more data become available, our method would require updates and retraining. Multiple other studies have worked on predicting PTMs [46,75,76,77], mainly focusing on one modification using sequence data. Here, in addition to sequence information, we leveraged the power of AlphaFold2 to enrich our features with structural data. In the case of N-linked glycosylation, some studies have not limited themselves to the NxT/S sequon and therefore achieve higher accuracies on their data sets [45]. Similarly, a recent study on predicting N-linked glycosylation used proteins that were known to be localized in the cell nucleus (and therefore never glycosylated) as negative examples [46]. While the prediction of cellular localization of proteins is interesting, this would not translate to designing new glycosylation sites. A noteworthy exception is an earlier study [78] which also used a stringent filtering approach to select positive and negative sequons based on the PDB, showing that a combination of structure and sequence features was superior to sequence features alone. Since this study was published in 2012, the number of glycosylated proteins in the Protein Data Bank has steadily increased and we showed that new progress in the field of natural language processing and the combined prediction with other PTMs further increases the prediction performance. A limitation of our study is the quality of structures predicted with AlphaFold2. While we filtered for local and overall pLDDT, the accuracy of all predicted structural models is not guaranteed. Additionally, it has been shown that regions with low AlphaFold2 pLDDT can correlate with intrinsically disorder regions (IDRs) [79] which are known to be enriched modifications like phosphorylation or O-linked glycosylation [80,81]. By removing protein models with low pLDDT we might have biased our prediction for areas with well-defined secondary structure. However, the distinction between intrinsically disordered regions and low-quality regions is not possible with pLDDT alone. While this limits prediction of PTMs for IDRs it reflects the engineering use case for which the tool was created. Engineering IDRs is an exciting future prospect that will be enabled by accurate prediction of such regions. An important caveat of this work, especially in the context of lower prediction performance for some PTMs, is the focus on PTM-aware protein engineering. We base our prediction on the local context of a potentially modified site to generalize beyond natural proteins. As the lower accuracies for some modifications highlight, our models are not intended to e.g., screen a whole proteome for glycosylation sites as models that consider homology would probably achieve higher accuracies. Instead, we focus on the downstream task of engineering particular modifications for which we optimized our prediction models, and we argue that this provides practically useful tools for protein engineering tasks. While our method allows the prediction of modifications irrelevant of protein homology or other global features like cellular localization, it therefore requires the user to be informed about the to-be engineered protein. For example, optimizing the probability of an N-linked glycosylation site will still not result in a glycosylated protein if the protein lacks a secretion tag or is expressed in an unsuitable system like Escherichia coli. In the case of N-linked glycosylation, a major limitation is the availability of high-quality data. While we extensively curated our dataset, including cross-referencing UniProt data and manually checking electron densities, false negative sequons could still be present when electron densities were missing and UniProt annotations not available. One option to supplement PTMs with low data availability would have been to leverage enzyme profiling data which are available for e.g., O-and N-glycosyltransferases [82]. However, the profiling studies are based on analyzing short peptides independent from proteins, which provides information of enzyme specificity in an idealized system. Using this kind of data would therefore prevent us from using certain features calculated from protein structures, like solvent-accessible-surface-area (SASA). Additionally, taking the example of N-linked glycosylation, modification is far from being based on substrate recognition alone, as is shown by the preferences for loops over structured residue sites. Training models with substrate recognition data could therefore, especially in the case of PTMs with low data availability, lead to models unable to accurately predict a modification in its full protein structure context. As we think that this would be a major limitation in our downstream engineering task, we choose to limit ourselves to determined, or predicted, protein structures. To achieve better performance, data on sequons that are not occupied will be necessary, as most databases focus on positive examples.

Conclusion The combination of accurate prediction and structure-based design should enable the modification of existing, as well as the introduction of novel, PTMs. The potential applications of our work include, but are not limited to, glycan masking epitopes, strengthening protein-protein interactions through phosphorylation, designing PTM-dependent protein switches, as well as protecting proteins from deamidation liabilities. In conclusion, our work adds novel tools to Rosetta’s protein engineering toolbox, that allow for the rational design of PTMs.

Methods Collection of proteins with PTMs We first collected experimentally verified modifications sites from the dbPTM non-homologous benchmark dataset [61,62]. To enrich our features with structural data, we additionally used the AlphaFold2 database to download a predicted model for each protein in the dataset, filtering the models by local and overall pLDDT greater than 50. While the benchmark present in the dbPTM is already non-homologous, we clustered the sequence windows surrounding a potentially modified site (10 residues) to 90% sequence identity with CD-HIT [83] to further avoid redundancy. We calculated the SASA, dihedral angles and secondary structure for all remaining proteins using PyRosetta [63]. This procedure was done for all PTMs except for N-linked glycosylation and deamidation. In the case of N-linked glycosylation, no benchmark comparing occupied and unoccupied sequons was readily available. Therefore, we collected all eukaryotic proteins from the Protein Data Bank [64] with at least one N-linked glycosylation site present and searched them for additional unoccupied sequons. Next, we cross-checked potential negative sites against UniProt annotations [65,66] and removed any that were annotated as experimentally verified to be glycosylated. To further avoid false negatives, we manually checked the electron densities of all potential negatives and excluded all with ambiguous densities. As the last step we clustered the sequence identities of the sequence windows to 90% using CD-HIT. In the case of deamidation, the largest dataset available is from Delmar et al [67], however, no full sequences were published or shared on request, therefore the dataset was used without protein structure prediction or feature calculation in PyRosetta. All datasets and detailed scripts can be found at github.com/MeilerLab/PTMPrediction. Training of a two-track neural network to predict PTMs We trained a two-track neural network using Tensorflow and Keras [58,84] using 10-fold cross validation through Sklearn [85] and different sampling strategies using imbalanced-learn [86]. We oversampled the positive class for all PTMs, except for phosphorylation and O-linked glycosylation where we under sampled the negative classes, resulting in both cases in a 1:1 ratio of negative and positive cases. Additionally, numpy [87], pandas [88,89], matplot [90] and seaborn [91] were used for data preparation and plotting. The first track of our neural network uses a sequence window of eight residues (-4/+4 around modification) as input into an embedding layer, followed by a global average pooling layer and a dense layer. The second track uses phi/psi angles of the potentially modified residue and its two neighbors, as well as the secondary structure and SASA of the potentially modified residue as input into two fully connected dense layers. The two-tracks were concatenated into one dense layer with a sigmoidal activation function outputting a probability between zero and one. In the case of training on multiple modification predictions, we added a small attention layer after the embedding layer to the sequence track, an additional fully connected dense layer to the structure track and changed the output layer to a softmax activation function (Fig 2). For the optimization we used Adam with a learning rate of 0.0001 and trained for 200 epochs with early stopping. For the multi class training we additionally used a learning rate warmup with cosine decay. A binary cross-entropy loss was applied for the single models and a sparse categorical cross-entropy loss for the multi class approach. A script to reproduce the training can be found at github.com/MeilerLab/PTMPrediction. Incorporation of the neural network into Rosetta To enable rapid combination with existing design and analysis methods in Rosetta, we incorporated our prediction method as a RosettaScripts [92] element. RosettaScripts enables the rapid and flexible combination of existing protocols without proficiency in C++/Python. Therefore, we implemented feature calculation and interference in a Rosetta SimpleMetric (a module for measuring properties of a Pose) called the PTMPredictionMetric using the newly developed RosettaTensorflowManager. Full details are in S1 Text. Exemplary protocols to compile Rosetta with the required submodules, how to run PTM prediction and PTM design are deposited at github.com/MeilerLab/PTMPrediction. Deamidation rate prediction of Protein A We collected the structure of Protein A from the Protein Data Bank (ID: 1DEE [68]) and relaxed it using FastRelax [93,94] in RosettaScripts [92]. Afterwards we predicted the deamidation probability for each asparagine using the newly developed PTMPredictionMetric which uses the described neural net. A script for this task can be found at github.com/MeilerLab/PTMPrediction. Next, we used FastDesign [95] to mutate the neighbor of N23 and N28 to all possible amino acids except cysteine and then repeated our deamidation rate prediction. ChimeraX was used to visualize the structures [96]. Glycosylation prediction of influenza hemagglutinin For prediction of influenza hemagglutinin N-linked glycosylation we first removed any ligands/glycans of the H3N2 HK68 strain (PDB ID: 4FNK [71]) and relaxed the structure using FastRelax [93,94]. We then predicted the glycosylation sites of the already present sequons using the newly developed PTMPredictionMover. Next, we introduced the sequons (including residues –2/+2) of glycosylation sites from newer strains into the original HK68 structure using Rosetta FastDesign [95], configured with a resfile specifying the particular mutations (i.e. with a fully determined sequence), and predicted their glycosylation probability. For visualization, the SimpleGlycosylateMover [54] was used to glycosylate N-linked glycosylation sites, and ChimeraX was used to render the resulting structures [96]. Scripts for prediction of glycosylation can be found at github.com/MeilerLab/PTMPrediction. Monte Carlo optimization of a de novo serine-kinase driven protein switch First, we relaxed the modeled structure of pGFP-S4 from Woodall et al. [44] using FastRelax [82,83]. Next, we predicted the phosphorylation probability of all Ser/Thr residues using the newly developed PTMPredictionMover. We then created a custom RosettaScripts script incorporating the GenericMonteCarloMover to optimize the predicted probability of the phosphorylation site S93. Starting from the initial structure we randomly mutated a neighbor residue (positions 89, 92, 94, 95, 96 or 97) to any amino acid expect cysteine and then accepted or rejected the mutation based on whether it improved Rosetta total score and predicted phosphorylation probability, repeating this for 50 trials in one trajectory. Using this protocol, 1000 designs were created and ranked by improvements in total score and predicted phosphorylation probability. Scripts for the prediction and design can be found at github.com/MeilerLab/PTMPrediction.

Supporting information S1 Text. Table A in S1 Text. Classes implemented for running Tensorflow models in Rosetta. Table B in S1 Text. Classes implemented to support the PTMPredictionMetric. Table C in S1 Text. Summary of positive and negative examples for each PTM type https://doi.org/10.1371/journal.pcbi.1011939.s001 (DOCX)

Acknowledgments Computations for this work were done (in part) using resources of the Leipzig University Computing Centre.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011939

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/