(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Functional unknomics: Systematic screening of conserved genes of unknown function [1]

['João J. Rocha', 'Mrc Laboratory Of Molecular Biology', 'Cambridge', 'United Kingdom', 'Satish Arcot Jayaram', 'Tim J. Stevens', 'Nadine Muschalik', 'Rajen D. Shah', 'Centre For Mathematical Sciences', 'University Of Cambridge']

Date: 2023-08

The human genome encodes approximately 20,000 proteins, many still uncharacterised. It has become clear that scientific research tends to focus on well-studied proteins, leading to a concern that poorly understood genes are unjustifiably neglected. To address this, we have developed a publicly available and customisable “Unknome database” that ranks proteins based on how little is known about them. We applied RNA interference (RNAi) in Drosophila to 260 unknown genes that are conserved between flies and humans. Knockdown of some genes resulted in loss of viability, and functional screening of the rest revealed hits for fertility, development, locomotion, protein quality control, and resilience to stress. CRISPR/Cas9 gene disruption validated a component of Notch signalling and 2 genes contributing to male fertility. Our work illustrates the importance of poorly understood genes, provides a resource to accelerate future research, and highlights a need to support database curation to ensure that misannotation does not erode our awareness of our own ignorance.

Funding: This work was supported by the Medical Research Council, as part of United Kingdom Research and Innovation (MC_U105178783 to SM and MC_U105178780 to MF). Work in MF’s lab was supported by Wellcome Investigator Awards 101035/Z/13/Z and 220887/Z/20/Z. RDS was funded by the Engineering and Physical Sciences Research Council (EP/R013381/1) and by the Alan Turing Institute through a Turing Fellowship (TU/B/00006). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

In this work, we have investigated directly the potential biological significance of conserved genes of unknown function by developing a systematic approach to their identification and characterisation. We have created an “Unknome database” that assigns to each protein from a particular organism a “knownness” score based on a user-controlled application of the widely-used Genome Ontology (GO) annotations [ 21 , 22 ]. The database allows selection of an “unknome” for humans, or a chosen model organism, that can be tuned to reflect the degree of conservation in other species, for example, allowing a focus on those proteins of unknown function that have orthologs in humans or are widely conserved in evolution. We use this database to evaluate the human unknome and find that it is shrinking only slowly. To assess the value of the unknome as a foundation for experimental work, we selected a set of 260 Drosophila proteins of unknown function that are conserved in humans and used RNA interference (RNAi) to test their contribution to a wide range of biological processes. This revealed proteins important for diverse biological roles, including cilia function and Notch pathway signalling. Overall, our approach demonstrates that significant and unexplored biology is encoded in the neglected parts of proteomes.

Whatever the reasons, this inadvertent neglect of the unknown is clear and does not appear to be diminishing [ 9 ]. This has led to concern that important fundamental or clinical insight, as well as potential for therapeutic intervention, is being missed, and hence, the launch of several initiatives to address the problem. These include programmes to generate proteome-wide sets of reagents such as antibodies or mouse knock-out lines [ 10 , 11 ]. In addition, the NIH’s Illuminating the Druggable Genome initiative supports work on understudied kinases, ion channels, and GPCRs [ 12 ]. There have been initiatives to develop new means to predict protein function or structure [ 13 – 17 ]. Finally, databases such as Pharos, Harmonizome, and neXtProt link human genes to expression and genetic association studies with the aim of highlighting understudied genes relevant to disease and drug discovery [ 18 – 20 ].

This apparent bias in biological research toward the previously studied reflects several linked factors. Clearly, funding and peer-review systems are more likely to support research on proteins with prior evidence for functional or clinical importance, and individual perception of project risk seems likely to also contribute. In addition, scientific factors have been proposed, including a lack of specific reagents like antibodies or small molecule inhibitors, and a tendency to focus on proteins that are abundant and widely expressed and so likely to be present in cell lines and model organisms [ 4 , 7 , 9 ]. Finally, some genes may have roles that are not relevant to laboratory conditions [ 5 ].

The advent of genome sequencing revealed in humans and other species thousands of open reading frames that encode proteins that had not been identified by earlier biochemical or genetic studies. Since the release of the first draft of the human genome sequence in 2000, the application of transcriptomics and proteomics has confirmed that most of these new proteins are expressed, and the function of many of them has been identified [ 1 ]. However, despite over 20 years of extensive effort, there are also many others that still have no known function [ 2 , 3 ]. The mystery and the potential biological significance of these unknown genes is enhanced by many of them being well conserved and often being unrelated to known proteins and thus lacking clues to their function. Analysis of publication trends has revealed that research efforts continue to focus on genes and proteins of known function, with similar trends seen in gene and protein annotation databases [ 2 , 4 , 5 ]. This is despite clear evidence from studies of gene expression and genetic variation that many of the poorly characterised proteins are linked to disease, including those that are eminently druggable [ 6 , 7 ]. Indeed, it has long been argued that ignorance can drive scientific advance [ 8 ].

Results

Construction of an Unknome database Much of the progress in understanding protein function has come from research in model organisms selected for their experimental tractability. Application of this research to the proteins of humans requires being able to identify the orthologs of these proteins in model organisms. Although it is not certain that orthologs in different species have precisely the same function, they generally have similar or related functions, implying that work from model organisms at the very least provides plausible hypotheses to test. Thus, our Unknome database was designed to link a particular protein with what is known about its orthologs in humans and popular model organisms. A range of methods for identifying orthologs have been developed based on sequence conservation and although none are perfect, several achieve an accuracy in excess of 70%. We initially used the OrthoMCL database as it covered a wide range of organisms [23]. However, OrthoMCL was not being updated, and so the current Unknome database is based on the PANTHER database (version 17.0) which covers over 143 organisms, is currently in continuous development, and has a good level of sensitivity and accuracy [24–26]. The heart of the Unknome database has been the development of an approach to assigning a “knownness” score to proteins. This is not trivial and is inevitably a somewhat subjective measure. Definitions of “known” range from a simple statement of activity to an understanding of mechanism at atomic resolution, and even well-characterised proteins can reveal unexpected extra roles. Thus, we designed the database so that the criteria for knownness can be user-defined, as well as having a default set of criteria. The GO Consortium provides annotations of protein function that are well suited to this application. Firstly, GO annotation is based on a controlled vocabulary and so is consistent between different species, and secondly, it is well structured thus allowing a user to apply their own definition of knownness. The Unknome database combines PANTHER protein family groups (which we term “clusters”) with the GO annotations for each member of the cluster. This includes annotations from humans and the 11 model organisms selected by the GO Consortium for their Reference Genome Annotation Project. The sequence-similar protein clusters (primary PANTHER families) not only contain orthologs, but also recent paralogs: duplications within individual species or lineages. The knownness score for each protein is calculated from the number of GO annotations it possesses. It is important, however, to recognise that GO annotations do not all have equal evidential value, but they helpfully include an evidence code that indicates the type of source it is derived from. The Unknome database allows users to make use of this in generating a knownness score with an option to apply greater weight to annotations that are more likely to be reliable, such as those from a “Traceable Author Statement” rather than those “Inferred from Electronic Annotation” (Fig 1A and S1A Fig). In addition, weighting allows the selection of annotations most relevant to function. For instance, a protein’s subcellular location is often included in its GO annotation, but this may not helpfully restrict the range of possible functions, so the database provides the option of excluding it when calculating a knownness value. The final knownness score of a cluster of proteins is set as the highest score of a protein in the cluster (Fig 1B). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. The Unknome database. (A, B) Calculation of a knownness score for a cluster of orthologs based on the highest score in the cluster. Illustrated with a cluster corresponding to a subunit of a mitochondrial inner membrane translocase; (A) shows the GO annotations for mouse TIMM10, and derivation of a score based on the number of annotations weighted for their confidence, while (B) shows the scores for all the members of the cluster containing TIMM10 (UKP01389), with the highest score of a member being the knownness of the cluster. (C) The Unknome database contains information for each cluster showing its distribution across species, links to information for the protein from each species, and the change in knownness over time—as illustrated for cluster UKP01389. (D) User interface to list clusters from a user-selected set of model organisms by the knownness of the cluster. The list indicates the best-known member of the cluster and the human member(s) of the cluster. (E) The 10 best known protein clusters, showing the best-known human gene in each. (F) Plot of the number of PubMed citations in the Uniprot comments section for human-gene containing clusters in the indicated range of knownness. The data underlying the plot can be found in S1 Data. GO, Genome Ontology. https://doi.org/10.1371/journal.pbio.3002222.g001 The Unknome database is available as a website (http://unknome.org) that provides all protein clusters that contain at least 1 protein from humans or any of 11 model organisms (Fig 1C). The clusters can be ranked by knownness, and the user can modify this list so as to include only those proteins that are present in a particular combination of species, such as human plus a preferred model organism (Fig 1D). For each protein family, the interface shows the orthologs in its cluster and how the knownness of the cluster has changed over time (Fig 1C). These design principles maximise the versatility and power of the Unknome database as a tool for researchers from different biomedical fields.

Validation of the Unknome database To confirm that the Unknome database was accurately capturing current understanding of protein function, we ranked the 7,515 clusters of orthologs and paralogs that contain at least 1 human protein. Reassuringly, the top 10 scoring proteins have well-known roles in development and cell function (Fig 1E). In contrast, proteins containing one of the “Domains of Unknown Function” defined by the Pfam database were concentrated at the bottom of the range (S1B Fig). Clusters with a score of 1.0 or less correspond to 18.3% of all clusters but to 36% of the domains of unknown function (DUFs) and 59% of the related uncharacterised protein families (UPFs). The exceptions were typically multidomain proteins of known function that contain 1 domain whose role is unclear. Finally, the total number of PubMed citations for each protein shows a good correlation with the knownness scores from the database (Fig 1F). Overall, we conclude that the calculated knownness score provides a useful means to identify proteins of unknown function.

The change of the Unknome over time Unlike most databases, the Unknome will shrink over time. The knownness scores for clusters containing human proteins have increased across the whole range of proteins, but the proportion with a knownness score of 2 or less has declined from 43% to 23% over the last 10 years, with the decline being less in nonhuman model organisms (Fig 2A and S2A Fig). This slow progress is unlikely to represent a deficit in GO annotation which is kept up to date, but rather that human genes and proteins are much more likely to have been published on in the last 12 years if they are in clusters that were already well known at the start of this period (Fig 2B and S2B Fig). Consistent with this, knownness increases more rapidly over time for genes that were already well annotated (S2C Fig). These observations provide further support to the notion that research activity tends to focus on what has already been studied in depth [2,4,27]. There are 750 human clusters whose knownness was zero 12 years ago but has since increased to above 2. The GO terms most enriched in this set are mostly associated with cilia, reflecting recent acceleration of progress in studying this large and complex structure that is absent from some model organisms such as yeast (Fig 2C). Consistent with this, the less known human genes tend to be less likely to be conserved outside of vertebrates, and generally have fewer orthologs, suggesting that progress has been hampered by there being fewer orthologs that could be found by genetic screens in non-vertebrates (S2D and S2E Fig). Interestingly, the most highly known proteins are also less likely to be conserved outside of metazoans, reflecting the fact that many are involved in important developmental pathways or signalling events relevant to multicellularity (S2D Fig). However, of the 1,606 human-containing clusters with a current knownness score of less than 2.0, 68% are detectably conserved outside of vertebrates and 45% are conserved outside of metazoans (Fig 2D). Interestingly, no one model organism contains all of these, indicating that each has a role to play in illuminating the human unknome. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Analysis of trends in knownness. (A) Change in the distribution of knownness of the 7,515 clusters that contain at least 1 protein from humans. (B) Mean number of publications added each year since 2010 to the UniProt entry for the human protein in each of the 7,515 clusters that contain at least 1 human protein, ranked into deciles based on knownness at 2010. Where there was more than 1 human protein in the cluster, their publications were summed. The best-known clusters in 2010 received the most publications in subsequent years. (C) The 10 largest GO term enrichments for the 753 human proteins from clusters whose knownness has increased from 0 in 2010 to 2.0 or above by 2022. When there was more than 1 human protein in the cluster, a single one was used chosen by alphabetical order to avoid bias. GO enrichment analysis used ShinyGO [112]. (D) Venn diagram showing the distribution of genes from the indicated species in the 1,551 clusters of knownness <2.0 and which contain at least 1 human protein. Not shown are the 55 clusters that appear only in humans. The data underlying the graphs shown in the figure can be found in S1 Data. GO, Genome Ontology. https://doi.org/10.1371/journal.pbio.3002222.g002

Functional unknomics in Drosophila To test the value of the Unknome database, and to pilot experimental approaches to studying neglected but well-conserved proteins, we selected a set of unknown human proteins that are conserved in Drosophila and hence amenable to genetic analysis. Drosophila also tends to lack partial redundancy between closely related paralogs, as in humans this arose in many gene families from the 2 whole-genome duplications that occurred early in vertebrate evolution [28]. A powerful approach to investigating gene function in Drosophila is to knockdown its expression with RNAi and assess the biological consequences [29,30]. We thus determined the effect of expressing hairpin RNAs to direct RNAi against a panel of genes of unknown function. We initially selected all genes that had a knownness score of ≤1.0 and are conserved in both humans and flies, as well as being present in at least 80% of available metazoan genome sequences. Of the 629 corresponding Drosophila genes, 358 were available in the KK library that was the best available genome-wide RNAi library at the time (S1 Table) [31]. This, and other RNAi libraries, have been used for several genome-wide screens for phenotypes readily analysed at large scale, but had not been used for the screens that we applied [31]. These KK library stocks were crossed to lines containing Gal4 drivers to express the hairpin RNAs in either the whole fly or in specific tissues. After testing for viability, the nonessential genes were then screened with a panel of quantitative assays designed to reveal potential roles in a wide range of biological functions. These include male and female fertility, tissue growth (in the wing), response to the stresses of starvation or reactive oxygen species, proteostasis, and locomotion. The results of these screens are discussed below.

Unknown genes have essential functions To determine if the genes were required for viability, a ubiquitous GAL4 driver was used to direct RNAi throughout development (daughterless-Gal4). For 162 of the 358 genes, the resulting progeny showed compromised viability with either all (lethal) or almost all (semi-lethal) failing to develop beyond pupal eclosion, suggesting that these genes are essential for development or cell function (S1 Table). However, it was subsequently reported that in a subset of the lines in the KK RNAi library, the transgene is integrated in a locus (40D) that itself results in serious developmental defects when the transgene is expressed with a GAL4 driver [32,33]. Following PCR screening, we removed all of the stocks that had this integration site, all but one of them having been lethal in the initial screen. For the remaining 260 genes, the stocks used the alternative integration site which is not problematic, with KK stocks having been used successfully in a range of different screens [29,34]. For these, the RNAi compromised viability in 62 cases (24%). In considering the results from RNAi screens, one must always be mindful of off-target effects, and in Drosophila, the possible effects of variability in genetic background and conditions of rearing and maintenance. Nonetheless, of these 62 genes, 12% were also identified in a recent genome-wide screen of genes required for viability of S2 cells; in contrast, only 4% of the 198 nonessential genes were hits in the S2 cell screen [35]. The S2 study estimated that 17% of genes known to be essential in flies are also essential in S2 cells, and it is likely that using RNAi to knockdown gene function underestimates lethality. Our screen in whole organisms reveals that, despite several decades of extensive genetic screens in Drosophila, there are many genes with essential roles that have eluded characterisation. Of course, there is more to life than being alive. We therefore subjected the 198 apparently nonessential genes to a range of phenotypic tests to determine if they had detectable roles in a wide range of organismal functions. On the grounds that the long history of Drosophila genetic screens may have saturated the discovery of mutants with easily detectable phenotypes (mostly developmental defects), we targeted our search to nonstandard and quantitative phenotypes that are harder to assess. In practice, this meant designing phenotypic screens that were more complex than normal. Our hope was that this would identify a larger proportion of genes that had not been hit in more standard Drosophila screens. The results of these function screens are described below, followed by a validation of selected hits, with the screening data provided in S2 and S3 Data and the results summarised in S2 Table.

Contribution of unknome genes to fertility To test fertility, specific GAL4 drivers were used to knockdown the set of 198 unknown genes in either the male or female germline. Even with collecting data for multiple flies per gene, the resulting brood sizes showed some variability, as expected for a quantitative measure of a biological process. Thus, for all our assays, we needed to determine if outliers had a phenotype that exceeded to a statistically significant degree the variation intrinsic in the population. To do this, we used statistical tests based on 3 steps. First, we performed a regression on the replicate data for each gene to estimate its parameters and standard errors within the assay. Next, an outlier region was determined by fitting the parameter estimates for all analysed genes to a normal distribution, which was then used to define a boundary for outliers. Finally, for each gene, we tested the hypothesis that it falls within the outlier boundary. This approach is summarised in the Methods and described in detail in the Supporting information (S1 Text). To display the data from the fertility tests, mean brood sizes obtained from RNAi-treated males was plotted against those obtained from RNAi-treated females for each gene (Fig 3A). Several of the RNAi lines gave a substantial reduction in brood size that was sex specific and highly statistically significant. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Testing of the unknome set of genes for roles in fertility and wing growth. (A) Plot of brood sizes obtained from matings in which each gene was knocked down in either the male or female germline. Dotted lines indicate outlier boundaries, with the genes named being those whose position outside of the boundary is statistically significant, error bars show standard deviation, and the size of the circles is inversely proportional to the p-value. Controls: Vret is involved in piRNA biogenesis and affects female fertility [113], and Ref1 is an essential protein predicted to be involved in RNA export [114], and affects both males and females. (B) Summary of the significant hits from the test of male fertility, showing the human ortholog and the phenotype reported for patients with loss of function mutations (PCD, MMAF). (C) Adult wing illustrating the posterior domain that expresses engrailed during development and hence the engrailed-Gal4 driver used to express the hairpin RNAs. Also shown are the intervein areas measured to assess tissue growth in the anterior and posterior halves of the wing. (D) Plot of the mean area of the anterior and posterior intervein areas as in (C) for flies in which each gene was knocked down by RNAi in the posterior domain (pixel dimensions 2.5 μm × 2.5 μm). Errors are shown as tilted ellipses with the major/minor axes being the square roots of the eigenvectors of the covariance matrix. Dotted lines indicate the outlier boundary, with the genes named being those whose position outside of the boundary is statistically significant, with the size of the circles being inversely proportional to the p-value. The genes Hippo (growth repressor) and Chico (growth stimulator) were included as controls. (E) Representative wings from flies expressing hairpin RNA for the indicated genes in the posterior domain. Hippo and Chico are controls as in (D), with CG11103 and CG5885 showing an increase or decrease in the posterior domain, respectively. The means and variances used for the graphs shown in the figure can be found in S2 Data with the data points in S3 Data. MMAF, multiple morphological abnormalities of the sperm flagella; PCD, primary ciliary dyskinesia; RNAi, RNA interference. https://doi.org/10.1371/journal.pbio.3002222.g003 Female fertility. Two genes gave a partial, but significant, reduction in female brood size. During the course of our work, a mouse ortholog, MARF1, of one of these hits, CG17018, was identified in a genetic screen as being required for maintaining female fertility, apparently by controlling mRNA homeostasis in oocytes [36,37]. A recent study of CG17018 has confirmed that it is indeed required for female fertility in Drosophila, despite lacking some domains present in MARF1. Its appearance as a hit in our screen is therefore an encouraging validation of the approach [38]. The other gene, CG8237, has not previously been linked to fertility, but has a mammalian ortholog (FAM8A1) that has been recently proposed to help assemble the machinery for ER-associated degradation (ERAD) and so may have an indirect effect on oogenesis [39,40]. We selected CG8237 for validation by CRISPR/Cas9 gene disruption as described below. Male fertility. Seven genes showed near complete male sterility, with 5 further genes giving a statistically significant reduction in brood size. In humans, male sterility is one of the symptoms associated with primary ciliary dyskinesia (PCD), a disorder affecting motile cilia and flagella. While our analysis was in progress, exome-sequencing allowed the identification of many new PCD genes [41,42]. Interestingly, 5 of the genes identified in our assay are homologs of human PCD genes (Fig 3B), of which CG5155 (ARMC4) and CG31320 (DNAAF5) have since been shown to be required in Drosophila for male fertility [43,44]. All of these genes comprise, or help assemble, the dynein-based system that drives the beating of cilia and flagella. In addition, human orthologs of 2 of the semi-sterile hits in the Unknome screen have been found to be mutated in related familial conditions. CFAP43 (orthologous to CG17687) is mutated in patients with multiple morphological abnormalities of the sperm flagella (MMAF), and CFAP52 (orthologous to CG10064) is mutated in laterality disorder, a condition caused by defects in ciliary beating during development [45,46]. A further semi-sterile hit, CG14183, is an ortholog of DRC11, a subunit of the nexin-dynein regulatory complex that regulates flagellar beating in Chlamydomonas [47]. These findings prove the value of the Unknome database approach to identifying new genes of biological significance and validate the RNAi-based screening approach. Of the 4 remaining genes that showed male fertility defects, CG11025 is now only partially unknown as its human ortholog (UBAC1) is a non-catalytic subunit of the Kip1 ubiquitination-promoting complex, an E3 ubiquitin ligase [48]. CG11025 was recently identified in a genetic screen for defects in ciliary traffic and found to be required for fertility [49]. However, the other 3 genes, CG8135, CG6153, and CG16890 (orthologous to LMBRD2, PITHD1, and FRA10AC1), remain poorly understood in any species. They are less likely to be flagellar components as they are not predominantly expressed in testes and, as described below, 2 were selected for validation by CRISPR/Cas9 gene disruption, along with CG10064 whose ortholog CFAP52 is mutated in laterality disorder.

Contribution of unknome genes to tissue growth To test the unknome set of genes for roles in tissue formation and growth, we examined the effect of knocking them down in the posterior compartment of the wing imaginal disc and comparing the area of the posterior compartment of the adult wing to that of the control anterior compartment (Fig 3C), a method previously used to detect effects of a range of different genes [50,51]. As controls, we used Hippo, a negative regulator of tissue size, and Chico, a component of the PI 3-kinase pathway that stimulates organ growth [52,53]. Knockdown of 3 of the unknome genes in the posterior compartment caused a statistically significant increase in its area (Fig 3D and 3E). These include CG12090, the Drosophila ortholog of mammalian DEPDC5, which was found to be part of the GATOR1 complex that inhibits the Tor pathway during the protracted course of our studies. Mutants in GATOR1 subunits promote cell growth by increasing Tor activity [54,55]. The other 2 are CG14905 and CG11103. CG14905 is a paralog of a testes-specific gene CG17083, and both are orthologs of mammalian CCDC63/CCDC114 that have a role in attaching dynein to motile cilia, although CG14905 seems likely to have additional roles as it is ubiquitously expressed [56]. CG11103 (TM2D2) encodes a small membrane protein that shares a TM2 domain with Almondex, a protein with an uncharacterised role in Notch signalling [57]. We therefore selected CG11103 for further validation by CRISPR/Cas9 as described below. A larger number of genes caused a reduced compartment size when knocked down (Fig 3D). However, this could arise from a wide range of causes and so this is broad ranging assay for protein importance, and indeed mammalian orthologs of several of the stronger hits have been subsequently found to act in known cellular processes such membrane traffic (CG13957, the ortholog of human WASHC4), lipid degradation (CG3625/AIG1), or tRNA production (CG15896/PRORP). The strongest effect was seen with CG5885, an ortholog of a subunit of the translocon-associated protein (TRAP) complex that is associated with the Sec61 ER translocon [58]. TRAP’s role is enigmatic and so it was also selected for CRISPR/Cas9 validation.

Contribution of unknome genes to protein quality control The removal of aberrant proteins is a fundamental aspect of cellular metabolism, and thereby organismal health, but it is a function that does not necessarily contribute substantially to well-screened developmental phenotypes. It also exemplifies our suspicion that a disproportionately high number of the unknome set of genes may be involved in quality control and stress response functions, which are likely to have been missed by many traditional experimental approaches. We therefore tested the unknome gene set for protein quality control phenotypes, using an assay based on aggregation of GFP-tagged polyglutamine, a structure found in mutants of huntingtin that cause Huntington’s disease [59]. When this Httex1-Q46-eGFP reporter is expressed in the eye, the aggregates can be detected by fluorescence imaging (Fig 4A). The RNAi guides were co-expressed in the eye to knockdown unknome genes, and the number of polyQ aggregates quantified for 2 different size ranges. Although there was considerable variation in aggregate number, statistical analysis allowed the identification of clear outliers among the unknome RNAi set (Fig 4B). Most of the genes showing the largest increase in aggregates remain of unknown function (CG7785 (SPRYD7 in humans), CG16890 (FRA10AC1), CG14105 (TTC36), and CG18812 (GDAP2)), although mutation of GDAP2 in humans causes neurodegeneration, consistent with a role in quality control [60]. More is now known about 2 of the hits. CG4050 is a mammalian ortholog of TMTC3, one of a family of ER proteins recently shown to be O-mannosyltransferases; deletion of TMTC3 causes neurological defects [61,62]. CG5885 is the ortholog of the SSR3 subunit of the TRAP complex that also showed reduced wing size; in mammalian cells, the TRAP complex is up-regulated by ER stress [58]. These hits are consistent with reports that ER stress can increase cytosolic protein aggregation [63]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Testing of the unknome set of genes for roles in quality control and responses to stress. (A) Fluorescence micrographs of eyes from stocks expressing Httex1-Q46-eGFP along with either no RNAi, or one to the screen hit CG5885, both under the control of the GMR-GAL4 driver. The GFP fusion protein forms aggregates whose number and size increase over time. (B) Plot of the mean number of large (≥50 pixels) or small (<50 pixels) aggregates of Httex1-Q46-eGFP formed after 18 days in flies in which the unknome set of genes has been knocked-down by RNAi (pixel dimensions 0.5 μm × 0.5 μm). Errors are shown as tilted ellipses with the major/minor axes being the square roots of the eigenvectors of the covariance matrix. Dotted lines indicate an outlier boundary set at 90% of the variation in the dataset, with the genes named being those whose position outside of the boundary is statistically significant with a p-value <0.05, with the size of the circles being inversely proportional to the p-value. (C) Flywheel apparatus for time-lapse imaging of 96-well plates containing 1 fly per well. Each of 3 wheels holds 20 plates that rotate under a camera to be imaged once per hour. (D) Use of time-lapse imaging to assay viability: 96-well plates were imaged very hour and the movement between frames quantified for the fly in each well. Plots of movement size over time allow the time point for cessation of movement and hence loss of viability to be determined automatically. (E) Survival plots obtained from the flywheel for flies in 96-well plates with food containing the indicated concentration of oxidative stressor paraquat. Increased levels of the paraquat shorten survival times. Two independent 96-well plates are shown for each condition to illustrate the reproducibility of the assay. (F) Plot of the median survival time of fly lines in which the unknome set of genes has been knocked-down by RNAi and which were then exposed to paraquat to induce oxidative stress or were starved for amino acids. Dotted lines indicate an outlier boundary set at 80% of the variation in the dataset, with the genes named being those whose position outside of the boundary is statistically significant (p-value <0.05), with error bars showing standard deviation and the size of the circles inversely proportional to the p-value. The means and variances used for the graphs shown in (B) and (F) can be found in S2 Data with the individual data points in S3 Data. The data underlying the graph in (E) can be found in S1 Data. RNAi, RNA interference. https://doi.org/10.1371/journal.pbio.3002222.g004

Contribution of unknome genes to resilience to stress Genomes have evolved to deal with many environmental stresses, and again, these are processes poorly investigated by traditional genetic approaches. We therefore tested resilience to stress, following knockdown of the unknome set. To quantify the viability of large numbers of flies, individual flies were arrayed in 96-well plates, and the plates maintained on a “flywheel” that rotated them under a camera every hour (Fig 4C and S1 Video). Viability was indicated by movement between images, allowing time of death to be determined with an accuracy of +/− 1 h (Fig 4D and 4E). We applied this method with 2 challenges likely to be associated with different cellular resilience mechanisms: amino acid starvation and oxidative stress. Resilience under starvation. Under conditions of amino acid deprivation, knockdown of 8 of the unknome test set significantly prolonged survival (Fig 4F). Seven of these genes remain of unknown function, but interestingly, 5 have orthologs in other species whose localisation or interactions suggest that they have roles in the endosomal system. Thus DEF8, the mammalian ortholog of CG11534, has been reported to interact with Rab7 [64,65], and TMEM184A (CG5850) has been reported to act in the endocytosis of heparin [66]. In addition, the mammalian orthologs of CG4593 and CG9536 (CCDC25 and TMEM115) are Golgi-localised proteins of unknown function, and the yeast ortholog of CG13784 (ANY1) has been found to suppress loss of lipid flippases that act in endosome-to-Golgi recycling [67,68]. Our identification of this cluster of genes with related functions suggests that defects in endocytic recycling can prolong survival in starvation, possibly by altering autophagy or by reducing signalling from receptors that promote anabolism. The other 2 genes that improved starvation resilience when knocked down have no known function in any species, with loss of CG31259 (TMEM135) causing mitochondrial defects, and nothing reported for CG3223 (UBL7) [69,70]. One gene, CG15738 (NDUFAF6), caused an increased susceptibility to starvation, and it has been found to be an assembly factor for mitochondrial complex I, whose loss compromises viability [71]. Resilience under oxidative stress. Resistance to oxidative stress was tested with paraquat, an insecticide widely used to elevate superoxide levels in Drosophila [72,73]. There was considerable variability in the survival times, but 11 genes gave a statistically meaningful increase in resistance (Fig 4F). Most of these genes remain unknown, but 3 have since been reported to have functions related to oxidative stress signalling. The mammalian ortholog of CG4025 (DRAM1/2) is induced by p53 in response to DNA damage and promotes apoptosis and autophagy [74]. The mammalian orthologs of CG13604 (UBASH3A/B) are tyrosine phosphatases that repress SYK kinase, an enzyme reported to help protect cells against ROS, with superoxide activation of Drosophila Syk kinase signalling tissue injury [75–77]. Finally, the ortholog of CG3709 in archaea has tRNA pseudouridine synthase activity, but the human ortholog PUS10 has been reported to be cleaved during apoptosis and promote caspase-3 activity, thus its loss may slow apopotic cell death [78]. Of the other 8 hits, 5 remain poorly characterised, 1 is involved in mitochondrial function and so may reduce ROS production, and 2 are involved microtubule function with no clear link to superoxide responses. Although further validation will be required, these 5 genes seem good candidates to have a role in mitochondria or ROS-response pathways.

Contribution of unknome genes to locomotion Metazoans benefit from having a musculature under neuronal control. We therefore addressed the possibility of neuromuscular functions by testing the role of the unknome set of genes in locomotion, using the iFly tracking system in which the climbing trajectories of adult flies are quantified by imaging and automated analysis (Fig 5A) [79,80]. Climbing speed declines with age, so the assay was performed at both 8 days and 22 days post eclosion. Climbing speeds are inevitably somewhat variable, even in wild-type flies, but nonetheless 6 genes were statistically significant outliers when assayed after 8 days (Fig 5B). Two of these genes remain poorly understood, and for 3 of the others recent work indicates a role in muscle or neuronal function. These include CG9951, whose human homolog CDCC22 has been recently found to be a subunit of the retriever complex that acts in endosomal transport. Missense mutations in CDCC22 causing intellectual disability [81,82]. The human ortholog of CG13920 (TMEM35A) is required for assembly of acetylcholine receptors [83]. Finally, CG3479 is the gene mutated in the Drosophila outspread (osp) wing morphology allele, and is expressed in muscle, with one of its 2 mammalian orthologs (MPRIP) being been found to regulate actinomyosin filaments [84,85]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 5. Testing the unknome set of genes for roles in locomotion. (A) iFly tracking system for automatic quantitation of Drosophila locomotion (reproduced from Kohlhoff and colleagues [80]). Drosophila are knocked to the bottom of a glass vial and placed in an imaging chamber that allows viewing from 3 angles and their climbing tracked automatically. (B) Plot of the mean climbing speeds of fly lines in which the unknome set of genes has been knocked down by RNAi, and the speeds for each line were determined after 8 days or 22 days post eclosion. Loss of the Parkinson’s gene Pink1 affects climbing speed and it was included as a control [115]. Dotted lines indicate an outlier boundary set at 90% of the variation in the dataset, with the genes named being those whose position outside of the boundary is statistically significant with a p-value <0.1, with error bars showing standard deviation and the size of the circles inversely proportional to the p-value. The means and variances used for the plot shown in the figure can be found in S2 Data with the data points in S3 Data. RNAi, RNA interference. https://doi.org/10.1371/journal.pbio.3002222.g005

Validation of fertility screen hits by gene disruption Analysis of gene function by RNAi can be confounded by off-target effects. We therefore used CRISPR/Cas9 gene disruption to validate selected hits from 2 of the phenotypic screens. From the fertility screens, 3 male steriles and 1 female sterile were selected for genetic disruption. Of the male hits, CG10064 and CG6153 were both confirmed as being required for male fertility (Fig 6A to 6D). CG10064 is a WD40 repeat protein, and mutation of its human ortholog, CFAP52, results in abnormal left-right asymmetry patterning, a process known to depend on motile cilia [46,86]. CG6153 comprises a PITH domain that is also found in TXNL1, a thioredoxin-like protein that associates with the 19S regulatory domain of the proteasome through its PITH domain [87,88]. Males lacking CG6153 made morphologically normal sperm, but they did not accumulate in the seminal vesicle, the organ in which nascent sperm are stored prior to deployment, suggesting that they have limited viability (Fig 6E to 6J). Neither CG6153 nor its human ortholog PITHD1 are testis specific, and, indeed, orthologs are also present in non-ciliated plants and yeasts, suggesting that the protein has a role in an aspect of proteasome biology that is of particular importance for maturing viable sperm. Recent work on mouse PITHD1 indicates it has a role in both olfaction and fertility [89,90]. The other male sterile hit, CG16890 (FRA10AC1), and the female sterile hit, CG8237 (FAM8A1), did not show reduced fertility when disrupted and presumably represent off-target RNAi effects (S3 Fig). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 6. Validation of RNAi male sterility phenotypes using CRISPR/Cas9 gene disruption. (A, B) Schematics of the genomic locus of candidate genes, position of CRISPR target sites and mutant alleles analysed. (C, D) Assessment of male fertility of mutants (homozygous and over a deficiency). The graphs show mean values +/− SD of the number of progeny produced by mutant males. Three crosses with 5 wild-type virgins and 3 mutant males were analysed for each genotype. Wild-type males or males carrying in-frame mutations were used as controls. Where possible, alleles covering both alternative reading frames were analysed. (E–G) Widefield fluorescent micrographs of male reproductive systems of control and JS27/CG6153 mutants expressing Don Juan-GFP to label sperm. Mutants exhibit empty seminal vesicles, (E’-G’) show zoomed regions of seminal vesicles from E–G (yellow dashed squares). (H–J) Widefield phase micrographs of reproductive systems of control and mutant males. Sperm are produced in both (asterisks), suggesting that sperm are made in the mutant but does not survive. Note that some mutant sperm gets into the ejaculatory duct (J). AG, accessory gland; ED, ejaculatory duct; SV, seminal vesicle; T, testis. Scale bars, 200 μm (H, I), 100 μm (J). The data underlying the graphs shown in the figure can be found in S1 Data. RNAi, RNA interference. https://doi.org/10.1371/journal.pbio.3002222.g006

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002222

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/