(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
A deep learning model for prediction of autism status using whole-exome sequencing data [1]
['Qing Wu', 'Department Of Molecular Biology', 'Cell Biology', 'Biochemistry', 'Brown University', 'Providence', 'Rhode Island', 'United States Of America', 'Center For Translational Neuroscience', 'Robert J.']
Date: 2024-12
Autism status prediction model with selected features
We developed a DL model to predict autism status based on the individuals’ genetic data (Fig 1). We named our model “Separate Translated Autism Research–Neural Network (STAR-NN)” as it separates variants based on their functional effects on each gene. The impact of both common and rare variants was considered in our model. Rare protein truncating variants (PTVs) include nonsense, frameshift and canonical splicing variants. Missense variants were annotated by Missense Badness, Polyphen-2, Constraint (MPC) score [31] and separated into three groups: MisA, MisB and MisC (Methods). A higher MPC score indicates a higher likelihood of damaging effect of a missense variant. Missense variants with MPC score above 1, predicted as possibly damaging missense variants, were grouped into MisA and MisB forming MisAB. MisC represents the possibly benign missense variants. Due to the specificity of model structure, STAR-NN learned the impact of different types of variants in the same gene separately. Gender, PGS calculated from significantly associated common variants (MAF > 1%), together with three types of rare variants, PTVs, MisAB, and MisC, were used as input to predict a binary autism status: individuals with autism or non-autistic controls. We trained and tested the model using the combined WES1 and WES2 datasets (WES12) from SPARK dataset (16,809 individuals with autism and 26,394 non-autistic controls). 80% of the samples were used for training, 10% for validation and 10% for testing.
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. The workflow and framework of STAR-NN. After quality control, rare variants (minor allele frequency, MAF < 1%) identified from whole exome sequencing data were separated into four categories based on their function effect: protein truncating variants (PTVs), MisA (Missense variants with MPC > 2), MisB (Missense variants with 1 < MPC < 2) and MisC (Missense variants with 0 < MPC < 1). MisA and MisB were then combined as MisAB. Three types of rare exonic variants were used as input for STAR-NN model. In addition, polygenic score (PGS) generated from common variants (MAF > 1%) from microarray data were also used as input for STAR-NN. STAR-NN uses a three-to-one mapping strategy to learn different types of variants on the same gene separately. G represents gene node, S with grey color represents the option to add gene sets node before final output (shaded circle). *, Quality control on WES1 and WES2 used the same standards, further details provided in Materials and Methods. #, numbers in brackets showing (the count of variants, in the count of individuals) in the dataset.
https://doi.org/10.1371/journal.pcbi.1012468.g001
While selecting features is not a necessary step in DL models, the DL model might not efficiently learn the importance of certain rare features due to the sparsity of the rare variants. For that reason, we used an automated ML model, Tree-based Pipeline Optimization Tool (TPOT) [32], to select the best features for the model (Methods, S1 Fig). A final list of 1489 selected features, including 1487 genes, PGS and gender were used. Of the 1487 pre-selected genes (S1 Table), 115 genes have previously been found in “SFARI Gene”, a database of genes implicated in autism susceptibility [33] (hypergeometric test, p-value = 3.057e-5. SFARI genes list released on 1/11/2022).
Our model outperformed traditional ML models, including decision trees, random forest, XGBoost, L1 logistic regression, L2 logistic regression, linear support vector and a basic DNN model that does not separate the rare variants by their functional effect, further supporting the importance of separation of variants based on their functional effect. Receiver Operating Characteristic Area Under the Curve (ROC-AUC) for STAR-NN was 0.7317 (Fig 2A and S2 Table). Our model demonstrated a faster training process than logistic regression with L2 regularization (L2LR) which exhibited the highest performance among the traditional ML models. STAR-NN took an average of 179.4 seconds for each training process whereas L2LR took an average of 690 seconds for each training (S3 Table). We used 1 CPU, 32 core and 256G memory for the training of each model on high performance computing cluster, Oscar maintained and supported by Center for Computation and Visualization at Brown University.
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Performance of STAR-NN. A. ROC-AUC plot, showing STAR-NN outperformed six traditional machine learning model and a basic deep neural network (DNN) model. Variants of different type was not separated in traditional machine learning model and the basic DNN. B. ROC-AUC plot, showing STAR-NN with selected gene features outperformed the model using other gene sets as input. C. The density plot of PGS for individuals with autism and non-autistic controls. D. The distribution plot of score generated from STAR-NN for individuals with autism and non-autistic controls.
https://doi.org/10.1371/journal.pcbi.1012468.g002
To test the predictive performance of selected features, we compared the model performance using 4 different groups of gene features as input, including 1487 selected gene features, 1031 SFARI genes, a combination of 2405 selected features and SFARI genes and the 19117 full gene set. We found that our model, using selected features has the highest performance (ROC-AUC = 0.7317) (Fig 2B and S4 Table).
We generated PGS using common variants for each individual (Fig 2C). Compared to the PGS, which exhibits a small difference between individuals with autism and non-autistic controls (mean PGS: 0.00156 for individuals with autism and -0.00152 for non-autistic controls), our model significantly separated the two groups (mean scores: 0.5784 for individuals with autism and 0.4271 non-autistic controls, respectively; Mann-Whitney-Wilcoxon Test, p<2.2e-16, Fig 2D). While a significant difference was observed between PGS of males with autism and non-autistic controls, PGS of females with autism and non-autistic controls shows no significant difference (Fig 3A and 3B). We also tested the STAR-NN model without PGS as an input and ROC-AUC was 0.7281 with gender, PTVs, MisAB and MisC as input (S5 Table). Compared to PGS, the score generated from STAR-NN has a significant distinction between individuals with autism and non-autistic controls for both males and females (Fig 3C and 3D).
PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Score from STAR-NN in male and female population. The density plot of PGS for individuals with autism and non-autistic controls in females (A) and males(B). The density plot of autism score generated from STAR-NN in females (C) and males (D). The dashed line shows the mean value for each distribution.
https://doi.org/10.1371/journal.pcbi.1012468.g003
We tested the individual effect of PTV, MisAB and MisC on the prediction of autism status. Basic DNN model using PGS, biological sex and aggregated count of PTVs per gene as input resulted in an ROC-AUC of 0.7080. The ROC-AUC of 0.7015 was generated using PGS, biological sex and aggregated count of MisAB per gene as input. The basic DNN model using PGS, biological sex and aggregated count of MisC per gene as input had a lower performance, (ROC-AUC of 0.6982) compared with using aggregated count of PTVs per gene or using aggregated count of MisAB per gene as input (S6 Table). We also tested STAR-NN with a 2-to-1 mapping structure to assess the necessity of including MisC as input. We used gender, PGS, aggregated counts of PTVs and MisAB as input and obtained an ROC-AUC of 0.7157 for the 2-to-1 STAR-NN model (S7 Table). Meanwhile, we tested STAR-NN model performance of a 4-to-1 mapping structure by separating PTVs, MisAB, MisC and synonymous variants on the same gene at the input level and merged into one. We found that including synonymous variants slightly decreased model performance (S8 Table). Each ROC-AUC value mentioned above were based on 10 random repeats. The results showed that PTV, MisAB and MisC are all contributing to the prediction of autism status. STAR-NN, incorporating the combined effect from PTVs, MisAB and MisC, had a slightly better performance than basic DNN model with individual effects of PTV, MisAB and MisC as input. This suggests the importance of 3-to-1 mapping structure of STAR-NN to the prediction of autism status.
[END]
---
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012468
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/