(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Cardiac risk stratification in cancer patients: A longitudinal patient–patient network analysis
['Yuan Hou', 'Genomic Medicine Institute', 'Lerner Research Institute', 'Cleveland Clinic', 'Cleveland', 'Ohio', 'United States Of America', 'Yadi Zhou', 'Muzna Hussain', 'Robert']
Date: 2021-08
In this study, we demonstrated that the patient–patient network clustering methodology is clinically intuitive, and it allows more rapid identification of cancer survivors that are at greater risk of cardiac dysfunction. We believed that this study holds great promise for identifying novel cardiac risk subgroups and clinically actionable variables for the development of precision cardio-oncology.
We identified 4 clinically relevant subgroups that are significantly correlated with incidence of cardiac outcomes and mortality. Among the 4 subgroups, subgroup I (n = 625) has the highest risk of de novo CTRCD (28%) with an HR of 3.05 (95% confidence interval (CI) 2.51 to 3.72). Patients in subgroup IV (n = 1,250) had the worst survival probability (HR 4.32, 95% CI 3.82 to 4.88). From longitudinal patient–patient network analyses, the patients in subgroup I had a higher percentage of de novo CTRCD and a worse mortality within 5 years after the initiation of cancer therapies compared to long-time exposure (6 to 20 years). Using clinical variable network analyses, we identified that serum levels of NT-proB-type Natriuretic Peptide (NT-proBNP) and Troponin T are significantly correlated with patient’s mortality (NT-proBNP > 900 pg/mL versus NT-proBNP = 0 to 125 pg/mL, HR = 2.95, 95% CI 2.28 to 3.82, p < 0.001; Troponin T > 0.05 μg/L versus Troponin T ≤ 0.01 μg/L, HR = 2.08, 95% CI 1.83 to 2.34, p < 0.001). Study limitations include lack of independent cardio-oncology cohorts from different healthcare systems to evaluate the generalizability of the models. Meanwhile, the confounding factors, such as multiple medication usages, may influence the findings.
We utilized a topology-based K-means clustering approach for unbiased patient–patient network analyses of data from general demographics, echocardiogram (over 25,000), lab testing, and cardiac factors (cardiac). We performed hazard ratio (HR) and Kaplan–Meier analyses to identify clinically actionable variables. All confounding factors were adjusted by Cox regression models. We performed random-split and time-split training-test validation for our model.
We built a large longitudinal (up to 22 years’ follow-up from March 1997 to January 2019) cardio-oncology cohort having 4,632 cancer patients in Cleveland Clinic with 5 diagnosed cardiac outcomes: atrial fibrillation, coronary artery disease, heart failure, myocardial infarction, and stroke. The entire population includes 84% white Americans and 11% black Americans, and 59% females versus 41% males, with median age of 63 (interquartile range [IQR]: 54 to 71) years old.
Cardiovascular disease is a leading cause of death in general population and the second leading cause of mortality and morbidity in cancer survivors after recurrent malignancy in the United States. The growing awareness of cancer therapy–related cardiac dysfunction (CTRCD) has led to an emerging field of cardio-oncology; yet, there is limited knowledge on how to predict which patients will experience adverse cardiac outcomes. We aimed to perform unbiased cardiac risk stratification for cancer patients using our large-scale, institutional electronic medical records.
Funding: This work was supported by the National Heart, Lung, and Blood Institute of the National Institutes of Health (NIH) under Award Number K99HL138272 and R00HL138272 to F.C. This work was supported in part by the National Institute of Aging (R01AG066707 and 3R01AG066707-01S1) and by the VeloSano Pilot Program (Cleveland Clinic Taussig Cancer Institute) to F.C. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2021 Hou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Recent advances in artificial intelligence [ 25 ] and network science technologies [ 26 – 29 ] offer valuable and increasingly useful network tools for deep phenotyping of patient heterogeneities as seen in patients who developed stroke [ 30 ], pulmonary vascular disease [ 31 ], as well as those seen in cardio-oncology [ 10 , 32 – 34 ]. In this study, we utilized a clinically actionable network-based methodology (called patient–patient similarity network-based risk assessment of CVD or psnCVD) for unbiased cardiac risk stratification for cancer patients with CTRCD using large-scale, longitudinal, heterogeneous patient data, including demographics, echocardiogram, laboratory testing, and cardiac factors. With the aid of psnCVD, patients of unknown status can be classified based on their similarity to patients with known status, offering precision medicine approaches to identify patients that are highly sensitive to CTRCD (and allowing more rapid identification of patients that are at greater risk of CTRCD). Compared to traditional supervised risk methods, we hypothesized that our unsupervised psnCVD can leverage heterogeneous patient data and generate interpretable models to visualize the decision boundary in cardiac risk stratification of cancer patients with CTRCD.
The growing awareness of CTRCD has led to the emerging field of cardio-oncology [ 17 ]. However, there are limited guidelines in terms of how to assess for, prevent, and treat CTRCD in cancer survivors due to lack of predictive and prognostic assays. Echocardiogram is the most utilized clinical test to assess for CTRCD. The American Society of Echocardiography (ASE) have defined cardiac dysfunction as a reduction in left ventricular ejection fraction (LVEF) >10% below the lower limit of normal [ 18 ]. However, traditional echocardiogram approaches alone have limitations including high false positive rates [ 19 ]. Additionally, it is already late for intervention when decreased LVEF is recognized, as only 42% patients have partial or full recovery in left ventricular function [ 20 ]. Next-generation machine learning technologies can harness the power of large-scale clinical data and offer new possibilities to predict which patients are at risk and allow for early intervention to prevent risk of CVD. Previously, Samad and colleagues built supervised machine learning models from echocardiogram data and clinical data to predict patient survival [ 21 ]. However, traditional “black box” machine learning methods and statistical risk models have various limitations, reducing their ability to predict clinical outcomes in new scenarios from heterogeneous patients [ 22 – 24 ].
The improvement in early detection and effective oncological treatment has led to an increased number of cancer survivors in the United States [ 1 ]. This number is estimated to increase from 16.9 million in 2019 to 22.1 million by 2030 [ 2 ]. However, improved survival from cancer leads to greater risk from other life-threatening conditions and, in particular, cardiovascular disease (CVD), which is the second leading cause of mortality and morbidity in cancer survivors [ 1 , 3 ]. The increased risk of CVD in cancer survivors is in part associated with cancer therapy–related cardiac dysfunction (CTRCD) [ 4 ], including radiotherapy [ 5 ], cytotoxic chemotherapy [ 6 ], targeted therapies [ 7 – 9 ], and immunotherapy [ 10 – 12 ]. For example, doxorubicin is the first-line anticancer drug for multiple malignancies; however, doxorubicin has adverse short- and long-term cardiovascular effects including heart failure [ 13 ], cardiomyopathy [ 14 ], and left ventricular dysfunction [ 15 , 16 ].
The KM method was used to estimate probabilities of overall survival of the 4 subgroups. The survival rate was calculated from the cancer start date to death (all-cause), and log-rank test was used for comparison among different subgroups with Benjamini and Hochberg (BH) adjustment [ 42 ]. All the survival analyses were performed using the Survival and Survminer packages in R v3.6.0 (
https://www.r-project.org ). Statistical tests for assessing cardiac outcome enrichment across different subgroups through χ2 were performed by SciPy v1.2.1 (
https://docs.scipy.org/doc/scipy/reference/index.html ). The Kolmogorov–Smirnov (KS) test was used to assess continuous variable comparisons, and one-way ANOVA was used to compare the difference of clinical variables among 4 subgroups. p < 0.05 was considered statistically significant. All confounding factors (including age, sex, tumor types, tumor stages, disease comorbidities [e.g., hypertension and diabetes], and medications) were adjusted by Cox regression models.
We utilized the Python 3.7 package NetworkX [ 41 ] to investigate the properties of the clinical variable networks and used 2 approaches for evaluation. For clinical variable evaluation, we used node degrees and betweenness centrality to rank the variables in the networks. We then checked whether some clinical variables (nodes) were important to the network. We used a complete linkage hierarchical clustering algorithm to cluster the variables across four subgroups.
In order to understand the differences among the patient subgroups in terms of the clinical variables, we constructed a clinical variable network for each patient subgroup. For each cluster, PCC values of all pairs of noncategorical variables using their distribution in the patients within a specific subgroup were calculated. For the derived echocardiogram variables, the maximum absolute PCC was used to represent the correlations between these variables and other non-echocardiogram variables. However, there were a limited number of variables; the network density−based PCC cutoff selection strategy resulted in very sparse networks with too few variables present in the network. Therefore, we adopted a top K percent strategy that uses the K% connections with the highest PCC for the construction of the network. To determine which K to use, we test the following percentages: 5%, 10%, 15%, and 20% ( S5 Fig ). For example, using top 5%, all variable pairs with |PCC| greater than the absolute PCC at the top 5% were connected. Too few clinical variables were still present in the network when 5% and 10% were used. When 20% was used, we found an increasing number of correlations with nonsignificant p-values (p > 0.05). Therefore, 15% was used for the final clinical variable network analysis. At this cutoff, the highest p-value among all the correlations in all clusters was 0.008.
To better visualize the patient–patient networks, we computed the network density at different cutoff values and selected the cutoff that resulted in the lowest network density [ 38 , 39 ]. Network density is defined as the ratio of the number of actual links and the number of all possible links from all the patients. The number of all possible links is calculated as n × (n − 1) / 2, where n is the number of patients in the network. Using this method, we tested the cutoffs in an increment of 0.05 and identified that the lowest network density (0.24%; S3 Fig ) was achieved when the cutoff was 0.65. Finally, all patient pairs with cosine similarity >0.62 were considered connected in the network to retain more patients for the network visualization and obtain a lower network density. In addition to cosine similarity, we also tested Pearson correlation coefficient (PCC), but this latter measure was not able to yield more distinguishable clusters ( S4 Fig ). The density minimization procedure was used to optimize a network layout, which does not have a direct impact to improve performance of patient network clustering. The patient network with each cluster indicated by a color was visualized using Cytoscape v3.7.1 [ 40 ].
Considering that the K-means clustering has a stochastic component, which may result in different clusters being produced from the same input data, we computed the adjusted rand index (ARI) and adjusted mutual information (AMI) to validate the clustering stability [ 36 , 37 ]. For both metrics, a value of 1 indicates perfect agreement, while randomly assigned clusters have scores around 0. Following the workflow ( S2A Fig ), we performed 100 K-means clustering experiments using different random initial states. Among the 100 random experiments, 99 showed high ARI and AMI scores for the clusters, indicating robustness of the clustering results ( S2B Fig ).
To identify patient subgroups, we clustered the 4,632 patients using their cosine similarity network profiles by K-means clustering analysis ( Fig 1 ). We first tried to use the elbow method [ 35 ] to determine the number of clusters. We tested the range of 3 to 20 of the sum of squared error (SSE): (2) where X i indicates each patient, and is the average of the patients within the cluster. However, SSE was decreasing smoothly as the number of clusters increase. Therefore, we performed the survival analysis and cardiovascular outcome analyses for different number of clusters to identify the best K value. In this study, we chose the best cluster number (K = 4) using subject matter expertise based on a combination of factors (log-rank p < 0.05; S1 Fig and S2 Table ): (i) significantly distinguishable survival rate and cardiovascular outcome by Kaplan–Meier (KM) estimator with log-rank test; and (ii) the highest number of clusters to identify more new patient subgroups. For each cluster, we computed the ratio of patients with CVD and the p-value using a χ 2 test.
For the construction of the patient–patient network, we computed the cosine similarity for all pairs of patients ( Fig 1 ). The cosine similarity of patient A and B was calculated as: (1) where n = 112, and A i and B i indicate the i th variable of patient A and B, respectively. A cosine cutoff was used to determine if 2 patients should be connected in the network for visualization.
The overall study design included 4 steps: (A) data preprocessing; (B) PPN construction and visualization; (C) clinical validation using cardiac outcomes and survival analysis; and (D) clinical variable interpretation. The data preprocessing includes outlier removal, feature scaling by z-score method, and missing data imputation. With the preprocessed patient-clinical variable matrix, we used cosine measure as the similarity metrics for generating a patient–patient similarity network. Then, we performed K-means clustering to layout patients to different subgroups based on the cosine measure (see Methods ). Patients with similar clinical characteristics are grouped in the same cluster and are visualized through a specific subgroup to form the final PPN. After the patient network construction and visualization, we used 2 clinical outcomes, mortality and CTRCD to evaluate performance of network-based clustering. Finally, we performed the clinical variable network analysis to enhance clinical interpretation of each risk subgroups with CTRCD. CTRCD, cancer therapy–related cardiac dysfunction; PPN, patient–patient network.
Since our echocardiogram and partial general demographics data were longitudinal, for each variable, we extracted several features: maximum of all follow-ups, minimum of all follow-ups, slope of the variable versus time of all follow-ups, maximum increase within 3 months, and maximum decrease within 3 months. In total, we obtained 112 variables (including the derived ones). A detailed description for all the variables can be found in the supplemental methods section ( S1 Table ). In this study, 4,632 patients were kept for downstream analysis. Missing values were imputed using the mean method, followed by z-score scaling ( Fig 1 ).
All-cause mortality with up to 20 years’ follow-up data (1997 to 2019, median with interquartile range (IQR) were 5.02 [2.39 to 8.01]) was used as the primary outcome. Cardiac outcomes defined by ICD 9/10 codes were manually checked through looking at patient charts on Epic for accuracy, including atrial fibrillation (AF), coronary artery disease (CAD), heart failure (HF), myocardial infarction (MI), and stroke. According to the diagnosis date of these 5 cardiac outcomes, we identified the cardiac events diagnosed before cancer therapy as preexisting cardiac events, and those after cancer therapy as de novo CTRCD. All diagnoses defined by ICD 9/10 codes were further confirmed by manual review of all medical records.
Comprehensive clinical information was collected using the institutional electronic medical records (EMR) database by International Classification of Diseases (ICD 9/10) codes after cancer diagnosis. This cohort of patients is seen at Cleveland Clinic and regularly followed up. Although a minority of cases moved to another institution, the EMR at Cleveland Clinic is part of the Care Everywhere Network, which is used in 373 institutions across 48 states in the US. This allowed us to collect the details of visits from any such institution and therefore analyze relevant outcomes for these patients. For each patient, 112 clinical variables commonly collected during cardio-oncology clinical practices were used in this study ( S1 Table ): (a) 43 general demographics; (b) 24 lab testing variables; (c) 7 cardiac variables; and (d) 38 echocardiogram variables. Echocardiogram clinical variables were generated from a total of 23,451 sequential echocardiograms. Detailed clinical characteristics of the entire cohort used are provided in Table 1 .
All adult patients with cancer referred to the cardio-oncology service at the Cleveland Clinic from March 1997 up to January 2019. Our retrospective study has not prespecified analysis plan. However, the patient pool in this study represents oncology patients seen by oncology specialists at our institution undergoing cancer treatments and referred for cardiology evaluation/testing based upon cardiac risk factor profile or cardiac comorbidity. Once patients were identified, patient information was collected. This study was reviewed and approved by the Institutional Review Board. In addition, this study is reported as per the STARD 2015 reporting guideline for diagnostic accuracy studies ( S1 Checklist ).
Results
Cohort description The study cohort contains 4,632 cancer patients with at least 2 follow-up visits from March 1997 to January 2019 at the Cleveland Clinic (Table 1). In addition to the clinical data from each patient, data from a total of 23,451 echocardiograms were collected (including baseline and longitudinal follow-up studies). The overall population are 59% females and 41% males, among which 39% were diagnosed with a hematologic cancer, and 61% with solid tumors at their initial cancer diagnosis (Table 1). The median age is 63 (IQR: 54 to 71) years old for the overall population. Median body mass index (BMI) is 27 kg/m2 (IQR: 23 to 32 kg/m2), and there were 1,610 (35%) patients with BMI ≥30 kg/m2 (in obese range). Overall, 1,799 (39%) patients died during the study period, and 486 (10%) patients died in hospital. In this study, we used 5 types of cardiovascular events defined by ICD 9/10 codes and manually checked by looking at patient charts on Epic for accuracy, including AF, CAD, HF, MI, and stroke. In total, 1,670 (36%) of patients have at least one type of diagnosed cardiac event. Specifically, 784 (17%) patients had preexisting cardiac events before cancer therapy, while 886 (19%) patients developed de novo CTRCD. The de novo CTRCD is defined as diagnosed cardiovascular events (AF, CAD, HF, MI, or stroke) after cancer therapy. This number is consistent to the previous research in breast cancer populations, in which 18% of patients were resulted from cardiac dysfunction receiving doxorubicin and trastuzumab [43].
[END]
[1] Url:
https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003736
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/