(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Deep learning exoplanets detection by combining real and synthetic data
['Sara Cuéllar', 'Escuela De Ingeniería Eléctrica', 'Pontificia Universidad Católica De Valparaíso', 'Valparaíso', 'Paulo Granados', 'Ernesto Fabregas', 'Departamento De Informática Y Automática', 'Universidad Nacional De Educación A Distancia', 'Madrid', 'Michel Curé']
Date: 2022-06
Abstract Scientists and astronomers have attached great importance to the task of discovering new exoplanets, even more so if they are in the habitable zone. To date, more than 4300 exoplanets have been confirmed by NASA, using various discovery techniques, including planetary transits, in addition to the use of various databases provided by space and ground-based telescopes. This article proposes the development of a deep learning system for detecting planetary transits in Kepler Telescope light curves. The approach is based on related work from the literature and enhanced to validation with real light curves. A CNN classification model is trained from a mixture of real and synthetic data. The model is then validated only with unknown real data. The best ratio of synthetic data is determined by the performance of an optimisation technique and a sensitivity analysis. The precision, accuracy and true positive rate of the best model obtained are determined and compared with other similar works. The results demonstrate that the use of synthetic data on the training stage can improve the transit detection performance on real light curves.
Citation: Cuéllar S, Granados P, Fabregas E, Curé M, Vargas H, Dormido-Canto S, et al. (2022) Deep learning exoplanets detection by combining real and synthetic data. PLoS ONE 17(5): e0268199.
https://doi.org/10.1371/journal.pone.0268199 Editor: Sathishkumar V E, Hanyang University, KOREA, REPUBLIC OF Received: January 25, 2022; Accepted: April 23, 2022; Published: May 25, 2022 Copyright: © 2022 Cuéllar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All databases are public. Funding: This research was supported in part by the Chilean National Agency for Research and Development (ANID) under Projects FONDECYT 1191188 and 1190486, and PhD Scholarship 21221393. The National University of Distance Education under Project 2021V/-TAJOV/00 and Ministry of Science and Innovation of Spain under Project PID2019-108377RB-C32. Competing interests: The authors have declared that no competing interests exist.
Introduction All the planets in our solar system orbit the sun. Planets orbiting other stars are called exoplanets under NASA’s Exoplanet Exploration Program [1]. Exoplanets are very difficult to see directly with telescopes. They are hidden by the brightness of the star they orbit. The search for planets outside the solar system has been investigated for many years. The existence of a possible exoplanet orbiting the white dwarf Van Maanen 2 has been suspected since 1917 [2], but its existence could not be confirmed due to the limited technology of the time. It was not until 1995 that Michel Mayor and Didier Queloz first confirmed an exoplanet called Dimidium or 51 Pegasi, with a 4-day orbit around the nearby star Helvetios [3]. They described it as a large ball of gas similar to Jupiter. For this finding they received the Nobel Prize in Physics 2019 [4]. Nowadays, scientists and astronomers have attached great importance to the task of discovering new exoplanets, even more so if they are in the habitable zone. Most of the exoplanets discovered so far are found in a relatively small region of our galaxy, the Milky Way. To date, NASA has confirmed 4301 exoplanets, using a variety of discovery techniques [5], including planetary transits, radial velocities, gravitational microlensing and direct imaging from databases provided by space and ground-based telescopes, e.g. NASA’s Kepler space telescope [6] and the NASA’s Transiting Exoplanet Survey Satellite (TESS) [7]. The Kepler space telescope has collected data on a large number of stars (in the order of 200,000) during the 4 years it was operating (2009-2013). Collecting around of 678 Gigabytes of data [8]. As many other examples in the literature (see for instance [9–12]), the manual analysis of large databases, such as these light curves, is a very time-consuming work. In this context, the use of artificial intelligence methods have emerged as tools for the analysis of this information. The main research questions addressed by this work are how can artificial intelligence algorithms contribute to the exoplanet detection field, and if it is possible to add technical knowledge through synthetic data to improve the performance of the detector. In the literature, different approaches that use artificial intelligence techniques to detect exoplanets can be found. For example, in [13], the authors describe a method for detecting exoplanet transits by applying the k-nearest neighbors (kNN) method to determine whether a given signal is sufficiently similar to known transit signals. In [14], they present for the first time the use of the Random Forest Classifiers (RFCs) algorithm for exoplanets classification. They achieve an overall error rate of 5.85% and an error rate in the classification of exoplanet candidates of 2.81%. The work described in [15], shows a combination of RFCs and Convolutional Neural Networks (CNNs) to distinguish between the different types of signals. The authors say that the combination of both methods offers the best approach to identify exoplanets correctly in the test data approximately 90% of the time. While in [16], the authors present another CNN based approach that is capable of detecting Earth-like exoplanets in noisy time series data with a greater accuracy than a least-squares method. The most important disadvantage of this case is that they use synthetic data to train the model instead of real traffic data. This does not provide evidence for its performance against real data. In [17], the method for classifying candidates using a Self-Organizing Maps (SOM) technique is developed on Kepler and K2 confirmed and candidate planets with a success of 87%. More recently, in [18] an Ensemble-CNN model for exoplanets detection is presented with an accuracy of 99.62%. Other approaches such as [19], shows a 98% cross-validated precision score using RFCs to classify objects of interest in Kepler’s cumulative information object table. But, in this case, the authors use only data from the training stage for cross-validation of their models. This does not allow to properly analyse the performance of the model with new data. Despite the good results obtained by these previously mentioned works, most of them show that in order to build and validate the models, in some cases light curves of unconfirmed planet candidates are used or even some of them are false positives. The main contributions of this work are the following: The development of a system for detection of planetary transits in Kepler Telescope light curves which includes the generation of synthetic data from estimated parametric models of the planet candidate. This approach allows finding planetary transits over a wider range of periods.
As far as we know our approach is the first exoplanet detection model trained by deep learning from a mixture of real and synthetic data. A sensitivity analysis and an optimisation technique is performed to determine the best ratio of synthetic data. The model consists on building an image from the folding of light curves. This image is used to determine planetary transits by means of a CNN.
Unlike other related works, the validation of the model is only performed with real data and different from those used in the training stage. This shows that the performance of the model is better than if only real data are used for training. This paper is structured as follows. Second section presents some exoplanet detection approaches that can be found in the literature and describes briefly the approach which is the start point of this work. Third section details the proposed method. Fourth section shows the experimental results and a comparison with previous results. Finally, Fifth section summarises the main conclusions and future work.
Exoplanets detection approaches As mentioned above, the discovery of new exoplanets has taken a high degree of importance during the last few years. Since the amount of data provided by telescopes is enormous, the only way to analyze it is using Machine Learning techniques. A significant amount of research can be found in the literature that has focused on the use of Machine Learning techniques for exoplanet detection. This section presents a review of the most significant and relevant works for our approach. Table 1 presents a summary of the articles covered in this brief review. The first column contains the reference to the article in the bibliography. The second column shows the names of the telescope and catalogs from which the data were obtained. The third column shows the details about the feature extraction used. The fourth column shows the machine learning method used for detection. Finally the fifth shows the results obtained by each approach. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 1. Machine learning approaches for exoplanet detection.
https://doi.org/10.1371/journal.pone.0268199.t001 In [20], published in 2015, the authors present the Autovetter, a machine learning based classifier. It is used to produce a catalog of Planet Candidates from the Q1-Q17 DR24 Threshold Crossing Events (TCEs) that are identified in the Kepler Science Operations Center pipeline. The Autovetter classify 20367 TCEs into three classes: 1.- Planet Candidate (PC), which contains 3600 signals that are consistent with transiting planets; 2.- Astrophysical False Positive (AFP), which contains 9596 signals of astrophysical origin that could mimic planetary transits; and 3.- Non-Transiting Phenomenon (NTP), which contains 2541 signals that are evidently of instrumental origin, or are noise artifacts. A set of 114 atributes calculated from Kepler pipeline are ultimately used to build a random forest classifier that maps the attributes of any TCE to a predicted class label of either PC, AFP, or NTP. The results evaluated on 4630 TCEs show the following accuracy/error rate for each class: PC (0.971/2.9%), AFP (0.976/2.4%) and NTP (0.968/3.2%). As can be seen, these results are very accurate, in fact, the Autovetter predictions are taken as ground truth for posterior studies. In [21], published in 2018, the authors present a method for classifying potential signals from planets using deep learning, specifically convolutional neural networks (CNNs). Feature extraction is generated by folding each flattended light curve in the TCE period (with the event centered) and clustering to produce a 1D vector. The training and test sets (PC, AFP and NTP) were selected from the Autovetter Planet Candidate Catalog for Q1-Q17 DR24. The result is a CNN model named Astronet that is able to distinguish with good accuracy the subtle differences between genuine transiting exoplanets and false positives such as eclipsing binaries, instrumental artifacts, and stellar variability. They also compared models based on linear logistic regression (LLR) and a fully connected neural network. The results show a performance of classified real planets with 95% recall, 90% of accuracy and 96% of precision. In [22], also published in 2018, the authors also present an approach based on CNN named Exonet. They use a dataset from the same catalog as the previous one (Kepler Q1-Q17 DR24). For the classification process, they use phase-folded light curves and associated centroid curves (measured by the Kepler pipeline from the same TPF), for both global and local views. They also add stellar normalized parameters like: effective temperature, surface gravity, metallicity, radius, mass, and density to the training set. The results overperformed the Astronet with an accuracy of 97.5% and 95.5% of precision. In [23], published in 2019, the first deep neural network trained and tested on real TESS data is presented. The model is modified based on Astronet and designed to automatically performing triage and vetting on TESS candidates. In triage mode, it can distinguish transit-like signals (planet candidates and eclipsing binaries) from stellar variability and instrumental noise with an average precision of 97.0% and an accuracy of 97.4%. In vetting mode, the model is trained to identify only planet candidates with the help of newly added scientific domain knowledge, and achieves an average precision of 69.3% and an accuracy of 97.8%. In [19], also published in 2019, the authors present a study of several classification models (SVM, KNN and RF) used to assign a probability of an observation being an exoplanet. A Random Forest Classifier was selected as the optimum machine learning model to classify the data on the Cumulative Kepler Object of Interest (KOI) catalog, which contains information for all Kepler Objects of Interest (KOI) in one place. The Random Forest Classifier, trained using the table attributes as features, obtained a cross-validated accuracy score of 98.96%, precision 99.55% and recall of 97.21% on the training set. In [24], published in 2019, the authors present an approach based on CNN for detecting exoplanet transits. A 2D phase folding technique is proposed, generating a set of images for training. They test the method with five different types of deep learning models with or without folding. Synthetic lightcurves were generated as the input of these models. The results indicate that a combination of two-dimension convolutional neural network with folding is the best choice for the future transit analysis. All models with folding have accuracy above 98%. The accuracy of models without folding can become about 85%. The precision and recall have a similar trend. This article is based on this approach, the main difference is that it uses real data with transit for both training and testing. In [25], published in 2020, the author present an approach based on a tree-based classifier using a popular machine learning tool lightgbm, to detect exoplanets using the transit method. They use time-series analysis library TSFresh to extract 789 features from lightcurves. These features capture information about the characteristics of each lightcurve. This method was trained and tested on synthetic data and real Kepler and TESS data. The evaluation on synthetic data proved it to be more effective than conventional box least squares fitting (BLS). On Kepler data, the method is able to detect a planet transit with an AUC of 94.8% of accuracy and Recall of 96%. With the TESS data, the method is able to classify lightcurves with an accuracy of 98% and is able to identify planets with a Recall of 82%.
Conclusion In this paper, the development of a deep learning system for detecting planetary transits in Kepler Telescope light-curves is presented. The approach is based on related work from the literature and enhanced to validation with real lightcurves. 2D phase folding is used as a feature extraction method that allows real and synthetic light-curves with transit to be described by an image distinguishable from those without transit. The model parameters are adjusted to improve the performance of the classification. The method is evaluated on real light-curves from the Kepler’s catalog and demonstrates superior performance against other approaches presented on the state of art. The main contribution of this work is the enhance of a detection model including the generation of synthetic light-curves with transit from estimated parameters. The best ratio of synthetic data is founded using a coarse tunning with Genetic algorithms and evidenced with a sensibility analysis. The evaluated metrics demonstrate that the combination of real and synthetic light-curves with transit on the training stage add knowledge to the model and improve the performance on real light curves. Future work will consider extend the study to systems with more than one confirmed planet or planetary candidate dealing with multi-transit detection on the same light-curve. Also the implementation of the method on a different database like the NASA’s Transiting Exoplanet Survey Satellite (TESS), mission that has discover already 166 exoplanets and has 4604 planet candidates, or even the data acquired from the James Webb Space Telescope launched on December 2021.
[END]
[1] Url:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0268199
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/