Enhancing Stellar Temperature Estimation through Machine Learning and Multifaceted Data Exploration

Siddhi Bansal; Phillip Phan; Zayaan Rahman

doi:10.3847/2515-5172/ad3919

1. Introduction

How can we approximate the temperatures of different stars using photometric data, and improve predictions by synthesizing other data?

Approximating the temperatures of different stars using photometric data is accomplished through the use of color indices, which quantify the star's brightness in different parts of the electromagnetic spectrum. A common method involves measuring a star's brightness using photometric bands such as the U, G, R, I, and Z filters.

The key idea is that hotter stars emit more energy in the shorter, bluer wavelengths and appear brighter in the U and G bands, while cooler stars emit more in the longer, redder wavelengths and are brighter in the I and Z bands. The color index allows astronomers to place the star on a color–temperature diagram, such as the Hertzsprung–Russell diagram, and estimate its temperature based on the observed color. This photometric technique provides a powerful tool for categorizing stars across the universe, aiding in the determination of the characteristics of a star.

Gaia became the primary source of data considering its expansive catalog, which offers stars with temperatures over 40,000 K. We used the GAIA Data Release 3 data set (ESA Gaia Collaboration 2022).

2. Data

We decided on Gaia as our source of data to retrieve the star's photometric data and metallicity for a total of just over 350,000. We used the gaia_source table to query the features of the stars. We obtained the information of the stars by steps of 500 K so that we can have a balanced distribution of star temperatures. However, for temperatures before 2500 K and after 20,000 K our fixed number of stars per step was significantly decreased since there were very few stars in that range. After running some linear regression models with photometric data, we decided to add metallicity as an additional parameter. Additionally, we used the Isolation Forest Algorithm to remove outliers from our data, which were originally skewing our results. We trained a general neural network on this data set with outliers and without metallicity to predict the temperature of a star. Then, we used AutoML via PyCaret to survey a space of multiple models and converge on an even more accurate top model.

As seen in the GAIA Data Release Documentation 8.3.1 Effective temperatures by René Andrae (Gaia Collaboration 2018), previous work has been done with predicting effective temperatures using the GAIA Data Release 2 by using an ExtraTrees regression model. This model was trained on stars' photometric data and predicts effective temperature.

Our model, while also predicting effective star temperature, differs from this study in multiple ways, including the type of ML model used for training (neural networks), the training parameters (photometric data and metallicity), and outlier detection.

In addition, the original study only used photometric data to determine stellar effective temperature. By including metallicity as an additional feature in our model, we were able to better estimate the temperatures of stars by using multiple characteristics. As seen in the Annual Review of Astronomy and Astrophysics Chapter 5, "an increase in metallicity results in lower effective temperatures" Conroy (2014). By accounting for metallicity in our model, we can improve the existing study by accounting for other stellar features that can affect temperatures significantly.

In the original study, an ExtraTrees ensemble regressor was used, which proved to be a fairly potent non-parametric learning algorithm. However, it had its limitations. First of all, its training was limited to non-synthetic photometry and the temperature range was limited to 3000–10,000 K, thus being very limited in ability to extrapolate. Our general neural network model significantly improves the prediction range due to training on star data points from a much wider range of temperatures. Additionally, because of the neural network's optimization update via back-propagation and stochastic gradient descent, it is inherently more adaptable to changes in data, meaning whether or not there is synthetic data or misleading data that becomes corrected later on, is not as much of a problem as it is for ExtraTrees. Finally, neural networks have a strong capability of feature extraction and function approximation for an arbitrary data set. This allows scientists to both infer what aspects of a star's data contribute more to its temperature, as well as apply the model for new features like new color indices that may reveal new patterns.

3. Results

We employed a neural network using the TensorFlow library. The ultimate model for the GAIA data set consists of 4 input nodes (3 color filters + metallicity), 4 hidden layers, 2 64-node layers, 2 32-node layers, and 1 output node for the temperature value. The hidden layers were decided as a result of trial and error based on validation accuracy, in addition to tuning the number of training epochs. Also, by applying the Isolation Forests algorithm for outlier detection, 8% of our stars were removed while enhancing the accuracy and reliability of our star temperature estimation model. Our accuracy with this model was 73% (61% before detecting outliers). Star temperatures toward the lower (2500 K) and higher ends (20,000 K) had limited data, causing our model to predict in undesirable ways. However, temperatures between the extreme ends show more confident predictions, giving us confidence in predicting star temperature between this range.

Automated machine learning (AutoML) is an emerging yet relatively unexplored technique for optimizing the design of machine learning models. It involves iterating the data set over a series of different machine-learning models, each with unique parameters and structures. This includes not only neural networks, but decision trees, K-neighbors, and many hybrid-resembling models. After iterating over a multitude of models, it ranks each model's performance on a leaderboard so the programmer can focus on the top-performing models. Overall, AutoML helps to reduce time spent on manual model searching and parameter tuning.

We decided to use the PyCaret Regressor library for our implementation of AutoML. It was run on GAIA, with outlier removal and metallicity included. The top-performing model of GAIA was the Random Forest Regressor with a r² value of 0.822 on the test split. Thus, through AutoML, the top models yielded correlations significantly higher than the tuned neural network counterpart.

4. Conclusion

Our study has yielded several key findings regarding the estimation of stellar effective temperatures using machine learning algorithms. First, neural network models consistently outperformed linear regression and polynomial regression models, demonstrating their superior ability to capture complex relationships between photometric data and temperature. This highlights the potential of neural networks in astronomical applications involving temperature estimation.

Second, the Isolation Forest outlier detection program proved to be effective at identifying and removing outliers from our data set. Its adaptiveness and robustness to outliers made it well-suited for handling the diverse characteristics of our data.

Third, incorporating metallicity as an additional parameter significantly enhanced the accuracy of our models for all three machine learning algorithms. This finding underscores the importance of metallicity in characterizing stellar properties and its influence on temperature estimation, especially when using photometric data.

**Figure 1.** Gaia neural network results and AutoML model rankings.
Download figure:
Standard image High-resolution image

Finally, the U − G color index exhibited the highest degree of skewness and outlier density among the photometric parameters. Applying the Isolation Forest outlier detection program specifically to the u − g parameter resulted in a substantial improvement in model accuracy, further emphasizing the importance of effective outlier detection for enhancing model performance.

Through AutoML, we were able to run and test other models that would have otherwise been outside the scope of this study, providing us with valuable insights into the relationships between effective star temperature and photometric data.

Overall, through this study, we were able to conclude that photometric data and effective stellar temperatures have a relationship when an additional star characteristic—metallicity—is also considered.

Acknowledgments

We thank Shyamal Mitra, our mentor in the Geometry of Space research group at the University of Texas at Austin, for his support and guidance in developing our research.

Enhancing Stellar Temperature Estimation through Machine Learning and Multifaceted Data Exploration

Article metrics

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Data

3. Results

4. Conclusion

Acknowledgments

Enhancing Stellar Temperature Estimation through Machine Learning and Multifaceted Data Exploration

Article metrics

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Data

3. Results

4. Conclusion

Acknowledgments