Aim

The project tackles astronomical phenomenon classification by examining and developing specific components of a cognitive vision system. These components entail a feature selection step for the extraction of a feature representation from a dataset, a machine learning approach harnessed by model selection and hyperparameter optimization techniques, and inference capabilities embodied by expert domain knowledge of astronomy for identification of galaxies in optical images. The development and examination of these components form part of a proposed three-stage pipeline. In future, the pipeline/system should be adaptable to dataset choice, and able to process large tracts of data from astronomical sky surveys. In building such as system, the project will construct a piece of work that moves towards the final objective of a cognitive vision system for astronomy. See Background and Dataset for more information.

Cognitive vision system pipeline
Figure 1. A proposed pipeline for a cognitive vision system

Background and Dataset

Astronomy is a field that actively accumulates and processes vast amounts of data with structure and information that may only be extracted using computational means. Analysis of this data requires the use of automated techniques such as machine learning in order to reduce data into consumable and useful knowledge as well perform analyses such as astronomical phenomenon classification.

To solve these important problems, astronomy has been the focus of many statistical machine learning approaches such as decision trees, k-means clustering, and simple multi-layer perceptrons for galaxy classification and supernovae classification, and most recently of deep-learning approaches . It has also seen top-down expert knowledge representations applied for efficient data set storage and recollection. It is therefore a prime candidate field for the application of a cognitive vision system (CVS), that can harness expert knowledge representation, feature selection, and machine learning techniques to perform classification tasks such as galaxy classification in optical astronomy settings.

The construction of a CVS often centers around developing and integrating/coupling various components that solve a given problem. In the case of astronomy, this means that techniques such as feature selection, various machine learning algorithms, and knowledge representation and inference may be applied for automated galaxy identification and classification, with the goal of data reduction and labeling to streamline astronomy observation workflows.

Dataset

The dataset used for this project is the optical image galaxy dataset available from the Galaxy Zoo crowdsourcing initiative (which can be found here).

Back to top

Feature extraction

Classification often relies upon a good set of features to discriminate between various classes, furthermore this paper describes an approach to feature extraction and selection on galactic images for the task of galaxy classification. The approach used not only compares different feature extraction techniques for Galaxy classification, but also allows for researchers alike to compare their techniques against the techniques explored via feature selection.

The feature extraction and selection component made use of popular feature extraction techniques for Galaxy Classification as per literature and developed a hybrid feature selection method which makes use of univariate and tree-based feature selection for selecting the best performing features for classification. An example of the feature extraction process can be seen in the figure below where a galaxy in an image has been successfully contoured out.

A galaxy with contour
Figure 2: A galaxy showing the contour determined by the feature extraction process.

Results

The feature extraction module was able to successfully extract the roughly 3000 WND-CHARM features as well as an additional 9 shape descriptors. Feature selection was successfully performed using three known feature selection methods (Random Forest, Pearson's Correlation, and Recursive Feature Elimination) as well as a new hybrid feature selection algorithm.

Graph showing normalized feature scores.
Figure 3: Graph showing the feature scores for 10 selected features.

The features were then ranked using the respective selection algorithms and the softmax function was applied in order to normalise the values. An example of 10 features that were selected and their normalised scores is shown in the figure below. The hybrid selection algorithm delivered promising results in comparison to the other three algorithms.

Back to top

Automated machine learning

The task of phenomenon classification in astronomy provides a novel and challenging setting for the application of state-of-the-art techniques addressing the problem of combined algorithm selection and hyperparameter optimization (CASH) of machine learning algorithms, which find local applications such as at the data-intensive Square Kilometre Array (SKA). This work will use various algorithms for CASH to explore the possibility and efficacy of hyperparameter optimization on improving performance of machine learning techniques for astronomy.

With focus on the Galaxy Zoo dataset, these algorithms were used to conduct an in-depth comparison of state-of-the-art in hyperparameter optimization (HPO) along with techniques that aim to improve performance on large datasets and expensive function evaluations.

Finally, the likelihood for an integration with a cognitive vision system for astronomy will be examined by conducting a brief exploration into different feature extraction and selection methods.

Results

Various algorithms for HPO were evaluated on the Galaxy Zoo dataset, showing the effectiveness of such algorithms in outperforming state-of-the-art solutions such as those published from top-performing Kaggle solutions, and on increasing the efficiency of methods on large dataset sizes, by training configurations on smaller sub-samples of the full dataset. The performance of those algorithms can be seen in the figure below.
Comparison of hyperparameter algorithms
Figure 2: A graph showing the performance of various hyperparameter optimisation algorithms.
The feasibility of using hyperparameter optimization for the problem domain area of astronomy certainly exists. In this work, HPO was shown to outperform existing methods of optimizing model configurations such as SKYNET. This opens up doors into future work for finding and improving other existing models and algorithm configurations for astronomy, which could potentially reduce computational costs and decrease runtime on these problems where simpler and computationally cheaper configurations of models are selected by HPO models to solve these problems. Further, future work in developing machine learning models for automated data processing in astronomy may also be accelerated if adoption of these HPO tools becomes more wide-spread.

Back to top

Knowledge representation and reasoning

The knowledge representation and reasoning was implemented as a Bayesian Network (BN) using the Bayespy Python library. Using a Bayesian network that incorporates shape descriptors in order to identify the galaxy type in terms of the Hubble sequence, the system was trained on various sized datasets using Variational Bayesian approximation in order to test over 10 thousand examples from the Galaxy Zoo dataset.

Different sized sets of training evidence were used to investigate the behaviour and performance of using Variational Bayesian approximation as opposed to an exact inference method.

Results

The Bayesian network model developed can be seen in the figure below and incorporates the shape descriptors as root nodes and the galaxy classes as leaf nodes. This model was given evidence to learn the network parameters and was then tested by comparing the predicted probability of the sixteen classes (marked by their numbers) against those from the network. Furthermore, the model was also tested against a neural network by the same comparison.
Bayesian network
Figure . Bayesian network structure used to identify galaxies from five shape descriptors

When compared to a neural network, the Bayesian network performed poorly in predicting the values of the test data regardless of the size of the dataset used to learn the parameters of the network. Implementing the inference engine requires specifying what domain knowledge should be included in the representation, what form the input concepts will take, and what form the output of the Bayesian network will take.

Back to top

Downloads