- Table of contents
- Aim
- Data set
- Feature extraction
- Automated machine learning
- Knowledge representation and resoning
- Downloads
Aim
The project tackles astronomical phenomenon classification by examining and developing specific components of a cognitive vision system. These components entail a feature selection step for the extraction of a feature representation from a dataset, a machine learning approach harnessed by model selection and hyperparameter optimization techniques, and inference capabilities embodied by expert domain knowledge of astronomy for identification of galaxies in optical images. The development and examination of these components form part of a proposed three-stage pipeline. In future, the pipeline/system should be adaptable to dataset choice, and able to process large tracts of data from astronomical sky surveys. In building such as system, the project will construct a piece of work that moves towards the final objective of a cognitive vision system for astronomy. See Background and Dataset for more information.

Background and Dataset
Astronomy is a field that actively accumulates and processes vast amounts of data with structure and information that may only be extracted using computational means. Analysis of this data requires the use of automated techniques such as machine learning in order to reduce data into consumable and useful knowledge as well perform analyses such as astronomical phenomenon classification.
To solve these important problems, astronomy has been the focus of many statistical machine learning approaches such as decision trees, k-means clustering, and simple multi-layer perceptrons for galaxy classification and supernovae classification, and most recently of deep-learning approaches . It has also seen top-down expert knowledge representations applied for efficient data set storage and recollection. It is therefore a prime candidate field for the application of a cognitive vision system (CVS), that can harness expert knowledge representation, feature selection, and machine learning techniques to perform classification tasks such as galaxy classification in optical astronomy settings.
The construction of a CVS often centers around developing and integrating/coupling various components that solve a given problem. In the case of astronomy, this means that techniques such as feature selection, various machine learning algorithms, and knowledge representation and inference may be applied for automated galaxy identification and classification, with the goal of data reduction and labeling to streamline astronomy observation workflows.
Dataset
The dataset used for this project is the optical image galaxy dataset available from the Galaxy Zoo crowdsourcing initiative (which can be found here).
Feature extraction
Classification often relies upon a good set of features to discriminate between various classes, furthermore this paper describes an approach to feature extraction and selection on galactic images for the task of galaxy classification. The approach used not only compares different feature extraction techniques for Galaxy classification, but also allows for researchers alike to compare their techniques against the techniques explored via feature selection.
The feature extraction and selection component made use of popular feature extraction techniques for Galaxy Classification as per literature and developed a hybrid feature selection method which makes use of univariate and tree-based feature selection for selecting the best performing features for classification. An example of the feature extraction process can be seen in the figure below where a galaxy in an image has been successfully contoured out.

Results
The feature extraction module was able to successfully extract the roughly 3000 WND-CHARM features as well as an additional 9 shape descriptors. Feature selection was successfully performed using three known feature selection methods (Random Forest, Pearson's Correlation, and Recursive Feature Elimination) as well as a new hybrid feature selection algorithm.

The features were then ranked using the respective selection algorithms and the softmax function was applied in order to normalise the values. An example of 10 features that were selected and their normalised scores is shown in the figure below. The hybrid selection algorithm delivered promising results in comparison to the other three algorithms.
Automated machine learning
The task of phenomenon classification in astronomy provides a novel and challenging setting for the application of state-of-the-art techniques addressing the problem of combined algorithm selection and hyperparameter optimization (CASH) of machine learning algorithms, which find local applications such as at the data-intensive Square Kilometre Array (SKA). This work will use various algorithms for CASH to explore the possibility and efficacy of hyperparameter optimization on improving performance of machine learning techniques for astronomy.
With focus on the Galaxy Zoo dataset, these algorithms were used to conduct an in-depth comparison of state-of-the-art in hyperparameter optimization (HPO) along with techniques that aim to improve performance on large datasets and expensive function evaluations.
Finally, the likelihood for an integration with a cognitive vision system for astronomy will be examined by conducting a brief exploration into different feature extraction and selection methods.
Results
Various algorithms for HPO were evaluated on the Galaxy Zoo dataset, showing the effectiveness of such algorithms in outperforming state-of-the-art solutions such as those published from top-performing Kaggle solutions, and on increasing the efficiency of methods on large dataset sizes, by training configurations on smaller sub-samples of the full dataset. The performance of those algorithms can be seen in the figure below.
Knowledge representation and reasoning
The knowledge representation and reasoning was implemented as a Bayesian Network (BN) using the Bayespy Python library. Using a Bayesian network that incorporates shape descriptors in order to identify the galaxy type in terms of the Hubble sequence, the system was trained on various sized datasets using Variational Bayesian approximation in order to test over 10 thousand examples from the Galaxy Zoo dataset.
Different sized sets of training evidence were used to investigate the behaviour and performance of using Variational Bayesian approximation as opposed to an exact inference method.
Results
The Bayesian network model developed can be seen in the figure below and incorporates the shape descriptors as root nodes and the galaxy classes as leaf nodes. This model was given evidence to learn the network parameters and was then tested by comparing the predicted probability of the sixteen classes (marked by their numbers) against those from the network. Furthermore, the model was also tested against a neural network by the same comparison.
When compared to a neural network, the Bayesian network performed poorly in predicting the values of the test data regardless of the size of the dataset used to learn the parameters of the network. Implementing the inference engine requires specifying what domain knowledge should be included in the representation, what form the input concepts will take, and what form the output of the Bayesian network will take.