The Hubble Space Telescope has been observing the cosmos for more than 35 years, amassing hundreds of thousands of observations. Each image was taken with a specific scientific goal, yet every exposure contains far more than its intended target: background galaxies, foreground objects and unexpected phenomena scattered across the field of view. Systematic human inspection of the millions of source cutouts in the Hubble Legacy Archive is impossible – but artificial intelligence has now uncovered more than a thousand astrophysical anomalies hiding in plain sight.
The challenge of identifying rare signals amid overwhelming backgrounds will resonate with CERN Courier readers. At the LHC, experiments increasingly deploy anomaly detection methods to search for new physics beyond the Standard Model without fully specifying the signal in advance. Both fields face a shared problem: isolating rare events from billions of observations with minimal prior assumptions about the target. “Semi-supervised” approaches that marry sparse expert knowledge with vast unlabelled datasets may prove as valuable for collider data as they have for astronomical archives.
A new semi-supervised machine-learning framework developed at the European Space Agency in December 2025 has identified 1339 unique astrophysical anomalies spanning 19 distinct morphological classes (see “Six out of 1339” figure). Approximately 65% of these – some 811 objects – had no prior reference in the scientific literature, despite residing in data that has been publicly available for years. Some of these newly discovered objects were excellent additions to existing catalogues of which examples are limited. These included collisional ring galaxies, galaxy mergers, jellyfish galaxies and gravitational lenses. Forty-three of the objects completely defied classification and remain unknown objects to this day.
Semi-supervised learning
At the heart of this work lies a fundamental tension in modern astronomy: datasets are growing far faster than our ability to label them. Traditional supervised machine learning requires large, annotated training sets, but expert labelling of millions of images is prohibitively expensive. Semi-supervised learning offers a way forward. In this approach, a model learns simultaneously from a small set of human-labelled examples and a vastly larger pool of unlabelled data, extracting patterns from the abundant unlabelled images to compensate for the scarcity of annotations.
The challenge of identifying rare signals amid overwhelming backgrounds will resonate with CERN Courier readers
The new code we have developed generates provisional “pseudo-labels” when the model’s confidence exceeds a threshold, then enforces consistent predictions with augmented versions of the same images. These augmentations take the form of cropping of the images, flipping them, inverting the pixel values, and so forth. This allows the model to leverage the statistical structure of millions of unlabelled cutouts without requiring a human to inspect each one. The algorithm then couples this semi-supervised backbone with human expertise. After each training cycle, the model ranks all images by anomaly score and a domain expert reviews the highest-ranked candidates, correcting misclassifications and confirming genuine anomalies. These newly labelled images feed the next training cycle. This human-in-the-loop design combines the pattern recognition capabilities of deep learning with the domain knowledge of an astronomer, achieving an efficiency that neither could match alone.
In our study, the entire process began with 128 standard astrophysical phenomena and three labelled anomalies where finding further examples would be valuable. The chosen examples were edge-on protoplanetary disks – young stellar objects with a proto-planetary disk around a host star that exhibits strong emission with a direct high-energy jet and secondary emission in a striking butterfly shape. Through successive iterations, the training set grew to 1400 images, at which point the model could flag anomaly types it had never been shown.
Community access
A search of this scale was made possible by ESA Datalabs, a collaborative science platform that provides researchers with direct access to ESA’s mission archives alongside computational resources – including GPU acceleration – through a browser-based environment. Rather than downloading terabytes of Hubble data, we brought our analysis code to where the data already resides. The full inference run across 99.6 million images completed in just 2.5 days on a single GPU, demonstrating that large-scale anomaly detection does not require vast computational resources, a consideration that matters as the community increasingly weighs the sustainability of data-intensive research.
The most abundant anomalies were galaxy mergers: 629 systems hosting tidal tails, bridges and other signatures of gravitational interactions that exist at the very limit of our detection power. We also found 140 candidate gravitational lenses and 39 gravitational arcs, where the warping of spacetime distorts background sources into characteristic rings. Mergers give us snapshots of hierarchical structure formation, while spacetime distortions provide direct tests of general relativity and enable dark-matter mapping on cosmological scales.
Even decades-old data can yield hundreds of new discoveries when the right tools are brought to bear
The model also independently recovered five previously catalogued quadruply lensed quasars in the Einstein cross configuration – a fourfold splitting of a distant quasar’s light by a foreground galaxy. That the model identified these without any lensed quasars in its training set validates its ability to generalise beyond the anomaly types it was explicitly taught. Fewer than 50 such systems are known, and each enables an independent “late universe” measurement of the Hubble constant; such measurements are invaluable given the persistent tension between values derived from the cosmic microwave background and the local distance ladder (CERN Courier March/April 2025 p28).
Among the genuinely new discoveries were two collisional ring galaxies – extreme systems that have undergone such an extreme galaxy interaction that a shockwave is moving through the galaxy, causing a burst of star formation through the galaxy. Thirty-five jellyfish galaxies shaped by ram pressure stripping in the intracluster medium also provide an excellent laboratory to understand the relationship between the galactic environment and the internal gas of the galaxy. Finally, 43 sources had morphologies that defied classification entirely – curved, distorted objects that fit none of the established categories and have been released to the community for further investigation.
With the Euclid space telescope now operational, and the Vera C. Rubin Observatory and Square Kilometre Array soon to follow, data volumes will dwarf Hubble’s archive by orders of magnitude. Our work shows that even decades-old data can yield hundreds of new discoveries when the right tools are brought to bear – and that AI-assisted discovery, guided by human expertise, is only just getting started.
Further reading
D O’Ryan and P Gómez 2025 Astronomy & Astrophysics 704 A227.