Inspired by high-dimensional data and the ideals of open science, high-energy physicists are using artificial intelligence to reimagine the statistical technique of ‘unfolding’.
All scientific measurements are affected by the limitations of measuring devices. To make a fair comparison between data and a scientific hypothesis, theoretical predictions must typically be smeared to approximate the known distortions of the detector. Data is then compared with theory at the level of the detector’s response. This works well for targeted measurements, but the detector simulation must be reapplied to the underlying physics model for every new hypothesis.
The alternative is to try to remove detector distortions from the data, and compare with theoretical predictions at the level of the theory. Once detector effects have been “unfolded” from the data, analysts can test any number of hypotheses without having to resimulate or re-estimate detector effects – a huge advantage for open science and data preservation that allows comparisons between datasets from different detectors. Physicists without access to the smearing functions can only use unfolded data.
No simple task
But unfolding detector distortions is no simple task. If the mathematical problem is solved through a straightforward inversion, using linear algebra, noisy fluctuations are amplified, resulting in large uncertainties. Some sort of “regularisation” must be imposed to smooth the fluctuations, but algorithms vary substantively and none is preeminent. Their scope has remained limited for decades. No traditional algorithm is capable of reliably unfolding detector distortions from data relative to more than a few observables at a time.
In the past few years, a new technique has emerged. Rather than unfolding detector effects from only one or two observables, it can unfold detector effects from multiple observables in a high-dimensional space; and rather than unfolding detector effects from binned histograms, it unfolds detector effects from an unbinned distribution of events. This technique is inspired by both artificial-intelligence techniques and the uniquely sparse and high-dimensional data sets of the LHC.
An ill-posed problem
Unfolding is used in many fields. Astronomers unfold point-spread functions to reveal true sky distributions. Medical physicists unfold detector distortions from CT and MRI scans. Geophysicists use unfolding to infer the Earth’s internal structure from seismic-wave data. Economists attempt to unfold the true distribution of opinions from incomplete survey samples. Engineers use deconvolution methods for noise reduction in signal processing. But in recent decades, no field has had a greater need to innovate unfolding techniques than high-energy physics, given its complex detectors, sparse datasets and stringent standards for statistical rigour.
In traditional unfolding algorithms, analysers first choose which quantity they are interested in measuring. An event generator then creates a histogram of the true values of this observable for a large sample of events in their detector. Next, a Monte Carlo simulation simulates the detector response, accounting for noise, background modelling, acceptance effects, reconstruction errors, misidentification errors and energy smearing. A matrix is constructed that transforms the histogram of the true values of the observable into the histogram of detector-level events. Finally, analysts “invert” the matrix and apply it to data, to unfold detector effects from the measurement.
How to unfold traditionally
Diverse algorithms have been invented to unfold distortions from data, with none yet achieving preeminence.
• Developed by Soviet mathematician Andrey Tikhonov in the late 1940s, Tikhonov regularisation (TR) frames unfolding as a minimisation problem with a penalty term added to suppress fluctuations in the solution.
• In the 1950s, statistical mechanic Edwin Jaynes took inspiration from information theory to seek solutions with maximum entropy, seeking to minimise bias beyond the data constraints.
• Between the 1960s and the 1990s, high-energy physicists increasingly drew on the linear algebra of 19th-century mathematicians Eugenio Beltrami and Camille Jordan to develop singular value decomposition as a pragmatic way to suppress noisy fluctuations.
• In the 1990s, Giulio D’Agostini and other high-energy physicists developed iterative Bayesian unfolding (IBU)– a similar technique to Lucy–Richardson deconvolution, which was developed independently in astronomy in the 1970s. An explicitly probabilistic approach well suited to complex detectors, IBU may be considered a forerunner of the neural-network-based technique described in this article.
IBU and TR are the most widely-used approaches in high-energy physics today, with the RooUnfold tool started by Tim Adye serving countless analysts.
At this point in the analysis, the ill-posed nature of the problem presents a major challenge. A simple matrix inversion seldom suffices as statistical noise produces large changes in the estimated input. Several algorithms have been proposed to regularise these fluctuations. Each comes with caveats and constraints, and there is no consensus on a single method that outperforms the rest (see “How to unfold traditionally” panel).
While these approaches have been successfully applied to thousands of measurements at the LHC and beyond, they have limitations. Histogramming is an efficient way to describe the distributions of one or two observables, but the number of bins grows exponentially with the number of parameters, restricting the number of observables that can be simultaneously unfolded. When unfolding only a few observables, model dependence can creep in, for example due to acceptance effects, and if another scientist wants to change the bin sizes or measure a different observable, they will have to redo the entire process.
New possibilities
AI opens up new possibilities for unfolding particle-physics data. Choosing good parameterisations in a high-dimensional space is difficult for humans, and binning is a way to limit the number of degrees of freedom in the problem, making it more tractable. Machine learning (ML) offers flexibility due to the large number of parameters in a deep neural network. Dozens of observables can be unfolded at once, and unfolded datasets can be published as an unbinned collection of individual events that have been corrected for detector distortions as an ensemble.
One way to represent the result is as a set of simulated events with weights that encode information from the data. For example, if there are 10 times as many simulated events as real events, the average weight would be about 0.1, with the distribution of weights correcting the simulation to match reality, and errors on the weights reflecting the uncertainties inherent in the unfolding process. This approach gives maximum flexibility to future analysts, who can recombine them into any binning or combination they desire. The weights can be used to build histograms or compute statistics. The full covariance matrix can also be extracted from the weights, which is important for downstream fits.
But how do we know the unfolded values are capturing the truth, and not just “hallucinations” from the AI model?
An important validation step for these analyses are tests performed on synthetic data with a known answer. Analysts take new simulation models, different from the one being used for the primary analysis, and treat them as if they were real data. By unfolding these alternative simulations, researchers are able to compare their results to a known answer. If the biases are large, analysts will need to refine their methods to reduce the model-dependency. If the biases are small compared to the other uncertainties then this remaining difference can be added into the total uncertainty estimate, which is calculated in the traditional way using hundreds of simulations. In unfolding problems, the choice of regularisation method and strength always involves some tradeoff between bias and variance.
Just as unfolding in two dimensions instead of one with traditional methods can reduce model dependence by incorporating more aspects of the detector response, ML methods use the same underlying principle to include as much of the detector response as possible. Learning differences between data and simulation in high-dimensional spaces is the kind of task that ML excels at, and the results are competitive with established methods (see “Better performance” figure).
Neural learning
In the past few years, AI techniques have proven to be useful in practice, yielding publications from the LHC experiments, the H1 experiment at HERA and the STAR experiment at RHIC. The key idea underpinning the strategies used in each of these results is to use neural networks to learn a function that can reweight simulated events to look like data. The neural network is given a list of relevant features about an event such as the masses, energies and momenta of reconstructed objects, and trained to output the probability that it is from a Monte Carlo simulation or the data itself. Neural connections that reweight and combine the inputs across multiple layers are iteratively adjusted depending on the network’s performance. The network thereby learns the relative densities of the simulation and data throughout phase space. The ratio of these densities is used to transform the simulated distribution into one that more closely resembles real events (see “OmniFold” figure).
As this is a recently-developed technique, there are plenty of opportunities for new developments and improvements. These strategies are in principle capable of handling significant levels of background subtraction as well as acceptance and efficiency effects, but existing LHC measurements using AI-based unfolding generally have small backgrounds. And as with traditional methods, there is a risk in trying to estimate too many parameters from not enough data. This is typically controlled by stopping the training of the neural network early, combining multiple trainings into a single result, and performing cross validations on different subsets of the data.
Beyond the “OmniFold” methods we are developing, an active community is also working on alternative techniques, including ones based on generative AI. Researchers are also considering creative new ways to use these unfolded results that aren’t possible with traditional methods. One possibility in development is unfolding not just a selection of observables, but the full event. Another intriguing direction could be to generate new events with the corrections learnt by the network built-in. At present, the result of the unfolding is a reweighted set of simulated events, but once the neural network has been trained, its reweighting function could be used to simulate the unfolded sample from scratch, simplifying the output.
Further reading
A Andreassen et al. 2020 Phys. Rev. Lett. 124 182001.
H1 Collab. 2023 Phys. Lett. B 844 138101.
LHCb Collab. 2023 Phys. Rev. D 108 L031103.
CMS Collab. 2024 CMS-PAS-SMP-23-008.
ATLAS Collab. 2024 Phys. Rev. Lett. 133 261803.