Experts in data analysis, statistics and machine learning for physics came together from 9 to 12 September at Imperial College London for PHYSTAT’s Statistics meets Machine Learning workshop. The goal of the meeting, which is part of the PHYSTAT series, was to discuss recent developments in machine learning (ML) and their impact on the statistical data-analysis techniques used in particle physics and astronomy.
Particle-physics experiments typically produce large amounts of highly complex data. Extracting information about the properties of fundamental physics interactions from these data is a non-trivial task. The general availability of simulation frameworks makes it relatively straightforward to model the forward process of data analysis: to go from an analytically formulated theory of nature to a sample of simulated events that describe the observation of that theory for a given particle collider and detector in minute detail. The inverse process – to infer from a set of observed data what is learned about a theory – is much harder as the predictions at the detector level are only available as “point clouds” of simulated events, rather than as the analytically formulated distributions that are needed by most statistical-inference methods.
Traditionally, statistical techniques have found a variety of ways to deal with this problem, mostly centered on simplifying the data via summary statistics that can be modelled empirically in an analytical form. A wide range of ML algorithms, ranging from neural networks to boosted decision trees trained to classify events as signal- or background-like, have been used in the past 25 years to construct such summary statistics.
The broader field of ML has experienced a very rapid development in recent years, moving from relatively straightforward models capable of describing a handful of observable quantities, to neural models with advanced architectures such as normalising flows, diffusion models and transformers. These boast millions to billions of parameters that are potentially capable of describing hundreds to thousands of observables – and can now extract features from the data with an order-of-magnitude better performance than traditional approaches.
New generation
These advances are driven by newly available computation strategies that not only calculate the learned functions, but also their analytical derivatives with respect to all model parameters, greatly speeding up training times, in particular in combination with modern computing hardware with graphics processing units (GPUs) that facilitate massively parallel calculations. This new generation of ML models offers great potential for novel uses in physics data analyses, but have not yet found their way to the mainstream of published physics results on a large scale. Nevertheless, significant progress has been made in the particle-physics community in learning the technology needed, and many new developments using this technology were shown at the workshop.
This new generation of machine-learning models offers great potential for novel uses in physics data analyses
Many of these ML developments showcase the ability of modern ML architectures to learn multidimensional distributions from point-cloud training samples to a very good approximation, even when the number of dimensions is large, for example between 20 and 100.
A prime use-case of such ML models is an emerging statistical analysis strategy known as simulation-based inference (SBI), where learned approximations of the probability density of signal and background over the full high-dimensional observables space are used, dispensing with the notion of summary statistics to simplify the data. Many examples were shown at the workshop, with applications ranging from particle physics to astronomy, pointing to significant improvements in sensitivity. Work is ongoing on procedures to model systematic uncertainties, and no published results in particle physics exist to date. Examples from astronomy showed that SBI can give results of comparable precision to the default Markov chain Monte Carlo approach for Bayesian computations, but with orders of magnitude faster computation times.
Beyond binning
A commonly used alternative approach to the full-fledged theory parameter inference from observed data is known as deconvolution or unfolding. Here the goal is publishing intermediate results in a form where the detector response has been taken out, but stopping short of interpreting this result in a particular theory framework. The classical approach to unfolding requires estimating a response matrix that captures the smearing effect of the detector on a particular observable, and applying the inverse of that to obtain an estimate of a theory-level distribution – however, this approach is challenging and limited in scope, as the inversion is numerically unstable, and requires a low dimensionality binning of the data. Results on several ML-based approaches were presented, which either learn the response matrix from modelling distributions outright (the generative approach) or learn classifiers that reweight simulated samples (the discriminative approach). Both approaches show very promising results that do not have the limitations on the binning and dimensionality of the distribution of the classical response-inversion approach.
A third domain where ML is facilitating great progress is that of anomaly searches, where an anomaly can either be a single observation that doesn’t fit the distribution (mostly in astronomy), or a collection of events that together don’t fit the distribution (mostly in particle physics). Several analyses highlighted both the power of ML models in such searches and the bounds from statistical theory: it is impossible to optimise sensitivity for single-event anomalies without knowing the outlier distribution, and unsupervised anomaly detectors require a semi-supervised statistical model to interpret ensembles of outliers.
A final application of machine-learned distributions that was much discussed is data augmentation – sampling a new, larger data sample from a learned distribution. If the synthetic data is significantly larger than the training sample, its statistical power will be greater, but will derive this statistical power from the smooth interpolation of the model, potentially generating so-called inductive bias. The validity of the assumed smoothness depends on its realism in a particular setting, for which there is no generic validation strategy. The use of a generative model amounts to a tradeoff between bias and variance.
Interpretable and explainable
Beyond the various novel applications of ML, there were lively discussions on the more fundamental aspects of artificial intelligence (AI), notably on the notion of and need for AI to be interpretable or explainable. Explainable AI aims to elucidate what input information was used, and its relative importance, but this goal has no unambiguous definition. The discussion on the need for explainability centres to a large extent on trust: would you trust a discovery if it is unclear what information the model used and how it was used? Can you convince peers of the validity of your result? The notion of interpretable AI goes beyond that. It is an often-desired quality by scientists, as human knowledge resulting from AI-based science is generally desired to be interpretable, for example in the form of theories based on symmetries, or structures that are simple, or “low-rank”. However, interpretability has no formal criteria, which makes it an impractical requirement. Beyond practicality, there is also a fundamental point: why should nature be simple? Why should models that describe it be restricted to being interpretable? The almost philosophical nature of this question made the discussion on interpretability one of the liveliest ones in the workshop, but for now without conclusion.
Human knowledge resulting from AI-based science is generally desired to be interpretable
For the longer-term future there are several interesting developments in the pipeline. In the design and training of new neural models, two techniques were shown to have great promise. The first one is the concept of foundation models, which are very large models that are pre-trained by very large datasets to learn generic features of the data. When these pre-trained generic models are retrained to perform a specific task, they are shown to outperform purpose-trained models for that same task. The second is on encoding domain knowledge in the network. Networks that have known symmetry principles encoded in the model can significantly outperform models that are generically trained on the same data.
The evaluation of systematic effects is still mostly taken care of in the statistical post-processing step. Future ML techniques may more fully integrate systematic uncertainties, for example by reducing the sensitivity to these uncertainties through adversarial training or pivoting methods. Beyond that, future methods may also integrate the currently separate step of propagating systematic uncertainties (“learning the profiling”) into the training of the procedure. A truly global end-to-end optimisation of the full analysis chain may ultimately become feasible and computationally tractable for models that provide analytical derivatives.