Deuxième atelier de Banff – découvertes et statistiques

À l'heure où l'on s'attend aux premières découvertes au LHC, des physiciens des particules, des astrophysiciens et des statisticiens se sont retrouvés en juillet à Banff à l'occasion d'un séminaire consacré au lien entre la revendication, ou la non-revendication, de la découverte d'un objet, et différents effets statistiques. Il s'agit de la deuxième réunion de ce type à Banff, et la neuvième d'une série d'ateliers et de conférences « PhyStat » lancée au CERN en janvier 2000. Les sujets abordés étaient notamment la sélection de modèles, avec les difficultés créées par les incertitudes systématiques, la sensibilité des mesures, et des incertitudes dans les fonctions de densité de parton. Dans l'ensemble, l'atelier a été l'occasion de soulever des questions appelant de nouvelles recherches et de mettre en évidence de nouveaux outils pour l'analyse et l'interprétation des observations.

On 11–16 July, the Banff International Research Station in the Canadian Rockies hosted a workshop for high-energy physicists, astrophysicists and statisticians to debate statistical issues related to the significance of discovery claims. This was the second such meeting at Banff (CERN Courier November 2006 p34) and the ninth in a series of so-called "PHYSTAT" workshops and conferences that started at CERN in January 2000 (CERN Courier May 2000 p17). The latest meeting was organized by Richard Lockhart, a statistician from Simon Fraser University, together with two physicists, Louis Lyons of Imperial College and Oxford, and James Linnemann of Michigan State University.

The 39 participants, of whom 12 were statisticians, prepared for the workshop by studying a reading list compiled by the organizers and by trying their hand at three simulated search problems inspired by real data analyses in particle physics. These problems are collectively referred to as the "second Banff Challenge" and were put together by Wade Fisher of Michigan State University and Tom Junk of Fermilab.

Significant issues

Although the topic of discovery claims may seem rather specific, it intersects many difficult issues that physicists and statisticians have been struggling with over the years. Particularly prominent at the workshop were the topics of model selection, with the attendant difficulties caused by systematic uncertainties and the "look-elsewhere" effect; measurement sensitivity; and parton density function uncertainties. To bring everyone up to date on the terminology and problematics of searches, three introductory speakers surveyed the relevant aspects of their respective fields: Lyons for particle physics, Tom Loredo of Cornell University for astrophysics and Lockhart for statistics.

Bob Cousins of the University of California, Los Angeles, threw the question of significance into sharp relief by discussing a famous paradox in the statistics literature, originally noted by Harold Jeffreys and later developed by Dennis Lindley, both statisticians. The paradox demonstrates with a simple measurement example that it is possible for a frequentist significance test to reject a hypothesis, whereas a Bayesian analysis indicates evidence in favour of that hypothesis. Perhaps even more disturbing is that the frequentist and Bayesian answers scale differently with sample size (CERN Courier September 2007 p39). Although there is no clean solution to this paradox, it yields several important lessons about the pitfalls of testing hypotheses.

One of these is that the current emphasis in high-energy physics on a universal "5σ" threshold for claiming discovery is without much foundation. Indeed, the evidence provided by a measurement against a hypothesis depends on the size of the data sample. In addition, the decision to reject a hypothesis is typically affected by one's prior belief in it. Thus one could argue, for example, that to claim observation of a phenomenon predicted by the Standard Model of elementary particles, it is not necessary to require the same level of evidence as for the discovery of new physics. Furthermore, as Roberto Trotta of Imperial College pointed out in his summary talk, the emphasis on 5σ is not practiced in other fields, in particular cosmology. For example, Einstein's theory of gravity passed the test of Eddington's measurement of the deflection of light by the Sun with rather weak evidence when judged by today's standards.

Statistician David van Dyk, of the University of California, Irvine, came back to the 5σ issue in his summary talk, wondering if we are really worried about one false discovery claim in 3.5 million tries. His answer, based on discussions during the workshop, was that physicists are more concerned about systematic errors and the "look-elsewhere" effect (i.e. the effect by which the significance of an observation decreases because one has been looking in more than one place). According to van Dyk, the 5σ criterion is a way to sweep the real problem under the rug. His recommendation: "Honest frequentist error rates, or a calibrated Bayesian procedure."

Many workshop participants commented on the look-elsewhere effect. Taking this effect properly into account usually requires long and difficult numerical simulations, so that techniques to simplify or speed up the latter are eagerly sought. Eilam Gross, of the Weizmann Institute of Science, presented the work that he did on this subject with his student Ofer Vitells. Using computer studies and clever guesswork, they obtained a simple formula to correct significances for the look-elsewhere effect. In his summary talk, Luc Demortier of Rockefeller University showed how this formula could be derived rigorously from results published by statistician R B Davies in 1987. Statistician Jim Berger of Duke University explained that in the Bayesian paradigm the look-elsewhere effect is handled by a multiplicity adjustment: one assigns prior probabilities to the various hypotheses or models under consideration, and then averages over these.

Likelihoods and measurement sensitivity

Systematic uncertainties, the second "worry" mentioned by van Dyk, also came under discussion several times. From a statistical point of view, these uncertainties typically appear in the form of "nuisance parameters" in the physics model, for example a detector energy scale. Glen Cowan, of Royal Holloway, University of London, described a set of procedures for searching for new physics, in which nuisance parameters are eliminated by maximizing them out of the likelihood function, thus yielding the so-called "profile likelihood". An alternative treatment of these parameters is to elicit a prior density for them and integrate the likelihood weighted by this density; the resulting marginal likelihood was shown by Loredo to take better account of parameter uncertainties in some unusual situations.

While the marginal likelihood is essentially a Bayesian construct, some statisticians have advocated combining a Bayesian handling of nuisance parameters with a frequentist handling of parameters of interest. Kyle Cranmer of New York University showed how this hybrid approach could be implemented in general within the framework of the RooFit/RooStats extension of CERN's ROOT package. Unfortunately, systematic effects are not always identified at the beginning of an analysis. Henrique Araújo of Imperial College illustrated this with a search for weakly interacting massive particles that was conducted blindly until the discovery of an unforeseen systematic bias. The analysis had to be redone after taking this bias into account – and was no longer completely blind.

In searches for new physics, the opposite of claiming discovery of a new object is excluding that it was produced at a rate high enough to be detected. This can be quantified with the help of a confidence limit statement. For example, if we fail to observe a Higgs boson of given mass, we can state with a pre-specified level of confidence that its rate of production must be lower than some upper limit. Such a statement is useful to constrain theoretical models and to set the design parameters of the next search and/or the next detector. Therefore, in calculating upper limits, it is of crucial importance to take into account the finite resolution of the measuring apparatus.

How exactly to do this is far from trivial. Bill Murray of Rutherford Appleton Laboratory reviewed how the collaborations at the Large Electron–Positron collider solved this problem with a method known as CLS. He concluded that although this method works for the simplest counting experiment, it does not behave as desired in other cases. Murray recommended taking a closer look at an approach suggested by ATLAS collaborators Gross, Cowan and Cranmer, in which the calculated upper limit is replaced by a sensitivity bound whenever the latter is larger. Interestingly, van Dyk and collaborators had recently (and independently) recommended a somewhat similar approach in astrophysics.

Parton density uncertainties

As Lyons pointed out in his introductory talk, parton distribution functions (PDFs) are crucial for predicting particle-production rates, and their uncertainties affect the background estimates used in significance calculations in searches for new physics. It is therefore important to understand how these uncertainties are obtained and how reliable they are. John Pumplin of Michigan State University and Robert Thorne of University College London reviewed the state of the art in PDF fits. These fits use about 35 experimental datasets, with a total of approximately 3000 data points. A typical parametrization of the PDFs involves 25 floating parameters, and the fit quality is determined by a sum of squared residuals. Although individual datasets exhibit good fit quality, they tend to be inconsistent with the rest of the datasets. As a result, the usual rule for determining parameter uncertainties (Δχ2 = 1) is inappropriate, as Thorne illustrated with measurements of the production rate of W bosons.

The solution proposed by PDF fitters is to determine parameter uncertainties using a looser rule, such as Δχ2 = 50. Unfortunately, there is no statistical justification for such a rule. It clearly indicates that the assumption of Gaussian statistics badly underestimates the uncertainties, but it is not yet understood whether this is the result of unreported systematic errors in the data, systematic errors in the theory or the choice of PDF parametrization.

Statistician Steffen Lauritzen of the University of Oxford proposed a random-effects model to separate the experimental variability of the individual datasets from the variance arising from systematic differences. The idea is to assume that the theory parameter is slightly different for each dataset and that all of these individual parameters are constrained to the formal parameter of the theory via some distributional assumptions (a multivariate t prior, for example). Another suggestion was to perform a "closure test", i.e. to check to what extent one could reproduce the PDF uncertainties by repeatedly fluctuating the individual data points by their uncertainties before fitting them.

In addition to raising issues that require further thought, the workshop provided an opportunity to discuss the potential usefulness of statistical techniques that are not well known in the physics community. Chad Schafer of Carnegie Mellon University presented an approach to constructing confidence regions and testing hypotheses that is optimal with respect to a user-defined performance criterion. This approach is based on statistical decision theory and is therefore general: it can be applied to complex models without relying on the usual asymptotic approximations. Schafer described how such an approach could help solve the Banff Challenge problems and quantify the uncertainty in estimates of the parton densities.

Harrison Prosper of Florida State University criticized the all-too frequent use of flat priors in Bayesian analyses in high-energy physics, and proposed that these priors be replaced by the so-called "reference priors" developed by statisticians José Bernardo, Jim Berger and Dongchu Sun over the past 30 years. Reference priors have several properties that should make them attractive to physicists; in particular their definition is very general, they are covariant under parameter transformations and they have good frequentist sampling behaviour. Jeff Scargle, of NASA's Ames Research Center, dispatched some old myths about data binning and described an optimal data-segmentation algorithm known as "Bayesian blocks", which he applied to the Banff Challenge problems. Finally, statistician Michael Woodroofe of the University of Michigan presented an importance-sampling algorithm to calculate significances under nonasymptotic conditions. This algorithm can be generalized to cases involving a look-elsewhere effect.

After the meeting, many participants expressed their enthusiasm for the workshop, which raised issues that need further research and pointed to new tools for analysing and interpreting observations. The discussions between sessions provided a welcome opportunity to deepen understanding of some topics and exchange ideas. That the meeting took place in the magical surroundings of the Banff National Park could only help its positive effect.

Further reading

The most recent PHYSTAT conference was at CERN in 2007, see http://phystat-lhc.web.cern.ch/phystat-lhc/. (Links to the earlier meetings can be found at www.physics.ox.ac.uk/phystat05/reading.htm.) Details about the 2010 Banff meeting are available at www.birs.ca/events/2010/5-day-workshops/10w5068.