The Large Hadron Collider (LHC) experiments will generate huge amounts of data, and special methods will be needed to analyse them efficiently. No single computer will be able to crunch the data in a reasonable amount of time.

The only way to cut analysis time to an acceptable level is to use parallelism. Depending on the amount of data, parallelism can be employed on multicore laptops, local clusters or global distributed clusters, i.e. the Grid. The challenge will be to provide this parallelism transparently so that users won't have to worry about how to access all of these resources in parallel.

The ROOT framework

From the mid-1990s, the ROOT system was developed to offer the functionality needed to handle and analyse large amounts of data. ROOT, which is an object-oriented framework, offers an extensive set of tools for data storage, analysis and presentation. The data are defined as objects, and special storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. The system was designed to query its databases in parallel on parallel-processing machines or on computer clusters.

ROOT is the preferred data analysis environment for the LHC experiments, so the ROOT team has increased its efforts to provide a system that will give efficient access to many computers in parallel.

PROOF provides transparency

The Parallel ROOT Facility (PROOF) enables distributed data sets to be analysed interactively in a transparent way. It exploits the inherent parallelism in data from uncorrelated events via a multi-tier architecture that optimizes I/O and CPU use in heterogeneous clusters with distributed storage. Being part of the ROOT framework, PROOF inherits the benefits of a performant object-oriented storage system, and statistical and visualization tools.

PROOF requires the analysis to be run via "selectors" that contain user code that is called by the PROOF worker nodes. PROOF opens the ROOT files that are to be analysed and tries to optimize the data-access patterns so that workers read data locally or via the high-performance xrootd file servers. Since many PROOF selectors typically are I/O bound, the efficiency of the data-access infrastructure largely determines the efficiency of the system.

Selectors are a universal way of processing ROOT files and can be run transparently in a local ROOT session or on a PROOF cluster. Selectors can be interpreted with CINT or compiled on the fly with the ACliC mechanism (ROOT's transparent C++ compiler interface). The user code in the selectors can access experiment or user shared libraries, which can then be uploaded, compiled and installed with the PROOF package manager. This mechanism enables PROOF cluster nodes to have a different architecture, operating system or compiler from the platform on which the user developed the code.

Suitable for long-running jobs

Initially PROOF was developed purely for interactive analysis, with queries lasting no more than several tens of minutes and operating in synchronous mode - i.e. the lifetime of a PROOF session was determined by that of the client session. We have recently extended PROOF to support the "interactive batch" mode, in which a user can close the local client session while the PROOF query continues to run on the remote cluster. The user can later reconnect to the PROOF cluster to inspect progress or get the results of finished queries.

The advantage of using PROOF for long-running jobs (instead of splitting the analysis into multiple-batch jobs, each running over part of a data set) is that the selector and analysis model does not change depending on the length of the query or the size of the data set. This means that a more dynamic and interactive monitoring of the intermediate results is possible.

PROOF can also be distributed over several sites if data sets span more than one site. In this case, PROOF has to operate with the Grid file and metadata catalogues, and authentication and job-submission systems.

Typically, a user will query the file catalogues for a data set to be analysed. The location of the files in the data set determines where PROOF worker nodes have to be started by the Grid job-submissions system. Once the PROOF worker agents are started, the user's client session will connect to the master and the queries can be executed (see figure). A prototype of this model, which combines AliEn (the ALICE file catalogue and job-submission system) with PROOF, was demonstrated at the 2005 Supercomputing conference.

This year the ROOT team will improve PROOF and help the LHC experiments to integrate it into their analysis and Grid environments.

Further information
• ROOT - http://root.cern.ch;
• xrootd - http://xrootd.slac.stanford.edu;
• CINT - http://root.cern.ch/root/Cint.html;
• AliEn - http://alien.cern.ch.