Today the mainstream use of Grids resembles a large batch system: the goal is to maximize the computational throughput over long periods of time.
This fits many applications, in particular large data productions of the Large Hadron Collider (LHC) experiments, where the production manager puts thousands of jobs into the system and after several days they come out with the result. However, this model does not support other scenarios well. For example, in an interactive analysis the response of the system should be much faster and aligned with the activity of the user; and life-science applications often involve short-deadline jobs, that is many short jobs that must finish within a certain time limit. In general, such quality of service (QoS) characteristics are not present in today's Grid systems.
The scale and complexity of the Grid also has implications. The LHC Computing Grid/Enabling Grids for E-sciencE (LCG/EGEE) is the world's largest Grid system to date, comprising more than 20,000 worker nodes, some 200 computing sites and petabytes of storage. Such an impressive enterprise, which connects heterogeneous computing environments and organizations, comes at a cost: from the end-user's perspective, tracking problems can be time consuming and the system may sometimes be less efficient.
The DIANE project
User-level scheduling is a light software technique that enables new capabilities to be added and QoS characteristics and reliability to be improved, on top of the existing Grid middleware and infrastructure.
DIANE (DIstributed ANalysis Environment) is such a tool. It is an R&D project for parallel scientific applications in the master–worker model that was started at CERN in 2001. At the beginning the target was to investigate distributed ntuple analysis for particle physics. However, with time DIANE has become an application-independent user-scheduling tool on the Grid (see http://cern.ch/diane). It has been interfaced to a number of applications in high-energy physics, medical physics, life sciences and other fields.
DIANE is a python framework based on a master–worker processing model that is used on top of regular Grid middleware in a transparent way. Worker agents are sent to the Grid as regular Grid jobs. By opening a TCP/IP connection they register to the master agent that runs on the user's desktop computer and is the coordination point for the virtual worker pool. Workers may dynamically join and leave the pool, without disrupting the processing as a whole. The units of computation are many short tasks, which the master allocates to workers directly, bypassing the middleware scheduling layer.
This makes it possible to reduce the total job turn-around time and to react much faster to errors in task execution by reallocating them to other workers. Splitting the processing into many fine-grained tasks improves the load balancing and ensures that the workers are used efficiently. As a result the computing resources may be returned to the Grid faster, because worker agents are automatically terminated when the processing reaches its end.
DIANE's python framework enables existing applications to be integrated quickly, even those as complex as Athena, the analysis framework of the ATLAS experiment. Studies performed by members of the ATLAS collaboration showed that DIANE can be used to integrate local and Grid resources, and even resources from different Grid infrastructures at the same time. The DIANE-based parallel Athena prototype has been shown at EGEE conferences and has been included in the ATLAS Technical Design Report (TDR 2005).
DIANE has also been interfaced to Ganga, a user-friendly Grid interface created in the context of ATLAS and LHCb experiments at CERN. In future, physicists using Ganga will be able to choose the DIANE optimizer, which will be attached transparently to their jobs.
The DIANE scheduler will also be used to operate the statistical regression testing part of the Geant4 release validation procedure. It enables turnaround time to be reduced and provides a more stable and predictable job output rate. This is because the worker agents acquired at the beginning of processing are held inside the pool and are shielded from the instabilities in the Grid brokering. Stable job output rate is an important QoS feature because it enables testing operations on the Grid to be planned with more reliability.
Practical applications
Earlier this year DIANE was used to perform a sizeable fraction of an in silico drug discovery application using the EGEE and other Grid infrastructures. The challenge was to analyse possible drug components against the avian flu virus H5N1.
This activity showed that a user-level scheduler like DIANE can improve the distribution efficiency on the Grid from below 40% to above 80% by optimizing the allocation of the fine-grained computing tasks. Automatic error-recovery mechanisms proved to be efficient in extended periods of continuous work: the part performed with DIANE lasted around 30 days.
In May and June, CERN successfully supported a series of large-scale data-processing activities carried out by the International Telecommunications Union (ITU) as part of the ITU's Regional Radiocommunication Conference. Several sites of the EGEE infrastructure provided a computing Grid of more than 400 PCs to work on each analysis in parallel, and the processing was conducted using the DIANE scheduling layer.
The system completed more than 200,000 very short frequency analysis jobs (clustered in around 40,000 processing tasks) in around one hour, proving that on-demand computing with a short deadline is possible on the Grid. The frequency allocation plan that was optimized with the help of the Grid enabled more than 1000 delegates from 104 countries to adopt the treaty agreement that will replace the analogue broadcasting plans that have existed since 1961 for Europe and since 1989 for Africa.
In the future, closer integration with Ganga will enable access to all of DIANE's capabilities. Ongoing PhD research is aimed at supporting hard QoS requirements with novel techniques such as a floating worker pool, extending scalability above 500 worker agents, and supporting inter-dependent tasks for workflow applications.