Résumé
CASTOR2 relève le défi du stockage de données du LHC
L'infrastructure de mémoire de masse du CERN se développe progressivement depuis le début des années 1990, époque où les systèmes Unix ont commencé à remplacer les gros ordinateurs. En 1999 a été mis au point le gestionnaire de stockage avancé du CERN, CASTOR, qui a fait ses preuves dans le cadre des expériences COMPASS et NA48. En 2003, cependant, il est apparu que CASTOR ne serait pas à même de gérer le débit et le volume des données, ni le nombre de fichiers prévus pour les expériences LHC. La nouvelle version du gestionnaire, CASTOR2, utilisée depuis 2006, a été conçue pour répondre à ces exigences dans l'environnement informatique actuel s'appuyant sur des serveurs de PC.
CERN's mass storage infrastructure has developed progressively since the early 1990s when Unix-based systems started to replace the mainframes. A significant step was the development in 1999 of CERN's Advanced STORage system (CASTOR), which allowed users to refer to files using logical names in a hierarchical namespace, replacing references to tape numbers. This was a great help in managing the data from post-LEP experiments such as COMPASS and NA48, but by 2003 it was clear that CASTOR would not be able to cope with the data rates, data volumes and number of files predicted for the LHC experiments. The latest incarnation, CASTOR2, deployed in production from 2006, was designed to address these issues, to improve scalability and resilience in today's environment based on PC servers and to improve efficiency of resource usage. In particular, it is designed both to exploit the capabilities of today's high-performance tape drives and to match the I/O load on disk servers to hardware capabilities, so guaranteeing consistent performance for users.
The architecture
A key aspect of CASTOR2 is the database-centred design (figure 1). The overall system state, as well as the status of all user requests, is held in a central relational database. A set of stateless daemons queries the database for the next operation to perform, e.g. to schedule the next transfer for a client, or to issue a tape recall. By making all of these daemons stateless, the design improves scalability and fault tolerance as the daemons can be replicated on different machines. It also simplifies operation by allowing the updating and restarting of daemons while another instance is supporting the load. Isolating key components, such as the stager, from direct user access (all user queries interact with a request-handler daemon) improves overall resilience – one of the design goals for CASTOR2. Of course, a single central database is potentially both a performance bottleneck and a single point of failure. At CERN, however, the central Oracle servers are configured for redundancy (exploiting Oracle's RAC and Data Guard features); tests have shown them to be more than adequate for handling the expected load.
Some of the more important CASTOR daemons are the name-server, the request handler, the stager and the scheduler. The nameserver manages a global hierarchical namespace that allows users to name files in a Unix-like manner under a directory hierarchy starting "/castor". As there is only one CASTOR nameserver for each site, speed and efficiency are primordial and – since it is developed at CERN – CASTOR, as opposed to commercial mass storage systems, can focus on optimizing the functions most needed by high-energy physics users. All user interaction passes via the request handler, which is a lightweight gateway that stores requests in the central database, handling peaks of more than 100 requests a second without any service degradation. In today's Grid-enabled world, however, many clients will interact with CASTOR via the storage resource manager (SRM), a generic interface to mass-storage systems. Today's interface can scale up to 1.7 million requests a day. A new interface developed at the UK's Rutherford Appleton Laboratory, implementing version 2.2 of the SRM specification, is expected to be deployed in production in the near future.
At the heart of CASTOR lies the stager, the daemon responsible for handling user requests. One of the more innovative features of CASTOR is that the stager does not itself decide which disk server to use when fulfilling a request. Instead, the stager passes processed requests to a scheduler, which will pick the most suitable disk server based on information collected by the resource monitor. A request from an experiment production manager can, for example, take priority over requests from other users. Unfortunately, the first interface to the scheduler daemon turned out to be rather inefficient, which limited the overall performance and prevented request prioritization – a situation that caused major problems for an ATLAS experiment test earlier this year. Fortunately, the redesigned scheduler interface (which caches disk server status in shared memory) works well and can support more than 50,000 user requests simultaneously. This is comfortably more than we expect to handle.
Four important daemons look after the interface with the tape layer – both tape drives and automated tape libraries. Apart from the principal tape daemon, which supports a variety of tape drives and related robots, the volume manager and the volume and drive queue manager handle the status of tapes and tape drives, allocating tape space to files as necessary and orchestrating the mounting of tape cartridges onto a suitable tape drive. The remote tape copy (rtcopy) daemon handles the transfer of data between tape and disk. The rtcopy daemon also computes a checksum of the files while they are transferred and compares this against a reference value that is stored in the nameserver to detect any transfer errors. This is in addition to arranging the speed streaming of data to support today's high-performance tape drives (with bandwidths of up to 120 MB/s) and dynamically selecting files for migration to or from tape to maximize utilization of the precious tape drives – which was a design requirement for CASTOR2.
Understanding what is happening in such a distributed system is both crucial and difficult, so CASTOR2 comes with a distributed logging facility (DLF) that centralizes recording of logging and accounting information from the various daemons. The DLF comes with an intuitive and easy-to-use web interface to browse the data, and can process and store thousands of events a second. Through regular monitoring of the DLF data, together with status and load information from the disk and tape servers, any problems that threaten service stability can be recognized early on. The modular design of CASTOR2 allows, in many cases, automatic corrective action. For example, individual disk servers can be quiesced when disk errors occur, with files being replicated as necessary to satisfy user requests with other servers. This enables service quality for end users to be maintained despite inevitable hardware failures – another design requirement for CASTOR2.
In terms of performance, CASTOR must cope with the aggregate data rates from the LHC experiments – up to 1 GB/s in proton mode or up to 2 GB/s during heavy-ion running – and the usual load for reconstruction and analysis. The system must also cope with the simultaneous export of data to the Tier-1 sites, leading to the overall bandwidth requirements shown in figure 2.
Bottlenecks and performance problems have been identified and removed thanks to a series of data challenges during the past two years, driven largely by the ALICE collaboration with their enormous central data recording requirements during heavy-ion running. As figure 3 shows, a single CASTOR instance at CERN can easily support sustained data rates up to 2 GB/s and accept incoming data at up to 3 GB/s (the data rate to tape being limited by the number of tape drives available).
With dedicated instances for each of the four LHC experiments, the CASTOR service at CERN looks set to meet their data recording and export needs. Admittedly, CASTOR has yet to demonstrate performance, reliability and robustness in the face of both data acquisition and an unquantifiable analysis load as physicists seek to exploit the initial LHC data. However, experience over the past year shows that the software and design is flexible and robust, so the CASTOR development and operations teams look forward with confidence to the forthcoming full dress rehearsals, the Combined Computing Readiness Challenge and, above all, first LHC data.