Topics

Flooded LHC data centre back in business

19 April 2018

Following severe damage caused by flooding on 9 November, the INFN-CNAF Tier-1 data centre of the Worldwide LHC Computing Grid (WLCG) in Bologna, Italy, has been fully repaired and is back in business crunching LHC data. The incident was caused by the burst of a large water pipe at high pressure in a nearby street, which rapidly flooded the area where the data centre is located. Although the centre was designed to be waterproof against natural events, the volume of water was overwhelming: some 500 m3 of water and mud entered the various rooms, seriously damaging electronic appliances, computing servers, network and storage equipment. A room hosting four 1.4 MW electrical-power panels was filled first, leaving the centre without electricity.

The Bologna centre, which is one of 14 Tier-1 WLCG centres located around the world, hosts a good fraction of LHC data and associated computing resources. It is equipped with around 20,000 CPU cores, 25 PB of disk storage, and a tape library presently filled with about 50 PB of data. Offline computing activities for the LHC experiments were immediately affected. About 10% of the servers, disks, tape cartridges and computing nodes were reached by floodwater, and the mechanics of the tape library were also affected.

Despite the scale of the damage, INFN-CNAF personnel were not discouraged, quickly defining a roadmap to recovery and then attacking one by one all the affected subsystems. First, the rooms at the centre had to be dried and then meticulously cleaned to remove residual mud. Then, within a few weeks, new electrical panels were installed to allow subsystems to be turned back on.

Although all LHC disk-storage systems were reached by the water, the INFN-CNAF personnel were able to recover the data in their entirety, without losing a single bit. This was thanks in part to the available level of redundancy of the disk arrays and to their vertical layout. Wet tape cartridges hosting critical LHC data had to be sent to a specialised laboratory for data recovery.

A dedicated computing farm was set up very quickly at the nearby Cineca computing centre and connected to INFN-CNAF via a high-speed 400 Gbps link to enable the centre to reach the required LHC capacity for 2018. During March, three months since the incident, all LHC experiments were progressively put back online. Following the successful recovery, INFN is planning to move the centre to a new site in the coming years.

bright-rec iop pub iop-science physcis connect