Topics

Data preservation is a journey

20 May 2016

Taking on the challenge of preserving “digital memory”.

 

As an organisation with more than 60 years of history, CERN has created large volumes of “data” of many different types. This involves not only scientific data – by far the largest in terms of volume – but also many other types (photographs, videos, minutes, memoranda, web pages and so forth). Sadly, some of this information from as recently as the 1990s, such as the first CERN web pages, has been lost, as well as more notably much of the data from numerous pre-LEP experiments. Today, things look rather different, with concerted efforts across the laboratory to preserve its “digital memory”. This concerns not only “born-digital” material but also what is still available from the pre-digital era. Whereas the latter often existed (and luckily often still exists) in multiple physical copies, the fate of digital data can be more precarious. This led Vint Cerf, vice-president of Google and an early internet pioneer, to declare in February 2015: “We are nonchalantly throwing all of our data into what could become an information black hole without realising it.” This is a situation that we have to avoid for all CERN data – it’s our legacy.

Interestingly, many of the tools that are relevant for preserving data from the LHC and other experiments are also suitable for other types of data. Furthermore, there are models that are widely accepted across numerous disciplines for how data preservation should be approached and how success against agreed metrics can be demonstrated.

Success, however, is far from guaranteed: the tools involved have had a lifetime that is much shorter than the desired retention period of the current data, and so constant effort is required. Data preservation is a journey, not a destination.

The basic model that more or less all data-preservation efforts worldwide adhere to – or at least refer to – is the Open Archival Information System (OAIS) model, for which there is an ISO standard (ISO 14721:2012). Related to this are a number of procedures for auditing and certifying “trusted digital repositories”, including another ISO standard – ISO 16363.

This certification requires, first and foremost, a commitment by “the repository” (CERN in this case) to “the long-term retention of, management of, and access to digital information”.

In conjunction with numerous more technical criteria, certification is therefore a way of demonstrating that specific goals regarding data preservation are being, and will be, met. For example, will we still be able to access and use data from LEP in 2030? Will we be able to reproduce analyses on LHC data up until the “FCC era”?

In the context of the Worldwide LHC Computing Grid (WLCG), self-certification of, initially, the Tier0 site, is currently under way. This is a first step prior to possible formal certification, certification of other WLCG sites (e.g. the Tier1s), and even certification of CERN as a whole. This could cover not only current and future experiments but also the “digital memory” of non-experiment data.

What would this involve and what consequences would it have? Fortunately, many of the metrics that make up ISO 16363 are part of CERN’s current practices. To pass an audit, quite a few of these would have to be formalised into official documents (stored in a certified digital repository with a digital object identifier): there are no technical difficulties here but it would require effort and commitment to complete. In addition, it is likely that the ongoing self-certification will uncover some weak areas. Addressing these can be expected to help ensure that all of our data remains accessible, interpretable and usable for long periods of time: several decades and perhaps even longer. Increasingly, funding agencies are requiring not only the preservation of data generated by projects that they fund, but also details of how reproducibility of results will be addressed and how data will be shared beyond the initial community that generated it. Therefore, these are issues that we need to address, in any event.

A reasonable target by which certification could be achieved would be prior to the next update of the European Strategy for Particle Physics (ESPP), and further updates of this strategy would offer a suitable frequency of checking that the policies and procedures were still effective.

The current status of scientific data preservation in high-energy physics owes much to the Study Group that was initiated at DESY in late 2008/early 2009. This group published a “Blueprint document” in May 2012, and a summary of this was input to the 2012 ESPP update process. Since that time, effort has continued worldwide, with a new status report published at the end of 2015.

In 2016, we will profit from the first ever international data-preservation conference to be held in Switzerland (iPRES, Bern, 3–6 October) to discuss our status and plans with the wider data-preservation community. Not only do we have services, tools and experiences to offer, but we also have much to gain, as witnessed by the work on OAIS, developed in the space community, and related standards and practices.

High-energy physics is recognised as a leader in the open-access movement, and the tools in use for this, based on Invenio Digital Library software, have been key to our success. They also underpin more recent offerings, such as the CERN Open Data and Analysis Portals. We are also recognised as world leaders in “bit preservation”, where the 100+PB of LHC (and other) data are proactively curated with increasing reliability (or decreasing occurrences of rare but inevitable loss of data), despite ever-growing data volumes. Finally, CERN’s work on virtualisation and versioning file-systems through CernVM and CernVM-FS has already demonstrated great potential for the highly complex task of “software preservation”.

• For further reading, visit arxiv.org/pdf/1205.4667 and dx.doi.org/10.5281/zenodo.46158.

bright-rec iop pub iop-science physcis connect