New tools help libraries to harvest literature

29 May 2002

Electronic publishing makes material accessible to more people, but this is not always a problem-free process. Ingrid Geretschlager and Jocelyne Jerdelet describe how the CERN library got to grips with a bewildering array of formats and approaches.


For more than 40 years, CERN’s library has collaborated with institutes and universities worldwide to collect carefully documented results of scientific research. Initially, this prodigious output was all on paper, and the CERN library regularly received papers from scientists at these institutes and universities via mailing lists. Because of its visibility, CERN received far more of this material than most institutes, and a major attraction of a visit to CERN was to peruse the latest pre-prints on view in the library.

With the advent of electronic publishing, more and more documents became accessible online. To complete the picture, documents still received on paper were scanned to offer Web access. Today this practice is diminishing as grey literature (library-speak for pre-prints and other material not published by a publishing house) in science, particularly in physics, is more widely available in electronic form.

Saving time and money

Having distributed documents for some years both on paper and electronically, many institutes have now chosen to use only the electronic route. This offers undeniable advantages: cost savings; quick and easy distribution; full text availability at a distance; the possibility of enriching the catalogue; and cheap online access, for example. The virtual library has become a reality. Paper documents are increasingly rare, and authors generally prefer to submit their papers electronically. Most major research centres also offer Web access to their documents and have ceased to send out paper copies via mailing lists, encouraging other scientific libraries and the researchers themselves to consult their Web pages and databases.

Faced with this evolution, library acquisition policies must be reconsidered and adapted to the new standards of scientific information dissemination.

The problem in this new context is the multiple consultation of databases. To find a document, a researcher must consult many resources, which is a time-consuming and tedious task with often dubious results. To facilitate searching and to offer users a single search interface, the CERN library chose to import as many electronic documents as possible. In 1999 the information support team introduced its Uploader program, which allows automatic importation of bibliographic records extracted from several sources. This has led to three main advantages: papers can now be found directly from institutes’ sites; the number of documents received from different research centres has increased; and new databases have been explored.

From any database or Web page, Uploader formats the records and adapts them to the cataloguing format used at CERN – Machine Readable Cataloguing (MARC). The program also updates existing records, searching for duplicates before importation. Which databases to explore was a difficult choice. First, the websites of all institutes from which CERN still received paper documents were consulted to see if the institutes offered the same documents online. This showed that more or less all institutes offer their publications on the Web in some form.

This study also revealed that CERN received, via mailing lists, only a third of the documents available on the Web. There are two possible explanations for this: perhaps for economic reasons research centres make a selection of which documents to send out; and mailing lists are not always kept up to date. The need for automatic importation of these documents from websites became obvious, but there were technical problems to overcome.

Diverse sources


Sources can be divided into two types: Web pages and online databases, which are handled differently. Medium-sized research centres and information sources that do not offer online databases generally offer Web pages presenting the work of their researchers (usually theses). Searching can be primitive if no real search engine is implemented. The number of documents is also often limited. This means that manual submission of the full text of the documents is the most efficient way of acquiring the documents. The constant evolution of Web pages also argues against automatic importation. Since alerting services for such sites are rare, the CERN library set up its own alert system for some 80 information sources at 30 institutes. This tells the librarians when the available information changes, allowing them to acquire new documents as they become available.

Online databases often allow multicriteria searching. In contrast to Web pages, however, it is usually impossible to put an alert on the search results. This means that for online databases that do not offer an alert system, a different approach is needed. The method adopted by the CERN library is a monthly or annual search.

The Uploader program helps CERN’s librarians to manage an effective document supply service, but the huge diversity of online information sources means that there is no shortage of work for the librarians. Document structure can vary from page to page, or even within the same page. In the majority of cases the pages are therefore presented as free text with no common structure. With virtually no constraints imposed by databases, no common import protocol is possible, and material must be input manually. Inconsistencies can arise when Web pages are not handled rigorously, causing confusion in bibliographic cataloguing – most frequently for authors’ names. Some databases allow external submission of documents and bibliographies, which results in many irregularities and loss of homogeneity in the presentation of the documents. Information can be presented in multiple forms. Pre-print numbers, for example, can appear as IUAP-00-xxx (number not yet attributed), CERN-TH-2K-1 (instead of CERN-TH-2000-1) or MPS15600 (instead of MPS-2000-156). Vital pieces of information, such as collaboration lists, are sometimes missing. All of these problems require traditional librarianship skills. CERN’s library aims to offer a coherent and homogeneous database, validation and improved metadata. Knowledge databases recognize retrievable work and provide links to relevant articles on the Web, while a computer program appends and corrects bibliographic data, keeping manual checking to a minimum.

Electronic advantage


There is no doubt that electronic uploading saves a considerable amount of time compared with manual submission. It has also greatly increased the number of documents made accessible and available at CERN. However, source databases must be carefully selected. The richer the database, the more time-consuming the procedure becomes. In addition, the volatility of Web pages requires close follow-up. Automatic importation has taken over from manual submission, but specialist monitoring remains essential.

The electronic approach was initially investigated at CERN on a test basis, to ascertain technical feasibility and to judge what the advantages would be. Since then, its use has spread and the laboratory has reached agreements with Cornell, Fermilab and several other information sources. Today, more than 90% of the material entering the CERN library database is imported or created electronically. Of this, only 8% comes from CERN.


bright-rec iop pub iop-science physcis connect