The big experiments in high-energy physics are fertile ground for new developments in computers and communications – witness the World Wide Web, which was developed at CERN. Computer specialist Ben Segal recalls how distributed UNIX boxes took over from CERN’s all-powerful IBM and Cray mainframe workhorses.
I don’t remember exactly who first proposed running physics batch jobs on a UNIX workstation, rather than on the big IBM or Cray mainframes that were doing that kind of production work in 1989 at CERN. The workstation in question was to be an Apollo DN10000, the hottest thing in town with reduced instruction set (RISC) CPUs of a formidable five CERN Units (a CERN Unit was defined as one IBM 370/168, equivalent to four VAX 11-780s) each and costing around SwFr 150 000 for a 4-CPU box.
It must have been the combined idea of Les Robertson, Eric McIntosh, Frederic Hemmer, Jean-Philippe Baud, myself and perhaps some others who were working at that time around the biggest UNIX machine that had ever crossed the threshold of the Computer Centre – a Cray XMP-48, running UNICOS.
At any rate, when we spoke to the Apollo salespeople about our idea, they liked it so much that they lent us the biggest box they had, a DN10040 with four CPUs plus a staggering 64 Mb of memory and 4 Gb of disk space. Then, to round it off, they offered to hire a person of our choice for three years to work on the project at CERN.
In January 1990 the machine was installed and our new “hireling”, Erik Jagel, an Apollo expert after his time managing the Apollo farm for the L3 experiment, coined the name “HOPE” for the new project. (Hewlett-Packard had bought Apollo and OPAL had expressed interest, so it was to be the “HP OPAL Physics Environment”).
We asked where we could find the space to install HOPE in the Computer Centre. We just needed a table with the DN10040 underneath and an Ethernet connection to the Cray, to give us access to the tape data. The reply was: “Oh, there’s room in the middle” – where the recently obsolete round tape units had been – so that was where HOPE went, looking quite lost in the huge computer room, with the IBM complex on one side and the Cray supercomputer on the other.
Soon the HOPE cycles were starting to flow. The machine was surprisingly reliable, and porting the big physics FORTRAN programs was easier than we had expected. After around six months, the system was generating 25 per cent of all CPU cycles in the centre. Management began to notice the results when we included HOPE’s accounting files in the weekly report we made that plotted such things in easy-to-read histograms.
We were encouraged by this success and went to work on a proposal to extend HOPE. The idea was to build a scalable version from interchangeable components: CPU servers, disk servers and tape servers, all connected by a fast network and software to create a distributed mainframe. “Commodity” became the keyword – we would use the cheapest building-blocks available from the manufacturers that gave the best price performance for each function.
On how large a scale could we build such a system and what would it cost? We asked around, and received help from some colleagues who treated it as a design study. A simulation was done of the workflow through such a system, bandwidth requirements were estimated for the fast network “backplane” that was needed to connect everything, prices were calculated, essential software was sketched out and the manpower required for development and operation was predicted.
Software development would be a challenge. Fortunately, some of us had been working with Cray at CERN, adding some facilities to UNIX that were vital for mainframe computing: a proper batch scheduler and a tape-drive reservation system, for example. These could be reused quite easily.
Other new functions would include a distributed “stager” and a “disk-pool manager”. These would allow the pre-assembly of each job’s tape data (read from drives on tape servers) into efficiently-managed disk pools that would be located on disk servers, ready to be accessed by the jobs in the CPU servers. Also new would be the “RFIO”, a remote file input-output package that would offer a unified and optimized data-transfer service between all of the servers via the backplane. It looked like Sun’s networking filing system, but was much more efficient.
SHIFT in focus
Finally, a suitable name was coined, again by Erik Jagel: “SHIFT”, for “Scalable Heterogeneous Integrated FaciliTy”, suggesting the paradigm shift that was taking place in large-scale computing: away from mainframes and towards a distributed low-cost approach.
The “SHIFT” proposal report was finished in July 1990. It had 10 names on it, including the colleagues from several groups that had offered their ideas and worked on the document.
“Were 10 people working on this?” and “How many Cray resources were being used and/or counted?” came the stern reply. In response, we pointed out that most of the 10 people had contributed small fractions of their time, and that the Cray had been used simply as a convenient tape server. It was the only UNIX machine in the Computer Centre with access to the standard tape drives, all of which were physically connected to the IBM mainframe at that time.
Closer to home, the idea fell on more fertile ground, and we were told that if we could persuade at least one of the four LEP experiments to invest in our idea, we could have matching support from the Division. The search began. We spoke to ALEPH, but they replied, “No, thank you, we’re quite happy with our all-VAX VMS approach.” L3 replied, “No thanks, we have all the computing power we need.” DELPHI replied, “Sorry, we’ve no time to look at this as we’re trying to get our basic system running.”
Only OPAL took a serious look. They had already been our partner in HOPE and also had a new collaborator from Indiana with some cash to invest and some small computer system interface (SCSI) disks for a planned storage enhancement to their existing VMS-based system. They would give us these contributions until March 1991, the next LEP start-up – on the condition that everything was working by then, or we’d have to return their money and disks. It was September 1990, and there was a lot of work to do.
Our modular approach and use of the UNIX, C language, TCP/IP and SCSI standards were the keys to the very short timescale we achieved. The design studies had included technical evaluations of various workstation and networking products.
By September, code development could begin and orders for hardware went out. The first tests on site with SGI Power Series servers connected via UltraNet took place at the end of December 1990. A full production environment was in place by March 1991, the date set by OPAL.
And then we hit a problem. The disk server system began crashing repeatedly with unexplained errors. Our design evaluations had led us to choose a “high-tech” approach: the use of symmetric multiprocessor machines from Silicon Graphics for both CPU and disk servers, connected by the sophisticated “UltraNet” Gigabit network backplane. One supporting argument had been that if the UltraNet failed or could not be made to work in time, then we could put all the CPUs and disks together in one cabinet and ride out the OPAL storm. We hadn’t expected any problems in the more conventional area of the SCSI disk system.
Our disks were mounted in trays inside the disk server, connected via high-performance SCSI channels. It looked standard, but we had the latest models of everything. Like a performance car, it was a marvel of precision but impossible to keep in tune. We tried everything, but still it went on crashing and we finally had to ask SGI to send an engineer. He found the problem: inside our disk trays was an extra metre of flat cable which had not been taken into account in our system configuration. We had exceeded the strict limit of 6 m for single-ended SCSI, and in fact it was our own fault. Rather than charging us penalties and putting the blame where it belonged, SGI lent us two extra CPUs to help us to make up the lost computing time for OPAL and ensure the success of the test period!
At the end of November 1991, a satisfied OPAL doubled its investment in CPU and disk capacity for SHIFT. At the same time, 16 of the latest HP 9000/720 machines, each worth 10 CERN Units of CPU, arrived to form the first Central Simulation Facility or “Snake Farm”. The stage was set for the exit of the big tidy mainframes at CERN, and the beginning of the much less elegant but evolving scene we see today on the floor of the CERN Computer Centre. SHIFT became the basis of LEP-era computing and its successor systems are set to perform even more demanding tasks for the LHC, scaled this time to the size of a worldwide grid.