Les processeurs pour la physique au LHC vus sous toutes les coutures

Lorsque le LHC entrera en service, les physiciens dépendront principalement de serveurs de PC standard de l'industrie pour analyser les données issues des collisions à haute énergie. Cela signifie que les caractéristiques des processeurs et du fonctionnement des logiciels utilisés pour la collecte des données devront être parfaitement compris. Sverre Jarp, responsable technique du programme CERN openlab qui collabore avec l'industrie pour tester les derniers développements apportés au matériel informatique, revient sur l'évolution des processeurs au cours des dix dernières années, à mesure qu'il est devenu possible d'intégrer de plus en plus de transistors – les composants de base – dans une puce. Il évoque également les défis ainsi posés aux logiciels de l'ère du LHC.

In the past few decades, largely thanks to Moore's Law, the world has witnessed an unprecedented race to higher and higher densities in integrated circuits. When we talked about "PC as Physics Computer for LHC?" at the CHEP '95 conference (figure 1) the x86-architecture based (micro)processors were already well established in other market segments. They had, at the time, typically 3–10 million transistors (figure 2). Today, they consist of more than 300 million transistors on a surface that is as big as a fingernail and the prediction is that this will continue to double every two years for at least another decade, or in other words, through most of the LHC era.

Making use of transistors

Over time, the transistors on a processor die have been put to use in several ways, some of which have been extremely advantageous to high-energy physics (HEP) software, others less so. Roughly speaking, a processor consists of the execution logic (including the register files), the cache hierarchy (typically 2–3 levels) and miscellaneous items, such as communications logic for interacting with the external world (figure 3).

In the era of the 386 and 486 chips, the processor executed a single stream of instructions in order, that is, exactly according to the way the code was laid out by the compilers. With the availability of additional transistors, engineers decided to change this simple execution scheme in two ways. First, they allowed instructions to be scheduled on multiple issue ports in parallel, like passing orders to several chefs who work in parallel in a kitchen. If neighbouring instructions are independent of each other, there is no need to wait for the first instruction to be completed before launching the second one. This strategy has taken us from the first Pentium with two ports (or pipes), via the Pentium Pro with three parallel ports, to today's Intel Core 2 micro-architecture with as many as six ports (figure 4).

Beyond the increase in the number of ports, designers also extended the instruction set so that instructions could operate on multiple data elements in one go. This is referred to as Single Instruction Multiple Data (SIMD) and can be found, for instance, in the x86 Streaming SIMD Extensions (SSE). Typically these instructions operate on data that is 128 bits wide, which means either four 32 bit numbers (integers or floats) or two 64 bit numbers (long integers or doubles) but other combinations are also possible. These instructions can be seen as "vector instructions" and usually achieve maximum efficiency in matrix or vector-based programs.

To feed the multiple-execution ports mentioned earlier, there is a need to identify as much parallelism as possible, and the chip engineers consequently added a mechanism for out-of-order (OOO) execution. This scheme allows the processor to search for independent instructions inside an "instruction window" of typically 100 instructions and execute additional, independent work on the fly whenever possible.

As engineers introduced even denser silicon circuits, more and more transistors became available on one chip. This allowed – as a third trick – the extension of cache sizes. Whereas sub-megabyte caches were standard for a long time, now it is not rare to find processors with more than 10 MB of cache. The latest Xeon MP processor, for instance, features up to 16 MB of Level-3 cache, and the latest Itanium-2 processor sports 24 MB.

The advantage of using more transistors for cache is easy to understand. First, caches are easy to design and implement on the die; second, they run cool, consuming on average much less power than the execution logic; and third, the time to access data in a cache is typically 10–100 times shorter than accessing main memory off the chip (figure 3).

Nevertheless, the world needed more ideas for transistor usage and engineers invented chip-level "multi-threading". This is a scheme whereby additional transistors are used to keep the "state" of two software processes (or "threads") inside the execution logic simultaneously, while sharing the executing units and the caches. The chip's control logic switches between the two threads according to a pre-determined algorithm, typically either "round-robin" or "switch on long waits". In the latter case, if data are not found in the cache, we obtain a cache miss, which forces instructions or data to be read from main memory. This opens a gap of hundreds of cycles for the other thread to use (as long as it does not itself create another cache miss). This scheme is by no means limited to two threads and some suppliers already operate with higher numbers.

Having observed multi-threading and still being blessed with an ever-increasing transistor budget, it was easy to guess the next step. Rather than retaining the execution units and caches as a single resource, multi-core processors replicate everything, leading to a chip with multiple independent processing units inside (figure 5). It is easy to see how this scheme (in addition to the cache expansion mentioned above) can be used to keep pace with transistor growth in the future.

On the other hand, since multi-core technology forces a rethink of the overall hardware design and more importantly the overall software-programming model, there is currently no agreement in the industry on what the "sweet spot" is. Sun Microsystems, on its T1 "Niagara" processor, currently offers eight single-port, in-order cores, and integrates support for four threads in each core. Intel already ships its first quad-core processor (launched recently at CERN; see "Chip may boost particle-physics Grid power"), although some purists will point out that this processor is built with two dual-core components for the time being. AMD has announced its single-die quad-core processor for availability during the summer. On the extreme side, Intel recently demonstrated an 80 core teraflop research processor chip, so it seems clear that double-digit core numbers are not too far into the future.

What's best for HEP?

In the community at large and also inside the CERN openlab, we have spent considerable time looking at the execution characteristics of HEP/LHC software. Our work has covered compiler investigations, benchmarking, code profiling using hardware monitors, information exchange with chip designers, and so on. In the following I will comment on the suitability of the various ways of deploying transistors for HEP software.

In openlab we were initially mandated to look at the advantages/disadvantages of running HEP codes on the Itanium in order, 64 bit processor, which ever since its inception in 2000 has proposed six parallel execution slots – the same number of ports as today's Core 2. The sad truth for both processor families is that at this level our programs express too little parallelism. When we measure the average number of instructions per cycle we typically find values that hover around 1 (or lower in certain programs). This is far from the maximum parallelism, especially when we also take SSE instructions into account, and is caused by the sequential ("perform one thing at a time") manner in which much of HEP software is written.

A standard way of expressing parallelism at the instruction level is to write small to medium-sized loops from which compilers may extract parallel components. Of course, in HEP we do have event loops, but these are simply too big and are not "seen" by the compilers, which typically scrutinize only small chucks of code in one go. As a result the compilers see only reams of mainly sequential code.

Let's look at a simple, rather typical example in which we test whether a point is inside or outside a box (in the x-direction); note that the hardware generates three load micro-operations on the fly:

Load point[0]; Load origin[0]; Subtract; Load a mask; Obtain the absolute value via an and instruction; Load the half-size; Compare; Branch conditionally;

The parallel hardware could cope with these instructions in a couple of cycles. The sequential nature of the code, however, together with the latency incurred by the loads from cache and the floating-point subtract and compare, result in a sequence that takes around 10 cycles. In other words, the program sequence only exploits 10–20% of the available execution resources (figure 6). Fortunately, other transistor deployment schemes work better for us.

HEP codes do benefit from OOO execution. This means that even when the compilers have laid out the code sequentially, the OOO hardware engine is able to shorten the execution time by finding work that can be done in parallel. This has, for instance, been seen when the test mentioned above is expanded to test x, then y, then z. The compilers lay out the tests sequentially, but the OOO engine overlaps the execution and minimizes the time used to compute the test for the two additional directions by more than 50% compared with the initial one. This is definitely good news for the day when we need to cope with more than three dimensions (!), but already today we see a clear gain.

As far as caches are concerned, HEP programs do not seem to need huge sizes. Our programs exhibit good cache locality with cache misses limited to around 1% of all loads. This is still not without consequences, since, as already mentioned, latency to main memory amounts to a few hundred cycles. Modern processors allow data to be pre-fetched, either via a hardware feature or software-controlled instructions, but we have not seen much evidence that execution paths in HEP software are regular enough to profit significantly.

Chip-level multi-threading has not received much attention in our community. This is probably linked to the fact that our jobs are CPU-bound with only few cache misses, and the potential gain is therefore finite. It may even be limited to single-digit percentage numbers in terms of throughput gains. On the other hand, more jobs need to run simultaneously, which increases the memory requirements and consequently the cost of the computer. The price/performance gain is therefore somewhat unclear.

With multi-core processors, however, we finally get to a scheme where the HEP execution profile shines. Thanks to the fact that our jobs are embarrassingly parallel (each physics event is independent of the others) we can launch as many processes (or jobs) as there are cores on a die. However, this requires that the memory size is increased to accommodate the extra processes (making the computers more expensive). As long as the memory traffic does not become a bottleneck, we see a practically linear increase in the throughput on such a system. A ROOT/PROOF analysis demo elegantly demonstrated this during the launch of the quad-core chip at CERN.

If memory size and the related bus traffic do become bottlenecks, we can easily alleviate the problem by exploiting our event loops. The computing related to each event can be dispatched as a thread in a shared-memory model where only one process occupies all the cores inside a chip. This should allow easy scaling with the increase of the number of cores, and permit us to enjoy Moore's Law for many years to come.

Compiler optimization

Our community writes almost all of the large software packages. Whether we think of event generators, simulation packages, reconstruction frameworks or analysis toolkits, we realize that they all have one thing in common: they are all in source format.

The most obvious way to optimize this software is to "tune" each package by using a tool that shows in a functional profile where the execution time is spent. Once a hot-spot is found, the source code is tweaked to see if performance improves. In rare cases, even the program design has to be revisited to correct severe performance issues.

In openlab, we have also taken another approach by working with compiler writers to improve the backend of the compiler, i.e. the part that is generating the binary code to be executed. The approach has great potential, because improvements in the code generator can lead to better performance across a range of applications – all the ones that exploit a given language feature, for instance. The approach presents a couple of tough challenges, though. The first is that you must master the "ancient" language "spoken" by the processor, which is called assembler or machine code. The second challenge is related to OOO execution, which makes the interpretation of execution speeds difficult: even if you believe that the compiler does something superfluously or inefficiently, you cannot assume that removal or simplification will result in a corresponding increase in speed.

It is beyond the scope of this article to cover this complex area in full detail, but let me list a few of the areas where programmers should try hard to assist the compilers, by paying attention to:
Memory disambiguation of data pointers or references. For humans it is often clear that pointers such as *in and *out refer to completely different memory areas. For a compiler with limited visibility of the code, this may not at all be obvious and forces it to generate code sequences that are too "conservative".
Optimizable loop constructs. Compilers and "wide" processors really shine when loop constructs are exposed in all of the important portions of a program (see box 1).
Minimization of if and switch statements. Complex programs with nested if-else structures can easily limit the compilers ability to create efficient code. If such a structure cannot be simplified, one should at least ensure that the most frequently executed code is at the top and not at the bottom of the construct.
Mathematical functions. Unnecessary calls should always be avoided since these functions are in general very expensive to calculate, as the compilers tend to lay them out as a single execution stream.

To improve profiling of the software execution, CERN openlab is actively working with the author of a powerful software package, called perfmon2. This package will soon become the universal interface in the Linux kernel to monitor performance of all supported processors.

Into the LHC era

For more than a decade HEP has been riding on the "commodity wave" of PC technologies. This has made gigantic computing resources available to our community. In January 2007, the count for the LHC Computing Grid showed that more than 30,000 processors are interconnected and this number will continue to grow.

There are, however, some worries to keep in mind. One is that we exploit the execution hardware at the 10–20% level, but given the cost ratio between "expensive" programmers and "cheap" hardware I do not expect that anybody is keen to revisit our program structures in a fundamental way.

Another worry is the megawatt ceiling on our computer centres. At CERN, we expect to saturate the cooling capability of our Computing Centre in a few years from now. It will then be impossible to add more computers to cope with the expected increase in demand and the only solution may be to optimize more to increase the efficiency of what is already installed.

On the positive side, multi-core systems that increase the number of available execution cores from generation to generation (on a constant power budget with increased energy efficiency) will definitely be in our favour. Vendors, such as Intel, tell us that we are seen as "ideal" customers for such systems. Let's just hope that the Googles and Yahoos of this world will also help create a strong demand. This is absolutely vital since it is somewhat unlikely that normal PC users at home will be equally enthusiastic about many-core systems where they only see advantages if they are running multiple processes in parallel, but little or no speed-up in the case of a single process.

Finally, it is important to remember what happened during the LEP era. We started with mainframes and supercomputers, transited through RISC workstations and ended up with x86 PCs. All in all, this gave us more than a thousand times the computing power with which we started. I sincerely hope that the computer industry will help us perform the same miracles in the LHC era.