Résumé

Des grappes d'ordinateurs pour la CDQ sur réseau

En décembre 2005, le Laboratoire Fermi a mis en service la nouvelle grappe d'ordinateurs "Pion" pour les calculs de chromodynamique quantique (CDQ) sur réseau. Cette grappe, qui comprend 520 ordinateurs Intel Pentium à processeur unique interconnectés par un réseau haute performance, Infiniband, tire parti des processeurs, technologies de mémoire et réseaux les plus récents. Les superordinateurs de CDQ sur réseau s'appuyant sur cette technologie peuvent être produits relativement rapidement et à moindre coût que les systèmes spécialisés, dont la conception, le développement et la fabrication s'étendent sur des années.

In December 2005, Fermilab brought a new cluster online devoted to lattice quantum chromodynamics (lattice QCD) calculations. "Pion" consists of 520 single-processor Intel Pentium computers connected with Infiniband, and is the latest lattice QCD cluster funded by the US Department of Energy (DOE).

Lattice QCD is the numerical technique used to study QCD, the theory that describes the strong force. Because lattice QCD calculations require such enormous computing power, simplifying assumptions (such as the quenched approximation) have been required in the past to make progress on the supercomputers available (see CERN Courier June 2004 p23). In recent years, improvements in algorithms and a steady increase in the capabilities of computers have led to more complete lattice QCD simulations. This has enabled lattice theorists to make a number of predictions of physical quantities that were matched by new experimental results with equal precision (Aubin et al. 2005a and 2005b, Allison et al. 2005, Kronfeld et al. 2005 and CERN Courier July/August 2005 p13).

In lattice QCD computations, a finite volume of the space-time continuum is represented within a supercomputer by a four-dimensional lattice of sites. At each site, one or more SU(3) vectors, i.e. 1 × 3 arrays of complex numbers, represent the field strengths of quarks on the lattice. SU(3) matrices, i.e. 3 × 3 complex arrays, reside on the links between sites and represent the gluon fields that interact with the quark fields. Performing a Monte Carlo calculation to generate representative configurations of the QCD vacuum involves repeated sweeps through all sites, multiplying the link matrices by neighbouring site vectors and updating their values.

In current lattice QCD problems, the representation of the lattice requires around 10 GB of memory and calculation rates of around 1 Tflops (1012 floating-point operations per second) to make reasonable progress. A typical desktop PC is capable of about 1 Gflops.

In recent years, two approaches have been taken in the design of dedicated supercomputers for lattice QCD. The first uses custom processors that include on-chip connections to the neighbouring processors in three or more dimensions. Each processor chip also includes several megabytes or more of on-chip memory. The QCDOC (see CERN Courier September 2004 p17). and apeNEXT (see CERN Courier September 2004 p18). machines are examples of this system-on-a-chip approach. The second approach, used on Pion, relies on commodity computers connected via a high-performance network. Although similar to the large Linux farms used for reconstruction processing for experiments at Fermilab and CERN, it is the high-performance network that distinguishes the computer-cluster approach. Also, while the computers in reconstruction farms operate independently, those on a lattice QCD cluster are tightly coupled, constantly exchanging data during the computation.

Purpose-built systems like QCDOC and apeNEXT require several years of design, development and fabrication before they are ready for production. When new, these supercomputers provide the largest capability, with the fastest calculation for a single problem. By contrast, commodity clusters take advantage of each year's newest processors, memory technologies and networks; lattice QCD supercomputers based on clusters can be built relatively quickly and frequently at a lower cost. In terms of performance, the purpose-built US QCDOC machine sustains 5 Tflops, compared with about 1 Tflops for Fermilab's Pion cluster. By the end of next summer, the successor to Pion at Fermilab will sustain about 2.25 Tflops.

The design of lattice QCD clusters requires careful attention to the balance between calculation speed, memory bandwidth and network performance to achieve the most cost-effective designs. Lattice QCD codes require high memory bandwidth and strong floating-point performance. A double-precision SU(3) matrix-vector multiplication, for example, consumes 192 B of operands, produces 48 B of results and uses 66 floating-point operations. This bytes-to-flops ratio of more than 3.6 stresses memory subsystems. Typical Intel and AMD microprocessors, for example, are capable of more than 10 Gflops when their parallel floating-point instructions (Streaming SIMD Extensions, or SSEs) are used. However, the typical peak memory bandwidths of these processors of about 6 GB/s fall far short of the 40 GB/s that would be required to feed and consume the data for SU(3) matrix-vector multiplications at the peak speed of the SSE floating-point unit. The calculations also stress the networks used to communicate data between the nodes of a cluster, requiring bandwidths and latencies superior to those provided by TCP/IP running over Gigabit Ethernet.

The networking requirements of lattice QCD on clusters have been met by using high-performance switched networks, such as Myrinet, Quadrics or recently Infiniband, or by using multiple Gigabit Ethernet networks running a specialized non-TCP/IP communications stack. The latter approach was used on two clusters built at the DOE's Jefferson Lab. The 2003 machine, "3G", uses six Gigabit Ethernet interfaces connected as a 3D toroidal mesh in which each computer can communicate directly only with neighbours in the positive and negative x, y and z directions. The 2004 machine, "4G", uses a 5D toroidal mesh.The toroidal Gigabit Ethernet mesh machines, pioneered by Zoltan Fodor of Eötvös University, Budapest, and colleagues were very cost-effective because of their use of commodity network interfaces (Z Fodor et al. 2003). However, in the past year the newer Infiniband network technology has become more favourable, offering greater performance and flexibility at a low cost. These factors led Fermilab to choose Infiniband for the Pion cluster. One advantage the switched Infiniband network provides for Pion is that any computer in the cluster may communicate directly with any other computer on the network.

Pion uses Intel Pentium 640 processors, each running at 3.2 GHz. With 800 MHz memory buses, these machines have one of the highest memory bandwidths per processor available. Pion was built in two halves, with the first 260 nodes last May and the second 260 in November, and the price per node dropped from $1970 in the spring to $1550 in autumn, with approximately $840 for the computer and $650 for the Infiniband. Pion exceeds 1.7 Gflops per processor on lattice QCD codes, or roughly $0.90/Mflops on the second half.

On 1 October 2005 the DOE launched a four-year project to support lattice QCD. This follows five years of support for lattice QCD by the DOE Office of Science via the SciDAC Lattice Gauge Theory Computational Infrastructure project, as well as grants used to purchase the US QCDOC in 2004 and 2005. The project will operate the 5 Tflops US QCDOC machine, which resides at Brookhaven National Laboratory, as well as the clusters at Jefferson Lab (3G and 4G, which total 0.55 Tflops) and Fermilab ("QCD", a 0.15 Tflops, 128-node Myrinet cluster built in 2004, and the 1 Tflops Pion).

The project will also support the construction of new machines. In the first year, two clusters will be constructed: a 0.5 Tflops Infiniband cluster based on 256 dual-core Intel processors at Jefferson Lab will come online at the end of March, and a 2.25 Tflops Infiniband cluster based on 1000 dual-core Intel processors at Fermilab will come online by the end of September. Other systems planned for 2007-2009 will provide an additional 11 Tflops of computing power.

Further reading


C Aubin et al. 2005 Phys. Rev. Lett. 94 011601. C Aubin et al. 2005 Phys. Rev. Lett. 95 122002. I F Allison et al. 2005 Phys. Rev. Lett. 94 172001. A S Kronfeld et al. 2005 Proc. Sci. LAT2005 206. Z Fodor et al. 2003 Comput. Phys. Commun. 152 121.