CTWatch Quarterly » The NRC Report on the Future of Supercomputing

The NRC Report on the Future of Supercomputing

Susan L Graham, University of California at Berkeley
Marc Snir, University of Illinois at Urbana-Champaign

Problems in the Offing

Memory latency continues to increase, relative to processor speed: An extrapolation of current trends would lead to the conclusion that by 2020 a processor will execute about 800 loads and 90,000 floating point operations while waiting for one memory access to complete — an untenable differential. While the problem affects all processors, it affects scientific computing and high-performance computing earlier, as commercial codes can usually take better advantage of caches.

Global communication latency continues to increase and global bandwidth continues to decrease, relative to node speed. Again, an extrapolation of current trends would lead by 2020 to a global bandwidth of about 0.001 word per flop and a global latency equivalent to almost 1 Mflops. The problem affects tightly coupled HPC applications much more than loosely coupled commercial workloads.

Improvement in single processor performance is slowing down: It is hard to further increase pipelining depth or instruction-level parallelism, so that increasing chip gate counts do not contribute much to single processor performance. To stay on Moore’s curve of microprocessor performance, vendors need to use increasing levels of on-chip multiprocessing. This is not a major problem for many commercial applications that can cope with modest levels of parallelism, but will be a problem for high-end supercomputers that will need to cope with hundreds of thousands of concurrent threads.

As circuit size shrinks and the number of circuits in a large supercomputer grows, mean-time-to-failure decreases. The largest computer systems are more affected by this problem than modest size computers.

It's the Software, Stupid

Although clusters have reduced the hardware cost of supercomputing, they have increased the programming effort needed to implement large parallel codes. Scientific codes and the platforms these codes run on have become more complex while the programming environments used to develop these codes have seen little progress. As a result, software productivity is low. Programming is done using message-passing libraries that are low-level and contribute large communication overheads. No higher-level programming notation that adequately captures parallelism and locality, the two main algorithmic concerns of parallel programming, has emerged. The application development environments and tools used to program complex parallel scientific codes are generally less advanced and less robust than those used for general commercial computing. Hybrid or custom systems could support more efficient parallel programming models, e.g., models that use global memory. But this potential is largely unrealized, because of the very low investments in supercomputing software such as compilers, the desire to maintain compatibility with the prevalent cluster architecture, and the fear of investing in software that runs only on architectures that may disappear in a few years. The software problem will worsen as higher levels of parallelism are required and as global communication becomes relatively slower.

Pages: 1 2 3 4 5 6 7 8 9

CTWatch is a collaborative effort				Sponsored By