IEEE Spectrum:

The global race to build more powerful supercomputers is focused on the next big milestone: a supercomputer capable of performing 1 million trillion floating-point operations per second (1 exaflops). Such a system will require a big overhaul of how these machines compute, how they move data, and how they’re programmed. It’s a process that might not reach its goal for eight years. But the seeds of future success are being designed into two machines that could arrive in just two years.

China and Japan each seem focused on building an exascale supercomputer by 2020. But the United States probably won’t build its first practical exascale supercomputer until 2023 at the earliest, experts say. To hit that target, engineers will need to do three things. First they’ll need new computer architectures capable of combining tens of thousands of CPUs and graphics-processor-based accelerators. Engineers will also need to deal with the growing energy costs required to move data from a supercomputer’s memory to the processors. Finally, software developers will have to learn how to build programs that can make use of the new architecture.

[...]

The U.S. Department of Energy recently announced that it will invest $325 million in a pair of supercomputers—capable of performing one-tenth of an exaflops or more—being developed by IBM, Mellanox, Nvidia Corp., and other companies for a 2017 debut. The planned supercomputers, named Summit and Sierra, will rely on a new computer architecture that stacks memory near the Nvidia GPU accelerators and IBM CPUs. That architecture’s method of minimizing the energy costs of moving data between the memory storage and processors is a big step toward exaflops supercomputers, experts say.

Practical exascale computing will need additional development of stacked memory and faster, more energy-efficient interconnects to boost the performance of densely packed supercomputer chips, Simon explains. But he anticipates the need for other technological tricks too. One such technology—silicon photonics—would use lasers to provide low-power data links within the system.

Power and cost aren’t the only problems preventing practical exa­scale systems. The risk of hardware failures grows as new supercomputers pack in a greater number of components, says Bronis de Supinski, chief technical officer for Livermore Computing at Lawrence Livermore National Laboratory, in California. His lab’s IBM Blue Gene/Q supercomputer, named Sequoia, currently has a mean time between failures of 3.5 to 7 days. Such a window could shrink to just 30 minutes for an exascale system.

That’s hardly enough time for researchers to run complex simulations or other applications. But software capable of automatically restarting programs could help supercomputing systems recover from some hardware errors. “This is an instance in which the physical realities of hardware...end up creating challenges which we have to handle in software,” De Supinksi says.


(Emphasis added.)

Interesting stuff.

Cheers,
Scott.