CTWatch
August 2005
The Coming Era of Low Power, High-Performance Computing — Trends, Promises, and Challenges
Jose Castanos, George Chiu, Paul Coteus, Alan Gara, Manish Gupta, Jose Moreira, IBM T.J. Watson Research Center

4
System Software

The system software for Blue Gene/L was designed with two key goals, familiarity and scalability. We wanted to make sure that high performance computing users could migrate their parallel application codes with relative ease to the Blue Gene/L platform. Secondly, we wanted the operating environment to allow parallel applications to scale to the unprecedented levels of 64K nodes (128K processors). It is important to note that this requires scaling not only in terms of performance but also in reliability. A simple mean-time-between-failure calculation shows that if the software on a compute node fails about once a month, under the assumption that failures over all nodes are independent, a node failure would be expected once every 40 seconds! Clearly, this shows the need for compute node software to be highly reliable.

We have developed a programming environment based on familiar programming languages (Fortran, C, and C++) and the single program multiple data (SPMD) programming model, with message passing supported via the message passing interface (MPI) library. This has allowed the porting of several large scientific applications to Blue Gene/L with a modest effort (often within a day).

We have relied on simplicity and a hierarchical organization to achieve scalability of software in terms of both performance and reliability. Two major design simplifications that we have imposed are:

  • Strictly space sharing: only one parallel job can run at a time on a Blue Gene/L partition; we go one step further and support only one thread of execution per processor. This allows us to use efficient, user-space communication without protection problems (the Blue Gene/L partitions are electrically isolated). Furthermore, having a dedicated processor behind every application-level thread leads to more deterministic execution and higher scalability.
  • No demand paging support: the virtual memory available on a node is limited to the physical memory size. This restriction, besides simplifying the compute node kernel, leads to a performance benefit that there are no page faults or translation lookaside buffer misses during program execution, leading to higher and more deterministic performance.

The software for Blue Gene/L is organized in the form of a three-tier hierarchy. A lightweight kernel, together with the runtime library for supporting user applications, constitutes the programming environment on the compute node. Each I/O node, which can be viewed as a parent of a set of compute nodes (referred to as a processing set or pset), runs Linux and supports a more complete range of operating system services, including file I/O and sockets, to the applications via offloading from the compute nodes. The Linux kernel on I/O nodes also provides support for job launch. Finally, the control system services run on a service node, which is connected to the Blue Gene/L computational core via a control network.

Pages: 1 2 3 4 5

Reference this article
Castanos, J., Chiu, G., Coteus, P., Gara, A., Gupta, M., Moreira, J. "Lilliputians of Supercomputing Have Arrived!," CTWatch Quarterly, Volume 1, Number 3, August 2005. http://www.ctwatch.org/quarterly/articles/2005/08/lilliputians-of-supercomputing-have-arrived/

Any opinions expressed on this site belong to their respective authors and are not necessarily shared by the sponsoring institutions or the National Science Foundation (NSF).

Any trademarks or trade names, registered or otherwise, that appear on this site are the property of their respective owners and, unless noted, do not represent endorsement by the editors, publishers, sponsoring institutions, the National Science Foundation, or any other member of the CTWatch team.

No guarantee is granted by CTWatch that information appearing in articles published by the Quarterly or appearing in the Blog is complete or accurate. Information on this site is not intended for commercial purposes.