CTWatch Quarterly » Introduction

Fred Johnson, Acting Director, Computational Science Research & Partnerships (SciDAC) Division
Office of Advanced Scientific Computing Research
DOE Office of Science

We begin with a discussion (Gibson et al.) of the future requirements for fault tolerant computing from the leaders of the Petascale Data Storage Institute (PDSI). Given the surprising consequences that scaling up often introduces, it seems to strike an appropriate note – sobriety based on experience. The PDSI team has been collecting and analyzing data on failure rates from contemporary HPC systems in an effort to understand the impact that scaling up to systems with millions of hardware elements will have on successful application execution in general, and on the requirements for next generation storage systems, in particular. The results of their timely analysis are thought provoking. They show generally that as systems scale up, conventional approaches to fault tolerance based on familiar check-point and restart may break down along various fronts because the size and frequency of the checkpoints that must be taken on massive systems makes the process unsustainable. Their analysis makes it clear that systems research in this area is destined to become more and more critical.

Three of the CETs focus on issues of software development and maintenance that are raised by the extreme demands of next generation applications and the requirements of the HPC systems on which they must run. The scope of the Center for Technology for Advanced Scientific Computing Software (TASCS), presented in Parker et al., is the most general. For the TASCS group, the increasing scale and complexity of SciDAC applications and systems software is itself a critical problem. They argue that a far higher degree of modularity is required in the software that describes the multi-physics, multi-scale simulations that are now being developed. The more stove-piped these applications are, the less smoothly and intelligently they will be able to adapt and innovate to meet the conditions that we know are coming – more parallelism, more data intensity, shorter mean time to failure, and so on. The core techniques, tools, components and best practices of the Common Component Architecture (CCA) that they survey in their article are designed to help solve this aspect of the scalability problem for the broad SciDAC community.

The other two code-oriented CETs – the Performance Engineering Research Institute (PERI) and the Center for Scalable Application Development Software (CScADS) – focus on application performance and programmer productivity in the context of systems designed with thousands or millions of multicore and/or heterogeneous processors. They share the common goal of providing a tool set for achieving high performance that is as automated and easy to use as possible, allowing researchers to keep their attention focused on the domain science questions at hand. Both have made concerted efforts, through sponsored workshops and direct contact, to engage with and leverage the experience of the SciDAC developer community, with initial emphasis in the areas of Fusion Energy and Combustion. Yet their work emphasizes different, but complementary aspects of the problem. The PERI group (Baily et al.) builds on a foundation of performance modeling, endeavoring to understand, through systematic empirical testing and analysis, the way real world applications behave on real world systems. The knowledge gained thereby is then used to help guide the application design and development process through a variety of techniques, the more automated the better. By contrast, the CScADS group (Mellor-Crummy et al.) is exploring programming models that make the process of developing well tuned, highly parallel software as easy and efficient as possible by innovatively combining high level languages, scripting languages, compilers and other software tools. As these efforts converge, their collective results hold tremendous promise for the HPC developer community.

The CETs dedicated to scientific visualization have to confront the problem of petascale science from a uniquely important point of view, namely, where the bits meet the mind and the bandwidth is inherently limited. Their task is to find ways to enable scientists to fruitfully apply their observational capabilities, constrained as they are by nature, to some of the world’s largest and most complex datasets, using some of the world’s most massive and sophisticated computational platforms.

Pages: 1 2 3

CTWatch is a collaborative effort				Sponsored By