The e-Science Challenge: Creating a Reusable e-Infrastructure for Collaborative Multidisciplinary Science
Tony Hey, Corporate Vice President for Technical Computing – Microsoft Corporation
Anne Trefethen, Director – UK e-Science Core Programme, EPSRC
CTWatch Quarterly
November 2005

1. Introduction

This issue of CTWatch Quarterly contains four articles that provide an overview of some of the major Grid projects in Europe. All these projects are aimed at developing distributed collaborative research capabilities for the scientists that are built on the deployment of a persistent middleware infrastructure on top of the high bandwidth research networks. The combination of a set of middleware services running on top of high speed networks is called ‘e-Infrastructure’ in Europe and ‘Cyberinfrastructure’ in the USA. In this brief article we shall abstract the key elements of such an e-Infrastructure from these projects and from our experience in our UK e-Science program. We look at the problems of creating and implementing a sustainable, global e-Infrastructure that will enable multidisciplinary and collaborative research across a wide range of disciplines and communities.

2. Background

The UK e-Science Initiative began in April 2001 and over the last four years, more than £250M has been invested in science applications and middleware development. In addition, the program created a pipeline from the science base to genuine industrial applications of this technology and, most importantly, has enabled the creation of a vibrant, multidisciplinary, e-Science community. This community comes together in its totality at the UK’s annual e-Science All Hands Meeting which is held each September. We now have a community of over 650 who attend and join in to share experience and technologies. These meetings have brought together an exciting mix of scientists, computer scientists, IT professionals, industrial collaborators and, more recently, social scientists and researchers in the arts and humanities. Research scientists from all domains of science and engineering–particle physics, astronomy, chemistry, physics, all flavours of engineering, environmental science, bioinformatics, medical informatics and social science–as well as the arts and humanities are beginning to appreciate the need for e-Science technologies that will allow them to make progress with the next generation of research problems. In most cases, researchers are now finding themselves faced with an increasingly difficult burden of both managing and storing vast amounts of data as well as analyzing, combining and mining the data to extract useful information and knowledge. Often this can involve automation of the task of annotating the data with relevant metadata as well as constructing complex search engines and workflows that capture complex usage patterns of distributed data and compute resources. Most of these problems and the tools and techniques to tackle them are similar across many different types of application. It makes no sense for each community to develop these basic tools in isolation. We need to identify and capture a set of generic middleware services and deploy them on top of the high-bandwidth research networks to constitute a reusable e-Infrastructure. In the UK e-Science Initiative, this task–of identifying and implementing the key features of a national e-Infrastructure–was the remit of the Core Program.

The phrase e-Infrastructure–or Cyberinfrastructure in the US–is used to emphasize that these applications will be facilitated by a set of services that permit easy but controlled access to the traditional infrastructure of science–supercomputers, high performance clusters, networks, databases and experimental facilities. The e-Science challenge is to provide a set of Grid middleware services that are sufficiently robust, powerful and easy to use that application scientists are freed from re-inventing such low-level ‘plumbing’ and can concentrate on their science. A second challenge is to make this combination of middleware and hardware into a truly sustainable e-Infrastructure in much the same way as we take for granted the research networks of today.

3. Requirements for a Sustainable e-Infrastructure

The Grid projects referred to in these articles as well as the national e-Science programmes in Europe give us a good idea of what is required to create such a persistent, global, e-Science e-Infrastructure. In the UK the key elements have been identified as:1

  1. A competitive network of National Research and Education Networks (NRENs) together with their ‘CERT’ teams for security monitoring and emergency response. In the UK, the SuperJANET5 network and the CERT are run by UKERNA. Across Europe, the EU has funded the Dante organization to manage the GEANT2 network that connects the European NRENs.
  2. A secure national and internationally accepted framework for multiple levels of authentication and authorisation. This must support both access within individual institutions as well as dynamic, cross-boundary ‘Virtual Organizations’ of research groups from different institutions.
  3. A collection of software centres and repositories for open source reference implementations of open standards compliant, infrastructure middleware. This will require the participation or creation of organizations with a serious software engineering capability to research, support and maintain this middleware.
  4. A national focus on digital ‘curation’ to provide scientists with support and guidance into the long-term preservation of both research data and traditional publications. By curation we mean annotation of data with metadata to enable efficient searching and provenance tracking.
  5. Integrated access to national data sets and publications is emerging via the developing Open Access Subject and Institutional Repositories. Examples in the UK include the Arts and Humanities Data Service (AHDS), the Economic and Social Data Service (ESDS), the EDINA and MIMAS JISC funded services that offer sets of national data resources for education and research, and the resources of the British Library.
  6. Remote access to large scale facilities such as Diamond and ISIS in the UK and the LHC, VLT and ITER internationally. Increasingly, scientists will have to pool their financial resources and perform experiments on facilities procured at a global level. For example, the particle physicists intend to use middleware developed in the EGEE project to create an LHC Grid for distribution and analysis from the machine in Geneva.
  7. A set of national services both for HEC and Grid computing and for data services and long-term data archiving. High end supercomputers are clearly an important resource for computational scientists but there is also a need for more modest cluster resources.
  8. National and international centres to enhance the creation of a strong culture of multi-disciplinary research and provide training in new informatics technologies. Much of the new e-Science will be international and there is a need for a strong program of activity dedicated to building and educating a new multidisciplinary community of scientists.
  9. Strong involvement in international standards activities both for infrastructure and for each of the global research communities. The GGF is focussing on developing a set of standards for infrastructure services while community organizations such as the International Virtual Observatory Alliance are delivering interoperability standards for their astronomy community.
  10. Development of tools and services to support multidisciplinary and collaborative environments. These include portals providing access to quality data and services, national service and ontology registries and tools to support workflow and track provenance.

4. An Example: The Emerging UK e-Infrastructure

With funding from the e-Science Core Program, the UK is now in the process of implementing a prototype national e-Infrastructure. Several key components have been developed and these include:

4.1 An Open Middleware Infrastructure Institute

The e-Science Core Program has recently funded the continuation of an Open Middleware Infrastructure Institute (OMII-UK). In its second phase, the Institute is now built on three existing centres and leverages their joint user groups and the different competencies of the three teams. The lead partner is the original OMII at the University of Southampton which was set up in 2004 to provide well-engineered e-Science middleware sourced from the e-Science community.2 They have now been joined by the OGSA-DAI team in Edinburgh that has developed middleware to support data access and integration now used worldwide;3 and by the myGrid project, which since 2001 has developed a set of workflow-based tools that have been widely adopted to support researchers in the bioinformatics community.4

By combining the expertise of these groups in OMII-UK, the e-Science Core Program has established a powerful source of well-engineered software, which should enable an integrated approach to the provision of higher level and more advanced tools. A dialogue is taking place with similar organizations in different countries, such as the NMI in the US and a new organization, OMII-China in Beijing.

4.2 A National Grid service

Figure 1Established in April 2004, the National Grid Service (NGS) builds on the experiences of the UK e-Science community.5 At the core of the service there are two compute and two data clusters located at the Universities of Manchester, Oxford, and Leeds, and at the Rutherford Appleton Laboratory. These are supported by the Grid Operations Support Centre (GOSC) who maintain the UK e-Science Certificate Authority, a help desk, and provide training for administrators and IT professionals.

The NGS service has now connected compute resources at several partner sites. Presently there are three such associate sites, namely; Cardiff, Lancaster, and Bristol Universities. In addition to this production service, other UK e-Science Centres play an important role in evaluating and testing Grid middleware as part of the software appraisal process for the NGS and OMII-UK. Since the NGS has been in production mode, the number of registered users has risen to over 300 in a broad range of application areas.

At present, the core middleware of the NGS production grid is based on the Globus GT2 Toolkit and the SDSC Storage Resource Broker (SRB). As the Web Services versions of Grid middleware mature, it is expected that the NGS will migrate to a set of middleware services compliant with the GGF OGSA architecture. It is also expected that this set will include software from the OMII-UK as well as from the NMI and the EGEE project described in this publication.

4.3 A Digital Curation Centre

The Digital Curation Centre (DCC) has been established in Edinburgh by the e-Science Core Program and the JISC. Its role is to support best practice and to pursue research in data curation and digital preservation.6 In particular, it is working with different application communities to understand their specific challenges and identify best practice. The Centre will provide advice and support services to UK researchers and institutions. In the next five years, it is clear that many scientists are likely to be swamped with data. Managing the whole data chain, from acquisition and annotation through to integration and preservation, will be a major challenge. Tools to support collaborative working, workflow, provenance and high performance visualization will be needed. In some communities, there are business or legal requirements for long-term data preservation and access, as for example, with engineering drawings and clinical records.

5. Conclusions: Embedding e-Science

At present, the e-Science research agenda both in technology and applications is largely being driven by leading-edge scientists and researchers who are prepared to engage with immature, ‘bleeding edge’ software and technologies. To engage a broader spectrum of the scientific community requires that the steepness of the learning curve be much reduced and the e-Science tools and technologies integrated into well known, familiar environments. Supportive, collaborative, ‘virtual organizations’ must be easy to establish and provide an adequate level of security and an acceptable user interface. Only with stable and robust middleware services will scientists be able to routinely construct the types of Grid that they need for their type of research.

Several other activities are underway in the UK that are attempting to move forward in this agenda of embedding e-Science into the fabric of research. These include:

  1. A research and development programme in security for e-Science infrastructures and applications. Issues include GSI style digital certificates for VO membership to Shibboleth mediated trust networks between institutions.
  2. A programme of research into usability issues related to tools, applications, e-Infrastructure and general methodologies.
  3. A program to develop flexible, easy-to-use Virtual Research Environments (VRE). The goal is to lower the barrier for adoption of the new e-Infrastructure services in several domains using portals to provide transparent access to resources.
  4. Teaching and training courses to educate the next generation of e-Science researchers. Several universities now have Masters level programmes or components within such programmes that address some of the issues in e-Science. The National e-Science Centre (NeSC) also provides training for application scientists in new technologies as they emerge.7

Similarly, there are many other EU R&D projects addressing a similar set of issues as well as a set of other national e-Science programs.

Given the large investment that the UK has made in e-Science since 2001, we are now beginning to see real benefits emerging for some application communities. This is true for projects both in the UK and the rest of the EU. Although some other application communities are still at an early stage of exploration of e-Science technologies, already the potential benefits are becoming clear for their particular area of research. The use of these technologies will have a profound change on the methodology and processes that the researchers have traditionally employed to do their science. With the advent of very large data sets, we are seeing a new form of data-centric, collections-based science begin to emerge to complement the traditional experimental, theoretical and computational approaches. There will be as much a change in social behaviour as a change in technology.

1 Geddes, N., Hey, T., Trefethen, A., Read, M., Robiette, A. "A National e-Infrastructure for Research and Innovation," Discussion Paper for UK e-Science Steering Committee 2004
2 OMII: http://www.omii.ac.uk/
3 The OGSA-DAI project:http://www.ogsadai.org.uk/
4 The myGrid project: http://www.mygrid.org.uk/
5 The National Grid Service: http://www.ngs.ac.uk/
6 The Digital Curation Centre: http://www.dcc.ac.uk/
7 The National e-Science Centre: http://www.nesc.ac.uk/

URL to article: http://www.ctwatch.org/quarterly/articles/2005/11/the-e-science-challenge-creating-a-reusable-e-infrastructure-for-collaborative-multidisciplinary-science/