The e-Science Challenge: Creating a Reusable e-Infrastructure for Collaborative Multidisciplinary Science
|
![]() November 2005 |
This issue of CTWatch Quarterly contains four articles that provide an overview of some of the major Grid projects in Europe. All these projects are aimed at developing distributed collaborative research capabilities for the scientists that are built on the deployment of a persistent middleware infrastructure on top of the high bandwidth research networks. The combination of a set of middleware services running on top of high speed networks is called ‘e-Infrastructure’ in Europe and ‘Cyberinfrastructure’ in the USA. In this brief article we shall abstract the key elements of such an e-Infrastructure from these projects and from our experience in our UK e-Science program. We look at the problems of creating and implementing a sustainable, global e-Infrastructure that will enable multidisciplinary and collaborative research across a wide range of disciplines and communities.
The UK e-Science Initiative began in April 2001 and over the last four years, more than £250M has been invested in science applications and middleware development. In addition, the program created a pipeline from the science base to genuine industrial applications of this technology and, most importantly, has enabled the creation of a vibrant, multidisciplinary, e-Science community. This community comes together in its totality at the UK’s annual e-Science All Hands Meeting which is held each September. We now have a community of over 650 who attend and join in to share experience and technologies. These meetings have brought together an exciting mix of scientists, computer scientists, IT professionals, industrial collaborators and, more recently, social scientists and researchers in the arts and humanities. Research scientists from all domains of science and engineering–particle physics, astronomy, chemistry, physics, all flavours of engineering, environmental science, bioinformatics, medical informatics and social science–as well as the arts and humanities are beginning to appreciate the need for e-Science technologies that will allow them to make progress with the next generation of research problems. In most cases, researchers are now finding themselves faced with an increasingly difficult burden of both managing and storing vast amounts of data as well as analyzing, combining and mining the data to extract useful information and knowledge. Often this can involve automation of the task of annotating the data with relevant metadata as well as constructing complex search engines and workflows that capture complex usage patterns of distributed data and compute resources. Most of these problems and the tools and techniques to tackle them are similar across many different types of application. It makes no sense for each community to develop these basic tools in isolation. We need to identify and capture a set of generic middleware services and deploy them on top of the high-bandwidth research networks to constitute a reusable e-Infrastructure. In the UK e-Science Initiative, this task–of identifying and implementing the key features of a national e-Infrastructure–was the remit of the Core Program.
The phrase e-Infrastructure–or Cyberinfrastructure in the US–is used to emphasize that these applications will be facilitated by a set of services that permit easy but controlled access to the traditional infrastructure of science–supercomputers, high performance clusters, networks, databases and experimental facilities. The e-Science challenge is to provide a set of Grid middleware services that are sufficiently robust, powerful and easy to use that application scientists are freed from re-inventing such low-level ‘plumbing’ and can concentrate on their science. A second challenge is to make this combination of middleware and hardware into a truly sustainable e-Infrastructure in much the same way as we take for granted the research networks of today.
The Grid projects referred to in these articles as well as the national e-Science programmes in Europe give us a good idea of what is required to create such a persistent, global, e-Science e-Infrastructure. In the UK the key elements have been identified as:1
With funding from the e-Science Core Program, the UK is now in the process of implementing a prototype national e-Infrastructure. Several key components have been developed and these include:
The e-Science Core Program has recently funded the continuation of an Open Middleware Infrastructure Institute (OMII-UK). In its second phase, the Institute is now built on three existing centres and leverages their joint user groups and the different competencies of the three teams. The lead partner is the original OMII at the University of Southampton which was set up in 2004 to provide well-engineered e-Science middleware sourced from the e-Science community.2 They have now been joined by the OGSA-DAI team in Edinburgh that has developed middleware to support data access and integration now used worldwide;3 and by the myGrid project, which since 2001 has developed a set of workflow-based tools that have been widely adopted to support researchers in the bioinformatics community.4
By combining the expertise of these groups in OMII-UK, the e-Science Core Program has established a powerful source of well-engineered software, which should enable an integrated approach to the provision of higher level and more advanced tools. A dialogue is taking place with similar organizations in different countries, such as the NMI in the US and a new organization, OMII-China in Beijing.
Established in April 2004, the National Grid Service (NGS) builds on the experiences of the UK e-Science community.5 At the core of the service there are two compute and two data clusters located at the Universities of Manchester, Oxford, and Leeds, and at the Rutherford Appleton Laboratory. These are supported by the Grid Operations Support Centre (GOSC) who maintain the UK e-Science Certificate Authority, a help desk, and provide training for administrators and IT professionals.
The NGS service has now connected compute resources at several partner sites. Presently there are three such associate sites, namely; Cardiff, Lancaster, and Bristol Universities. In addition to this production service, other UK e-Science Centres play an important role in evaluating and testing Grid middleware as part of the software appraisal process for the NGS and OMII-UK. Since the NGS has been in production mode, the number of registered users has risen to over 300 in a broad range of application areas.
At present, the core middleware of the NGS production grid is based on the Globus GT2 Toolkit and the SDSC Storage Resource Broker (SRB). As the Web Services versions of Grid middleware mature, it is expected that the NGS will migrate to a set of middleware services compliant with the GGF OGSA architecture. It is also expected that this set will include software from the OMII-UK as well as from the NMI and the EGEE project described in this publication.
The Digital Curation Centre (DCC) has been established in Edinburgh by the e-Science Core Program and the JISC. Its role is to support best practice and to pursue research in data curation and digital preservation.6 In particular, it is working with different application communities to understand their specific challenges and identify best practice. The Centre will provide advice and support services to UK researchers and institutions. In the next five years, it is clear that many scientists are likely to be swamped with data. Managing the whole data chain, from acquisition and annotation through to integration and preservation, will be a major challenge. Tools to support collaborative working, workflow, provenance and high performance visualization will be needed. In some communities, there are business or legal requirements for long-term data preservation and access, as for example, with engineering drawings and clinical records.
At present, the e-Science research agenda both in technology and applications is largely being driven by leading-edge scientists and researchers who are prepared to engage with immature, ‘bleeding edge’ software and technologies. To engage a broader spectrum of the scientific community requires that the steepness of the learning curve be much reduced and the e-Science tools and technologies integrated into well known, familiar environments. Supportive, collaborative, ‘virtual organizations’ must be easy to establish and provide an adequate level of security and an acceptable user interface. Only with stable and robust middleware services will scientists be able to routinely construct the types of Grid that they need for their type of research.
Several other activities are underway in the UK that are attempting to move forward in this agenda of embedding e-Science into the fabric of research. These include:
Similarly, there are many other EU R&D projects addressing a similar set of issues as well as a set of other national e-Science programs.
Given the large investment that the UK has made in e-Science since 2001, we are now beginning to see real benefits emerging for some application communities. This is true for projects both in the UK and the rest of the EU. Although some other application communities are still at an early stage of exploration of e-Science technologies, already the potential benefits are becoming clear for their particular area of research. The use of these technologies will have a profound change on the methodology and processes that the researchers have traditionally employed to do their science. With the advent of very large data sets, we are seeing a new form of data-centric, collections-based science begin to emerge to complement the traditional experimental, theoretical and computational approaches. There will be as much a change in social behaviour as a change in technology.