Resource Cataloging and
Distribution Service (RCDS)

Keith Moore
Shirley Browne
Stan Green
Reed Wade
 

Netlib Development Group
University of Tennessee

June 20, 1996

Abstract

We describe an architecture for cataloging the characteristics of Internet-accessible resources, for replicating such resources to improve their accessibility, and for cataloging the current locations of the resources so replicated. Message digests and public-key authentication are used to ensure the integrity of the files provided to users. The service is designed to provide increased functionality with only minimal changes to either a client or a server. Resources can be named either by URNs or by existing URLs, and the service is designed to facilitate long-term resolution of resource names.

1. Introduction

Almost any user of the World Wide Web will be familiar with the following problems: We therefore propose an architecture for a system which attempts to address these problems.

1.1 Design Goals

The goals of our system include: These goals have certain implications for our design:

1.2 Issues

The following issues must be considered: The assumed significance of transition issues on the success of the project influenced our design in the following ways: we allow ordinary URLs as one kind of resource name, we use existing file servers and file access protocols, and we employ DNS as a component of the system rather than building a new distributed database from the ground up. The need for reliable authentication and integrity assurances, coupled with the difficulty of providing secure servers, influenced us to use end-to-end (between information provider and user) authentication, consisting of public-key signatures and cryptographically signed certificates, rather than depending on the security of resource catalog servers or file servers (though reasonable security for these is still required to thwart denial-of-service attacks). Finally, some of the inherent limitations of DNS and the desire to separate administration of ``naming authority'' names from administration of resource names for a particular naming authority, led us to use DNS only as a means to identify one or more resource catalog servers for a particular resource naming authority, rather than to provide actual location or catalog information directly through DNS.

1.3 Non-Goals

The following were deliberately omitted from our design goals:

2. Description of RCDS

The Resource Cataloging and Distribution System (RCDS) consists of the following components:

2.1 Resource names

RCDS uses three kinds of resource names: URLs, URNs, and LIFNs. Web users will already be familiar with the syntax of URLs and how they are used. URNs and LIFNs are described below.

2.1.1 URNs and LIFNs

URNs (Uniform Resource Names) are used to provide stable names for resources whose characteristics may vary over time. For instance, a URN may be used as a stable reference to a web page. The web page can then move, be replicated, or change its contents and still remain accessible through the URN (even though the URN itself has not changed). In contrast to URLs which have wired-in location information, the location information and other characteristics of a URN are provided by external resolution servers.

A LIFN (Location-Independent File Name) is similar to a URN in that it is a stable name and that it can be resolved to find locations of resources that it names. However, unlike a URN, a LIFN is constrained to name a specific instance of a resource, which is location- and time-independent. Thus all copies of a file named by a LIFN are byte-for-byte identical. The meaning of a LIFN also does not change over time. Once a LIFN is used to refer to a particular file, it must always refer to that same sequence of octets. LIFNs are intended to refer to files, though they can also refer to services as long as those services do not change over time (a difficult constraint!) and all locations of those services are identical. A URN is associated with a description of the resource it names, while a LIFN is associated with with one or more locations of identical copies of that resource.

LIFNs have two purposes in RCDS:

  1. First, they serve as a link between a catalog record (description) that describes a resource and the locations of a particular instance of that resource. The description associated with a URN will normally contain one or more LIFNs, which describe particular instances of that resource and the differences between them. For instance, if the resource named by a particular URN exists in several different data formats (e.g. plain text, PostScript, PDF, HTML), the description for that URN will list each of these, along with a LIFN for that specific instance. Similarly, if the resource associated with a URN has changed over time, and multiple versions of the resource are still accessible, the description of that resource might contain a list of the current and previous versions along with the LIFNs for each. Since the LIFN can then be used to find the current locations of a resource, it serves as a ``link'' or ``file handle'' from the description of a resource to the list of its current locations.
  2. Second, they are used by replication daemons (mirroring tools) which create new replicas by copying files across a network. The replication daemons use LIFNs to refer to the files being replicated, so that there is no ambiguity about which version of a file is being copied. This also allows the locations of all replicas created by such daemons to be associated with the same identifier.
The distinction between URNs and LIFNs was crafted for several reasons:

2.1.2 Format of URNs and LIFNs

An URN consists of three parts:.
  1. The fixed prefix string URN:.
  2. A namespace identifier, (NSI) which identifies the format of the remaining portion of the URN.
  3. A namespace specific suffix (NSS) string, which is an identifier assigned according to rules of that particular name space.
The namespace specific suffix may also contain a naming authority, which indicates further delegation of assignment within the name space. The location of the naming authority within the NSS varies from one name space to another . This allows URNs to serve as an ``umbrella'' for other naming schemes (e,g, ISBNs, SGML Formal Public Identifiers, Usenet Message-IDs) that have a variety of structures.

So URN:inet:foo.bar:mumblefrotz might be a URN within the inet name space, that was assigned by the naming authority foo.bar.

A LIFN is a URN with a name space identifier of lifn. For LIFNs, the naming authority appears at the left of the namespace specific suffix and is separated from the remainder of the LIFN by a ``/''. An example LIFN is URN:lifn:foo.bar/199612250000.314159.

2.1.3 Resolution of URNs and LIFNs

The structure of URNs provides just enough of a toe-hold to facilitate scalable, distributed resolution. Resolution is a process by which a client may identify services pertaining to the resource named by a URN. It works like this:
  1. The client queries a well-known registry for information about the namespace identifier. The information returned will either indicate services and locations for all URNs with that identifier, or it will contain instructions, specific to that name space, which indicate the location of the naming authority within the NSS, and where to find the registries for that naming authority.
  2. In the latter case, the naming authority is extracted from the URN and one of the registries for that naming authority is queried.
  3. Each registry queried returns either (a) referral instructions for further queries (for a narrower portion of the name space) or (b) a list of services, locations of those services, and protocols which may be used when communicating with those services.
  4. When the latter is found, the resolution process stops, and the client chooses one of the available services and locations. Among the services provided by RCDS are: URN-to-catalog record mapping and LIFN-to-location mapping.
The resolution process allows for mutiple servers at each registry, so it is both scalable (allowing the load to be split across multiple servers) and fault-tolerant (if a query fails at one server, the same query may be submitted to a different server).

RCDS clients may also be configured to consult ``proxy'' resolution servers (which perform queries on behalf of clients and cache results) as well as ``fallback'' resolution servers (which can be consulted when there are no ``official'' servers for a domain or when the ``official'' servers do not respond.)

2.2 Publishing and Distribution

Figure 1 illustrates how files are published in RCDS.
  1. An author submits a file to RCDS using a publication tool. If this is a new file, a new description (containing catalog information) of that file is created and a new URN is assigned; otherwise, the description of the old URN is updated to reflect the new version of the file. A LIFN is assigned to the new file, and this LIFN is included in the description of that file. The part of the description containing the LIFN and file fingerprint (and perhaps other parts of it) are cryptographically signed by the author using the publication tool.
  2. The publication tool deposits a copy of the file on a file server, and a copy of the description on a ``master'' resource catalog server. It also sends a copy of the description of the new file to interested parties, which might include file servers and search services.
  3. The ``master'' resource catalog server updates its slave servers with the new description.
  4. The ``master'' file server informs a location server that it has a copy of the file with that particular LIFN.
  5. As other file servers find out about the existance of the new file, their collections managers decide whether to acquire it. When a file server acquires the new file and makes it accessible, it informs a location server about it.
  6. The location servers propagate new file location information to one another.

2.3 Access and Retrieval

Figure 2 illustrates how files are accessed or retrieved in RCDS.
  1. A user acquires a URN of a resource that seems to suit his needs from a search service, hypertext link, or other means. This URN is resolved using DNS (see below) to find the network addresses of one or more resource catalog servers. One of those servers is selected by the client, perhaps based on network proximity estimates provided by SONAR.
  2. The resource catalog server is queried for a descripton of the resource named by the URN. The description may contain multiple LIFNs, each describing a different version of the resource. The client selects a particular LIFN from those available.
  3. The client resolves the LIFN using DNS to find the network addresses of one or more location servers. One of those location servers is then queried for locations of the file named by that LIFN.
  4. The location server returns one or more URLs at which the file can be obtained.
  5. The client chooses one of those file servers (again, perhaps based on network proximity estimates) and fetches the file from that server.
The interaction with RCDS may be accomplished either directly by a client, or via a proxy server which communicates with the client via HTTP. This arrangement is shown in Figure 3.

3. Protocols

Because an understanding of some of the protocol details is important to understand how well RCDS acheives it goals, this section outlines important aspects of the protocols used by the current prototype.

RCDS currently uses a lightweight query-response protocol based on Sun's Open Network Computing Remote Procedure Call technology. Either UDP datagrams or TCP streams may be used. Unlike normal RPC applications which use a separate binding protocol to associate an RPC function to a TCP or UDP port, RCDS requests are sent to a ``well-known port'' on the server machine, cutting the network overhead in half.

There are currently four function calls:

  1. update_name adds zero or more assertions and/or certificates to the catalog record for a URN or URL.
  2. update_lifn adds a resource location (URL) to the list of locations associated with a LIFN.
  3. query_name allows a client to obtain the catalog record (or portions thereof) associated with a URN or URL.
  4. query_lifn allows a client to obtain the current list of resource locations for a particular LIFN.
Although the catalog records and the locations are maintained separately, the server optimizes for the common case where a catalog record contains a LIFN whose locations are kept on the same server. In this case, the URLs associated with that LIFN are returned in the same response that contains the catalog record, space permitting.

For the update calls, authentication is accomplished by the use of a secret shared by client and server. The request, along with the shared secret and a timestamp, is used to calculate a 128-bit digest using a variant of the ``keyed MD5'' algorithm. This digest is transmitted by the client along with the request, but omitting the shared secret. The server computes the same digest using its copy of the shared secret, and compares the result it obtained with the digest included in the request. Only if the result matchines is the client considered to be authenticated. The client must still have appropriate permissions to perform the request.

Replay attacks are thwarted (while allowing for duplicated UDP datagrams) as follows: the server keeps a copy of the last request id and the last result of any update call from a particular client. If the last request was repeated, the server repeats the response obtained from the previous call (which consists of a single integer indicating success or failure), without modifying the database. The client must increase the request id on each call; if the request id from a particular client is less than the previous request id, the request is considered to be a delayed duplicate and ignored.

Catalog records used by the update_name and query_name functions are composed of assertions and certificates. An assertion is a data structure composed of:

which is interpreted to mean: ``A states that, as of time Ta, the attribute named N of the resource named U had value V, and that this value is expected to be valid until time Te.''

A certificate is a data structure containing:

and which is interpreted to mean: ``C warrants that, as of time Tc, assertions A1, A2, ..., An'' are valid. A certificate also contains C's cryptographic signature, computed over Tc and the contents of assertions A1, A2, ..., An.

Location records associated with a LIFN (and used by the update_lifn and query_lifn functions) consist of a URL, a cache retention time-to-live Tr, and an expiration date Tx. The cache retention time-to-live is a contract by the file server that supplied the binding, that it will notify the server Tr seconds before making that file inaccessible to clients. Clients and caches should not use that LIFN to URL binding after Tr has expired. The expiration date is an indication that the resource is expected to be made inaccessible at time Tx. This system is intended to allow both client and file server to avoid the overhead of requests for files that are no longer accessible.

4. How the system meets its goals

4.1 Ease of deployment

Deployment of RCDS requires no new infrastructure other than that which can be provided directly by existing publishers. Clients, information providers, and mirror servers can each begin supporting RCDS independently of one another, except that ``official'' mirror servers need to obtain credentials in order to add location information to the resource owner's RCDS server. RCDS servers are associated (via the URI resolution process) with a particular subset of URI-space; it is assumed that the official RCDS servers are for a resource are maintained or authorized by the owner of that resource. Owners must therefore provide permission and authentication credentials to a party before it can add assertions, certificates, or locations to an ``official'' RCDS database. This provides a mechanism to ensure that only ``authorized'' catalog information or replicas are listed in the official servers. Nothing prevents an unauthorized third party from establishing its own RCDS servers, but (barring attack of the resolution system) those servers must be explicitly configured by the user -- they will not be found by a normal URI resolution process.

4.2 Reliability and Fault-tolerance

Reliability is achieved by robust construction of the components and by allowing redundancy at every step. Multiple DNS servers may exist for a particular domain, multiple RCDS servers for any portion of URI-space may be registered in DNS, and RCDS servers may list multiple locations for a particular resource. Fault-tolerance is acheieved by having clients attempt to reach multiple servers before declaring failure.

4.3 Efficient use of the network

RCDS promotes efficient use of the network by providing a lightweight protocol for queries and updates. In most cases, the resolution process is expected to cost one extra long-distance round-trip (to an RCDS server), and one extra local round-trip (to a SONAR server), as compared to the name-to-address lookup for a URL. The benefit is that the client can then choose to access the resource from a nearby server (rather than the one explicitly listed in a URL); the client can also avoid fetching the resource at all if it can tell by the catalog record that it is not needed. Finally, since RCDS allows listing of multiple URLs for a resource, it allows the client to choose not only a nearby server, but also the best access protocol that it supports. RCDS thus provides a means to transition from ftp and http to more efficient protocols; for instance, streaming protocols for real-time audio and video, or multicast-based protocols for information which is transmitted simultaneously to many users.

4.4 Security

RCDS can provide authenticity and integrity assurance for ordinary files through the use of message digests such as SHA or MD5, and public-key signatures for assertions that a particular message digest represents a particular version of a file. Such assurances do not extend to other kinds of resources such as database search services. The authentication required to update an RCDS server does not ensure the authenticity or integrity of a resource listed by that server; it is intended to provide some protection against denial-of-service attacks. Of course, an RCDS server can still be compromised if the host it resides on is compromised by other means; hosts running RCDS servers need to have fairly strict security policies not only to minimize the liklihood of such attacks, but also to facilitate the detection of attempts to breach security, analysis of threats, and recovery from successful attacks.

Public keys are of limited utility without a means of verifying signatures. In addition to the fucntions already provided by RCDS, it may be possible to use it to transmit certificates of public keys -- signed assertions from a well-known party that a particular public key is associated with a particular RCDS asserter or certifier. The difficulty is not in RCDS itself, but in finding an intermediary (or chain of intermediaries) that certifies that asserter's key, and that the user believes will not certify the key of any untrustworthy party. Even if such a party exists, if there are many signatures on a particular asserter's key, each of which is certified by several other parties, etc., it might be infeasible to find a certificate chain. In general, this is an unsolved problem, except in environments (such as military organizations) where the user can be required to trust a particular key certificate hierarchy.

4.5 Flexibility

RCDS attempts to provide flexibility by having few wired-in assumptions about the structure of catalog information. No particular data model is assumed, and any party can (subject to permissions) potentially add any kind of assertion. Catalog information stored in an RCDS server is currently assumed to be ``flat'' lists of (attribute, value) pairs, but this model can also be used to represent of hierarchical structures. Clients can retrieve portions of the catalog information for a resource by the attribute's full name or by the prefix of a name.

Likewise, RCDS servers know nothing about particular message digest or certificate algorithms. The server knows about certificates only in that:

Experience with several other protocols (X.400, MIME, xxx) indicates that it is difficult to retro-fit security into a protocol after it is deployed, especially if there are multiple ways in which a protocol element may be represented. RCDS therefore supports cryptographic authentication without specifying any particular algorithm for signing certificates. In particular, it does not specify the representation of a set of assertions before signing. Such representation is bound to the signature algorithm identifier.

In general, RCDS servers do not interpret assertions or certificates. They merely serve as repositories at which they can be stored and retrieved.

4.6 Scalability

Scalability is provided by allowing multiple instances of any particular server (to distribute the load), and by minimizing the overhead necessary to propagate updates between RCDS servers. In particular, there is no requirement that all RCDS servers for a particular portion of LIFN-space contain the same set of locations, so long as a client can query multiple RCDS servers until it finds an accessible location for the desired resource. Location updates can thus be submitted to any RCDS server and propagated to the others without needing a commit protocol. Somewhat more constraints are placed on updates to catalog information, which (for now) are assumed to be master-slave.

Scalability of RCDS is also enhanced by limiting its scope (for instance, it does not do searching) and allowing it to be optimized for a small number of simple functions.

5. Future work