Project Number: | RE 4004 (RE) |
Project Title: | DESIRE II - Development of a European Service for Information on Research and Education II |
Deliverable Type: | PU |
Deliverable Number: | D4.1 |
Contractual Date of Delivery: | January 1999 |
Actual Date of Delivery: | July 1999 |
Title of Deliverable: | Review of Caching Technologies & New Opportunities |
Workpackage(s) contributing to the Deliverable: | WP4 |
Nature of the Deliverable: | RE |
Author: | Ingrid Melve, UNINETT |
Contact Details: | UNINETT Tel: +47 73 55 79 07Email: Ingrid.Melve@uninett.no |
Other Authors: | Tim Dixon, TDC Networking Consultancy Limited |
URL |
Abstract | Web caching is a process which attempts to improve the efficiency with which information on the World Wide Web can be accessed by keeping copies of frequently-requested information close to the information user; the set of copies of information is known as the cache. Much in the same way that a microprocessor cache can reduce memory-access latency and reduce the bus bandwidth required to retrieve data from main memory, a web cache can reduce the delay in retrieving documents and reduce the communications bandwidth required for retrieval. The web caching process is implemented by a caching proxy.This document reviews the current and emerging web caching technologies which will be explored during the execution of the DESIRE II Contract. |
Keywords | WWW |
Distribution List: | DESIRE Project Team; European Commission |
Issue: | V1.1 |
Reference: | Review of Caching Technologies & New Opportunities |
Total Number of Pages: | 14 |
Issue Number | Issue Date | Reason for Change |
0.1 | 27/06/99 | (Initial Draft) |
0.2 | 28/06/99 | Early draft issued for comment |
0.3 | 06/07/99 | Final draft issued for comment |
1.0 | 16/07/99 | Incorporates comment on draft; version released for peer review |
1.1 | 30/07/99 | Incorporates comments from peer review |
Web caching is a process which attempts to improve the efficiency with which information on the World Wide Web can be accessed by keeping copies of frequently-requested information close to the information user; the set of copies of information is known as the cache.
Much in the same way that a microprocessor cache can reduce memory-access latency and reduce the bus bandwidth required to retrieve data from main memory, a web cache can reduce the delay in retrieving documents and reduce the communications bandwidth required for retrieval.
The web caching process is implemented by a caching proxy.
This document reviews the current and emerging web caching technologies which will be explored during the execution of the DESIRE II Contract.
Opportunities exist to improve the level of automatic configuration that is possible of web client software and of caching proxies themselves. Since many of the protocols proposed for this purpose are in competition, pursuit of standards and interoperability will be of great importance.
The increasing use of caching raises issues of privacy (of the information user) and security (of all parties to the information exchange) as well as legal concerns over intellectual property rights. Careful consideration must be given to the design and implementation of caching systems to achieve the available performance benefits without compromising confidentiality or infringing copyright.
This is an introductory document, exploring the technology context in which Work Package 4 of the DESIRE II project will be conducted.
Caching Proxy
A system for optimising the retrieval of information objects with HTTP, by keeping copies of objects (chosen algorithmically) close to the information user.
CARP
Cache Array Routing Protocol. A proposed protocol for routing requests to caching proxies.
FTP
File Transfer Protocol. A protocol used to retrieve a named file from a remote computer system.
HTTP
Hypertext Transfer Protocol. The application-level protocol used to retrieve information objects from the World Wide Web.
ICP
Internet Cache Protocol. A protocol for communication between caching proxies.
Internet Draft
A working document of the Internet Engineering Task Force.
PAC
Proxy Auto-Configuration. A Netscape Communications Corporation scheme for semi-automatic configuration of web browsers, adopted by many manufacturers.
Replication
Making copies of a human-selected set of information objects available at strategic points on a network to optimise retrieval times and reduce network load.
SSL
Secure Socket Layer. A means of securing a connection between a client and a web server using encryption at the transport layer.
URI
Uniform Resource Identifier. A generic term referring to the any of the different possible ways of identifying an information object in the World Wide Web (for example, a URL or a URN, q.v.).
URL
Uniform Resource Locator. The representation of the location in the World Wide Web from which an information object can be retrieved.
URN
Uniform Resource Name. A name that uniquely identifies an information object, regardless of its location in the World Wide Web.
WPAD
Web Proxy Auto-Discovery. A proposal to further automate Proxy Auto-Configuration (PAC, q.v.).
This section introduces some of the terminology used in the rest of the document.
Web caching is a process which attempts to improve the efficiency with which information on the World Wide Web can be accessed by keeping copies of frequently-requested information close to the information user; the set of copies of information is known as the cache.
Much in the same way that a microprocessor cache can reduce memory-access latency and reduce the bus bandwidth required to retrieve data from main memory, a web cache can reduce the delay in retrieving documents and reduce the communications bandwidth required for retrieval.
The web caching process is implemented by a caching proxy.
A proxy is a system which is interposed between the real user of a service and the service itself, performing some useful intermediary function. Web proxies act at the application level and use the HTTP protocol to communicate with at least one of the parties to the information-request transaction.
In addition to caching, web proxies can be used for a number of purposes:
A caching proxy is interposed between an information user and an information provider. When the end user requests an information object using HTTP, the request is sent to the caching proxy instead of to the original information source identified in the HTTP URL.
If the caching proxy already has a current copy of the information object identified by the URL, it returns it directly to the information user. If it has no copy, or the copy is out of date, then the object is retrieved from the information source, a local copy may be made, and the object is returned to the information user.
In order for this process to work, the information user’s web browser (or other retrieval tool) must be redirected to the caching proxy, the proxy must be able to make an effective judgement about the currency of information objects in its local cache and also decide whether a copy should be kept of information newly retrieved or a copy discarded of information that may no longer be needed; the user and information source must be willing for the information they are exchanging to be copied. These criteria can be difficult to meet in all circumstances, and some of the issues involved are discussed below.
Replication (informally referred to as mirroring) is a process in which a conscious choice is made to locate copies of frequently-accessed information at strategic points around the Internet – for example the FTP archives at FUNET in Finland or Imperial College in London. Information providers have control over where their content is replicated and all of the information they choose to replicate is copied. In principle, replicas could be used for secure or paid-for information resources, since the information provider could set up a security infrastructure to authenticate and regulate access. By contrast, the contents of a cache depends on usage patterns – an information object will not be present in the cache unless it has been requested at some point and may be removed from the cache if it has not been requested for some time. Using a cache to store secure or paid-for information is fraught with difficulties.
A cache can acquire some of the characteristics of a replica – for example, it may be possible to preload the cache with selected information objects, or redirect requests for certain information objects to local replicas.
One of the problems linking caching and replication is that the only practical means at present of identifying documents in the World Wide Web is the URL. URLs give the location of information resources, but imply nothing about the content. If a document is replicated in two locations, the URLs will be different, even if the replicas are identical in content. A proxy cannot therefore easily substitute a request for one copy with a request for another without extensive manual configuration which, in practice, is likely to be infeasible. If URNs (Uniform Resource Names) were in general use, this problem would be largely solved, since the same URN would be assigned to all copies of an information resource, regardless of its current location.
The use of URNs to identify requested resources would require an additional step processing step in which the URN is resolved into a URL identifying the location from which the resource can be fetched. The resolution process could take into account the location of the client and return the URL of the closest replica of the information object (or, at least, the one that can most efficiently be retrieved).
In the absence of widespread adoption of URNs, well known URLs can serve as informal URNs (for example http://www.apache.org/ could be used as a convenient synonym for all of the various mirror sites at which the same information is stored). A proxy can then perform a “resolution” process in which portions of the URL in the HTTP request are substituted in such a way that the correct information is returned, but from a different place (replacing, for example, http://www.apache.org/ with http://www.apache.org.uk/).
Popular caching proxies provide mechanisms for rewriting or redirecting requests in this fashion. However, the rewriting rules can be quite complicated and the administrator must know a priori which sites are mirrored and the location of the most convenient local replica.
A cache mesh is an intercommunicating set of caching proxies serving different information users. The proxy servers communicate amongst themselves using a protocol such as the Internet Cache Protocol (ICP). If an information object requested by a user is not available, the proxy receiving the request may check (for example, using ICP) if one of the other proxies in its mesh has a copy before resorting to the information source.
A mesh may be created simply for load balancing: a single proxy may not have the resources to cope with a large user community. However, caching proxies tend to be most effective when they serve a community of interest (increasing the likelihood that users will request the same information resources), and by linking caches across groups of such communities, their efficiency can be increased.
Current caching products permit two types of relationship between proxy servers in a caching mesh: siblings and parents. A proxy in a caching mesh which receives an information request first checks whether it has the requested object in its own cache. If not, it checks if any of its siblings has the object cached. If a sibling has a copy of the object, then the object is requested from the sibling. If none of the siblings has the object, then the server which received the request forwards it either to the original information source, or to a parent proxy server which may in turn have sibling relationships of its own.
By this means, a hierarchy of caches can be built which ensures that requests for frequently-requested information are aggregated at locations where caching can provide the most benefit.
Caching which occurs close to the information user is generally most effective, while caches at higher layers in the hierarchy tend to show lower efficiencies. Cache misses (requests for information objects which are not found in the cache) propagate up the hierarchy and typically the top-level cache has to retrieve the document. Higher-level caches also have a smaller user community (their subordinate caches) whereas a first-level cache might serve a user community of thousands – the likelihood of the same document being requested many times is therefore significantly reduced.
The web client communicates with the caching proxy using the same HTTP protocol it would use to access the original source of the information object. Minor additions to Version 1.1 of the protocol allow the client to specify the host name of the original information source (a feature missing from early versions of HTTP).
In order for the information user’s retrieval software (the client) to make use of a caching proxy, the client must be configured to make use of the proxy rather than requesting the information from the source specified by information object’s URL. In fact, it is usually possible to configure the client to use different proxies for different information objects, which can make configuration quite complex. The two most common ways of configuring a client to use a proxy are manual configuration and the semi-automatic Automatic Proxy Configuration (PAC) which are discussed below along with some of the less-frequently-encountered alternatives.
Manual configuration of the preferences settings in web browsers is still the most common way to configure the use of a proxy. Although manufacturers may provide pre-configuration tools for wide-scale rollouts of identical systems for business users, in the research community it is unlikely that there will be such central control over the users’ desktop.
PAC[PAC] was originally developed by Netscape Communications Corporation for use with their Netscape Navigator® web browser, but has been widely adopted. A web page containing a JavaScript function which the browser interprets to configure its use of proxy servers is published by the system administrator. Users must enter the web address of the page containing the function (the PAC URL) into their browser configuration. Although changes to the configuration are propagated automatically, which is an advance over totally manual configuration, the address of the page must first be entered manually by every user, obviating some of the potential benefits.
The system is not wholly interoperable across different platforms: popular browsers interpret the script contents in slightly different ways.
Web Proxy Auto-Discovery[WPAD] (currently an Internet Draft) is a proposal which would allow clients to automatically discover the PAC URL described above, eliminating the remaining manual configuration step.
WPAD uses a collection of pre-existing Internet resource discovery mechanisms to perform web proxy auto-discovery. Clients may use a variety of techniques:
CARP (also at present an Internet Draft) is aimed at permitting clients to select proxies dynamically.
CARP is not so much a protocol as an algorithm to distribute load across an array of proxy servers. The script located at the PAC URL contains code specific to the mesh of CARP-compliant proxy servers and causes the client to select a different server for each URL is requests, based on a hash function that takes into account the URL and the capabilities and configuration of the proxy servers themselves. The hash function is designed to repeatably target requests for the same URL to the same proxy server, ensuring clients are always directed to the proxy most likely to have the information object in its cache and that information objects are well distributed across the participating servers.
All the mechanisms discussed above cause the client explicitly to send its HTTP request to the Internet address of the caching proxy rather than to the original information source. However, by intercepting HTTP traffic at the Transport or Network layers, it is possible to relay HTTP requests through a caching proxy without the client being aware of it, or needing any special configuration.
A transparent proxy works in much the same way as an Internet Firewall – by either implicit or explicit routing (or by using a transport-layer switch), traffic associated with selected TCP ports is passed through an application relay which either passes on the request to the information source, or simply mimics it if the information object is already to hand.
Depending on the design of the proxy, it may use its own network address as the source of requests it makes to original information sources - which may break any authorisation schemes that rely on receiving the network address of the client. Also, the transparent proxy can only recognise the use of the HTTP protocol from the TCP port number to which the transport-layer connection is being directed. Perfectly legitimate HTTP requests using a port number other than the default (port 80) may not be recognised by the proxy.
The interposition of a caching proxy between a client and the source of the original information object poses many problems of security and privacy.
From the point of view of the information source, the true identity of the information client is disguised by the proxy. From the point of view of the information user, the proxy is able to construct an entire history of the information objects retrieved by that user – an issue in itself - and inspect their contents (which may include sensitive information, such as passwords, bank account or credit card numbers).
Successful caching currently depends on the public nature of much of the material on the Internet and a certain amount of trust on the part of information users in their Internet Service Providers.
Once the information requester has located a caching proxy, the proxy must communicate with any other proxies that are part of its caching mesh in order to see if a local copy of the information object is available. Communication about the availability of an object takes place using a protocol distinct from the HTTP protocol used to transfer the object itself. Protocols in common use are described in the following sections.
ICP[ICP] is used by caches to query other caches about information objects, to see if the object is present in the other cache. Most current cache implementations support ICP in one form or another. In outline, it allows one proxy to ask a neighbour whether it has a given information object, receiving a simple yes/no answer.
One of the major problems of ICP is that information objects are identified simply by their URL. Although it was once a reasonable assumption that the contents of an information object was entirely determined by its URL, this is far from being the case today. The contents of an information object may depend on elements of the HTTP protocol (for example, the languages deemed acceptable in the HTTP request), but, more likely will have been determined by the information origin on the basis of the type of web browser used to request the URL and state information (perhaps encoded in form fields or in “cookies” previously sent to the browser). HTTP/1.1 also allows the original information source to annotate the information object with instructions on when a cached copy might be valid. ICP does not allow for any of this contextual information to be queried, so it is quite possible that a neighbour will appear to have a match for a URL, even though the contents of the information object would be incorrect for the client requesting it.
Although not mandatory, all current implementations of ICP use the unreliable datagram service, UDP.
Since UDP is unreliable, an estimate of network congestion and availability may be calculated by ICP loss. This rudimentary loss measurement does, together with round trip times provide a load balancing method for caches. An estimate of network capacity based on short datagrams, is, however, not a particularly good measure of the likely performance for transferring larger documents.
UDP is extremely vulnerable to malicious injection of network traffic and communication with caches can be disrupted by a hostile user.
Currently issued as an Internet Draft and implemented, at least partially, in a small number of products, HTCP[HTCP] permits members of caching meshes to check whether their neighbours have information objects which match not only a URL, but also a set of HTTP request headers.
Another feature of HTCP is that it supports “cache push”, that is a cache can inform its neighbours about significant events (such as the expiry of a cached copy) without waiting for a request for the information object concerned.
Mentioned in the previous section, this is not, strictly speaking, a means of communicating between caching proxies. Its prime purpose is to avoid communication between proxies by sharing traffic between them in a way that can be determined a priori.
It is observed that a mesh of caches communicating using ICP may not scale very well (there is a tendency for each member to duplicate the information objects held by the others and for the ICP traffic to increase disproportionately as the mesh size increases).
By contrast, CARP is essentially free of protocol overhead and distributes load across an array of proxy servers. As described above, CARP uses program code specific to the mesh of CARP-compliant proxy servers which causes the client to select a different server for each URL is requests, based on a hash function that takes into account the URL and the capabilities and configuration of the proxy servers themselves. Provided that the proxy servers are configured with a compatible algorithm, they can also infer from a given URL which other server is most appropriate to serve the specified information object.
Cache Digests[CDIG] replace the query/response communication for individual information objects by a mechanisms for members of the cache mesh to inform their neighbours about the identities of the information objects in their local cache. To use a network-layer comparison, Cache Digests are to ICP rather as a routing protocol is to ARP.
Cache Digests allow proxies to make information about their cache content available to peers in a compact format. A peer uses digests to identify co-operating caches that are likely to have a given web object. To use Cache Digests, two components are needed; the first is a specification of the construction and interpretation of the digest; the second is a protocol to exchange and query digests between mesh neighbours.
Although implemented in one popular caching proxy, and documented in a draft form, cache digests depend on intellectual property owned by the University of Wisconsin which is not currently widely licensed.
The Web Cache Control Protocol[WCCP] is developed by Cisco Systems, Inc. and is used in that manufacturer’s products to transparently redirect web traffic from a router to a caching proxy. The protocol specification has been released as an Internet draft.
Using the Web Cache Control Protocol, a router redirects Web requests to a Cache Engine (rather than to the intended Web server). The router also determines Cache Engine availability and redirects requests to new Cache Engines as they are added to the caching mesh.
Several different trials have been made with multicast inter cache communication. Multicast ICP is implemented, but not in widespread use.
Although the caching proxy makes use of the same HTTP protocol to retrieve documents from the original information source as a client would use directly, extensions to the protocol permit a more efficient caching regime to be implemented.
Version 1.0 of HTTP[HTTP] returned the date of last modification of each information object, allowing simple caching on the basis that if the object in the cache was no older than the reported modification date, then the cache copy could be assumed to be “fresh”. An expiration date could also be added to an information object so that it could be automatically purged from a cache without reference to the information source to see if a newer version of the object existed.
Version 1.1 of the protocol adds “Cache Control Headers” to the information object. These extensions allow better control over caching:
Although most client software maintains its own local cache of recently-accessed information objects, once the cache is stored on a third-party caching proxy, information providers begin to have concerns about the level of control they have over their intellectual property and its exploitation.
Information in a cache may not always be up to date. Although HTTP version 1.1 gives content providers considerable flexibility in determining how long information can remain cached, it is not always easy to predict the amount of time for which information may remain valid. Misconfiguration at the information source or at the proxy can compound the problem.
However, where the lifetime of information can be predicted and the information source provides the right information in its HTTP response, the cache will always ensure the most recent information is served to the client.
If sensitive information or information of commercial value is stored in a cache, the information provider may not trust the cache operator to ensure that copies are delivered only to authorised users.
The information provider may also not wish to trust the cache to serve up the information in its original form without unauthorised modification.
The effect of a cache is to funnel all information through a single proxy, or through an international hierarchy of proxies co-operating in a caching mesh.
The possibility exists to monitor all of a user’s information requests, or to cause one user to appear to impersonate another.
There is plenty of potential for a complex international legal tangle. Different laws may on the one hand require information about a user to be kept confidential for reasons of privacy in one country, while requiring information to be divulged to law-enforcement agencies in another on either a routine or case-by-case basis. The information provider may exist in yet another jurisdiction and operate within yet other legal constraints.
If information requests are not passed back to the information source, it becomes more difficult to track users, monitor the frequency of access to documents, or ensure that advertising is delivered to the information user.
There has been some recent discussion about the legality of caching, particularly in respect of copyright. Worries among information providers are voiced more strongly in places where web caching is not yet widely deployed – particularly in the USA. Much of this has been prompted by the perceived threat to the music industry of web sites containing compressed audio CDs leading to a general review of network copyright issues. A draft proposal from the European Commission would have made web caching illegal. However, it seems that the EC has been persuaded that the added benefits of caching outweigh the concerns of information providers.
Many information “sites” are in fact front ends to complex applications; the operators must ensure that critical pages are not cached so that information requests that require an interaction with the application are passed back to the information source.
State information is often maintained by storing “cookies” (small pieces of contextual data) at the client and adding that to the HTTP headers in the information request. Caching proxies that use version 1.0 of the HTTP protocol cannot handle cookies properly (the protocol contains insufficient information) and generally do not cache pages with associated cookies at all.
Cachebusting[BUST] is a term that has been applied to practices which defeat caching. Information providers may feel driven by some of the issues raised above to attempt to bypass caching proxies as much as possible. Others may simply do so by accident – not appreciating the effect on downstream caching proxies of local configuration decisions.
Content providers and server administrators can take steps to improve the extent to which information objects can be cached (for example, by performing as much client-side processing as possible through the use of client-side image maps, and locally-executed scripts). In reality, this is likely to be of concern only to the developers of popular sites (who most appreciate the benefits of downstream caching) or the determinedly good citizen.
Where sensitive content – for example a credit card number – is passed through a proxy, the end-user may not be able to trust that the cache-operator will not use the information for nefarious purposes, or disclose it to others. This is clearly one case in which “cachebusting” is justified and can be assured using an encryption technique such as SSL.
Proxies have less obvious effects on information security, however. The client’s network address is usually not presented to the information provider (that of the proxy server being given instead), so access controls based on network address will not work or can easily be subverted by re-routing requests from unauthorised domains via the proxy. As basic authentication cannot be trusted (especially if the essentially plain text password is relayed through the proxy), the use of caching strips away the vestiges of information security that these weak authentication mechanisms provide. Strong (cryptographic) authentication has the disadvantage of contributing to “cachebusting”.
A useful reading list can be found online at: http://www-sor.inria.fr/projects/relais/reading-list.html
[BUST] M. Hamilton. Work in Progress: Cachebusting – Cause & Prevention
http://www.ietf.org/internet-drafts/draft-hamilton-cachebusting-01.txt
[CARP] V. Valloppillil & K.W. Ross. Work in Progress: Cache Array Routing Protocol
http://msdn.microsoft.com/LIBRARY/BACKGRND/HTML/CARP.HTM
[CDIG] M. Hamilton, A. Rouskov & D. Wessels. Cache Digest Specification
http://squid.nlanr.net/Squid/CacheDigest/cache-digest-v5.txt
[HTCP] P. Vixie & D. Wessels. Work in Progress: Hypertext Caching Protocol
http://www.ietf.org/internet-drafts/draft-vixie-htcp-proto-04.txt
[HTTP] T. Berners-Lee, et al. Hypertext Transfer Protocol Version 1.0: RFC 1945
http://www.ietf.org/rfc/rfc1945.txt
Fielding, et al. Hypertext Transfer Protocol Version 1.1: RFC2068
http://www.ietf.org/rfc/rfc2068.txt
[ICP] D. Wessels & K. Claffy. Internet Cache Protocol, Version 2: RFC2186
http://www.ietf.org/rfc/rfc2186.txt
D. Wessels & K. Claffy. Application of Internet Cache Protocol, Version 2: RFC2187
http://www.ietf.org/rfc/rfc2187.txt">http://www.ietf.org/rfc/rfc2187.txt
[PAC] Netscape Communications Corporation. Proxy Auto-Configuration File Format http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html
[WCCP] D. Forster. Work in progress: Cisco Web Cache Control Protocol
[no reference currently extant]
[WPAD] P. Gauthier, et al. Work in Progress: Web Proxy Auto-Discovery
http://eggplant.rte.microsoft.com/wpad/wpad.txt