XML Entity and URI Resolvers (original) (raw)
It's very common for web resources to be related to other resources: documents rely on DTDs and schemas, schemas are derived from other schemas, stylesheets are often customizations of other stylesheets, documents refer to the schemas and stylesheets with which the expect to be processed, etc. These relationships are expressed using URIs, most often URLs.
Relying on URLs to directly identify resources to be retrieved often causes problems for end users:
- If they're absolute URLs, they only work when you can reach them[1]. Relying on remote resources makes XML processing susceptible to both planned and unplanned network downtime.
The URL “http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd” isn't very useful if I'm on an airplane at 35,000 feet. - If they're relative URLs, they're only useful in the context where the were initially created.
The URL “../../xml/dtd/docbookx.xml” isn't useful_anywhere_ on my system. Neither, for that matter, is “/export/home/fred/docbook412/docbookx.xml”.
One way to avoid these problems is to use an entity resolver (a standard part of SAX) or a URI Resolver (a standard part of JAXP). A resolver can examine the URIs of the resources being requested and determine how best to satisfy those requests.
The best way to make this function in an interoperable way is to define a standard format for mapping system identifiers and URIs. TheOASIS Entity Resolution Technical Committee is defining an XML representation for just such a mapping. These “catalog files” can be used to map public and system identifiers and other URIs to local files (or just other URIs).
The Resolver classes that are described in this article greatly simplify the task of using Catalog files to perform entity resolution. Many users will want to simply use these classes directly “out of the box” with their applications (such as Xalan and Saxon), but developers may also be interested in theJavaDoc API Documentation. The full documentation, current source code, and discussion mailing list are available from theApache XML Commons project.
See the release notes.
The most important change in this release is the availability of both source and binary forms under a generous license agreement.
Other than that, there have been a number of minor bug fixes and the introduction of system properties in addition to the CatalogManager.properties
file to control the resolver.
The problems associated with system identifiers (and URIs in general) arise in several ways:
- I have an XML document that I want to publish on the web or include in the distribution of some piece of software. On my system, I keep the doctype of the document in some local directory, so my doctype declaration reads:
As soon as I distribute this document, I immediately begin getting error reports from customers who can't read the document because they don't have DocBook installed at the location identified by the URI in my document.
2. Or I remember to change the URI before I publish the document:
And the next time I try to edit the document, _I get errors_because I happen to be working on my laptop on a plane somewhere and can't get to the net.
3. Just as often, I get tripped up this way: I'm working collaboratively with a colleague. She's created initial drafts of some documents that I'm supposed to review and edit. So I grab them and find that I can't open or publish them because I don't have the same network connections she has or I don't have my applications installed in the same place. And if I change the system identifiers so they work on my system, she has the same problems when I send them back to her.
4. These problems aren't limited to editing applications. If I write a special stylesheet for formatting our collaborative document, it will include some reference to the “main” stylesheet:
<xsl:import href="/path/to/real/stylesheet.xsl"/>
But this won't work on my colleague's machine because she has the main stylesheet installed somewhere else.
Public identifiers offer an effective solution to this problem, at least for documents. They provide global, unique names for entities independent of their storage location. Unfortunately, public identifiers aren't used very often because many users find that they cannot rely on applications resolving them in an interoperable manner.
For XSLT, XML Schemas, and other applications that rely on URIs without providing a mechanism for associating public identifiers with them, the situation is a little more irksome, but it can still be addressed using a URI Resolver.
The OASIS Entity Resolution Technical Committee is actively defining the next generation XML-based catalog file format. When this work is finished, it is expected to become the official XML Catalog format. In the meantime, the existing OASIS Technical Resolution TR9401 format is the standard.
OASIS XML Catalogs are being defined by the Entity Resolution Technical Committee. This article describes the 01 Aug 2001 draft. Note that this draft is labelled to reflect that it is “not an official committee work product and may not reflect the consensus opinion of the committee.”
The document element for OASIS XML Catalogs iscatalog
. The official namespace name for OASIS XML Catalogs is “urn:oasis:names:tc:entity:xmlns:xml:catalog
”.
There are eight elements that can occur in an XML Catalog:group
,public
,system
,uri
,delegatePublic
,delegateSystem
,delegateURI
, andnextCatalog
:
<catalog _`prefer="public|system"`_ _`xml:base="uri-reference"`_>
The catalog
element is the root of an XML Catalog.
The prefer
setting determines whether or not public identifiers specified in the catalog are to be used in favor of system identifiers supplied in the document. Suppose you have an entity in your document for which both a public identifier and a system identifier has been specified, and the catalog only contains a mapping for the public identifier (e.g., a matching public
catalog entry). If the current value of prefer
is “public”, the URI supplied in the matchingpublic
catalog entry will be used. If it is “system”, the system identifier in the document will be used. (If the catalog contained a matching system
catalog entry giving a mapping for the system identifier, that mapping would have been used, the public identifier would never have been considered, and the setting of override would have been irrelevant.)
Generally, the purpose of catalogs is to override the system identifiers in XML documents, soprefer
should usually be “public” in your catalogs.
The xml:base
URI is used to resolve relative URIs in the catalog as described in theXML Base specification.
<group _`prefer="public|system"`_ _`xml:base="uri-reference"`_>
The group
element serves merely as a wrapper around one or more other entries for the purpose of establishing the preference and base URI settings for those entries.
<public publicId="_`pubid`_" uri="_`systemuri`_"/>
Maps the public identifier pubid
to the system identifier systemuri
.
<system systemId="_`sysid`_" uri="_`systemuri`_"/>
Maps the system identifier sysid
to the alternate system identifier systemuri
.
<uri name="_`uri`_" uri="_`alternateuri`_"/>
The uri
entry maps a_uri
_ to an_alternateuri
_. This mapping, as might be performed by a JAXP URIResolver, for example, is independent of system and public identifier resolution.
<delegatePublic publicIdStartString="_`pubid-prefix`_" catalog="_`cataloguri`_"/>
, <delegateSystem systemIdStartString="_`sysid-prefix`_" catalog="_`cataloguri`_"/>
, <delegateURI uriStartString="_`uri-prefix`_" catalog="_`cataloguri`_"/>
The delegate entries specify that identifiers beginning with the matching prefix should be resolved using the catalog specified by the_cataloguri
_. If multiple delegate entries of the same kind match, they will each be searched, starting with the longest prefix and continuing with the next longest to the shortest.
The delegate entries differs from thenextCatalog
entry in the following way: alternate catalogs referenced with a nextCatalog
entry are parsed and included in the current catalog. Delegated catalogs are only considered, and consequently only loaded and parsed, if necessary. Delegated catalogs are also used instead of the current catalog, not as part of the current catalog.
<rewriteSystem systemIdStartString="_`sysid-prefix`_" rewritePrefix="_`new-prefix`_"/>
, <rewriteURI uriStartString="_`uri-prefix`_" rewritePrefix="_`new-prefix`_"/>
Supports generalized rewriting of system identifiers and URIs. This allows all of the URI references to a particular document (which might include many different fragment identifiers) to be remapped to a different resource).
<nextCatalog catalog="_`cataloguri`_"/>
Adds the catalog file specified by the _cataloguri
_to the end of the current catalog. This allows one catalog to refer to another.
These catalogs are officially defined by OASIS Technical Resolution TR9401.
A Catalog is a text file that contains a sequence of entries. Of the 13 types of entries that are possible, only six are commonly applicable in XML systems: BASE, CATALOG, OVERRIDE, DELEGATE, PUBLIC, and SYSTEM:
BASE uri
Catalog entries can contain relative URIs. The BASE entry changes the base URI for subsequent relative URIs. The initial base URI is the URI of the catalog file.
In XML Catalogs, this functionality is provided by the closest applicable xml:base
attribute, usually on the surrounding catalogor groupelement.
CATALOG cataloguri
This entry serves the same purpose as thenextCatalog entry in XML Catalogs.
OVERRIDE YES|NO
This entry enables or disables overriding of system identifiers for subsequent entries in the catalog file.
In XML Catalogs, this functionality is provided by the closest applicable prefer
attribute on the surrounding catalogor groupelement.
An override value of “yes” is equivalent to “prefer="public"”.
DELEGATE pubid-prefix
cataloguri
This entry serves the same purpose as thedelegate entry in XML Catalogs.
PUBLIC pubid
systemuri
This entry serves the same purpose as thepublic entry in XML Catalogs.
SYSTEM sysid
systemuri
This entry serves the same purpose as thesystem entry in XML Catalogs.
The Resolver classes uses either Java system properties or a standard Java properties file to establish an initial environment. The property file, if it is used, must be calledCatalogManager.properties
and must be somewhere on your CLASSPATH
. The following properties are supported:
System property xml.catalog.files
; CatalogManager property catalogs
A semicolon-delimited list of catalog files. These are the catalog files that are initially consulted for resolution.
Unless you are incorporating the resolver classes into your own applications, and subsequently establishing an initial set of catalog files through some other means, at least one file must be specified, or all resolution will fail.
System property xml.catalog.prefer
; CatalogManager property prefer
The initial prefer setting, either public
or system
.
System property xml.catalog.verbosity
; CatalogManager property verbosity
An indication of how much status/debugging information you want to receive. The value is a number; the larger the number, the more information you will receive. A setting of 0 turns off all status information.
System property xml.catalog.staticCatalog
; CatalogManager property static-catalog
In the course of processing, an application may parse several XML documents. If you are using the built-inCatalogResolver
, this option controls whether or not a new instance of the resolver is constructed for each parse. For performance reasons, using a value of yes
, indicating that a static catalog should be used for all parsing, is probably best.
System property xml.catalog.allowPI
; CatalogManager property allow-oasis-xml-catalog-pi
This setting allows you to toggle whether or not the resolver classes obey the <?oasis-xml-catalog?>
processing instruction.
System property xml.catalog.className
; CatalogManager property catalog-class-name
If you're using the convenience classesorg.apache.xml.resolver.tools.*
), this setting allows you to specify an alternate class name to use for the underlying catalog.
CatalogManager property relative-catalogs
If relative-catalogs
is yes
, relative catalogs in the catalogs
property will be left relative; otherwise they will be made absolute with respect to the base URI of the CatalogManager.properties
file. This setting has no effect on catalogs loaded from thexml.catalogs.files
system property (which are always returned unchanged).
System property xml.catalog.ignoreMissing
By default, the resolver will issue warning messages if it cannot find a CatalogManager.properties
file, or if resources are missing in that file. However if either xml.catalog.ignoreMissing
is yes
, or catalog files are specified with thexml.catalog.catalogs
system property, this warning will be suppressed.
My CatalogManager.properties
file looks like this:
The Resolver distribution includes a couple of test programs,resolver and xparse, that you can use to see how this all works.
Thexparse command simply sets up a catalog resolver and then parses a document. Any external entities encountered during the parse are resolved appropriately using the catalogs provided.
In order to use the program, you must have theresolver.jar
file on yourCLASSPATH
and you must be using JAXP. In the examples that follow, I've already got these files on myCLASSPATH
.
The file we'll be parsing is shown in Example 6, “An xparse Example File”.
First let's look at what happens if you try to parse this document without any catalogs. For this example, I deleted thecatalogs
entry on myCatalogManager.properties
file. As expected, the parse fails:
With an appropriate catalog file, we can map the public identifier to a local copy of the DTD. We could have mapped the system identifier instead (or as well), but the public identifier is probably more stable.
Using a command-line option to specify the catalog, I can now successfully parse the document:
The additional messages in each of these examples arise as a consequence of the debugging option, -d 2
. In practice, you can make resolution silent.