This paper is also available in Postscript and PDF for printing.



DSD: A Schema Language for XML

Nils Klarlund
AT&T Labs Research
klarlund@research.att.com

Anders Møller & Michael I. Schwartzbach
BRICS, Aarhus University
{amoeller,mis}@brics.dk

Abstract

We present DSD (Document Structure Description), which is a schema language for XML documents. A DSD is itself an XML document, which describes a family of XML application documents.

The expressiveness of DSD goes far beyond the simple DTD concept, covering advanced features such as conditional constraints, multiple nonterminals for each element, gradual declaration of attributes, support for both ordered and unordered contents, and constraints on reference targets. We also support a declarative mechanism for inserting default elements and attributes that is reminiscent of cascading style sheets. Finally, we include a simple technique for evolving DSD documents through selective redefinitions.

DSD is completely self-describable, meaning that the syntax of legal DSD documents together with all static requirements are captured in a special DSD document, the meta-DSD of less than 500 lines.

We relate DSD to other recent XML schema languages. In particular, we provide a critique of and comparison with the proposal from the W3C XML Schema Working Group.

DSD is fully implemented and is available in an open source distribution. This prototype is guaranteed to process any application document in linear time.

1. Introduction

A Document Structure Description (DSD) is a specification of a class of XML documents. A DSD defines a grammar for XML documents, default element attributes and content, and documentation of the class. A DSD is itself an XML document. We have five major goals for the descriptive power of DSDs, namely that they should:

[...]

2. XML Concepts

The reader is assumed familiar with standard XML concepts, such as those defined in  [] . The following provides a brief description of the XML object model used in DSDs.

A well-formed XML document is represented as a tree. The leaf nodes correspond to empty elements, chardata, processing instructions, and comments. The internal nodes correspond to non-empty elements. DTD information is not represented. Each element is labelled with a name and a set of attributes, each consisting of a name and a value. Names, values, and chardata are strings.

Child nodes are ordered. Thecontent of an element is the sequence of its child nodes. The context of a node is the path of nodes from the root of the tree to the node itself. Element nodes are ordered: An element v is before an element w if the start tag of v occurs before the start tag of w in the usual textual representation of the XML tree.

Processing instructions with target dsd or include as well as elements and attributes with namespace http://www.brics.dk/DSD contain information relevant to the DSD processing. All other processing instructions and also chardata consisting of white-space only and comments are ignored.

3. The DSD Language

A DSD defines the syntax of a family of conforming XML documents. An application document is an XML document intended to conform to a given DSD. It is the job of a DSD processor to determine whether an application document is conforming or not.

A DSD is itself an XML document. This section describes the main aspects of the DSD notation and its meaning. For a complete definition, we refer to  [] .

A DSD is associated to an application document by placing a special processing instruction in the document prolog. This processing instruction has the form


     <?dsd URI="URI"?>


where URI is the location of the DSD. A DSD processor basically performs one top-down traversal of the application document in order to check conformance. During this traversal, nodes are assigned constraints by the DSD. A constraint specifies a requirement of a node relative to its context and content that must be fulfilled in order for the document to conform. Initially, a constraint is assigned to the root node. During the checking of a node, its child nodes are assigned new constraints, which are checked later in the traversal. Also, checking a constraint may cause default attributes and child nodes to be inserted. The term the current element is used in the following to refer to the node in the application document that is being processed at a given moment during the traversal.

If no constraints have been violated upon termination, then the original document conforms to the DSD and the resulting document with defaults inserted is output.

A DSD consists of a number of definitions, each associated with an ID allowing reference for reuse and redefinition. In the following, the various kinds of definitions are described. We use a number of small examples, some inspired by the XHTML language  [] and some that are fragments of the book example described in Section 4.

3.1 Element constraints

The central kind of definition is the element definition. An element definition defines a pair consisting of an element name and a constraint. During the application document processing, the elements in the application documents are assigned IDs of such element definitions. An element can only be assigned the ID of an element definition of the same name.

The IDs of element definitions are reminiscent of nonterminals in context-free grammars. Each ID determines the requirements imposed on the contents, attributes, and context of the element to which it is assigned. We allow several different element definitions with the same name; thus, element names are not used as nonterminals. This distinction allows several versions of an element to coexist.

As an example, consider a DSD describing a simple database containing information about books, such as, their titles, authors, ISBN numbers, and so on. Imagine that both the whole database and each book entry should contain a title tag, but with different structure. Book entry titles may only contain chardata without markup, but may optionally contain a style attribute; also, defaults may be specified for book titles. Database titles may contain arbitrary contents and no attributes. These two kinds of title elements can be defined as follows:


<ElementDef ID="book-title" Name="title"
            Defaultable="yes">
  <Content><StringType/></Content>
</ElementDef>

<ElementDef ID="database-title" Name="title">
  <ZeroOrMore>
    <Union>
      <StringType/><AnyElement/>
    </Union>
  </ZeroOrMore>
</ElementDef>

[...]

3.2 Attribute declarations

During evaluation of a constraint, attributes are declared gradually. Only attributes that have been declared are allowed in an element. Since constraints can be conditional and attributes are declared inside constraints, this allows hierarchical structures of attributes to be defined. For instance, in a input element, the length attribute may only be present if the type attribute is present and has value text or password.

An attribute declaration consists of a name and a string type. The name specifies the name of the attribute, and the string type specifies the set of its allowed values. It is an error if an attribute being declared is not present in the current element, unless it is declared as "optional".

[...]

3.3 String types

[...]

4. The Book Example

We now present a small example of a complete DSD. It describes an XML syntax for databases of books. Such a description could be arbitrarily detailed; we have settled for title, ISBN number, authors (with homepages), publisher (with homepage), publication year, and reviews. The main structure of the DSD is as follows:

References

[...]



klarlund@research.att.com Nils Klarlund received his Ph.D. (Liberal Arts) from Finkelstein Mail-Order College in 1989. He has bungled through life since then, before remarkably landing a real job a AT&T, whose stock value has subsequently plunged 43%. By the generosity of numerous co-authors, his name appears on several publications.

Homepage: http://www.research.att.com/~klarlund/
amoeller@brics.dk Anders Møller is a Ph.D. student at BRICS at the University of Aarhus, Denmark. His main research interests include programming languages, formal verification, and Web technology. In addition to the DSD project, he is involved in the BRICS MONA project and the <bigwig> project.

Homepage: http://www.brics.dk/~amoeller/
Michael I. Schwartzbach received his Ph.D. (Computer Science) from Cornell University in 1987. He is an associate professor at the Aarhus University and a kernel researcher at the BRICS Research Center. Michael I. Schwartzbach has studied design and implementation of programming languages, type systems, static analysis, and applications of logic.

Homepage: http://www.brics.dk/~mis/