<?xml version="1.0" encoding="iso-8859-1"?>
<?dsd URI="article.dsd"?>
<?xml:stylesheet type="text/xsl" href="article2html.xsl"?>

<!--this document is available at http://www.brics.dk/DSD/examples/dsd.xml-->

<article postscript="http://www.brics.dk/DSD/dsd.ps"
         pdf="http://www.brics.dk/DSD/dsd.pdf">

<title>DSD: A Schema Language for XML</title>

<!--
  (note: this is an old version of the DSD paper - see the new version at 
   http://www.brics.dk/DSD/dsd.html)
-->

<authors>
  <names>Nils Klarlund</names>
  <affiliation>AT&amp;T Labs Research</affiliation>
  <link href="mailto:klarlund@research.att.com">klarlund@research.att.com</link>
</authors>
<authors>
  <names>Anders Møller <amp/> Michael I.<nbsp/>Schwartzbach</names>
  <affiliation>BRICS, University of Aarhus</affiliation>
  <link href="mailto:amoeller@brics.dk,mis@brics.dk">{amoeller,mis}@brics.dk</link>
</authors>

<abstract>

<p>
We present DSD (Document Structure Description), which is a schema
language for XML documents. A DSD is itself an XML document, which
describes a family of XML application documents.
</p>
<p>
The expressiveness of DSD goes far beyond the simple DTD concept,
covering advanced features such as conditional constraints, multiple
nonterminals for each element, gradual declaration of attributes,
support for both ordered and unordered contents, and constraints on
reference targets. We also support a declarative mechanism for
inserting default elements and attributes that is reminiscent of
cascading style sheets. Finally, we include a simple technique for
evolving DSD documents through selective redefinitions.
</p>
<p>
DSD is completely self-describable, meaning that the syntax of
legal DSD documents together with all static requirements are captured
in a special DSD document, the meta-DSD of less than 500 lines.
</p>
<p>
We relate DSD to other recent XML schema languages.  In particular,
we provide a critique of and comparison with the proposal from the W3C
XML Schema Working Group.
</p>
<p>
DSD is fully implemented and is available in an open source
distribution.  This prototype is guaranteed to process any application
document in linear time.
</p>

</abstract>

<section id="introduction">
<title>Introduction</title>

<p>
A Document Structure Description (DSD) is a specification of a
class of XML documents. A DSD defines a grammar for XML documents,
default element attributes and content, and documentation of the
class. A DSD is itself an XML document.  We have five major goals for the
descriptive power of DSDs, namely that they should:

<itemize>
<item>
allow context dependent descriptions of content and attributes, since
the context of a node, such as ancestors and attribute values, often
govern what is legal syntax;
</item>
<item>
generalize CSS<nbsp/><cite article="bos98:_cascad_style_sheet_css2_specif"/> 
(Cascading Style Sheets) so that readable, CSS-like rules for default
attribute values and default content can be defined for arbitrary XML
domains, not only predefined user formatting models;
</item>
<item>
complement XSLT<nbsp/><cite article="clark99:_xsl_trans_xslt_specif"/> in the sense
that the expressive power of DSDs should be close to that of XSLT, so
that assumptions made by XSLT style sheets can be made explicit in a
DSD;
</item>
<item>
permit the description of semi-structured data, that is, the
description of what references may point to; and
</item>
<item>
enable the redefinitions of syntactic classes, so that language
extensions can be expressed in terms of existing DSDs.
</item>
</itemize>
</p>

<p>
[...]
</p>

</section>

<section id="xmlconcepts">
<title>XML Concepts</title>

<p>
The reader is assumed familiar with standard XML concepts, such as
those defined in<nbsp/><cite article="bray98:_exten_markup_languag_xml"/>. 
The following provides a brief description of the XML object model
used in DSDs.
</p>
<p>
A well-formed XML document is represented as a tree. The leaf nodes
correspond to empty elements, chardata, processing instructions, and
comments. The internal nodes correspond to non-empty elements. DTD
information is not represented. Each element is labelled with a name
and a set of attributes, each consisting of a name and a value. Names,
values, and chardata are strings.
</p>
<p>
Child nodes are ordered. The<it>content</it> of an element is the
sequence of its child nodes. The <it>context</it> of a node is the
path of nodes from the root of the tree to the node itself. Element
nodes are ordered: An element <it>v</it> is <it>before</it> an element <it>w</it> if
the start tag of <it>v</it> occurs before the start tag of <it>w</it> in the usual
textual representation of the XML tree.
</p>
<p>
Processing instructions with target <tt>dsd</tt> or <tt>include</tt>
as well as elements and attributes with namespace
<tt>http://www.brics.dk/DSD</tt> contain information relevant to the
DSD processing. All other processing instructions and also chardata
consisting of white-space only and comments are ignored.
</p>

</section>

<section id="sec:dsdlanguage">
<title>The DSD Language</title>

<p>
A DSD defines the syntax of a family of conforming XML
documents. An <it>application document</it> is an XML document
intended to conform to a given DSD. It is the job of a <it>DSD
processor</it> to determine whether an application document is
conforming or not.
</p>
<p>
A DSD is itself an XML document. This section describes the main
aspects of the DSD notation and its meaning. For a complete
definition, we refer to<nbsp/><cite article="dsddoc99"/>.<br/>
</p>
<p>
A DSD is associated to an application document by placing a special
processing instruction in the document prolog. This processing
instruction has the form
<oneline>
<tt>&lt;?dsd URI="</tt><it>URI</it><tt>"?&gt;</tt>
</oneline>
where <it>URI</it> is the location of the DSD. A DSD processor
basically performs one top-down traversal of the application document
in order to check conformance. During this traversal, nodes are
assigned <it>constraints</it> by the DSD. A constraint specifies a
requirement of a node relative to its context and content that must be
fulfilled in order for the document to conform.  Initially, a
constraint is assigned to the root node. During the checking of a
node, its child nodes are assigned new constraints, which are checked
later in the traversal. Also, checking a constraint may cause default
attributes and child nodes to be inserted.  The term <it>the current
  element</it> is used in the following to refer to the node in the
application document that is being processed at a given moment during
the traversal.
</p>
<p>
If no constraints have been violated upon termination, then the
original document conforms to the DSD and the resulting document with
defaults inserted is output.
</p>
<p>
A DSD consists of a number of definitions, each associated with an ID
allowing reference for reuse and redefinition.  In the following, the
various kinds of definitions are described. We use a number of small
examples, some inspired by the XHTML language<nbsp/><cite
article="pemberton99:_xhtml"/> and some that are fragments of the book
example described in Section<nbsp/><ref section="sec:bookexample"/>.
</p>

<subsection>
<title>Element constraints</title>

<p>
The central kind of definition is the <it>element
definition</it>. An element definition defines a pair consisting of
an element name and a constraint. During the application document
processing, the elements in the application documents are assigned IDs
of such element definitions. An element can only be assigned the ID of
an element definition of the same name.
</p>
<p>
The IDs of element definitions are reminiscent of nonterminals in
context-free grammars. Each ID determines the requirements imposed on
the contents, attributes, and context of the element to which it is
assigned.  We allow several different element definitions with the
same name; thus, element names are not used as nonterminals. This
distinction allows several versions of an element to coexist.
</p>
<p>
As an example, consider a DSD describing a simple database
containing information about books, such as, their titles, authors,
ISBN numbers, and so on. Imagine that both the whole database and each
book entry should contain a <tt>title</tt> tag, but with different
structure.  Book entry titles may only contain chardata without
markup, but may optionally contain a <tt>style</tt> attribute; also,
defaults may be specified for book titles. Database titles may contain
arbitrary contents and no attributes. These two kinds of
<tt>title</tt> elements can be defined as follows:
<example><![CDATA[
<ElementDef ID="book-title" Name="title"
            Defaultable="yes">
  <Content><StringType/></Content>
</ElementDef>

<ElementDef ID="database-title" Name="title">
  <ZeroOrMore>
    <Union>
      <StringType/><AnyElement/>
    </Union>
  </ZeroOrMore>
</ElementDef>
]]></example>
</p>
<p>
[...]
</p>

</subsection>

<subsection>
<title>Attribute declarations</title> 

<p>
During evaluation of a constraint, attributes are declared
gradually.  Only attributes that have been declared are allowed in an
element.  Since constraints can be conditional and attributes are
declared inside constraints, this allows hierarchical structures of
attributes to be defined. For instance, in a <tt>input</tt> element,
the <tt>length</tt> attribute may only be present if the <tt>type</tt>
attribute is present and has value <tt>text</tt> or <tt>password</tt>.
</p>
<p>
An attribute declaration consists of a name and a string type. The
name specifies the name of the attribute, and the string type
specifies the set of its allowed values. It is an error if an
attribute being declared is not present in the current element, unless
it is declared as "optional".
</p>
<p>
[...]
</p>

</subsection>

<subsection id="stringtypes">
<title>String types</title>

<p>
[...]
</p>

</subsection>

<!-- [...] -->

</section>

<section id="sec:bookexample">
<title>The Book Example</title>

<p>
We now present a small example of a complete DSD. It describes an XML
syntax for databases of books. Such a description could be arbitrarily
detailed; we have settled for title, ISBN number, authors (with
homepages), publisher (with homepage), publication year, and
reviews. The main structure of the DSD is as follows:
</p>
<!-- [...] -->

</section>

<!-- [...] -->

<references>
<item id="dsddoc99">
  <authors>Nils Klarlund, Anders Møller, and Michael I. Schwartzbach</authors>
  <title>Document Structure Description 1.0</title>
  <publisher>AT<amp/>T <amp/> BRICS</publisher>
  <note>URL:<nbsp/><nbsp/><link>http://www.brics.dk/DSD/specification.html</link></note>
  <month>October</month>
  <year>1999</year>
</item>
<item id="bray98:_exten_markup_languag_xml">
  <authors>Tim Bray, Jean Paoli, and C. M. Sperberg-McQueen</authors>
  <title>Extensible Markup Language (XML) 1.0</title>
  <publisher>W3C</publisher>
  <note>URL:<nbsp/><nbsp/><link>http://www.w3.org/TR/REC-xml</link></note>
  <year>1998</year>
</item>
<item id="pemberton99:_xhtml">
  <authors>Steven Pemberton et al.</authors>
  <title>XHTML 1.0: The Extensible HyperText Markup Language</title>
  <publisher>W3C</publisher>
  <note>URL:<nbsp/><nbsp/><link>http://www.w3.org/TR/WD-html-in-xml</link></note>
  <year>1999</year>
</item>
</references>

<vitae>

<person img="http://www.brics.dk/DSD/examples/klarlund.jpg" alt="klarlund@research.att.com" width="100">

<bf>Nils Klarlund</bf> received his Ph.D.<nbsp/>(Liberal Arts) from
Finkelstein Mail-Order College in 1989. He has bungled through life
since then, before remarkably landing a real job a AT<amp/>T, whose
stock value has subsequently plunged 43<percent/>. By the generosity of
numerous co-authors, his name appears on several publications.
<br/><br/> 
<it>Homepage:</it> <link href="http://www.research.att.com/~klarlund/">http://www.research.att.com/<tilde/>klarlund/</link>

</person>

<person img="http://www.brics.dk/DSD/examples/amoeller.jpg" alt="amoeller@brics.dk" width="100">

<bf>Anders Møller</bf> is a Ph.D.<nbsp/>student at BRICS at the University of
Aarhus, Denmark. His main research interests include programming
languages, formal verification, and Web technology.  In addition to
the DSD project, he is involved in the BRICS MONA project and the
<tt>&lt;bigwig&gt;</tt> project.
<br/><br/>
<it>Homepage:</it> <link href="http://www.brics.dk/~amoeller/">http://www.brics.dk/<tilde/>amoeller/</link>

</person>

<person img="http://www.brics.dk/DSD/examples/mis.jpg" alt="" width="100">

<bf>Michael I.<nbsp/>Schwartzbach</bf> received his Ph.D.<nbsp/>(Computer Science)
from Cornell University in 1987. He is an associate professor at the
University of Aarhus and a kernel researcher at the BRICS Research
Center. Michael I.<nbsp/>Schwartzbach has studied design and implementation
of programming languages, type systems, static analysis, and
applications of logic.

<br/><br/>
<it>Homepage:</it> <link href="http://www.brics.dk/~mis">http://www.brics.dk/<tilde/>mis/</link>
</person>

</vitae>

</article>

