Digital Objects

A Brief Introduction to XML (eXtensible Markup Language)

XML is a flexible syntax for data structuring. It was intentionally designed to be human-readable and easy to create (see: XML Recommendation - 1.1 Origin and Goals). It is the de facto standard for a wide variety of uses.

 

TAG SYNTAX

The syntax of XML will be familiar to anyone who has experience with HTML. Unlike HTML, YOU decide what kinds of elements your document uses and their meaning (hence, "extensible"). While HTML consists of a predefined set of tags that are interpreted by browsers for the display of content, XML defines the logical structure of content and does not have any inherent display properties.

 

NODES

The most basic concept is that of a NODE or ELEMENT. There are two basic types of NODES, one contains content, the other is an "empty" node.

A containing element must have a START-TAG and a matching END-TAG, which delimit it's content. The type of the tag is identified by the NAME of the element (which you are free to define).

ex: <name>...</name>
note the format: the START-TAG consists of the NODE's NAME enclosed in chevrons; the END-TAG is exactly like the START except that the NAME is preceeded by a forward slash.

This syntax is identical to that of HTML, for example, the paragraph element looks like this:

<p>...</p>

An EMPTY-ELEMENT TAG looks like this:
ex: <name/>
note the format: the EMPTY-ELEMENT TAG consists of the NODE's NAME followed by a forward slash enclosed in chevrons.

 

ATTRIBUTES

A NODE may have any number of ATTRIBUTES. These are contained within the START-TAG. Every ATTRIBUTE must have a NAME and a VALUE. The format of this text follows a pattern: name="value" (note that the value is enclosed within quotation marks).
ex: <nodeName attributeName="value" otherAttributeName="another value" />

A NODE NAME and any ATTRIBUTE NAME may not contain spaces, but an ATTRIBUTE VALUE can contain almost any character.
ex: <soda type="Coke" percentSugar="82.75" carbonated="yes"/>

 

NESTING

A node may have CHILDREN, nodes nested inside it, of which it is the PARENT, i.e., any node may contain more nodes or text:
ex: <house><room><furniture/></room><room><furniture/><furniture/></room></house>

Notice that the furniture contains nothing (has no children) while the house explicitly contains 2 rooms. The example expanded:

<house>
	<room type="kitchen">
		<furniture type="table"/>
	</room>
	<room type="living room">
		<furniture type="sofa"/>
		<furniture type="television" note="no cable"/>
	</room>
</house>

The indentation indicates explicit hierarchy here. In some cases the hierarchy is not as evident as it is in the case of a book and its author (see below).

 

FLEXIBILITY

There are many ways to structure any data. Do we describe an author and her books like this?

<author>
	<name>Pam Lee</name>
	<book> 
		<title>Chronophobia</title>
	</book> 
	<book>
		<title>Object to Be Destroyed</title> 
	</book>
</author>

Or?

<book title="Chronophobia">
	<author name="Pam Lee"/>
</book>
<book title="Object to Be Destroyed">
	<author name="Pam Lee"/>
</book>

Or?

<author name="Pam Lee">
	<book title="Chronophobia"/>
	<book title="Object to Be Destroyed"/>
</author>

All three descriptions are valid.

The duplication of Pam's name in the seond example forces duplicate data, which is not ideal. If the second example was the desired structure, we might trick it out a little with a reference.

<author id="plee" firstName="Pam" lastName="Lee" /> 
<book title="Chronophobia" author="plee" /> 
<book title="Object to Be Destroyed" author="plee" />

In this case we reference the author by 'id'. This allows us to make changes to the author tag and for that change to be propagated to all books.

 

For more help with XML, see Links, Sept. 22