XML: Introduction


	links to this page:

XML: Introduction

Last updated at 3:26 pm UTC on 21 October 2006

See also XML: Parsers

XML is a method of specifying structured information without any relation to context or formatting, if so desired, or the information can be specified containing contextual and/or format-related data. The goal of XML is to provide a much simpler SGML for the masses, one which does not require a gigantic parser such as SP to account for every special case that SGML has. XML is text-based, to avoid problems with binary interchange.

The Document Object Model (DOM see http://www.w3.org/DOM/) is, supposedly, a language-neutral environment for working with XML and HTML documents. It defines a set of interfaces for viewing the document logically as a tree. The DOM is not the end-all for programming support for XML and HTML, but with a DTD one can make sure that a valid XML document transformed using the DOM produces another valid XML document.

This pair of related projects could become the foundation for generalized XML support in Squeak. There are a few problems to be solved before I can go ahead and produce the whole thing, and there is a bit of coding to be done that I don't have enough experience to do correctly, so collaborators are very much welcome.

Here is the comment I currently have in my source:

XML is a W3C recommendation (http://www.w3.org/XML/) for the interchange of marked-up data. It is a simplified proper subset of SGML designed to have minimal requirements for comprehension and processing. Thus, there are many aspects of SGML that have been removed or canonicalized. Unfortunately, this means that XML cannot express HTML, but it also means that XML can express a reformulation of HTML called XHTML/1.0 (http://www.w3.org/TR/xhtml1/).

Some of the major features/deficiencies (depending on standpoint) of XML are the lack of optional end tags, the explicit denotation of empty elements, simplified DTDs, the support of a unique unescaped character data section, and the specification of the requirements of validating and non-validating processors. The effect is that XML is especially easy to parse and manipulate, and it has very few special forms. It also maintains enough similarity to SGML that it can be processed by SGML processors such as the Jade DSSSL processor.

XML is intended to be used to express all kinds of information, with current languages expressing mathematical formulas and vector graphics, as well as documents. XML is general enough that virtually any existing dependence on SGML for markup can be mechanically reformulated into XML without loss of information.

Like SGML, XML is content-based markup. It separates content from other aspects of information processing. This is meant to differ from other systems where content, formatting, and presentation are interrelated. Where, in HTML, there would be a link, in XML it might be a reference, footnote, sidebar, ancillary documentation, or image. It is good to begin to think of HTML as something that looks like SGML but is really not in the same vein. It is instead a convenient way of laying out documents without dependence on platforms.

XML support for Squeak is being added on several levels. All support here is being designed to operate on the unabridged text of input files. Thus, in all uses of the DOM that comes paired with this parsing framework, text that has not been manipulated is expected to reflect the same text that has been sent to the processor.

With the support given here, you can tokenize an XML document in two ways, either as a sequence of elementary tokens such as '' and 'id', or you can tokenize it as a sequence of compound tokens, where, for example, start tags and empty tags have their attributes within, or you can do a full parse and walk through the document in the DOM.

Since virtually anything can be expressed as XML, the DOM can be used to manipulate virtually anything. Thus, if a DTD is written to express a Smalltalk system then Smalltalk content can be manipulated using a plug-in to the DOM. The power of content-based markup is tremendous, and it is my belief that its incorporation into Squeak will allow this power to be tapped.

The parser is based loosely on Expat by James Clark. The DOM source will be original.

If there is interest in participation (i.e., at least one person), I'll put the source that I have up on my website.

John Duncan