XML Library

July 26, 2013

In previous essays, we discussed the value of using PHP libraries for code modularity. Since most of the communication in a web application is HTML-based, one of the first libraries to consider writing is an (X)HTML library.

Aside: XHTML vs. HTML#

We prefer XHTML to plain HTML (even in "version 5") because of its well-formed structure. XHTML documents are always valid XML documents, and this property alone makes it much easier to parse, not just by the browser, but by client scripts. This, in turn, becomes an invaluable advantage when the time inevitably comes for migration, resource-scouring, etc.

Common approach to producing HTML code#

Many PHP-coding fora and beginner guides—most of them at the top of search results—recommend structuring the HTML elements of the page into text files that are included (or, if the program is robustly created, required) at various places throughout the script.

The code looks something like this (taken from WordPress):

<?php
/**
 * The main template file.
 *
 * This is the most generic template file in a WordPress theme
 * and one of the two required files for a theme (the other being style.css).
 * It is used to display a page when nothing more specific matches a query.
 * E.g., it puts together the home page when no home.php file exists.
 * Learn more: http://codex.wordpress.org/Template_Hierarchy
 *
 * @package WordPress
 * @subpackage Twenty_Eleven
 */

get_header(); ?>

        <div id="primary">
            <div id="content" role="main">

            <?php if ( have_posts() ) : ?>

                <?php twentyeleven_content_nav( 'nav-above' ); ?>

                <?php /* Start the Loop */ ?>
                <?php while ( have_posts() ) : the_post(); ?>

                    <?php get_template_part( 'content', get_post_format() ); ?>

                <?php endwhile; ?>

                <?php twentyeleven_content_nav( 'nav-below' ); ?>

            <?php else : ?>

                <article id="post-0" class="post no-results not-found">
                    <header class="entry-header">
                        <h1 class="entry-title"><?php _e( 'Nothing Found', 'twentyeleven' ); ?></h1>
                    </header><!-- .entry-header -->

                    <div class="entry-content">
                        <p><?php _e( 'Apologies, but no results were found for the requested archive. Perhaps searching will help find a related post.', 'twentyeleven' ); ?></p>
                        <?php get_search_form(); ?>
                    </div><!-- .entry-content -->
                </article><!-- #post-0 -->

            <?php endif; ?>

            </div><!-- #content -->
        </div><!-- #primary -->

<?php get_sidebar(); ?>
<?php get_footer(); ?>

The above snippet exhibits several flaws:

Mixing HTML and PHP in the same file.
Piecing the page "linearly" (more on this later).
Other errors beyond the scope of this discussion such as improper namespacing of functions

The problem with "linear" page creation#

An HTML document is structured as a tree, with elements (or nodes) nested inside other elements to arbitrary depths. The problem with assembling the page linearly is that the individual snippets are unaware of their place in the tree, which invariably results in malformed HTML documents and abominable tag soup.

As an example, consider the following typical web-site composition:

<html><head><title>Welcome to my Blog
</title>
<!--  other elements like script, links, etc  -->
</head>
<body><div id="header"><h1>Blog Entry Title
</h1>
</div>
<div id="content"><!--  body of the blog entry  -->
</div>
<div id="footer"></div>
</body>
</html>

With this setup, we can imagine that the getHeader() function in the snippet above produces (that is, outputs to standad output) the part of the document from the DOCTYPE declaration down to the opening of the <div id="content"> tag. Meanwhile, getFooter() contains the closing tag for div#content, as well as the div#footer element, and the closing </body> and </html> tags.

Now suppose that the entire page needs to be wrapped inside another "wrapper" div element. Naturally, the new DIV declaration is added to the getHeader() function, but what is not so obvious, is that the closing DIV must be added, in the corresponding place, inside getFooter().

Survey the typical structure of most pages produced by WordPress and similar Content Management Systems and what you'll notice is a sea of closing </div>s: orphaned, abandoned, and completely out of context; strewn about haphazardly with only a faint hope that they might be correct.

But wait, there's more#

There are other setbacks, too. For example, there's no way to access and edit the contents of the <head> element after it has been produced by getHeader(). In addition, the structure epitomized by the sample above forces an unnatural segregation of processing logic and page preparation: all the checks must be executed before any output is sent.

Fortunately, there's a better way: and it has to do with using an XML library.

HTML page as a tree-structure#

The best way to deal with a document that is naturally structured as a tree, is to use a tree data-structure in the first place. For this reason, we recommend leveraging PHP's Objects to build up a webpage.

Parsing XML is a very difficult endeavor, and one better left for established libraries. But writing XML (and, therefore, XHTML) is extremely easy.

Rather than incurring the overhead cost of bulky libraries (keep-it-simple principle in play), you can easily create your own OOP, XML library. Start with the appropriate interface:

<?php
/**
 * @package xml
 * @version 0.9
 *
 * A library of XML goodies. A set of tools to CREATE, NOT PARSE Xml
 * documents in a very easy and straightforward way. In particular,
 * entire trees can be created on the fly.
 *
 */

/**
 * Interface for XML objects requires toXML method
 *
 * @author OpenWeb Solutions
 * @date   2010-03-16
 */
interface Xmlable {

  /**
   * Returns a textual representation of the XML
   *
   * @return String xml
   */
  public function toXML();

  /**
   * Echoes the XML representation to standard output
   *
   */
  public function printXML();
}
?>

Aside: why both `toXML` and `printXML`#

As will become obvious, most documents are prepared in memory and then offloaded to standard output once completed. The printXML method provides vast memory improvements by not serializing the entire XML as a String first.

XElem object#

An XML element is nothing more than a tag, a map (PHP calls these associative arrays) of attribute name-value pairs, and a list of children, which must in turn be Xmlable. We purposefully keep the class as simple as possible. After all, this is merely the foundation for more specialized subclasses.

<?php
/**
 * Basic parent class for XML objects
 *
 * @author OpenWeb Solutions
 * @date   2010-03-16
 */
class XElem implements Xmlable {

  protected $name;
  protected $child;
  protected $attrs;

  /**
   * Should never be empty, i.e. <td></td> vs. <td/>. Default false
   *
   * @param boolean
   */
  public $non_empty = false;

  /**
   * Creates the named XML object and optional attributes and children
   *
   * @param String $tag the tagname
   * @param Array<String,String> alist of attributes
   * @param Array<Xmlable> list of children
   * @throws InvalidArgumentException should something go wrong
   */
  public function __construct($tag, Array $attrs = array(), Array $child = array()) {
    $this->name = (string)$tag;
    $this->child = array();
    $this->attrs = array();
    
    foreach ($attrs as $key => $value)
      $this->set($key, $value);
    foreach ($child as $c)
      $this->add($c);
  }

  /**
   * Sets the given attribute
   *
   * @param String $key the key
   * @param String $val the value
   */
  public function set($key, $val) {
    $this->attrs[(string)$key] = (string)$val;
  }

  /**
   * Appends the given child
   *
   * @param Xmlable $child the child
   */
  public function add($child) {
    if (!($child instanceof Xmlable))
      throw new InvalidArgumentException("Child must be instance of Xmlable");
    $this->child[] = $child;
  }

  /**
   * Retrieves an array of the children for this object.  This needs
   * to be overridden by those objects which delay their creation
   * until either <pre>toXML</pre> or <pre>printXML</pre> is called.
   *
   * @return Array<Xmlable> children
   */
  public function children() {
    return $this->child;
  }

  /**
   * Fetches the tag name for this element
   *
   * @return String the name
   */
  public function name() {
    return $this->name;
  }

  /**
   * Implementation of Xmlable function
   *
   * @return String the XML object
   */
  public function toXML() {
    $str;
    // ...
    return $str;
  }

  /**
   * Implementation of Xmlable function
   *
   * @return String the XML object
   */
  public function printXML() {
    echo "";
    // ...
  }
?>

As the listing illustrates, the XElem class is very shallow, exposing just a handful of methods:

add appends an element as a child
set replaces the attribute
toXML, part of the Xmlable interface (omitted for brevity)
printXML
children: included merely for completion, this method provides third-party code an avenue to access the sub-nodes of the XElem object. This method is almost never used.
name: as with children, only included for completeness, but rarely used.

Note the lack of "advanced" methods (well-known to DOM hackers) such as,

insertBefore
nextSibling
etc

The goal here is clear: create a lightweight container for XML documents, and not a full-fledged XML library.

Aside: the `$non_empty` property#

The purpose of this property is to work around the XHTML specification that requires some elements (like DIV) to contain a closing tag, even if empty. That is, <div/> is never allowed.

Creating entire trees at a time#

Notice how the constructor allows for attributes and children to be optionally passed in. With this setup, it is possible to create an entire tree in one go:

<?php
/*
 * Generate an entire XML tree
 *
 * @author OpenWeb Solutions
 * @created 2013-07-26
 */

require_once('XML/XmlLib.php');

// We use PHP 5 shorthand notation for arrays for elegance
//
// $doc contains the following structure:
//   html
//    head
//    body (onload="init")
//     div (id="header")

$doc = new XElem('html', [],
                 [new XElem('head'),
                  new XElem('body', ['onload'=>'init()'],
                            [new XElem('div', ['id'=>'header'])])]);

$doc->printXML();
// Result:
// <html><head/><body onload="init()"><div id="header"/></body></html>
?>

Other needed subclasses#

Just a couple more Xmlable classes will round up this library:

XText, for encapsulating textual entries,
XHeader, to create the prologue entries, such as <?xml ... ?>.

The convenience of OOP#

With this basic class as a starting point, it is possible to create entire sublibraries that provide for a more concise vocabulary. For instance, a collection of subclasses, one for each appropriate element defined in (X)HTML comprises the HtmlLib (or XhtmlLib). And a separate file with a different set of subclasses can be used for creating SVG documents.

The following snippet defines XElem subclasses for common HTML tags. The extra X in the classname acts as a rudimentary namespace (for PHP < 5.4). We encourage the use of PHP namespaces where possible.

<?php
/**
 * An anchor
 */
class XA extends XAbstractHtml {
  /**
   * Creates a new anchor
   *
   * @param String $href where the anchor links to
   * @param String|Xmlable the content of the anchor
   * @param Array $attrs the attributes
   */
  public function __construct($href, $link, Array $attrs = array()) {
    parent::__construct("a", $attrs, array($link));
    $this->set("href", $href);
  }
}

/**
 * Blockquotes
 *
 */
class XBlockQuote extends XAbstractHtml {
  public function __construct($text, Array $attrs = array()) {
    parent::__construct("blockquote", $attrs, array($text));
  }
}

/**
 * Division element
 *
 */
class XDiv extends XAbstractHtml {
  public function __construct(Array $attrs = array(), $elems = array()) {
    parent::__construct("div", $attrs, $elems);
    $this->non_empty = true;
  }
}

/**
 * Input element
 *
 */
class XInput extends XAbstractHtml {
  /**
   * Creates a new input element with the given type, name, and value,
   * and optional attributes
   *
   * @param String $type the type of the input element
   * @param String $name the name
   * @param String $value the content of the value attribute
   * @param Array $attrs the optional attributes
   */
  public function __construct($type, $name, $value, Array $attrs = array()) {
    parent::__construct("input", $attrs);
    $this->set("type", $type);
    $this->set("name", $name);
    $this->set("value", $value);
  }
}

/**
 * A file input
 *
 */
class XFileInput extends XInput {
  public function __construct($name, Array $attrs = array()) {
    parent::__construct("file", $name, "", $attrs);
  }
}
?>

Notice how small these shell classes are. The key is to keep it simple, predictable, and clean.

Advantages of this structure#

Armed with this reusable library, your web applications can consistently produce valid, properly escaped output while making the code easier to manipulate! For instance, the entire page template can be encapsulated in an XElem subclass called WebPage, and this class can expose whatever methods are deemed necessary to fulfill its unique purpose:

setDescription (adds a <meta> tag to the head element),
addContent (adds an element at a specific point in the page, like the <div id="header">, for example.

The true structure of the HTML document is naturally preserved by the Xmlable objects, and all output can be suppressed until printXML is ready to be called. There is no need to mess around with output buffering, as the objects in memory naturally provide the functionality. In fact, if done correctly, storing the page as objects requires less memory than working with the equivalent document as a String.

But wait, there's more! The toXML and printXML methods handle character escaping. This means that you can work with the system inputs exactly as they are, without resorting to blanket uses of hacks like addslashes.

Example, revisited#

As a concrete example, consider the WordPress snippet above re-imagined in this new paradigm:

<?php
/*
 * The main template file.
 *
 * @package WordPress, Improved
 */

require_once('XML/HtmlLib.php');

$P = new WebPage("Welcome");

while (have_posts())
  $P->addContent(the_post());
// Other manipulations

$P->printXML();
?>

Conclusion#

With a little bit of work behind the scenes, it is possible to create a reusable XML library that can virtually eliminate errors of malformedness, and cross-site scripting, while improving memory consumption. What's more, the resulting code is cleaner and easier to maintain.

You can very quickly write your own library. But if you're interested, feel free to derive inspiration from the XML-DP-1.0 library downloadable from the PHPLIB repository on BitBucket.