Wednesday, June 29, 2011

SimpleXML and Namespace Quirks

The knowledge that the Talking Owls have in The Talking Owl Project is represented as RDF Triples. That is, the most basic unit of knowledge is a structure with this format: {Subject, Predicate, Object}. The most rich framework for representing knowledge in RDF Triples is Web Ontology Language or "OWL". That's actually where the pun of the name comes from: the chatbots learn and talk using knowledge that is representing using the "OWL" framework, and thus are "Talking Owls".

RDF Triples are represented in a standard XML format that looks something like this:

<rdf:rdf 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:id="#owl42">
<rdfs:label>Pigwidgeon</rdfs:label>
<rdf:type rdf:resource="urn:concept:TalkingOwl" />
</rdf:Description>
</rdf:rdf>

This bit of XML contains two triples of information about an entity that I have whimsically given an ID of #owl42. The "knowledge" about #owl42 expressed in this XML is:

  • #owl42 has the label "Pigwidgeon"
  • #owl42 is an entity of the type referred to by a URI urn:concept:TalkingOwl

I'm not going to go into any detail explaining RDF and OWL and XML here, so from this point on I'm going to assume that this is basic stuff that you are familiar with.

The thing is, one of the basic operations that we have to do in the Talking Owl Project is to take bits of knowledge serialized as XML strings like the above and convert it into an "object" that represents the triples. The first step in doing this is to parse the string into an XML object tree.

Enter: SimpleXML

This is a very light-weight module of PHP that allows you to manipulate XML object trees. Just a short bit of code will take a text XML string and convert it into an object representation of the tree.

According to the documentation, the function simplexml_load_string($xml) will return a SimpleXML Object if it is able to parse the XML, and false if there is a problem with the XML string.

So I can do this, right?

$xmlobj = simplexml_load_string($xml)
    or die('Bad XML!');

WRONG.

Use this code with the string of XML provided above and you will get the "Bad XML!" message, even though the XML isn't bad.

I know the XML isn't bad, because I do not get the "Bad XML!" message if I do this:

$xmlobj = simplexml_load_string($xml);
if ($xmlobj===false)
    die('Bad XML!');

The function is not actually returning a false value. It is returning an object that is evaluating to false.

So, next, I try this:

$xmlobj = simplexml_load_string($xml);
if ($xmlobj===false)
    die('Bad XML!');
print_r($xmlobj);

The result is a completely empty object:

SimpleXMLElement Object
(
)

But now I will show you what was really making me tear my hair out. The object is not actually empty. I can use methods on the SimpleXML object to get the namespaces and even to re-construct an XML string from the object, and both show that the contents of the XML are correctly represented.

This is the code:

$xmlobj = simplexml_load_string($xml);
if ($xmlobj===false)
    die('Bad XML!');
print_r($xmlobj->getNamespaces());
print_r($xmlobj->asXML());

This is the result:

Array
( 
[rdf] => http://www.w3.org/1999/02/22-rdf-syntax-ns# 
[rdfs] => http://www.w3.org/2000/01/rdf-schema#
)

<?xml version="1.0"?>
<rdf:rdf 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<rdf:Description rdf:id="#owl42">
<rdfs:label>Pigwidgeon</rdfs:label>
<rdf:type rdf:resource="urn:concept:TalkingOwl" />
</rdf:Description>
</rdf:rdf>

It's obvious that the data is there. But, the object evaluates to false, and print_r shows it as an empty object.

So how do I get at the contents of the object?

The PHP documentation discusses a few methods to get inside an object, including the children() method. So let's try viewing the children.

print_r($xmlobj->children());

No dice: the result is still an empty object:

SimpleXMLElement Object
(
)

What if we don't trust print_r() so we iterate over the children instead?

foreach ($xmlobj->children() as $node) 
{
    print($node->getName().'<br />');

}

Also no dice! This returns nothing at all.

So this still looks like an empty object.

Now, if you look at the documentation that specifically talks about namespaces, you will find that you can use the children() method with a parameter, to get the children that belong to a specific namespace. So, since my little example document above only has one child (rdf:Description), I can try it with the namespace parameter:

print_r($xmlobj->children('rdf',true));

Now this gives us the result we want!

SimpleXMLElement Object
(
 [Description] => SimpleXMLElement Object
  (
   [@attributes] => Array
   (
    [id] => owl42
   )
   [type] => SimpleXMLElement Object
   (
    [@attributes] => Array
    (
     [resource] => urn:concept:TalkingOwl
    )
   )
  )
)

This is much better: it returns a node representing the rdf:Description node and it's children!

No, wait.

....and some of its children.

Notice that the rdfs:label child is not present in the structure, because it is from a different namespace. We have specified with the children() function that we want direct children only with a specific namespace, and apparently the function interprets that to apply to all subtrees as well.

 

This is interesting, not necessarily intuitive, and a huge pain when using SimpleXML to parse RDF and OWL strings in particular. With OWL/RDF, there is commonly a mixture of namespaces in the children of any given element.

Despite the fact that some sources on the web (which out of politeness will remain unnamed) claim that the children() method with no parameter will return all children regardless of namespace, this is false: children() with no parameter returns only elements in the default namespace.

Since it is a standard convention in RDF/OWL XML to have no default namespace (i.e. to have every element identified with a namespace prefix), that means that if you create a SimpleXML object from an OWL/RDF XML string, you will most likely get what looks like an empty object, unless you deliberately iterate over every namespace declared in the object. And then, you have to deliberately iterate over every namespace when retrieving the children of each child (because the children of a particular object are often from mixed namespaces). And then you have to iterate over all namespaces for each of their children... and so on.

This isn't particularly difficult to code, per se. Merely cumbersome. And it took me a while to figure out, so I thought I would share the discovery. Apparently, there be no shortcuts for those of you who want to use SimpleXML to parse RDF/OWL documents with multiple namespaces. You will just have to do a lot of namespace looping.

No comments:

Post a Comment