* * * * *

                                Playing a SAX

There's a project that might start up at The Company involving lots of XML
(eXtensible Markup Language) and C programming, so I've been poking around
libxml [1]. I'm thinking I might even want to use this for mod_blog to
validate HTML (HyperText Markup Language) (since libxml has an HTML parser,
and about a quarter of the time I blow the coding on an entry and have to fix
it).

One problem that crops up is the difficulting in getting errors as libxml is
reading the document into memory. Sure, I can suck the HTML in with one call:

> htmlDocPtr doc = htmlParseFile(filename,NULL);
>

(yes, it is that simple). But not seeing how to change the underlying
reporting mechanism (not that I looked all that hard), I decide to switch to
the SAX (Simple API for XML) interface for parsing. The SAX interface allows
you to register functions to be called during portions of the HTML (or even
XML) parsing. Yes, I can grab the errors as they happen, but now I have to
resort to building the document into memory myself (more or less). But that's
okay, since in theory, this will allow me to not only capture the errors, but
filter the HTML as I see fit.

Two thing that popped right out at me.

First, the callback when a tag is found:

> -----[ C ]-----
> void **startElement**(void           *user_data,
>                        const xmlChar  *name,
>                        const xmlChar **attrs);
>
> void **endelement**(void          *user_data,
>                      const xmlChar *name);
> -----[ END OF LINE ]-----
>
> In these callbacks, the name parameter is the name of the element. The
> attrs parameter contains the attributes for the start tag. The even
> indicies in the array will be attribute names, the odd indicies are the
> values, and the final index will contain a NULL.
>

“Using the SAX Interface of LibXML [2]” (a tutorial)

Okay, seems simple enough. I write some code:

-----[ C ]-----
static void start_tag(void *data,const xmlChar *name,const xmlChar **attr)
{
 int i;

 /*--------------------------------------
 ; similar to printf() but functionally
 ; a bit better.
 ;
 ; And yes, this is how I format comments
 ; in C.
 ;--------------------------------------*/

 LineSFormat(StdoutStream,"$","<%a",name);

 for (i = 0 ; attr[i] != NULL ; i+= 2)
 {
   LineSFormat(StdoutStream,"$ $"," %a=\"%b\",attr[i],attr[i+2]);
 }
}
-----[ END OF LINE ]-----

And the first time this code runs it crashes.

It seems that the documentation is a bit misleading—attr is only valid if
there are attributes. Otherwise a NULL is passed in, which means you have to
explicitely check attr for NULL!

Aaaaah!

Would it have been that difficult for the authors of libxml to always pass in
a valid attr, even if it's two elements long that both contain NULL? (I
suppose most programmers would check anyway just because, and the bloat
continues)

The second thing. Catching the errors. Yeah. The call backs for those?

> void sax_error(void *data,const char *msg, ... );
>

The errors (and warnings, and fatal errors) are passed back as a printf()
style message.

So forget about intelligently handling the errors unless you want to parse
the actual error messages.

Aaaaaaaarg!

[1] http://xmlsoft.org/
[2] http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html#start-end-element

Email author at [email protected]