* * * * *

                           Adventures in Formatting

If you are reading this via Gopher and it looks a bit different, that's
because I spent the past few hours (months?) working on a new method to
render HTML (HyperText Markup Language) into plain text. When I first set
this up [1] I used Lynx [2] because it was easy and I didn't feel like
writing the code to do so at the time. But I've never been fully satisfied at
the results [Yeah, I was never a fan of that either –Editor]. So I finally
took the time to tackle the issue (and is one of the reasons I was timing
LPEG (Lua Parsing Expression Grammar) expressions [3] [DELETED-the other day-
DELETED] [Nope. –Editor][DELETED- … um … the other week-DELETED] [Still nope.
–Editor][DELETED- … um … a few years ago?-DELETED] [Last month. –Editor]
[Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-
19 –Sean] last month).

The first attempt sank in the swamp. I wrote some code to parse the next bit
of HTML (it would return either a string, or a Lua table containing the tag
information). And that was fine for recent posts where I bother to close all
the tags (taking into account only the tags that can appear in the body of
the document, <P>, <DT>, <DD>, <LI>, <THEAD>, <TFOOT>, <TBODY>, <TR>. <TH>,
and <TD> do not require a closing tag), but in earlier posts, say, 1999
through 2002, don't follow that convention. So I was faced with two choices—
fix the code to recognize when an optional closing tag was missing, or fixing
over a thousand posts.

It says something about the code that I started fixing the posts first …

I then decided to change my approach and try rewriting the HTML parser over.
Starting from the DTD (Document Type Definition) for HTML 4.01 strict [4] I
used the re module [5] to write the parser, but I hit some form of internal
limit I'm guessing, because that one burned down, fell over, and then sank
into the swamp.

I decided to go back to straight LPEG, again following the DTD to write the
parser, and this time, it stayed up.

It ended up being a bit under 500 lines of LPEG code [6], but it does a
wonderful job of being correct (for the most part—there are three posts I've
made that aren't HTML 4.01 strict, so I made some allowances for those). It
not only handles optional ending tags, but the one optional opening tag I
have to deal with—<TBODY> (yup—both the opening and closing tag are
optional). And <PRE> tags cannot contain <IMG> tags while preserving
whitespace (it's not in other tags). And check for the proper attributes for
each tag.

Great! I can now parse something like this:

-----[ HTML ]-----
<p>This is my <a href="http://boston.conman.org/">blog</a>.
Is this not <em>nifty?</em>

<p>Yeah, I thought so.
-----[ END OF LINE ]-----

into this:

-----[ Lua ]-----
tag =
{
 [1] =
 {
   tag = "p",
   attributes =
   {
   },
   block = true,
   [1] = "This is my ",
   [2] =
   {
     tag = "a",
     attributes =
     {
       href = "http://boston.conman.org/",
     },
     inline = true,
     [1] = "blog",
   },
   [3] = ". Is it not ",
   [4] =
   {
     tag = "em",
     attributes =
     {
     },
     inline = true,
     [1] = "nifty?",
   },
 },

 [2] =
 {
   tag = "p",
   attributes =
   {
   },
   block = true,
   [1] = "Yeah, I thought so.",
 },
}
-----[ END OF LINE ]-----

I then began the process of writing the code to render the resulting data
into plain text. I took the classifications that the HTML 4.01 strict DTD
uses for each tag (you can see the <P> tag above is of type block and the
<EM> and <A> tags are type inline) and used those to write functions to
handle the approriate type of content—<P> can only have inline tags,
<BLOCKQUOTE> only allows block type tags, and <LI> can have both; the
rendering for inline and block types are a bit different, and handling both
types is a bit more complex yet.

The hard part here is ensuring that the leading characters of <BLOCKQUOTE>
(wherein the rendered text each line starts with a “| ”) and of the various
types of lists (dictionary, unordered and ordered lists) are handled
correctly—I think there are still a few spots where it isn't quite correct.

But overall, I'm happy with the text rendering I did, but I was left with one
big surprise [7] …

[1] gopher://gopher.conman.org/0Phlog:2018/01/09.1
[2] http://lynx.browser.org/
[3] gopher://gopher.conman.org/0Phlog:2020/06/05.1
[4] https://www.w3.org/TR/html4/strict.dtd
[5] http://www.inf.puc-rio.br/~roberto/lpeg/re.html
[6] gopher://gopher.conman.org/0Phlog:2020/07/04/html.lua
[7] gopher://gopher.conman.org/0Phlog:2020/07/04.2

Email author at [email protected]