* * * * *
Adventures in Formatting
If you are reading this via Gopher and it looks a bit different, that's
because I spent the past few hours (months?) working on a new method to
render HTML (HyperText Markup Language) into plain text. When I first set
this up [1] I used Lynx [2] because it was easy and I didn't feel like
writing the code to do so at the time. But I've never been fully satisfied at
the results [Yeah, I was never a fan of that either –Editor]. So I finally
took the time to tackle the issue (and is one of the reasons I was timing
LPEG (Lua Parsing Expression Grammar) expressions [3] [DELETED-the other day-
DELETED] [Nope. –Editor][DELETED- … um … the other week-DELETED] [Still nope.
–Editor][DELETED- … um … a few years ago?-DELETED] [Last month. –Editor]
[Last month? –Sean] [Last month. –Editor] [XXXX this timeless time of COVID-
19 –Sean] last month).
The first attempt sank in the swamp. I wrote some code to parse the next bit
of HTML (it would return either a string, or a Lua table containing the tag
information). And that was fine for recent posts where I bother to close all
the tags (taking into account only the tags that can appear in the body of
the document, <P>, <DT>, <DD>, <LI>, <THEAD>, <TFOOT>, <TBODY>, <TR>. <TH>,
and <TD> do not require a closing tag), but in earlier posts, say, 1999
through 2002, don't follow that convention. So I was faced with two choices—
fix the code to recognize when an optional closing tag was missing, or fixing
over a thousand posts.
It says something about the code that I started fixing the posts first …
I then decided to change my approach and try rewriting the HTML parser over.
Starting from the DTD (Document Type Definition) for HTML 4.01 strict [4] I
used the re module [5] to write the parser, but I hit some form of internal
limit I'm guessing, because that one burned down, fell over, and then sank
into the swamp.
I decided to go back to straight LPEG, again following the DTD to write the
parser, and this time, it stayed up.
It ended up being a bit under 500 lines of LPEG code [6], but it does a
wonderful job of being correct (for the most part—there are three posts I've
made that aren't HTML 4.01 strict, so I made some allowances for those). It
not only handles optional ending tags, but the one optional opening tag I
have to deal with—<TBODY> (yup—both the opening and closing tag are
optional). And <PRE> tags cannot contain <IMG> tags while preserving
whitespace (it's not in other tags). And check for the proper attributes for
each tag.
Great! I can now parse something like this:
-----[ HTML ]-----
<p>This is my <a href="
http://boston.conman.org/">blog</a>.
Is this not <em>nifty?</em>
<p>Yeah, I thought so.
-----[ END OF LINE ]-----
into this:
-----[ Lua ]-----
tag =
{
[1] =
{
tag = "p",
attributes =
{
},
block = true,
[1] = "This is my ",
[2] =
{
tag = "a",
attributes =
{
href = "
http://boston.conman.org/",
},
inline = true,
[1] = "blog",
},
[3] = ". Is it not ",
[4] =
{
tag = "em",
attributes =
{
},
inline = true,
[1] = "nifty?",
},
},
[2] =
{
tag = "p",
attributes =
{
},
block = true,
[1] = "Yeah, I thought so.",
},
}
-----[ END OF LINE ]-----
I then began the process of writing the code to render the resulting data
into plain text. I took the classifications that the HTML 4.01 strict DTD
uses for each tag (you can see the <P> tag above is of type block and the
<EM> and <A> tags are type inline) and used those to write functions to
handle the approriate type of content—<P> can only have inline tags,
<BLOCKQUOTE> only allows block type tags, and <LI> can have both; the
rendering for inline and block types are a bit different, and handling both
types is a bit more complex yet.
The hard part here is ensuring that the leading characters of <BLOCKQUOTE>
(wherein the rendered text each line starts with a “| ”) and of the various
types of lists (dictionary, unordered and ordered lists) are handled
correctly—I think there are still a few spots where it isn't quite correct.
But overall, I'm happy with the text rendering I did, but I was left with one
big surprise [7] …
[1]
gopher://gopher.conman.org/0Phlog:2018/01/09.1
[2]
http://lynx.browser.org/
[3]
gopher://gopher.conman.org/0Phlog:2020/06/05.1
[4]
https://www.w3.org/TR/html4/strict.dtd
[5]
http://www.inf.puc-rio.br/~roberto/lpeg/re.html
[6]
gopher://gopher.conman.org/0Phlog:2020/07/04/html.lua
[7]
gopher://gopher.conman.org/0Phlog:2020/07/04.2
Email author at
[email protected]