Parsers vs. regular expressions? No contest

* * * * *

Parsers vs. regular expressions? No contest

I'm finding that where once in Lua [1] to its regular expressions for
parsing, I am now turning to LPeg [2]—or rather, the re [3] module, as I find
it easier to understand the code once written.

For instance, the regression test program I wrote for work outputs the
results of each test:

> 1.a.8 0-0 16-17 scp: ASREQ (1) ASRESP (1) LNPHIT (1) SS7COMP (1) SS7XACT (1) tps: ack-cmpl (1) cache-searches (1) cache-updates (1) termreq (1)
>

Briefly, the first three fields are the test case ID, and indications if
certain data files changed. The scp field indicates which variables of the
SCP (Service Control Point) (you can think of this as a service on a phone
switch) were modified (these just happen to be in uppercase) and then the tps
field indicates which TPS (Transaction Processor Service) (our lead developer
does have a sense of humor [4]) were modified. But if a variable is added (or
removed—it happens), the order can change and it makes checking the results
against the expected results a bit of a challenge.

The result is some code to parse the output and check against the expected
results. And for that, I find using the re module for parsing:

> local re = require "re"
>
> G = [[
> line <- entry -> {}
> entry <- {:id: id :} %s
> {:seriala: serial :} %s
> {:serialb: serial :} %s
> 'scp:' {:scp: items* -> {} :} %s
> 'tps:' {:tps: items* -> {} :}
> id <- %d+ '.' [a-z] '.' %d+
> serial <- %d+ '-' %d+
> items <- %s* { ([0-9A-Za-z] / '-')+ %s '(' %d+ ')' }
>
> ]]
>
> parser = re.compile(G)
>

to be more understandable than using Lua-based regular expressions:

> function parse(line)
> local res = {}
>
> local id,seriala,serialb,tscp,ttps = line:match("^(%S+)%s+(%S+)%s+(%S+)%s+scp%:%s+(.*)tps%:%s+(.*)")
>
> res.id = id
> res.seriala = seriala
> res.serialb = serialb
>
> res.scp = {}
> res.tps = {}
>
> for item in tscp:gmatch("%s*(%S+%s%(%d+%))%s*") do
> res.scp[#res.scp + 1] = item
> end
>
> for item in ttps:gmatch("%s*(%S+%s%(%d+%))%s*") do
> res.tps[#res.tps + 1] = item
> end
> return res
> end
>

with both returning the same results:

> {
> scp =
> {
> [1] = "ASREQ (1)",
> [2] = "ASRESP (1)",
> [3] = "LNPHIT (1)",
> [4] = "SS7COMP (1)",
> [5] = "SS7XACT (1)",
> },
> id = "1.a.8",
> tps =
> {
> [1] = "ack-cmpl (1)",
> [2] = "cache-searches (1)",
> [3] = "cache-updates (1)",
> [4] = "termreq (1)",
> },
> serialb = "16-17",
> seriala = "0-0",
> }
>

Personally, I find regular expressions to be an incomprehensible mess of
random punctuation and letters, whereas the re module at least lets me label
the parts of the text I'm parsing. I also find it easier to see what is
happening six months later if I have to revisit the code.

Even more importantly, this is a real parser. Would you ranther debug a
regular expression for just validating an email address [5] or a grammar that
validates all defined email headers [6] (email address validation starts at
line 464)?

[1] http://www.lua.org/
[2] http://www.inf.puc-rio.br/~roberto/lpeg/
[3] http://www.inf.puc-rio.br/~roberto/lpeg/re.html
[4] http://www.youtube.com/watch?v=Fy3rjQGc6lA
[5] http://ex-parrot.com/~pdw/Mail-RFC822-Address.html
[6] https://github.com/spc476/LPeg-Parsers/blob/master/email.lpeg

Email author at [email protected]