* * * * *
And I still haven't found what I'm looking for
If I have any text processing to do, I pretty much gravitate towards using
LPeg (Lua Parsing Expression Grammar) [1]. Sure, it might take a bit longer
to generate code to parse some text, but it tends to be less “write only”
than regular expressions.
Besides, you can do some pretty cool things with it. I have some LPeg code
that will parse the following strftime() [2] format string:
> %A, %d %B %Y @ %H:%M:%S
>
and generate LPeg code that will parse:
> Tuesday, 03 February 2015 @ 20:59:51
>
into:
> date =
> {
> min = 57.000000,
> wday = 4.000000,
> day = 4.000000,
> month = 2.000000,
> sec = 16.000000,
> hour = 20.000000,
> year = 2015.000000,
> }
>
Or, if I set my locale [3] correctly, I can turn this:
> maŋŋebarga, 03 guovvamánu 2015 @ 21:00:21
>
into:
> date =
> {
> min = 0,000000,
> wday = 3,000000,
> day = 3,000000,
> month = 2,000000,
> sec = 21,000000,
> hour = 21,000000,
> year = 2015,000000,
> }
>
But one annoyance that hits from time to time—named captures require a
constant name. For instance, this pattern:
> pattern = lpeg.Ct(
> lpeg.Cg(lpeg.P "A"^1,"class_a")
> * lpeg.P":"
> * lpeg.Cg(lpeg.P "B"^1,"class_b")
> )
>
(translated: when matching a string like AAAA:BBB, return a Lua table [4]
(lpeg.Ct()) with the As (lpeg.P()) in field class_a (lpeg.Cg()) and the Bs in
field class_b)
applied to this string:
> AAAA:BBB
>
returns this table:
> {
> class_a = "AAAA",
> class_b = "BBB
> }
>
The field names are constant—class_a and class_b. I'd like a field name based
on the input. Now, there is a function lpeg.Cb() that is described as:
> Creates a back capture. This pattern matches the empty string and produces
> the values produced by the most recent group capture [5] named name.
>
> Most recent means the last complete outermost group capture with the given
> name. A Complete capture means that the entire pattern corresponding to the
> capture has matched. An Outermost capture means that the capture is not
> inside another complete capture.
>
“LPeg - Parsing Expression Grammars For Lua [6]”
A quick reading (and I'm guilty of this) leads me to think this:
> pattern = lpeg.Cg(P"A"^1,"name")
> * lpeg.P":"
> * lpeg.Ct(lpeg.P "B"^1,lpeg.Cb("name"))
>
applied to the string:
> AAAA:BBB
>
returns
> {
> AAAA = "BBB"
> }
>
But sadly, no. The only example of lpeg.Cb(), used to parse Lua long strings
(which start with a “[”, zero or more “=”, another “[”, then text, ended with
a “]”, zero or more “=” (but the number of “=” must equal the number of “=”
between the two “[”) and a final “]”)):
> equals = lpeg.P"="^0
> open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
> close = "]" * lpeg.C(equals) * "]"
> closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
> string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1
>
shows that lpeg.Cb() was designed with this partular use case in mind—
matching one pattern with the same pattern later on, and not what I want.
I can do what I want (a field name based upon the input) but the way to go
about it is very klunky (in my opinion):
> pattern = lpeg.Cf(
> lpeg.Ct("")
> * lpeg.Cg(
> lpeg.C(lpeg.P"A"^1)
> * lpeg.P":"
> * lpeg.C(lpeg.P"B"^1)
> )
> ,function(acc,name,value)
> acc[name] = value
> return acc
> end
> )
>
This is a “folding capture [7]” (lpeg.Cf()) where we are accumulating our
results (even though it's only one result—we have to do it this way) in a
table (lpeg.Ct()) where each “value” is a group (lpeg.Cg()—the name is
optional) consisting of a collection (lpeg.C() of As (lpeg.P()) followed by a
colon (ignored), followed by a collection of Bs, all of which (except for the
colon—remember, it's ignored) are passed to a function that assigns the
string of Bs to a field name based on the string of As.
It gets even messier when you mix fixed field names with ones based upon the
input. If all the field names are defined, it's easy to do something like:
> eoln = P"\n" -- match end of line
> text = (P(1) - eoln)0 -- match anything but an end of line
>
> pattern = lpeg.Ct(
> P"field_one: " * Cg(text^0,"field_one") * eoln
> * P"field_two: " * Cg(text^0,"field_two") * eoln
> * P"field_three:" * Cg(text^0,"field_three") * eoln
> )
>
against data like this:
> field_one: Lorem ipsum dolor sit amet
> field_two: consectetur adipiscing elit
> field_three: Cras aliquet enim elit
>
to get this:
> {
> field_one = "Lorem ipsum dolor sit amet",
> field_two = "consectetur adipiscing elit",
> field_three = "Cras aliquet enim elit"
> }
>
But if we have some defined fields, but want to accept non-defined field
names, then … well … yeah … I haven't found a nice way of doing it. And I
find it annoying that I haven't found what I'm looking for.
[1]
http://www.inf.puc-rio.br/~roberto/lpeg/
[2]
http://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html
[3]
http://en.wikipedia.org/wiki/Locale
[4]
http://www.lua.org/manual/5.3/manual.html#2.1
[5]
http://lucy/~spc/docs/LPeg-0.12/lpeg.html#cap-
[6]
http://lucy/~spc/docs/LPeg-
[7]
http://lucy/~spc/docs/LPeg-
Email author at
[email protected]