* * * * *
An annotated example of using LPeg to parse a string to generate LPeg to
parse other strings
A message on the Lua email list was asking about the best way to parse MQTT
(does not have a meaning) topics, specifically, how to handle the multilevel
wildcard character [1]. I answered that LPeg (Lua Parsing expression grammar)
[2] would be good for this, and gave annotated source code to show how it
works. I thought I might also post about it for better visibility.
So, here's the code:
-----[ Lua ]-----
local lpeg = require "lpeg"
local Cc = lpeg.Cc
local Cf = lpeg.Cf
local P = lpeg.P
local R = lpeg.R
local filter do
local separator = P'/'
local topic = R("AZ","az","09")^1 * (#separator + P(-1))
local single = P'+' * (#separator + P(-1))
local multi = (P"/#" + P'#') * P(-1)
local csep = separator / function() return separator end
local ctopic = topic / function(c) return P(c) end
local csingle = single / function() return topic^-1 end
local cmulti = multi
/ function() return (separator * topic)^0 * P(-1) end
filter = (P"#" * P(-1))
/ function() return (separator^-1 * topic)^0 * P(-1) end
+ Cf(
(-P"/#" * (ctopic + csingle + csep))^0 * cmulti^-1 * Cc(P(-1)),
function(a,r)
return a * r
end
) * P(-1)
end
-----[ END OF LINE ]-----
And now the annotations—code fragment first, then annnotation.
-----[ Lua ]-----
local lpeg = require "lpeg"
local Cc = lpeg.Cc
local Cf = lpeg.Cf
local P = lpeg.P
local R = lpeg.R
-----[ END OF LINE ]-----
This loads the LPeg module into Lua. I also grab the functions I'll be using
from the module into locals. I do this not for speed purposes (although it
will be slightly faster) but to reduce code clutter—there will be less lpeg.
littered about the code, and I find that easier to read personally. It's not
required that this be done.
-----[ Lua ]-----
local filter do
-- ...
end
-----[ END OF LINE ]-----
filter will contain the resulting LPeg expression. I create a new scope since
the variables I'll be declaring won't be used outside of the definition for
filter and it seems cleaner to me to reduce variable visibility as much as
possible. It will also mean that over time (if this is intended for code that
runs for a long time) the local variables created in this scope will be
reclaimed as garbage. It's just a stylistic choice I do for Lua.
-----[ Lua ]-----
local separator = P'/'
-----[ END OF LINE ]-----
This defines an LPeg expression that matches a literal slash character. The
P() [3] function can do a bit more than match literal strings, but we'll be
mostly using it for literal string matches, as well as matching the end of
the input string.
-----[ Lua ]-----
local topic = R("AZ","az","09")^1 * (#separator + P(-1))
-----[ END OF LINE ]-----
This expression will match a “topic.” I'm using R() [4] to match a range of
characters (in this case, letters and digits). The multiplication sign (okay,
an asterisk, but it's used to designate multiplication in Lua) here is used
as an “AND” clause [5]—a topic is a range of characters “AND” something else.
That “something else” is either a separator (and the “#” mark is used to look
ahead [6] in the input without consuming it) or (the plus sign is read as “OR
[7]”) end of the input string.
-----[ Lua ]-----
local single = P'+' * (#separator + P(-1))
-----[ END OF LINE ]-----
This expression will match a plus sign, which is used to indicate a single
topic wildcard character. And again, we're expecting this to be followed by a
separator character or the end of the string.
-----[ Lua ]-----
local multi = (P"/#" + P'#') * P(-1)
-----[ END OF LINE ]-----
The “#” charcter is a multiple topic wildcard character and it must appear at
the end of the string. I check for both “/#” and “#” because of a way I
process the input later on. I might have found a better way to deal with
this, but for a “proof-of-concept” this is good enough for now.
Now we get to the mind bending bit of this—I'm writing LPeg to parse a “topic
filter” and generate an LPeg expression that will see if a “topic name”
matches the “topic filter.”
-----[ Lua ]-----
local csep = separator / function() return separator end
local ctopic = topic / function(c) return P(c) end
local csingle = single / function() return topic^-1 end
local cmulti = multi
/ function() return (separator * topic)^0 * P(-1) end
-----[ END OF LINE ]-----
These four expressions all do similar things—they match an existing pattern
and pass the matching text to a function [8] which returns an LPeg
expression. csep returns an expression that matches the separator; ctopic
returns an expression that matches the literal topic just parsed; csingle
returns an expression that matches an alphanumeric string that represents a
topic; and finally cmulti returns an expression that matches the remaining
input [9].
And the final bit of code:
-----[ Lua ]-----
filter = (P"#" * P(-1))
/ function() return (separator^-1 * topic)^0 * P(-1) end
+ Cf(
(-P"/#" * (ctopic + csingle + csep))^0 * cmulti^-1 * Cc(P(-1)),
function(a,r)
return a * r
end
) * P(-1)
-----[ END OF LINE ]-----
The first line just matches a single multiple topic character and returns a
pattern that will match the input. If that doesn't match (remember, “+” is
read as an “OR”) we do a folding capture [10] (Cf())—the code parses through
the “topic filter” and builds an LPeg expression using a folding capture that
will parse “topic names“ per the filter. Each piece that does match and
return a capture will be “accumulated” into a single expression, the
“folding” being done by the anonymous function passed in. The -P("/#") bit
there looks ahead in the input to make sure it isn't a multiple topic
wildcard character at the end of the string, and if that is the case, then it
will compile a match for a literal topic, a non-specified topic (which
fulfills the “single wildcard character match”) or a separator (but as long
as the separator isn't itself followed by a multiple topic wildcard
character, which is why we peek forward into the input). If we get to a point
in the input where we either hit the end of input, or a multiple topic
wildcard character, we handle that and cap the LPeg expression we're building
with checking for end of the input (all on lines 4–7).
The point of all this is to turn a string like “+/tennis/+/#” into the
following LPeg expression:
-----[ Lua ]-----
topic * separator * P"tennis" * separator * topic * (separator * topic)^0 * P(-1)
-----[ END OF LINE ]-----
which can then be used to match “topic names:”
-----[ Lua ]-----
local topics =
{
"news/tennis", -- won't match
"news/tennis/mcenroe", -- will match
"news/football/dolphins", -- won't match
"news/baseball/marlins", -- won't match
"sports/tennis/williams/ranking", -- will match
}
local the_topic = filter:match("+/tennis/+/#")
for _,topic in ipairs(topics) do
if the_topic:match(topic) then
report_on_it(topic)
end
end
-----[ END OF LINE ]-----
Yes, there is a learning curve (okay, maybe a cliff) to LPeg. But once you
get used to it, it is quite powerful and allows you to transform data in ways
that you can't with regular expressions. In fact, instead of returning the
default value (which is one past the position of the match in the string, or
nil if it failed to parse) I could have instead returned an array of topics
(or nil if it failed to parse)—but I will leave such changes as an exercise
to the reader.
There's also a bit about dollar signs further down in the MQTT document I
linked to, but again, handling that is left as an exercise for the reader.
[1]
http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/os/mqtt-v3.1.1-os.html#_Toc398718107
[2]
http://www.inf.puc-rio.br/~roberto/lpeg/
[3]
http://www.inf.puc-rio.br/~roberto/lpeg/#op-p
[4]
http://www.inf.puc-rio.br/~roberto/lpeg/#op-r
[5]
http://www.inf.puc-rio.br/~roberto/lpeg/#op-mul
[6]
http://www.inf.puc-rio.br/~roberto/lpeg/#op-len
[7]
http://www.inf.puc-rio.br/~roberto/lpeg/#op-add
[8]
http://www.inf.puc-rio.br/~roberto/lpeg/#cap-func
[9]
http://www.inf.puc-rio.br/~roberto/lpeg/#op-pow
[10]
http://www.inf.puc-rio.br/~roberto/lpeg/#cap-f
---
Discussions about this page
An annotated example of using LPeg to parse a string to generate LPeg to parse other strings | Lobsters
https://lobste.rs/s/cxzmbn
Email author at
[email protected]