* * * * *
Timing LPEG expressions
One pattern that is seemingly missing from LPEG (Lua Parsing Expression
Grammar) [1] is a case-insensitive match, a pattern where you can specify a
pattern to match “foo”, “FOO”, “Foo” or “fOo”. I have to think this was
intentional, as doing case insensitive matching on non-English words is … a
complex topic (for a long time, the German letter “ß [2]” upper case form was
“SS” but not all “SS” were an upper case “ß”). So it doesn't surprise me
there's no pattern for it in LPEG. But I wish it did, as a lot of Internet
text-based protocols require case-insensitive matching.
There are two ways around the issue. One way is this:
-----[ Lua ]-----
local lpeg = require "lpeg"
lcoal P = lepg.P
local patt = (P"S" + P"s")
* (P"A" + P"a")
* (P"M" + P"m")
* P"-"
* (P"I" + P"i")
* P"-"
* (P"A" + P"a")
* (P"M" + P"m")
-----[ END OF LINE ]-----
But this would seem to produce a lot of branching code that would be slow
(LPEG has its own parsing-specific VM (Virtual Machine)). Of course, there's
this solution:
-----[ Lua ]-----
local lpeg = require "lpeg"
local P = lpeg.P
local S = lpeg.S
local impS = S"Ss" * S"Aa" * S"Mm"
* P"-"
* S"Ii"
* P"-"
* S"Aa" * S"Mm"
-----[ END OF LINE ]-----
But each lpeg.S() uses 32 bytes to store the set of characters it matches,
and that seems like a large waste of memory for just two characters. A third
way occured to me:
-----[ Lua ]-----
local lpeg = require "lpeg"
local Cmt = lpeg.Cmt
local R = lpeg.R
local impC = Cmt(
S"SsAaMmIi-"^1,
function(subject,position,capture)
if capture:lower() == 'sam-i-am" then
return position
end
end
)
-----[ END OF LINE ]-----
This just looks for all the characters in “Sam-I-am” and then calls a
function to do an actual case-insensitive comparison, but at the cost of
doing it at the actual match time, instead of potentially doing it lazily (as
the manual puts it, “this one is evaluated immediately when a match occurs
(even if it is part of a larger pattern that fails later)”). And it might be
a bit faster than the one that just uses lpeg.P(), and with less memory than
the one using lpeg.S().
So before going to work on a custom case-insensitive pattern for LPEG (where
lpeg.P() is pretty much the case-sensitive pattern), I thought I might want
to profile the existing approaches first just to get a feeling for how long
each approach takes.
-----[ Lua ]-----
local lpeg = require "lpeg"
local rdtsc = require "rdtsc" -- this is another post
local Cmt = lpeg.Cmt
local Cf = lpeg.Cf
local P = lpeg.P
local R = lpeg.R
local S = lpeg.S
local test = "Sam-I-Am"
local base = P"Sam-I-Am"
local impP = (P"S" + P"s")
* (P"A" + P"a")
* (P"M" + P"m")
* P"-"
* (P"I" + P"i")
* P"-"
* (P"A" + P"a")
* (P"M" + P"m")
local impS = S"Ss" * S"Aa" * S"Mm"
* P"-"
* S"Ii"
* P"-"
* S"Aa" * S"Mm"
local impC = Cmt(
S"SsAaMmIi-"^1,
function(subject,position,capture)
if capture:lower() == "sam-i-am" then
return position
end
end
)
local function testf(patt)
local res = {}
for i = 1 , 10000 do
local zen = rdtsc()
patt:match(test)
local tao = rdtsc()
table.insert(res,tao - zen)
end
table.sort(res)
return res[1]
end
print('base',testf(base))
print('impP',testf(impP))
print('impS',testf(impS))
print('impC',testf(impC))
-----[ END OF LINE ]-----
I'm testing the normal case-sensitive pattern, and the three case-insensitive
patterns. I run each test 10,000 times and return the lowest value (“lowest”
means “fastest”). The rdtsc() function … that's another post [3] (but a pre-
summary—it represents the number of clock cycles the CPU has been running and
on the test system there are 2,660,000,000 cycles per second).
Anyway, on to the results:
Table: Timing some LPEG patters
base 2800
impP 3020
impS 3020
impC 5190
I'm honestly surprised. First, I thought the last method would do better than
it did, but it's nearly twice as slow. The other two are pretty much the
same, time wise (which leads me to wonder if the pattern
lpeg.P(single_letter) + lpeg.P(single_letter) is internally converted to
lpeg.S(letters)—it could very well be). And they aren't that much slower than
the case-sensitive pattern. Well, not enough for me to worry about it. Even a
much longer string, like “Access-Control-Allow-Credentials” gave similar
results.
And no, I did not write out by hand the expression to match “Access-Control-
Allow-Credentials” case-insensitively, but wrote an LPEG expression to
generate the LPEG expression to do the match:
-----[ Lua ]-----
local lpeg = require "lua"
local Cf = lpeg.Cf -- a folding capture
local P = lpeg.P
local R = lpeg.R
local char = R("AZ","az")
/ function(c)
return P(c:lower()) + P(c:upper())
end
+ P(1)
/ function(c)
return P(c)
end
Hp = Cf(char^1,function(a,b) return a * b end)
-----[ END OF LINE ]-----
It's a powerful technique, but one that can take a while to wrap your brain
around. It's just one of the reasons why I like using LPEG.
[1]
http://www.inf.puc-rio.br/~roberto/lpeg/
[2]
https://en.wikipedia.org/wiki/ß
[3]
gopher://gopher.conman.org/0Phlog:2020/06/05.2
Email author at
[email protected]