* * * * *
Parsing—it's not just for compilers anymore
I've been playing around with LuaRocks [1] and while I've made a rock of all
my modules [2], I've been thinking that it would be better if I made the
modules individual rocks. That way, you can install just the modules you want
(perhaps you want to embed a C compiler in your Lua program [3]) instead of a
bunch of modules most of which you won't use.
And that's fine. But I like the ability to pull the source code right out of
the repository when making a rock. Now, given that the majority of my modules
are single files (either in Lua [4] or C) and the fact that it's difficult to
checkout a single file with git (or with svn for that matter) I think I'd be
better served having each module be its own repository.
And that's fine, but now I have a larger problem—how do I break out the
individual files into their own repositories and keep the existing revision
history? This doesn't seem to be an easy problem to solve.
Sure, git now has the concept of “submodules”—external repositories
referenced in an existing repository, but that doesn't help me here (and
git's handling of “submodules” is quirky at best). There's git-filter-branch
but that's if I want to break a directory into its own repository, not a
single file. But there's also git-fast-export, which dumps an existing
repository in a text format, supposedly to help export repositories into
other version control systems.
I think I can work with this.
The resulting output is simple and easy to parse [5], so my thought is to
only look at bits involving the file I'm interested in, and generating a new
file that can then be imported into a fresh resposity with git-fast-import.
I used LPeg [6] to parse the exported output (why not? The git export format
is documented with BNF (Backus-Naur Form), which is directly translatable
into Lpeg), and the only difficult portion was handling this bit of syntax:
> 'data' SP <count> LF
> <raw> LF?
>
A datablock contains the number of bytes to read starting with the next line.
Defining this in LPeg took some thinking. An early approach was something
like:
> data = Ct( -- return parse results in table
> P'data ' -- match 'data' SP
> * Cg(R"09"^1,'size') -- get size, save for later reference
> * P'\n' -- match LF
> * Cg( -- named capture
> P(tonumber(Cb('size'))) -- of 'size' bytes characters
> ,'data' -- store as 'data'
> )
> * P'\n'^-1 -- parse optional LF
> )
>
lpeg.P(n) states that it matchs n characters, but in my case, n wasn't
constant. You can do named captures, so I figured I could capture the size,
then retrieve it by name, passing the value to lpeg.P(), but no, that didn't
work. It generates “bad argument #1 to 'P' (lpeg-pattern expected, got nil)”—
in other words, an error.
It took quite a bit of playing around, and close reading of the LPeg manual
before I found the solution:
> function immdata(subject,position,capture)
> local size = tonumber(capture)
> local range = position + size - 1
> local data = subject:sub(position,range)
> return range,data
> end
>
> data = Ct(
> P'data '
> * Cg(Cmt(R"09"^1 * P"\n",immdata),'data')
> * P'\n^-1
> )
>
It's the lpeg.Cmt() that does it. It calls the given function as soon as the
given pattern is matched. The function is given the entire object being
parsed (one huge string, in this case the subject parameter), the position
after the match (the position parameter), and the actual string that was
matched (the capture parameter). From there, we can parse the size
(tonumber(), a standard Lua functionm, ignores the included line feed
character), then we return what we want as the capture (the variable amount
of data) and the new position where LPeg should resume parsing.
And this was the hardest part of the entire project, trying to match a
variable number of unknown characters. Once I had this, I could read the
exported respository into memory, find the parts relating to an individual
file and generate output that had the history of that one file (excluding the
bits where the file may have moved from directory to directory—those wheren't
needed) which could then be imported into a clean git repository.
[1]
http://www.luarocks.org/
[2]
https://github.com/spc476/lua-conmanorg
[3]
https://github.com/spc476/lua-conmanorg/blob/master/src/tcc.c
[4]
http://www.lua.org/
[5]
http://kernel.org/pub/software/scm/git/docs/git-fast-import.html
[6]
http://www.inf.puc-rio.br/~roberto/lpeg/
Email author at
[email protected]