Wak Awk Implementation

Wak Awk Implementation
======================
* An awk implementation
* Writing an awk
* Parsing awk is tricky
* Parsing awk
* Runtime internals
* Implementing regex RS
* Testing awk

An awk implementation
=====================
Rob Landley's toybox project provides a variety of Linux command-line
tools, similar to busybox. I have written a compact but fairly
complete awk implementation intended to integrate with toybox, but it
can also build standalone.

<https://landley.net/toybox/>

<https://github.com/landley/toybox>

<https://busybox.net/>

This implementation is named wak, because all the good awk names are
taken. But when used in toybox, it's just awk, or toybox awk.

wak is coded in C99. It uses mostly C standard library, aka libc, but
also needs some POSIX support, mostly for regular expressions.

It is not public domain but does have Landley's very liberal 0BSD
license.

<https://landley.net/toybox/license.html>

These pages describe my awk implementation, but are not complete.

The implementation is at github.

<https://github.com/raygard/wak/>

Writing an awk
==============

Introduction
------------

I have written a compact but fairly complete awk implementation
intended to integrate with Rob Landley's toybox project, but it can
also build standalone.

These pages document some aspects of this effort. This implementation
is named wak, because all the good awk names are taken. But when used
in toybox, it's just awk, or toybox awk.

It is not public domain but does have Landley's very liberal 0BSD
license.

Who is this written for?
------------------------

This is written for anyone who wants to understand how wak works
internally. It's partly for my own use, to document what I've done so
I can find it later. It's also for anyone curious about how awk can
be implemented, including some of the problems I encountered and how
I dealt with them.

To understand this, it helps to know awk pretty well and know a bit
about how a lexical analyzer (aka lexer / scanner / tokenizer) and a
recursive-descent parser work.

There is a POSIX specification for awk, and wak should conform to it
with some exceptions (mostly) documented here. The reader should
probably be familiar with the POSIX spec to read this, or the wak code.

<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html>

Other implementations
---------------------

There are several implementations of awk readily available, which I
have been testing my implementation against.

The original awk, written by Al Aho, Peter Weinberger, and Brian
Kernighan, is still being maintained and even updated recently with
new features. Kernighan (bwk) calls it the One True Awk, and it's
often referred to as nawk (new awk, as compared with the pre-1985
version, sometimes called old awk, which lacked many essential
features).

<https://github.com/onetrueawk/awk>

Around September 2023, nawk was updated with some new features,
primarily around UTF-8 handling and the ability to read CSV
(comma-separated values) files. I have used this later version in my
testing, and refer to it here as nnawk (new new awk).

Many Linux distros include gawk (GNU awk, maintained for decades by
Arnold Robbins). Some distros (like Raspberry Pi's Raspian) have mawk
(Mike Brennan's awk, now maintained by Thomas Dickey).

<https://www.gnu.org/software/gawk/>

<https://invisible-island.net/mawk/>

Busybox has an awk implementation by Dmitry Zakharov, written in 2002
and currently maintained.

<https://git.busybox.net/busybox/tree/editors/awk.c>

More recently, Ben Hoyt wrote goawk, an implementation in Go language.

<https://github.com/benhoyt/goawk>

See also <https://en.wikipedia.org/wiki/AWK>.

Documents
---------

The 1985 version of nawk is documented in The AWK Programming
Language (1988), by Aho, Kernighan, and Weinberger. I'll refer to
that edition as the AWK book. Since nnawk (aka One True Awk or OTA)
has been enhanced in 2023 by Kernighan, the book has been issued in a
second edition available at Amazon. I'll refer to that as the AWK
book 2nd Ed.

<https://awk.dev/>

In addition to The AWK Programming Language (1st and 2nd editions)
and the POSIX spec mentioned above, the gawk manual by Arnold
Robbins, titled GAWK: Effective AWK Programming is a great resource.
It documents gawk in detail, including "standard" awk and gawk-only
extensions, and it also explains how to use awk. The explanations of
awk's "dark corners" and many details omitted from the AWK book
helped immensely in developing wak.

<https://www.gnu.org/software/gawk/manual/>

What makes a proper awk?
------------------------

None of the existing implementations follow POSIX to the letter, and
no two of them behave exactly the same. As I understand it, the POSIX
awk spec is largely based on the description in the AWK book, but
there are some differences. The POSIX spec is more precise than the
description in the AWK book, but not as detailed as, for example, the
language spec for C. The grammar in the POSIX spec reflects the
actual nawk implementation more so than the book does, but is not
based directly on the yacc/Bison grammar used by nawk. Where POSIX
differs from nawk, it seems most implementers tend to follow the
actual behavior of nawk rather than POSIX.

wak intends to follow the POSIX spec as well, but there are
departures. For example, wak does not handle locales except to the
extent that toybox does.

Some restrictions imposed by toybox
-----------------------------------

Toybox is licensed 0BSD, Rob Landley's approach to have copyrighted
code that is as free and close to public domain as possible; see the
link for more information. Rob does not accept code that has more
restrictions than the 0BSD license, so he can't adopt any other
current awk implementations. wak uses the 0BSD license.

Rob is adamant that it not have any outside dependencies except
"standard" libc (plus some POSIX elements such as regular expression
functions), so using yacc/Bison or lex/flex is out, and toybox awk
will have to use a hand-written lexical analyzer and parser.

In order to avoid any "contamination" of wak code by other
implementations, I did not look at the code for other
implementations. I did look at the yacc/Bison grammar for nawk, only
to attempt to understand the actual language accepted by the original
awk.

Program structure
-----------------

The design is a classic "scan, compile to internal format, interpret"
interpreter.

The compiler is a recursive descent parser with code generation
rolled directly into it. It uses a Pratt/precedence-climbing approach
for parsing expressions. It is a one-pass design, with a few "fixups"
needed to deal with awk allowing functions to be called before being
defined, and also to handle a few difficulties with the awk grammar.

The internal format is a "virtual machine" where the instructions are
each a "word" (int). For no special reason, I call my generated
format "zcode". The zcode machine is a straight stack machine.

Lexical analysis
----------------

The current wak lexical analyzer, scan.c, is about 300 LOC (counting
non-blank, non-comment lines). It's a fairly typical scanner design,
with functions to get tokens (numbers, strings, variables, keywords,
regular expressions), look up keywords, etc. It also handles reading
the program text given on the command line or stepping through the
files of awk code specified on the command line.

The state of the scanner is maintained in a global of type struct
scanner_state, updated as each token is scanned. This includes the
state of the file being read (name, line number, etc) and the
particular token scanned and information about the token. The tokens
are signified by integer values from an enum with names of the form
tk... such as tkfor, tkwhile, tklt (<), tknumber, tkstring, etc.

One interesting detail is the handling of regular expressions
(hereafter just "regex"). The language allows literal regex specified
as / ... regex here .../. The / symbol also indicates division, and
/= is the divide-and-assign operator, just as in C. To disambiguate,
the POSIX spec says:

> When an input sequence begins with a character in any syntactic
> context where the token '/' or DIV_ASSIGN could appear as the next
> token in a valid program, the longer of those two tokens that can
> be recognized shall be recognized. In any other syntactic context
> where the token ERE could appear as the next token in a valid
> program, the token ERE shall be recognized.

(DIV_ASSIGN is the `/=' token, and ERE means a POSIX Extended Regular
Expression token.)

I asked Arnold Robbins and Ozan Yigit (the current OneTrueAwk
maintainer) if I am correct that a `/' or `/=' token can appear only
after a number, string, variable, getline keyword, right parenthesis
`)', right bracket `]', or an increment or decrement operator `++' or
`--'. I intended to have the scanner assume a `/' character after any
of these means divide, and a `/' in any other context indicates a
regex literal.

Neither gave me a definitive answer to that, but Ozan did not think
it a good idea to have the scanner make the decision. He said
"letting the parser inform the scanner what to collect is a
reasonable thing to do" and that "both ota and gawk do exactly that
for collecting regular expressions." That implied to me that the
parser would have to tell the scanner when it is expecting a regex,
which seemed to me to be trickier than having the scanner do as I
suggested. Perhaps I was wrong, but I did not want to look at the
code for OTA (nawk) or gawk to see just how that is done. I have
taken the approach I described, and have not run into a problem yet
with valid awk code being scanned or parsed wrongly as a result.

Internal format and zcode machine
---------------------------------

Each awk value is a zvalue struct, that can be a scalar, literal
regular expression (regex for short), or an array. A scalar can have
a numeric value (C double) or a string value or both. A string
(zstring) is a struct that is reference counted and can hold an
arbitrary number of bytes. An array is a pointer to a zmap struct
that holds a pointer to a hash table. (I often refer to the awk array
structure as a map.) A regex is a pointer to a regex_t compiled POSIX
regex.

Because the string, array, and regex values are mutually exclusive
within an awk variable, I use a union to hold them. Here are the
zvalue and zstring structures:

// zvalue: the main awk value type.
// Can be number or string or both, or else map (array) or regex
struct zvalue {
unsigned flags;
double num;
union { // anonymous union not in C99
struct zstring *vst;
struct zmap *map;
regex_t *rx;
};
};

// zstring: flexible string type.
// Capacity must be > size because we insert a NUL byte.
struct zstring {
int refcnt;
unsigned size;
unsigned capacity;
char str[]; // C99 flexible array member
};

The anonymous union in the struct zvalue is not valid C99, but gcc
accepts it, giving a warning with -Wpedantic. I may fix this, but
maybe it just clutters up the code with extra .z characters for no
really good reason. This is the only non-standard aspect of wak code,
as far as I know. With all gcc warnings turned on, it compiles with
no other warnings.

The stack, several compiler tables, and also the actual value
structures of the arrays are held in expanding sequential list
structures called zlist:

// zlist: expanding sequential list
struct zlist {
char *base, *limit, *avail;
size_t size;
};

The size member holds the size of each entry in the list (also
referred to as slots). base points to the base of the list; limit
points to just past the end of the list, and avail points to the
first unused slot. The list is reallocated to be 50% larger as needed
as it fills up.

The map (awk array) structure is the zmap hash table:

// zmap: Mapping data type for arrays; a hash table. Values in
// hash are either 0 (unused), -1 (marked deleted), or one plus
// the number of the zmap slot containing a key/value pair. The
// zlist slot entries are numbered from 0 to count-1, so need to
// add one to distinguish from unused. The probe sequence is
// borrowed from Python dict, using the "perturb" idea to mix in
// upper bits of the original hash value.
struct zmap {
unsigned mask; // tablesize - 1; tablesize is 2 ** n
int *hash; // (mask + 1) elements
int limit; // 80% of table size ((mask+1)*8/10)
int count; // number of occupied slots in hash
int deleted; // number of deleted slots
struct zlist slot; // expanding list of zmap_slot elements
};

The zlist slot member holds the actual values of the hash table:

// Elements of the hash table (key/value pairs)
struct zmap_slot {
int hash; // store hash key to speed hash table expansion
struct zstring *key;
struct zvalue val;
};

The zmap hash member points to 2n int values that in turn index to
zmapslot structs, each of which holds a zvalue, a key as a zstring,
and its hash (to speed rehashing, a la Python). The hash table grows
by doubling as needed, starting at 8 slots and uses a probe sequence
borrowed from Python: (*probe = (*probe * 5 + 1 + (perturb >>=
PSHIFT)) & m->mask;). See comments in the Python dict source code for
details; wak uses a simplified version of this approach.

<https://github.com/python/cpython/blob/main/Objects/dictobject.c>

Parsing awk is tricky
=====================

What is the grammar and meaning of awk code? Maybe a yacc expert
could know.

If you only know what the Awk book (by A, K, & W, 1st or 2nd Ed.)
says, you'd be missing some details. For example, the Awk book and
many other documents say a for statement is:

for (expression [1]; expression [2]; expression [3])
statement

(or any of the three expressions can be empty).

But try this: awk 'BEGIN{for (; x++ < 3; print x);}' (using gawk,
nawk, nnawk, or goawk; mawk and bbawk (busybox awk) fail.)

This works because expression [1] and expression [3] are actually
simple_statement which can be an expression, delete statement, or
print statement. (This is in both the yacc grammar for nawk/nnawk and
in the POSIX reference grammar.) Why? Maybe no one knows. Maybe it
somehow made the syntax better for yacc. I asked gawk maintainer
Arnold Robbins about this, and he suggested

> I can imagine something like
>
> for (printf("Enter data: ");
> getline data > 0;
> printf("Enter data: ")) {
> ## stuff with data
> }

But even if it could be useful, very few users would ever know about
it because it's undocumented outside the formal grammar.

A few interesting cases
-----------------------

Case 1:

Consider this: awk 'BEGIN{if (1 < 2 && x = 3) print x}'. According to
the POSIX grammar this should be a syntax error, but it prints 3.

<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
#tag_20_06_13_16>

The spec says:

> This grammar has several ambiguities that shall be resolved as
> follows: Operator precedence and associativity shall be as
> described in Expressions in Decreasing Precedence in awk.

Assignment = has lower precedence than &&, which has lower precedence
than <. The statement should parse as: if (((1 < 2) && x) = 3), and
the expression to the left of = must be an lvalue, which it clearly
is not. (The additional parentheses are notional; no parentheses are
allowed around an lvalue.) The rules are similar in C, and a
statement like this is a syntax error in C.

Case 2:

Another odd one, adapted from gawk test getline.awk: running this in
bash (as e.g. ./getline_test1.sh gawk):

#!/usr/bin/bash
awk=$1
echo -e 'A\nB\nC'|$awk 'BEGIN {
x = y = "!"
a = (getline x y); print a, x
a = (getline x + 1); print a, x
a = (getline x - 2); print a, x
}'

with any of gawk, nawk, nnawk, goawk, bbawk gives the same result:

1! A
2 B
-1 C

But running this (as getline_test2.awk):

BEGIN {
x = y = "!"
cmd = "echo A"; a = (cmd | getline x y); close(cmd); print a, x
cmd = "echo B"; a = (cmd | getline x + 1); close(cmd); print a, x
cmd = "echo C"; a = (cmd | getline x - 2); close(cmd); print a, x
}

gives the same result as above in gawk, mawk, and bbawk, but goawk
gets a parsing error on the "echo A" line, and nawk/nnawk gives:

1! A
11 B
1-2 C

This seems really odd to me. The original awk (nawk/nnawk -
Kernighan's OneTrueAwk) apparently parses (cmd | getline x y) as
((cmd | getline x) y), but parses (cmd | getline x + 1) as
((cmd | getline x) (+ 1)) and (cmd | getline x - 2) as
((cmd | getline x) (- 2)). (That is, the y, the + 1, and the - 2 are
treated as strings concatenated with the result of
(cmd | getline x).) This is despite the fact that the right side of a
concatenation excludes an expression beginning with + or -, according
to POSIX, as it must because it otherwise makes a + b ambiguous as to
whether it's (a + b) or (a) (+b).

Case 3:

BEGIN {$0 = "2 3 4"; $$0++; print} prints 3, except bbawk says
"Unexpected token", and my awk as of this writing prints 2 4 4.

Which is "correct"? Apparently nawk, gawk, mawk, and goawk work as
follows (bbawk gets a syntax error): in $$0++, the $0 is "2 3 4" but
the ++ forces it numeric, as 2, so the post-increment applies to $0
which sets the entire record to 3. The subsequent outer $ selects
field 3 of $0, which does not exist, so the value of $$0++ is the
empty string.

My awk tries to obey the precedence rules in POSIX, which have $ at
higher precedence than ++, so evaluates $$0 first as $2, selecting
the second field, and then the ++ increments it to 4.

But wait, there's more! Try BEGIN {a = 2; b[1] = 2; $0 = "33 44";
print $a; $a++; print; print $b[1]; $b[1]++; print b[1]; print} In
gawk, mawk, goawk, bbawk, and my awk, it prints

44
33 45
45
2
33 46

But in nawk/nnawk, it prints

44
33 45
45
3
33 45

What's happening in nawk/nnawk is: $a++ increments the field $a (i.e.
$2), but $b[1]++ increments the array value b[1] and does not change
the field. All the other awks increment field 2 in both cases, as one
might expect. Once again, the original One True Awk's behavior seems
inconsistent and not easily discernible from the yacc grammar, nor
from the documentation.

Case 4:

BEGIN {$0="3 4 5 6 7 8 9"; a=3; print $$a++++; print} prints

7
3 4 6 6 8 8 9

except bbawk and my awk currently indicate a syntax error.

Similar to case 3 above; $a selects field 3, which is 5, increments
it, but then $$a++ is a reference to field 5, which is 7, and that is
printed as the value of $$a++++, but then that field is incremented,
giving $0 the value "3 4 6 6 8 8 9".

My awk (and apparently bbawk?) parse $$a++++ as if it were
(($($a))++)++ and reject the outer ++ because (($($a))++) is a value,
not an lvalue. The parentheses here are again notional; an lvalue
cannot be in parentheses.

Case 5:

BEGIN {a[y] = 1; x = y in a + 2; print x} prints:

* an empty line in bbawk
* 3 in wak
* 12 in nnawk/nawk (!)
* a syntax error message at `+' in gawk, mawk, goawk

Apparently nnawk/nawk treats the + 2 as a string to be concatenated
with the 1 from evaluating y in a as true.

What causes these peculiarities?

Aho, Weinberger, and Kernighan have written that

> The development of awk was significantly shortened by using UNIX
> tools. The grammar is specified with yacc; the lexical analysis is
> done by lex. Using these tools made it easy to vary the syntax of
> the language during development.

<https://awk.dev/awk.spe.pdf>

In The Unix Programming Environment, Kernighan and Rob Pike wrote
(footnote, p. 254):

> The yacc message "reduce/reduce conflict" indicates a serious
> problem, more often the symptom of an outright error in the grammar
> than an intentional ambiguity.

<https://scis.uohyd.ac.in/~apcs/itw/UNIXProgrammingEnvironment.pdf>

Compiling One True Awk (nawk/nnawk) gets 85 reduce/reduce conflicts
in addition to 44 shift/reduce conflicts.

The original yacc grammar for awk has many ambiguities. I don't know
much about yacc yet, but I understand it's basically an LALR(1)
parser generator, but with additional features to resolve what would
otherwise be ambiguities. I suspect that these features make it hard
to understand exactly what the parser does if you are not a yacc
expert. The actual awk language is determined by what the parser
accepts and what it does with that. My guess is that there may not be
any true LALR(1) grammar for awk, maybe no LR(1) grammar, and
probably no LL(1) grammar either. This makes writing a
recursive-descent parser by hand somewhat difficult.

The POSIX grammar is in a yacc-like format and was obviously written
with some reference to the original yacc grammar for awk. I can say
that with some confidence because it includes the for
(simple_statement ...) business discussed above, which one would
never know without studying the original yacc grammar. But as shown
above, existing implementations do not always follow POSIX grammar,
and when they differ they usually follow the original awk behavior.

I asked Arnold Robbins (gawk maintainer and contributor to One True
Awk) about case 1. He replied "The answer, as unsatisfying as it may
be, is that Awk has always worked this way." Also, regarding case 1
above, Ben Hoyt had the same issue in goawk. He resolved it with the
commit that included this comment:

> The other awks support this by using a yacc grammar which supports
> backtracking, and as Vitus13 said on reddit: "If there are two
> syntactically valid parsings and one is a semantic error, the error
> handling may resolve the ambiguity towards the valid parsing. In
> this case, you can only assign to L values, so trying to assign to
> (1&&x) doesn't make any sense."

I believe this is a misconception; yacc does not support backtracking
(though there is a different program, btyacc, that does). I think Ben
and Vitus13 are wrong here; yacc does not see multiple syntactically
valid parsings, it sees only the parsing that it performs
deterministically, and many of the problems understanding what
nawk/nnawk does are due to the way yacc resolves conflicts.

Parsing Awk
===========

I wanted to try parsing awk using recursive descent along with
Pratt's method of handling expression syntax. Pure recursive descent
without backtracking requires a grammar meeting certain requirements,
specifically LL(1). The reference grammar for POSIX awk does not seem
to meet this, and I am doubtful a correct LL(1) awk grammar exists.
There are some difficulties discussed below, so a bit of finagling in
the parser is needed.

If you are not familiar with Pratt parsing, there are several
tutorials and discussions around on the Web. Vaughan Pratt introduced
the idea in a 1973 paper, then Douglas Crockford sort of rediscovered
and popularized it in the first paper linked above.

<https://crockford.com/javascript/tdop/tdop.html>

<https://journal.stuffwithstuff.com/2011/03/19/
pratt-parsers-expression-parsing-made-easy/>

<https://eli.thegreenplace.net/2010/01/02/
top-down-operator-precedence-parsing>

<https://dl.acm.org/doi/10.1145/512927.512931>

The basic idea is that instead of having a separate routine to parse
each operator or precedence level as in pure recursive descent, you
have a main "expression" routine with a loop to run through the
tokens of the expression that calls "nud" and "led" routines for each
operator, and those routines can recursively call the main expression
routine to process subexpressions with the same or higher operator
precedences.

The "led" and "nud" (left denotation and null denotation, in Pratt's
terms) routines are for operators that take a left operand (e.g.
infix and postfix operators) or no left operand (prefix operators),
respectively.

I was blocked for a time by the concatenation operator being present
by its absence. That is, in awk, there is no explicit operator for
string concatenation; it happens when two expressions appear adjacent
with no operator between them. I figured I had to determine it and
insert it by looking for places where a symbol that can end an
expression is followed by a symbol that can start an expression. But
the ++ and -- operators threw me. They can either end or begin an
expression. So I could not see how to have ++ and -- as both prefix
and postfix operators, that is, handled by both the "nud" and "led"
routines.

Then I saw that if they followed an lvalue they were always taken as
postfix. I modified my "nud" routine, which was handling prefix
operators, to also consume ++ / -- operators if they followed an
lvalue.

I also had my main expression routine expr() looking for the start of
an expression where it could be the right side of a concatenation.
Note that the right side cannot start with + or -, or there will be
confusion between a + b and a (concatenated with) + b. If you look at
the POSIX grammar for awk, you'll see a long list of nearly identical
productions for unary_expr and non_unary_expr, that are needed solely
to exclude the right side of a concatenation beginning with unary +
or -. There is nothing like that in the yacc grammar for
nawk/nnawk/OneTrueAwk, and I'm not clear on how yacc handles this.

<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
#tag_20_06_13_16>

But now my "nud" routine was processing not only prefix operators and
plain variables, but also subscripted variables (array references),
as well as postfix ++ and --. By the time I got it working, the "nud"
routine was also handling the $ field reference operator and function
calls, and the parenthesized expression list normally seen with the
array-subscript test expression (expr, expr,...) in arrayname. The
supposed Pratt parser was looking more similar to a precedence
climbing parser, though it still has a Pratt-like main loop routine
for expressions and still uses "left binding power" and "right
binding power" concepts from Pratt parsing. Consequently, I renamed
the nud function to primary() and led function to binary_op().

<https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing>

<https://eli.thegreenplace.net/2012/08/02/
parsing-expressions-by-precedence-climbing>

I hacked code into the expression routine expr() to handle the if (1
< 2 && x = 3) case mentioned in the previous section. (This is also
an issue for some similar cases, such as 1 && x = 3 and 2 < x = 3.)
It's not elegant, but it seems to work correctly.

Statement termination
---------------------

Another feature of awk is that statements normally end at the end of
a line and do not require a semicolon to end the statement. Some
constructs require a semicolon in certain places: in if (condition)
statement [1]; else statement [2], where the else appears on the same
line as statement [1] and statement [1] is a single statement (i.e.
not a compound statement in {braces}, and in do statement; while
(condition) where the while appears on the same line as the statement
and the statement is not enclosed in braces. Multiple statements on
the same line must be separated with semicolons.

The awk(1) man page says "Statements are terminated by semicolons,
newlines or right braces." Actually, all statements must be
terminated with a newline or semicolon, except the last statement
before a closing right brace in brace-enclosed statement list. The
POSIX grammar has complications due to this, in that it has
productions for terminated_statement_list,
unterminated_statement_list, terminated_statement,
unterminated_statement, and terminatable_statement.

The nawk/nnawk grammar avoids all this, via a lexical trick. Since it
is never wrong to put a semicolon before a closing brace, and it
cannot change the meaning of a correct program, the nawk/nnawk
lexical analyzer returns a faked semicolon before each closing brace.
Then it can use a much simpler grammar structure. But you can detect
this trick, because it can also make an incorrect program correct!

Try awk 'BEGIN{while(n-->0);}' with any awk and it will do nothing,
silently. But awk 'BEGIN{while(n-->0)}' (note missing semicolon empty
statement) works in nawk/nnawk but gets a syntax error in gawk,
goawk, bbawk, and wak. (It also works in mawk, indicating that mawk
uses the same lexical hack.)

I initially intended to do the same, but it seemed too hacky.
Instead, when I encounter an else or the while of a do ... while
statement, I look at the previous token to see if it's any of
newline, semicolon, or closing brace. This has the equivalent effect
and also avoids overly complicating the grammar and parsing logic.

Some other parsing-related problems
-----------------------------------

print and printf:

The print and printf statements have some issues. These can be
followed by one or more expressions, which may or may not be enclosed
in parentheses. So, you may encounter

print # by itself, means print $0, the entire record
print 3, 4 # prints 3 4 # case 2
print(3, 4) # same effect # case 3
print(1+2), 4 # same effect # case 4

When all you have is print ( the parser cannot know if it's case 3 or
4 above, so it cannot know to look for an expression (case 4) or a
list of expressions (case 3). Also, when it sees a list of
expressions (case 3), it still cannot decide it's a list to be
printed, or the beginning of a subscripted in expression, as in
print(3, 4) in a.

Any of the above print statements can be followed by redirection, as
print 3, 4 > "some_file". This would make print "def" > "abc"
ambiguous, so the language spec says that any comparison operator in
a print statement must be inside parentheses, as in
print("def" > "abc"), or print("def" > "abc"), 123 > "outfile".
Actually, only the original nawk/nnawk enforces that against
operators other than >. Other awks allow
print "abc" < "def" > "outfile", for example.

I had a nasty kluge of switches and state variables in place to deal
with all this, but now have it simplified somewhat. In the getlbp()
function (that returns the Pratt left binding power), I have logic to
say if the parser is inside a print/printf statement, and not inside
parentheses, and has a > token, return 0, which will end any
expression. Then the > will be correctly treated as a redirection
operator.

The primary() function returns the number of expressions in a
parentheses-enclosed statement list, or 0 (or -1 for a potential
lvalue, that is used to help with the (1 && a = 2) problem mentioned
in the previous previous section).

The print_stmt() function calls the expression parser as
expr(CALLED_BY_PRINT), where CALLED_BY_PRINT is a "magic" value that
flags the expr() function to see if the initial primary() call
returns a statement list (>= 2) followed immediately by a token that
can end a print statement (`>', `>>', `|', `;', `}', or newline). If
so, it returns the expression count from primary() to print_stmt(),
otherwise it continues parsing what is expected to be the first (or
possibly only) expression for the print statement.

I believe this correctly handles the print/printf statement.

Data types and scope
--------------------

There are two main data types in awk: scalar and array, maybe three
if you count regex literals (/regular expression here/) as a separate
type. Scalars can be numeric (always floating point), string, or
both, which might be considered sub-types. You may consider some
special values as numeric string (defined in POSIX) or uninitialized
as sub-types as well. Arrays are associative, "subscripted" by string
values. I often refer to them as maps; they are implemented with hash
tables in wak.

The scope rules are pretty simple. Function formal parameters
(variables supplied in the function definition
function _name_(param1, param2,...)) are local to the function, as
far as name scoping is considered. All other variables are global.

When a function is called with a scalar variable, the call is by
value and changes made to it within the function have no effect
outside the function. But when a function is called with an array
variable, changes made to the array are visible outside the function.
A function can be called with fewer arguments than it has defined as
formal parameters. The "extra" parameters will then be uninitialized
local variables inside the function, and this is the only mechanism
for defining local variables in a function.

This presents some problems for a one-pass parser. When the parser
first encounters a variable, the type (scalar or array) is noted
depending on whether or not it is followed by a subscript, or if it
appears after the in operator.

With this: BEGIN{f(a); g(a)} a one-pass parser can't know if a is
scalar or array. If it's an array, it needs to be set up before
calling f(a) so awk can pass the array to g().

In BEGIN{f(a); g(a)}; function f(x){x[1]=3}; function g(x){print x[1]},
goawk / gawk / mawk / nnawk (OneTrueAwk) and bbawk all work.

But worse, it is possible for a function to be called with a scalar
sometimes and an array (as the same parameter) other times.
For example:

BEGIN{f(1); f(0)}
function f(a, b){if (a) g(b); else {h(b); i(b)}}
function g(x){print x}
function h(x){x[1] = 37}
function i(x){print x[1]}

Here, gawk / nnawk / bbawk all print an empty line and then 37; mawk
and goawk complain about type errors. Posix says "The same name shall
not be used within the same scope both as a scalar variable and as an
array." But I don't see how that can be determined during compilation
of f(a, b) in a one-pass parser. I'm curious how this works inside
gawk, nnawk and bbawk but not going to look.

In compiling f(a, b), g(b) will pass a scalar and h(b) and i(b) will
pass an array. In my parser, if a variable appears only as a "bare"
variable name in a function call, it is left "untyped" (actually gets
a ZF_MAYBEMAP flag). When it is pushed on the stack at runtime, an
empty map is attached to it. Then if it is used as a scalar, it is
converted to scalar and if it is used as an array it is converted to
an array.

Error recovery
--------------

I am using D.A. Turner's error recovery technique to handle syntax
errors and continue parsing.

Dick Grune's Annotated Bibliography of papers on parsing includes
this reference for Turner's method:

> Turner, D. A. Error diagnosis and recovery in one pass compilers.
> Inform. Process. Lett., 6:113-115, 1977. Proposes an extremely
> simple(minded) error recovery method for recursive descent parsers:
> when an error occurs, the parser enters a recovering state. While
> in this recovering state, error messages are inhibited. Apart from
> that, the parser proceeds until it requires a definite symbol.
> Then, symbols are skipped until this symbol is found or the end of
> the input is reached. Because this method can result in a lot of
> skipping, some fine-tuning can be applied.

<https://dickgrune.com/Books/PTAPG_2nd_Edition/CompleteList.pdf>

Code generation
---------------

Code is generated directly in the parser, using functions gencd() and
gen2cd(). These take one or two arguments, an opcode and (for
gen2cd()) an operand. These are integers that are simply appended to
the zcode zlist, which expands as needed. The opcodes are usually the
same as the tokens representing the program element involved, for
example, tkmul to multiply the top two values on the stack and
replace them with the product. There are also about 25 additional
opcodes that don't correspond to an awk token, such as oppush, opdrop
(stack manipulation), opjump, opjumptrue (unconditional and
conditional jump in the zcode program), etc. The zcode machine and
its operations are kept simple and no attempt is made to do code
optimization, in order to keep the implementation small.

Runtime internals
=================

Much more needed here...

Stack
-----

The runtime stack is a zlist of zvalue structs. All variables and
temporary values are stored on the stack. The "special variables"
(FS, NF, RS, CONVFMT, etc.) are stored at the bottom of the stack.
Program globals are stored above that. As the program runs, values
are pushed, popped, operated on, etc. Function call arguments and
other related information, including variables local to functions,
are put on the stack as well, as discussed below.

In earlier versions, the zlist_append() function was called for every
stack push. This ensured that the stack would never overflow unless
memory is exhausted, but profiling showed that this was expensive.
Now, a healthy margin of stack is checked (only) at every function
call, where the stack will be expanded as needed to ensure the margin
is maintained. Nearly all awk statements will be "stack neutral", so
the stack is at the same point before and after a statement except
for function calls. One exception is for (index in array)..., which
maintains some array iteration information on the stack inside the
for loop. The margin MIN_STACK_LEFT (currently 1024) ensures that
only a pathological statement will be able to overflow the stack, but
it is possible. It is possible to analyze the maximum stack usage for
each statement during compilation, and avoid any possibility of
overflow, but I believe it is not worth the added complexity.

Function definition, call, return, stack frame design
-----------------------------------------------------

A function definition function f(a, b, c,...) { ... } generates:

tkfunction function_number
(code for function body)
tknumber uninitvalue
tkreturn number_of_params

As each parameter is parsed, it is added to a table of local
variables for this function. The tkreturn is added to return an
uninitialized value if the code falls off the end of the function.

When a return keyword is encountered, the expression following it is
compiled and its value is left on the stack. If no expression follows
the return, then a tknumber 0 is compiled to push a zero on the stack.

A function call f(a, b, c,...) generates:

opprepcall function_number
(code to push args)
tkfunc number_of_args

At runtime, these work as follows:

The runtime keeps an index into the stack called parmbase (a local
variable of the main interpreter function interpx(), initially 0)
that points into the current call stack frame as follows:

return_value parmbase-4
return_addr parmbase-3
prev_parmbase parmbase-2
arg_count parmbase-1
function_number parmbase
arg1 parmbase+1
arg2 parmbase+2
...

A function call executes as follows:

opprepcall pushes placeholder 0 values for the return value, return
address, previous parmbase, and argument count. Then it pushes the
function number (from the opprepcall function_number code sequence).

The arguments are pushed onto the stack after that.

tkfunc retrieves the number of arguments (arg count) from the tkfunc
number_of_args code sequence, calculates where the parmbase should be
by subtracting the arg count from the current stack top index, then
fills in the return address (offset of next zcode word in the zcode
table) and the arg count in the stack frame. Finally, the tkfunc op
pushes the arg count on the stack and sets the ip (zcode instruction
pointer) to go to the tkfunction op of the function definition.

tkfunction finds the local variable table for the function and gets
the number or parameters. It then pops the arg count (number of
actual arguments) from the stack, calculates the new parmbase (stack
top index minus arg count), stores the previous parmbase in the stack
frame, and sets parmbase to the new parmbase. Next, it loops to drop
excess calling arguments if more args have been supplied in the call
than there are params defined for the function. (NOTE: This is an
error and should be caught at compile time! FIXME) Then, if the
number of supplied args is less than the number of defined params,
additional "args" are pushed to be used as local variables by the
function. This is where the "maybe map" variables may have to be
created, as explained in "Parsing awk".

When the function returns, either via a return keyword or "falling
off the end" of the function (where a tkreturn op has been compiled),
the tkreturn op picks up the param count (from the tkreturn
number_of_params code sequence) [NOTE this is unused; remove it?
FIXME], gets the arg count from the stack frame, and copies the
return value from the stack into the return value slot in the stack
frame, and drops the return value from the stack. At this point, the
stack should have on it just the arguments, including the locals
created beyond the args supplied. The tkreturn op loops through the
locals not supplied by the caller, releasing any map (array) data and
dropping the local "arg" from the stack. Now, only the args actually
supplied by the caller remain on the stack, and these are dropped.
Finally, the ip instruction pointer is set from the return address in
the stack frame, the previous parmbase value is restored from the
stack frame, and the execution continues with the next instruction.

Implementing regex RS
=====================

The POSIX spec (and the original AWK book) only accepts a single
character as the record separator, the default being <newline>. And
POSIX says "If RS contains more than one character, the results are
unspecified." I noticed that all the awk implementations I have
available have implemented the capability of using a regular
expression for RS, the record separator string. This includes nawk
(the original Kernighan "One True Awk" in the 2022 version) and nnawk
(the latest version), gawk, mawk, goawk, and Busybox awk (bbawk). I
believe mawk was the first to implement this feature; it was added to
nawk in 2019. Maybe it will be in the next release of the POSIX spec
for awk.

So I decided to try adding it to my version of awk. This proved a bit
trickier than it looked at first.

I had noticed that Busybox awk fails the rebuf.awk test in the gawk
test set. The problem in the bbawk output looked like the same sort
of problem that the rebuf.awk test (a bug reported in 2002)
originally revealed in gawk. Looking at the nature of the output,
where the regex RS="ti1\n(dwv,)?" was not being matched on some
records, causing "dwv," to appear spuriously in the output, I guessed
that the problem was that the data at the end of the buffer contained
only a partial match for the regex. That is, if the buffer ended with
"ti1\n", the RS would match, but the "dwv," would be at the start of
the next buffer. But when I tried to implement it, I found my code
was getting similar results. I had logic such that if I got a match
ending exactly at the end of the buffer, then expand the buffer, read
more data, and try again. The problem was that I could get a partial
match near the end of buffer, and have the same problem. For example,
if the buffer ends with "ti1\ndw", the RS will match the "til\n". So
I have to expand the buffer whenever the match is near the end of the
buffer. How near, though? After some experimentation, I settled
tentatively on a buffer with initial size 8192 byte and a "margin"
1024 bytes from the end.

I wrote a program to stress test the awk versions on hand. I
generated 8500 lines of data that looks like:

1wxyz
2wxyyz
3wxyyyz
...

And a test:

BEGIN{ RS = "w(x[^z]*z\n)?" }
1

This did stress them. Correct behavior would be to print 1, 2, ...
8500 lines with just the numbers on them.

Version # lines printed # initial lines correct
------- --------------- -----------------------
nnawk 8501 8180
wak 11048 1170
mawk 8639 717
goawk 9078 355
gawk 12007 40
bbawk 16828 25

nnawk did the best, running for 8180 lines before printing a couple
of anomalous lines, then ran OK for the rest of the input.

Testing awk
===========

I've been testing wak against the other awk implementations I've been
able to obtain. I have recent versions of nnawk (Kernighan's One True
Awk, the original Unix awk updated), gawk, mawk, goawk, and bbawk
(busybox awk).

As of this writing, the versions are:

* nnawk: awk version 20231228 (compiled 2024-01-23)
* gawk: GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1
(source 2023-11-02; compiled 2023-11-19)
* mawk: mawk 1.3.4 20231102 (compiled 2023-11-16)
* goawk: V1.25.0 (compiled 2024-01-23)
* bbawk: 2023-12-31 (compiled 2024-01-23)

I'm sure there are plenty of bugs. Brian Kernighan has until recently
been maintaining the original Unix awk from the start almost 40 years
ago and has a "FIXES" file with over 200 entries from 1987 to 2023,
and continuing to the present in a separate file for the "second
edition" of One True Awk. If Kernighan has been fixing bugs for 35+
years, I doubt I can make a bug-free awk.

Some "bugs" are incompatible interpretations of awk compared with
what other implementations do with certain features. No two versions
of awk (original awk, gawk, mawk, goawk, Busybox awk, etc.) agree
completely.

Please report bugs to raygard at gmail.com.

Testing strategy
----------------

I have used the test files that come with existing awk
implementations, plus some I've written. The original One True Awk
comes with a folder testdir of about 315 files. Kernighan's
README.TESTS file says there are about 60 small tests p.* from the
first two chapters of The AWK Programming Language (1st ed.) that are
basic stuff; about 160 small tests t.* that are "a random sampling of
awk constructions collected over the years. Not organized, but they
touch almost everything."; about 20 tt.* files that are timing tests.

The testdir folder has also about 30 T.* files that are "more
systematic tests of specific language features", but unfortunately
these are shell scripts that can test a single awk program to see if
it computes correct output as compared with known good data built
into the scripts. This makes it difficult to use to compare my
implementation against all the others in one pass, but I can run the
scripts on wak separately.

gawk also comes with a folder of about 1475 files, and most of these
are sets of foo.awk, foo.in, and foo.ok files. In each case, the
foo.awk file is run with foo.in input and the result can be compared
with foo.ok. Some are standalone tests that do not need an input
file, so there is sometimes no foo.in file.

I have a not-very-neat test driver test_awk.py that I can use to run
a batch of tests, such as all t.* in testdir, at one time against
several awk implementations, and see how they compare. In the case of
testdir's p.* and t.* files, they are intended to use certain input
files (test.countries and test.data), and the outputs are compared
via MD5 hashes. Each unique output is saved for later examination.
For the gawk-style tests, the program can compare the output against
the foo.ok file and give a pass/fail result. If there is a non-zero
exit code or an exception, that is noted on the test_awk.py output.
The output looks like this:

======= ======= ======= ======= ======= ======= =======
=versions=> gawk nnawk mawk goawk bbawk tbawk muwak
==== ==== ==== ==== ==== ==== ====
[...]
Test delarpm2.awk dd8e2e5 8841567 NNAWK c992867 gawk GOAWK GOAWK
Test dfacheck1.awk 03a19ad 0000000 NNAWK NNAWK gawk !3a19ad TBAWK
ERR: tbawk: awk: file tests/gawktests/dfacheck1.awk line 1:
warning: '\<' -- unknown regex escape
ERR: muwak: muwak: file tests/gawktests/dfacheck1.awk line 1:
warning: '\<' -- unknown regex escape
Test double1.awk 8c7dbdf 819e6db gawk 351564a d97b6e5 BBAWK BBAWK
Test double2.awk 4dbdb44 4941b67 gawk 7a665a2 acfcfe0 0124355 TBAWK
Test dtdgport.awk a916caa gawk gawk gawk !!00000 gawk gawk
RET: bbawk: 1
ERR: bbawk: awk: tests/gawktests/dtdgport.awk:37:
%*x formats are not supported

The hex values are the first 7 digits of the MD5 of the output file.
If the output is an empty file, the MD5 is replaced with all zeroes
to make it easier to spot. If any stderr output occurs, the first
digit is replaced with a `!'; if a non-zero exit code occurs, the
second digit is replaced with `!'. In either case, the stderr output
(ERR:) and exit code (RET:) are printed. These (non-pass-fail,
non-timing) reports always display a hash value of the output for the
first column (i.e. the first awk version tested). In subsequent
columns, if the hash is different from the first column, that hash is
listed; but if hash matches a hash from a previous column then the
awk version of that column is listed, and if it differs from the
first column it is up-cased.

So for example, for delarpm2.awk, nnawk has a different output from
gawk, mawk matches nnawk, goawk has yet another different output,
bbawk matches gawk, and both tbawk (toybox awk - my awk for toybox)
and muwak (my awk compiled with musl libc) match goawk. For
dfacheck1.awk, gawk gave some output, nnawk produced no output, mawk
and goawk also produced no output (matching nnawk), bbawk matched
gawk, tbawk produced the same output as gawk but had stderr output,
and muwak matched tbawk, including having stderr output.

The gawk tests were originally intended to be run via the supplied
Makefile, and some of them use special gawk options, environment
setup, etc., so that when the foo.awk file is run by test_awk.py it
may not produce correct foo.ok output even from gawk. Because of
this, I sifted the output from all the gawk tests against all the awk
versions into several parts and moved the tests into corresponding
folders: gawktests/allfail has tests that fail for all versions,
including gawk; gawktests/allpass has tests that pass for all
versions; gawktests/gawkonly has tests that pass for gawk and fail
for all others (usually because they use gawk-only features); and
gawktests has all the remaining tests.

I also wrote a shell script and awk script to sift the resulting test
output files into several categories. I usually have test results in
colums for gawk, nnawk, mawk, goawk, bbawk, my awk within toybox
(tbawk), and my awk standalone (may be compiled with ASAN sanitizer,
or with musl lib, or some other version). The order is significant
because I consider my result golden if it matches both gawk and
nnawk, still good if it matches gawk or nawk, less good if it matches
(only) mawk, goawk or bbawk.

If my awks (last two columns) differ, they go into a set_mismatch
file; that's usually a result of the difference in random generators,
or due to differences (bugs?) in the musl regex functions.

If any "allfail" tests do not all fail, or any "allpass" tests do not
all pass, or any "gawkonly" tests do not pass only for gawk, they go
into separate files.

If a test is pass/fail, then I put tests that my awk fails into a
set_fail file; if it passes and both gawk and nawk pass, it goes into
an set_good file; if it matches gawk or nawk it goes into a set_gawk
or set_nawk file, otherwise it goes into a general set_pass file (or
set_passx if it passed but had stderr output or non-zero exit code).

If it's not pass/fail, then if my result matches gawk and nawk, it
goes into a set_good file, else if it matches gawk it goes into the
set_gawk file; else if matches nawk it goes into the set_nawk file,
else if it matches mawk, goawk, or bbawk it goes into a set_mawk,
set_goawk, or set_bbawk file respectively.

If it doesn't fit into any of those buckets, then it doesn't match
any other implementation, and goes into a set_odd file. The set_odd
file needs the closest scrutiny, as those are possible bugs in my
implementation, though some differ from gawk and/or nawk only in that
they have stderr output, usually warnings. Some others are due to
different implementations iterating over arrays in different orders.
(An annoyance for testing is that goawk doesn't usually traverse
arrays the same way on different runs due to golng's intentionally
random hash behavior that apparently cannot be turned off.)

Currently, I have 30 tests in the set_odd category out of 1182 tests
run. I believe only a relative few of these are actual bugs.

Here is an approximate breakdown of the current test results:

category count
--------------- -----
set_good 736
set_gawk 87
set_nawk 158
set_mawk 50
set_goawk 8
set_bbawk 10
set_pass 2
set_passx 11
set_fail 47
set_badpass 2
set_badfail 2
set_badgawkonly 1
set_mismatch 38
set_odd 30

To put the 47 set_fail results in perspective, all but two of those
are also failed by at least one of gawk, nnawk, mawk, or goawk. Of
those two others, both are also failed by bbawk. Still, I would like
to make wak/toybox awk work on more of those cases as well.

Also, the tests need some cleanup; there is some overlap and
duplication.

From: <https://www.raygard.net/awkdoc/>