(2023-04-29) AWK is underrated, even in the POSIX variant
---------------------------------------------------------
I want to take a little break from writing new stuff. But still, there is one
thing that bothers me a lot. Whenever I search any information on how to do
this or that with AWK, especially on StackOverflow-like forums, I constantly
stumble upon "solutions" using Bash, Coreutils, sed, Perl and even Python or
Ruby. Anything but AWK the question authors initially ask about. I don't
know, maybe forum know-it-alls think it's a kind of "XY problem" (which
bears a bag of bullshit on its own, but that's another topic) and whoever
asked the question chose the wrong tool for the job and the tool they offer
is better and so on, but damn! I'm fluent in Bash, Dash, Python 3.6+, JS
(from ES3 to ES6 and whatever was next), C89 and VTL-2, and as such, I have
a lot of options to choose from when writing new stuff, but I want to get
fluent in AWK as well. So, if I (hypothetically) ask about how to do
something in AWK, I want an answer about AWK, not about Bash or Python which
I already can write just about everything in, or about Perl which honestly
must already die. The know-it-alls can't even consider the situation someone
could be left with Busybox and nothing else, and that's why they want to
learn how to solve problems with AWK alone (which is the only proper
programming language they can have on some systems, and Busybox sed is much
more limited compared to GNU sed too), not because they don't know Perl or
whatever.

This is why I have given up on trying to find answers on forums and turned to
the sole point of authority: POSIX.1-2017, 2018 edition ([1]). It has some
external links (e.g. for printf/sprintf format specifiers ([2]) or for
extended regular expressions format ([3])) but this is where everything
becomes crystal clear in terms of features we can use: anything not in there
is some non-standard extension. Compared to the real-life AWK versions I'm
using right now (Busybox and GAWK), I'm still missing bitwise operations
but, to be honest, they are not necessary everywhere and can be emulated
with normal integer arithmetics if required, although it would definitely be
slower. To make sure you're on the safe side (mostly), GAWK even has a
--posix (or -P) flag to turn on the POSIX compatibility mode. I say "mostly"
because no matter which options you set, different implementations handle
null bytes in strings differently, and POSIX states the behavior is
undefined in this case, so no one is to blame. For instance, in Busybox, you
can't have null bytes inside any string as they automatically truncate its
contents, while in GAWK they are handled normally even if you don't
explicitly pass the -b flag (treat all characters as raw bytes regardless of
locale). The POSIX specification is also missing GAWK's epic TCP/UDP socket
pseudo-filenames (starting with /inet) and bidirectional process
communication operator (|&). Yet, despite all this, I consider even the
standard AWK criminally underrated.

Why? Well, think about how much programming around us really boils down to
processing text in one way or another. Rendering templates, parsing logs,
scraping web pages, collecting reports, emulating terminals, marshalling
objects between client and server, most popular client-server protocols and
APIs themselves... Not even to mention how smaller Bopher-NG could become if
rewritten in AWK, but first, it couldn't be called Bopher anymore, second, I
don't have time for this effort for now. But you get the idea, right?
Whatever task involving text where using C is too tedious, is a job for AWK
with its record- and field-oriented engine with extended regular expressions
available out of the box. And, if you really need it, basic math is already
there too, up to square roots, logarithms, sines, cosines and arctangents,
as well as your basic built-in PRNG with rand() and srand(). I don't really
know what prevented them to add bitwise operations to the standard but it's
already pretty functional for such a tiny package (and I already mentioned
that even Busybox AWK that has them is just under 3K SLOC long). Of course,
this tinyness comes at a cost of some sacrifice in convenience: no way of
explicitly declaring variables as local (only implicitly, as unused function
parameters), 1-based string indexing (as opposed to C-like languages where
0-based indexing is commonplace), no multi-assignment in the initializing
clause of for loops (although Busybox supports them but even GAWK doesn't),
a single format for numbers (stored as floating-point, even when explicitly
cast to integers with int()), a single format for arrays (strictly
associative and all keys are cast to strings), but all these are minor
quirks compared to what this language is really capable of.

Another thing I'd like to mention is that AWK specification, while having
some minor updates to clarify things from time to time, has been staying
like this for good 35 years or so, and this means as long as you adhere to
POSIX, your programs will run on some ancient systems just as successfully
as on the current ones. Yes, you may struggle to replicate the behavior of
old C compilers and runtime libraries, you may find incompatibilities across
various versions of Perl (not even to mention Bash, Lua and Python), you
might have issues with compiling J2ME or other old Java 2/3 code on OpenJDK
higher than 8 or running REXX on anything modern non-IBM, you can find your
entire JS code not working on KaiOS 2.x because of some ES6 feature not yet
present in Gecko 48 back then... but as long as you have an AWK there and an
AWK here and you're not using any non-standard extensions and null bytes in
your strings, you can be sure your program will be fully portable to any
standard-compatible implementation from 35 years ago and probably from 35
years forth. And this is probably where the lack of big-market interest is
even somewhat good: no one is going to try to shove in fancy useless
"features" like OOP, template-based programming, decorators and other BS
that breaks all compatibility and makes the codebase even slower and much
bulkier.

And, as a good example of "don't try to fix what's not broken", AWK is
definitely worth learning and using as an everyday tool.

--- Luxferre ---

[1]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition
    /utilities/awk.html
[2]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition
    /basedefs/V1_chap05.html#tag_05
[3]: https://pubs.opengroup.org/onlinepubs/9699919799.2018edition
    /basedefs/V1_chap09.html#tag_09_04