MAR 6, 1989 8:34 PM File AWK2.Rev Page 1

MAR 6, 1989 8:34 PM File AWK2.Rev Page 1

Three AWK Implementations for MS-DOS - How Do They Compare?
Copyright (c) 1989, by George A. Theall

In the fall of 1988, I was introduced to a relatively unknown
programming language named AWK. Its main feature is undoubtedly the speed
with which programs can be developed. AWK has been available now on Unix
systems for about 10 years but only recently crossed over to the MS-DOS
world. Despite this slow start, AWK's power, versatility, and flexibility
should make it a hit for anyone who is serious about using their PC.

When I first discovered AWK, it seemed perfectly suited to the type of
data manipulation which much of my research work involves. Two companies
were then marketing implementations of AWK for MS-DOS: Mortice Kern Systems
(MKS) and Polytron Corp. Both claimed to have a complete implementation of
AWK as described in _The AWK Programming Language_ by Aho, Kernighan, and
Weinberger, the language's developers. I decided on MKS' product - it was
the cheaper, and support was available via electronic mail.

At first, I was a bit uncomfortable with my decision: Although both
companies have fine reputations, I had not seen any comparisons of the two
AWKs. Since then, I have worked with the implementations from MKS and
Polytron as well as a non-commercial version written by Rob Duff. While
lacking a few of the features listed in the Aho, Kernighan, and Weinberger
book, Duff's implementation may be distributed freely for non-commercial
uses. Yet through it all, my enthusiasm for the AWK language itself has
not dimmed.

Given my experiences, two questions come to mind: How do the three
implementations differ? And more importantly, why spend roughly $100 for a
commercial version when you can download Duff's AWK from a local bbs for
just the cost of a phone call? (Of course, the last question applies only
to non-commercial users.)

To compare the three, I devised some programs based on tasks which I
commonly perform in AWK. Each processes three input files, constructed of
lines such as:

PD1:<MSDOS.APL>
SAPLPC-A.ARC 208K BINARY 04/02/88
SAPLPC-B.ARC 225K BINARY 04/02/88

PD1:<MSDOS.ARC-LBR>
ADIR103.ARC 8K BINARY 05/24/87 5-col .ARC file ...
ADIR140.ARC 10K BINARY 02/05/88 Dave Rand's ARC ...

The input files themselves differ only in their sizes, which, as reported
by the word-counting task, are:

MAR 6, 1989 8:34 PM File AWK2.Rev Page 2

112 936 7449 SMALL.FIL
1006 7199 60711 MEDIUM.FIL
10569 74685 631238 LARGE.FIL

[Fields are: number of lines, number of words, number of characters, and
file name.] The tasks do not purport to represent all of AWK's capabilities
nor is there much justification for selecting them. Nevertheless, they do
point to some interesting differences.

All tasks were run on an NEC PowerMate SX machine at 16MHz with DOS
v3.30 and a fast (28ms) hard disk. Roughly 575K of conventional RAM was
available, and no TSR's had been installed. The disk had been optimized so
that fragmentation would not affect the results. For each implementation,
six tasks were performed on the input files, results were tabulated, and
then the executable and output files were deleted from the disk. Execution
times were calculated with Brant Cheikes' TM utility, which rounds to the
nearest second. In this way, times are not subject to _human_ measurement
inaccuracies. Admittedly, rounding to the nearest second can produce some
misleading results so care must be taken when interpreting the actual
execution times.

Results are presented in Table 1 below. For each version and each
task, three execution times are reported - the times required to process
SMALL.FIL, MEDIUM.FIL, and LARGE.FIL respectively. The actual AWK programs
appear at the end of this document.

TABLE 1. AWK Program Execution Times
(in seconds)

------------------------------------------------------------------------------
Task MKS_AWK MKS_AWKL POLY_AWK DUFF_AWK
------------- --------- --------- --------- --------
Record Counting 0/1/12 1/3/23 1/2/15 3/22/230
Word Counting 1/5/44 1/6/53 1/6/42 4/35/359
Line Numbering 2/6/58 1/7/67 2/7/59 5/30/305
Regular Expressions 3/15/150 3/23/230 1/5/42 4/30/314
Column sums 1/5/44 1/6/56 2/5/45 4/33/336
Spelling 5/*/* 8/71/1036 4/26/128 19/*/*
------------------------------------------------------------------------------
* indicates the program ran out of memory.

MKS_AWK and MKS_AWKL denote versions 2.3 of the small and large models from
MKS; POLY_AWK refers to version 1.3 of Polytron's product; and DUFF_AWK
represents version 2.10 of Duff's implementation. [While MKS also supplies
versions with 80x87 support for both memory models, I'm not able to test
them: my machine does not have a math chip.]

Note that while _actual_ execution times will vary from one situation
or machine to another, _relative_ times are useful when making comparisons.
The figures reported above are from a single run rather than averages of
multiple runs. I did perform three earlier sets of runs, with much the
same results. The problem with multiple runs is one of time: it takes
about 1.5 hours for a single set of runs on the SX!

MAR 6, 1989 8:34 PM File AWK2.Rev Page 3

Among the commercial products, there is no clear-cut leader. For
tasks using SMALL.FIL, execution times for the three implementations are
all within a few seconds of each other so that any differences are probably
due largely to TM's rounding to the nearest second. Moving to MEDIUM.FIL,
it becomes clear that POLY_AWK excels at handling regular expressions while
MKS_AWK is, at best, only marginally faster at disk input (as measured by
the first two tasks). The comparative advantages become more accentuated
with LARGE.FIL. The amazing difference of 700% reported between POLY_AWK
and MKS_AWKL for the spelling task is probably attributable to the former's
speedy handling of regular expressions, used in gsub() to remove
non-alphanumerics from the input stream.

When compared with DUFF_AWK, though, the commercial implementations
offer a clear performance advantage. In every case, Duff's version turned
in the slowest execution times. These differences range from a low of
around 30% (Regular Expressions; MKS_AWKL; MEDIUM.FIL) to a high of 2100%
(Record Counting; MKS_AWK; MEDIUM.FIL), though these numbers should be
taken with extreme caution. Apparently, DUFF_AWK's performance is hobbled
by poor disk I/O.

In terms of how the language is implemented by each package, I did
find some interesting differences while devising these tasks. These arise
because several areas of the language are left up to the implementors
themselves and do not indicate any lack of compliance with the de facto
standard of _The AWK Programming Language_. [NB: The two versions from MKS
differ only in execution speed and available storage area; therefore, what
is said below about MKS_AWK applies to MKS_AWKL as well.]

The most disturbing difference concerns the function printf() in
DUFF_AWK: while there's no mention of it in the docs, printf("%d", i)
displays properly only integers in the range [-32768, 32767]! Note that
this aberrant behaviour disappears if the floating point format (%f) is
used.

Also annoying is the treatment of associative array indices: both
POLY_AWK and DUFF_AWK alphabetize them while MKS_AWK merely reverses them.
This is a personal nit-pick since the standard clearly says treatment of
indices is implementation-dependent. Yet I often want to output them in
the proper order - with the versions from Polytron and Rob Duff, it's
basically impossible; with MKS', it's just a hassle.

The maximum size of any single record varies from one implementation
to another: 1024 for DUFF_AWK, 2048 for MKS_AWK, and a whopping 32000 for
POLY_AWK! [NB: The documentation for MKS claims the limit is 1024, but my
experience shows it's actually 2K.] For each, records exceding the limit
are simply split into several smaller ones. I'm currently thinking about
devising a free-form database in which the records span multiple lines and
can easily envision a record taking up 1, perhaps 2K.

It's also worth mentioning that MKS_AWK, unlike POLY_AWK and DUFF_AWK,

MAR 6, 1989 8:34 PM File AWK2.Rev Page 4

does not regard ^Z as an end-of-file marker; whether or not it should is
unclear. This caused some initial consternation when I was comparing the
results of the word-counting task, but otherwise seems of little import.

Documentation for each version is sparse, but this is because all
follow the standard closely. MKS supplies an additional reference manual
which describes not only AWK but several other utilities included in its
package. I've found one apparent mistake (about maximum record size) and
another omission (concerning placement of temporary files), but overall
it's quite adequate. Polytron's documentation consists of a single README
file. It mentions first those examples in the AWK book which don't work
because of shortcomings in MS-DOS, then Polytron's own extensions to the
language. If you use AWK just on MS-DOS machines, you'll appreciate the
extensions; otherwise, you'll likely be bothered by portability problems.
As for DUFF_AWK, there are two primary documentation files: one, a 1978
abstract by Aho, Kernighan, and Weinberger describing the language; the
other, a unix-style man page. For those new to AWK, Rob Duff has also
included a **large** collection of sample programs in his distribution;
you'd do well to get a hold of it if only for these examples.

What conclusions can be made from these comparisons? Well, as the old
adage says, "you don't get something for nothing"; i.e., choosing Duff's
implementation over a commercial version will save you money initially, but
every time you use it you'll "pay" a price in terms of slower performance.
Whether this is worth $100 depends on _your_ particular situation. On one
hand, if you're interested in learning about AWK, use it only infrequently,
or process small files (and don't work in a commercial environment) you'll
probably be quite happy with Duff's implementation. On the other, if you
can justify spending the money, you'll face a tough choice between the
offerings from MKS and Polytron, with Polytron's version looking slightly
better. Nevertheless, all three implementations are well worth your
consideration, and I have no qualms about recommending them as effective
tools for users of DOS-based computers.

Disclaimer: Apart from being a satisfied owner of Mortice Kern
System's AWK and Polytron's PolyShell, I have no direct connection with the
companies mentioned above.

If you have any comments about this article, or the AWK language in
general, please get in touch. For those with email access, I can be
reached as GTHEALL@PENNDRLS (BITNET) or [email protected] (ARPA
Internet). Otherwise, give me a call at 215-898-6741.

MAR 6, 1989 8:34 PM File AWK2.Rev Page 5

Tasks Used in Comparisons

1. Record-counting task
END {print NR}

2. Word-counting task
FName != FILENAME {
if (FName)
printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
FName = FILENAME
cc = wc = lc = 0
}

{
cc += length($0) + 1 # don't forget LF!
wc += NF
lc ++
}

END {
printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
}

3. Line-numbering task
{print NR ": ", $0}

4. Regular-expressions task
/test|Test|TEST/

5. Column-sums task
$2 ~ /[0-9]+K$/ {
sum += sub(/K/, "", $2)
}

END {
printf("Sum of 2nd column: %5dK\n", sum)
}

6. Spelling task
# List words occurring only once in a document.
# A "word" is defined as a sequence of alphanumerics
# or underscores.

# Scan thru each line and compute word frequencies.
# The associative array Words[] holds these frequencies.
{
# replace non-alphanumerics with blanks throughout line
gsub(/[^A-Za-z0-9_]/, " ")

# count how many times each word used

MAR 6, 1989 8:34 PM File AWK2.Rev Page 6

for (i = 1; i <= NF; i++) # scan all fields ...
Words[$i]++ # increment word count
}

# Print out infrequently-used words.
END {
for (w in Words) # scan over all words ...
if (Words[w] == 1) # if word only appears once ..
.
print w # print it
}