MAR 6, 1989  8:34 PM  File AWK2.Rev  Page 1





        Three AWK Implementations for MS-DOS - How Do They Compare?
                 Copyright (c) 1989, by George A. Theall




      In the fall of 1988, I was introduced to a relatively unknown
 programming language named AWK.  Its main feature is undoubtedly the speed
 with which programs can be developed.  AWK has been available now on Unix
 systems for about 10 years but only recently crossed over to the MS-DOS
 world.  Despite this slow start, AWK's power, versatility, and flexibility
 should make it a hit for anyone who is serious about using their PC.


      When I first discovered AWK, it seemed perfectly suited to the type of
 data manipulation which much of my research work involves.  Two companies
 were then marketing implementations of AWK for MS-DOS: Mortice Kern Systems
 (MKS) and Polytron Corp.  Both claimed to have a complete implementation of
 AWK as described in _The AWK Programming Language_ by Aho, Kernighan, and
 Weinberger, the language's developers.  I decided on MKS' product - it was
 the cheaper, and support was available via electronic mail.


      At first, I was a bit uncomfortable with my decision: Although both
 companies have fine reputations, I had not seen any comparisons of the two
 AWKs.  Since then, I have worked with the implementations from MKS and
 Polytron as well as a non-commercial version written by Rob Duff.  While
 lacking a few of the features listed in the Aho, Kernighan, and Weinberger
 book, Duff's implementation may be distributed freely for non-commercial
 uses.  Yet through it all, my enthusiasm for the AWK language itself has
 not dimmed.


      Given my experiences, two questions come to mind: How do the three
 implementations differ? And more importantly, why spend roughly $100 for a
 commercial version when you can download Duff's AWK from a local bbs for
 just the cost of a phone call? (Of course, the last question applies only
 to non-commercial users.)


      To compare the three, I devised some programs based on tasks which I
 commonly perform in AWK.  Each processes three input files, constructed of
 lines such as:

      PD1:<MSDOS.APL>
         SAPLPC-A.ARC  208K  BINARY  04/02/88
         SAPLPC-B.ARC  225K  BINARY  04/02/88

      PD1:<MSDOS.ARC-LBR>
         ADIR103.ARC     8K  BINARY  05/24/87  5-col .ARC file ...
         ADIR140.ARC    10K  BINARY  02/05/88  Dave Rand's ARC ...

 The input files themselves differ only in their sizes, which, as reported
 by the word-counting task, are:


 MAR 6, 1989  8:34 PM  File AWK2.Rev  Page 2



           112     936    7449 SMALL.FIL
          1006    7199   60711 MEDIUM.FIL
         10569   74685  631238 LARGE.FIL

 [Fields are: number of lines, number of words, number of characters, and
 file name.] The tasks do not purport to represent all of AWK's capabilities
 nor is there much justification for selecting them.  Nevertheless, they do
 point to some interesting differences.


      All tasks were run on an NEC PowerMate SX machine at 16MHz with DOS
 v3.30 and a fast (28ms) hard disk.  Roughly 575K of conventional RAM was
 available, and no TSR's had been installed.  The disk had been optimized so
 that fragmentation would not affect the results.  For each implementation,
 six tasks were performed on the input files, results were tabulated, and
 then the executable and output files were deleted from the disk.  Execution
 times were calculated with Brant Cheikes' TM utility, which rounds to the
 nearest second.  In this way, times are not subject to _human_ measurement
 inaccuracies.  Admittedly, rounding to the nearest second can produce some
 misleading results so care must be taken when interpreting the actual
 execution times.


      Results are presented in Table 1 below.  For each version and each
 task, three execution times are reported - the times required to process
 SMALL.FIL, MEDIUM.FIL, and LARGE.FIL respectively.  The actual AWK programs
 appear at the end of this document.

                   TABLE 1.  AWK Program Execution Times
                                (in seconds)

 ------------------------------------------------------------------------------
 Task                     MKS_AWK        MKS_AWKL       POLY_AWK       DUFF_AWK
 -------------            ---------      ---------      ---------      --------
 Record Counting          0/1/12         1/3/23         1/2/15         3/22/230
 Word Counting            1/5/44         1/6/53         1/6/42         4/35/359
 Line Numbering           2/6/58         1/7/67         2/7/59         5/30/305
 Regular Expressions      3/15/150       3/23/230       1/5/42         4/30/314
 Column sums              1/5/44         1/6/56         2/5/45         4/33/336
 Spelling                 5/*/*          8/71/1036      4/26/128       19/*/*
 ------------------------------------------------------------------------------
 * indicates the program ran out of memory.

 MKS_AWK and MKS_AWKL denote versions 2.3 of the small and large models from
 MKS; POLY_AWK refers to version 1.3 of Polytron's product; and DUFF_AWK
 represents version 2.10 of Duff's implementation.  [While MKS also supplies
 versions with 80x87 support for both memory models, I'm not able to test
 them: my machine does not have a math chip.]


      Note that while _actual_ execution times will vary from one situation
 or machine to another, _relative_ times are useful when making comparisons.
 The figures reported above are from a single run rather than averages of
 multiple runs.  I did perform three earlier sets of runs, with much the
 same results.  The problem with multiple runs is one of time: it takes
 about 1.5 hours for a single set of runs on the SX!


 MAR 6, 1989  8:34 PM  File AWK2.Rev  Page 3




      Among the commercial products, there is no clear-cut leader.  For
 tasks using SMALL.FIL, execution times for the three implementations are
 all within a few seconds of each other so that any differences are probably
 due largely to TM's rounding to the nearest second.  Moving to MEDIUM.FIL,
 it becomes clear that POLY_AWK excels at handling regular expressions while
 MKS_AWK is, at best, only marginally faster at disk input (as measured by
 the first two tasks).  The comparative advantages become more accentuated
 with LARGE.FIL.  The amazing difference of 700% reported between POLY_AWK
 and MKS_AWKL for the spelling task is probably attributable to the former's
 speedy handling of regular expressions, used in gsub() to remove
 non-alphanumerics from the input stream.


      When compared with DUFF_AWK, though, the commercial implementations
 offer a clear performance advantage.  In every case, Duff's version turned
 in the slowest execution times.  These differences range from a low of
 around 30% (Regular Expressions; MKS_AWKL; MEDIUM.FIL) to a high of 2100%
 (Record Counting; MKS_AWK; MEDIUM.FIL), though these numbers should be
 taken with extreme caution.  Apparently, DUFF_AWK's performance is hobbled
 by poor disk I/O.


      In terms of how the language is implemented by each package, I did
 find some interesting differences while devising these tasks.  These arise
 because several areas of the language are left up to the implementors
 themselves and do not indicate any lack of compliance with the de facto
 standard of _The AWK Programming Language_.  [NB: The two versions from MKS
 differ only in execution speed and available storage area; therefore, what
 is said below about MKS_AWK applies to MKS_AWKL as well.]


      The most disturbing difference concerns the function printf() in
 DUFF_AWK: while there's no mention of it in the docs, printf("%d", i)
 displays properly only integers in the range [-32768, 32767]! Note that
 this aberrant behaviour disappears if the floating point format (%f) is
 used.


      Also annoying is the treatment of associative array indices: both
 POLY_AWK and DUFF_AWK alphabetize them while MKS_AWK merely reverses them.
 This is a personal nit-pick since the standard clearly says treatment of
 indices is implementation-dependent.  Yet I often want to output them in
 the proper order - with the versions from Polytron and Rob Duff, it's
 basically impossible; with MKS', it's just a hassle.


      The maximum size of any single record varies from one implementation
 to another: 1024 for DUFF_AWK, 2048 for MKS_AWK, and a whopping 32000 for
 POLY_AWK! [NB: The documentation for MKS claims the limit is 1024, but my
 experience shows it's actually 2K.] For each, records exceding the limit
 are simply split into several smaller ones.  I'm currently thinking about
 devising a free-form database in which the records span multiple lines and
 can easily envision a record taking up 1, perhaps 2K.


      It's also worth mentioning that MKS_AWK, unlike POLY_AWK and DUFF_AWK,

 MAR 6, 1989  8:34 PM  File AWK2.Rev  Page 4



 does not regard ^Z as an end-of-file marker; whether or not it should is
 unclear.  This caused some initial consternation when I was comparing the
 results of the word-counting task, but otherwise seems of little import.


      Documentation for each version is sparse, but this is because all
 follow the standard closely.  MKS supplies an additional reference manual
 which describes not only AWK but several other utilities included in its
 package.  I've found one apparent mistake (about maximum record size) and
 another omission (concerning placement of temporary files), but overall
 it's quite adequate.  Polytron's documentation consists of a single README
 file.  It mentions first those examples in the AWK book which don't work
 because of shortcomings in MS-DOS, then Polytron's own extensions to the
 language.  If you use AWK just on MS-DOS machines, you'll appreciate the
 extensions; otherwise, you'll likely be bothered by portability problems.
 As for DUFF_AWK, there are two primary documentation files: one, a 1978
 abstract by Aho, Kernighan, and Weinberger describing the language; the
 other, a unix-style man page.  For those new to AWK, Rob Duff has also
 included a **large** collection of sample programs in his distribution;
 you'd do well to get a hold of it if only for these examples.


      What conclusions can be made from these comparisons? Well, as the old
 adage says, "you don't get something for nothing"; i.e., choosing Duff's
 implementation over a commercial version will save you money initially, but
 every time you use it you'll "pay" a price in terms of slower performance.
 Whether this is worth $100 depends on _your_ particular situation.  On one
 hand, if you're interested in learning about AWK, use it only infrequently,
 or process small files (and don't work in a commercial environment) you'll
 probably be quite happy with Duff's implementation.  On the other, if you
 can justify spending the money, you'll face a tough choice between the
 offerings from MKS and Polytron, with Polytron's version looking slightly
 better.  Nevertheless, all three implementations are well worth your
 consideration, and I have no qualms about recommending them as effective
 tools for users of DOS-based computers.


      Disclaimer: Apart from being a satisfied owner of Mortice Kern
 System's AWK and Polytron's PolyShell, I have no direct connection with the
 companies mentioned above.


      If you have any comments about this article, or the AWK language in
 general, please get in touch.  For those with email access, I can be
 reached as GTHEALL@PENNDRLS (BITNET) or [email protected] (ARPA
 Internet).  Otherwise, give me a call at 215-898-6741.



 MAR 6, 1989  8:34 PM  File AWK2.Rev  Page 5



                        Tasks Used in Comparisons



 1. Record-counting task
      END {print NR}


 2. Word-counting task
      FName != FILENAME {
           if (FName)
                printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
           FName = FILENAME
           cc = wc = lc = 0
      }

      {
           cc += length($0) + 1     # don't forget LF!
           wc += NF
           lc ++
      }

      END {
           printf("%8.0f%8.0f%8.0f %s\n", lc, wc, cc, FName)
      }


 3. Line-numbering task
      {print NR ": ", $0}


 4. Regular-expressions task
      /test|Test|TEST/


 5. Column-sums task
      $2 ~ /[0-9]+K$/ {
           sum += sub(/K/, "", $2)
      }

      END {
           printf("Sum of 2nd column: %5dK\n", sum)
      }


 6. Spelling task
      # List words occurring only once in a document.
      #    A "word" is defined as a sequence of alphanumerics
      #    or underscores.

      # Scan thru each line and compute word frequencies.
      # The associative array Words[] holds these frequencies.
      {
           # replace non-alphanumerics with blanks throughout line
           gsub(/[^A-Za-z0-9_]/, " ")

           # count how many times each word used

 MAR 6, 1989  8:34 PM  File AWK2.Rev  Page 6



           for (i = 1; i <= NF; i++)          # scan all fields ...
                Words[$i]++                   #    increment word count
      }

      # Print out infrequently-used words.
      END {
           for (w in Words)                   # scan over all words ...
                if (Words[w] == 1)            #    if word only appears once ..
 .
                     print w                  #         print it
      }