PROGRAMMING FOR PERFORMANCE
                                 PART 2

                              by Lee A. Hart
                       The Computer Journal, Issue 40
                         Reproduced with permission
                          of author and publisher

                            Know The Hardware

  Truly efficient software has an intimate, almost incestuous relationship�
with its hardware.  They merge so thoroughly as to become inseparable;�
neither makes any sense without the other.

  This requires that you, the programmer, must TOTALLY understand the�
hardware.  I cannot stress this point too strongly.  The strengths and�
weaknesses of the hardware influence program structure at every level, not�
just the low-level drivers.  A system with weak disk I/O will be slow and�
unresponsive if your program relies on overlays.  A shallow keyboard buffer�
requires frequent checks to avoid missing keys.  The characteristics of the �
console device determines your whole approach to data displays.  If you try�
to hide from these limitations in a high-level language, your program will�
work as if it were written in BASIC 101.  Let's consider some actual case�
histories of what can be gained by paying attention to the hardware.

CASE #1

  A customer needed a faster way to transfer data between two computers. �
He had been using a serial port at 9600 baud but complained that it was too�
slow and tied up the computer's serial port.  Hardware mods were ruled out.

  After study, I found that each computer had unused handshake lines in its�
RS-232 port.  A special "Y" cable was built to cross-connect two of these�
lines, providing one bit of serial I/O in each direction.  A "software UART"�
program was then written to transfer data between the two machines.  This�
worked to about 30K  bits per second before timing dither (due to�
interrupts, memory refresh, etc.) caused errors.

  The serial port's UART could be programmed to generate an interrupt when�
the handshake line went low.  Therefore, an interrupt-driven protocol with�
handshaking was devised.  A '0' was sent by pulling the output low until the�
other computer echoed the low on its output.  A '1' was sent by pulsing the�
output low and immediately back high and waiting until the other system�
echoed it.  The data rate increased to over 100K bits per second, and�
transfers were now unaffected by disk I/O, keyboard activity, etc.

CASE 2

  The firmware for a CRT terminal was to be upgraded to run 38400 bits per�
second without handshaking.  Now, 38400 bps is fast, only 260 microseconds�
per character (about 75 instructions for a 3 MHz Z80).
�   The slowest routines need the most attention.  For example, clear-line�
was accomplished by moving the stack pointer to the end of the line and�
executing 36 PUSH HL instructions.  The interrupt handler needed a 4-level�
stack, so the last 8 bytes were cleared normally.  Clear-screen used 25�
iterations of clear-line.

  This still isn't fast enough to complete every ESC sequence before the�
next one is received.  This calls for an interrupt-driven system.  Each�
character received generates an interrupt.  The interrupt handler pushes the�
character into one end of a FIFO (First-In-First-Out) buffer in memory.  The�
main program pops characters out the other end and processes them.  The FIFO�
fills while we process slow commands like clear-screen and empties back out�
during fast commands.

  But what if some idiot sends a long string of slow commands (like 100�
clear-screens in a row)?  The FIFO would eventually overflow, and data would�
be lost.  I prevented this with "look-ahead" logic.  When the interrupt�
handler spots a clear-screen command, it sets a flag so MAIN expects it. �
MAIN can then ignore unnecessary commands (no sense altering a screen that's�
about to be cleared).

  Scrolling is one of the most difficult actions.  The obvious algorithm is�
to block move lines 2-24 up 1, then clear line 24.  But that's what IBM did�
on the PC, and we all know how well that worked.  So examine the 6845 CRT�
controller.  The Start-Address register holds the address of the first�
character on the screen, the one displayed in the top left corner.  If we�
add 80 to it, line 2 instantly becomes the top line, and we've scrolled the�
whole screen up a line.  All that remains is to clear the 80 bytes that �
form the new 24th line, for which we have a fast routine.

  Each scroll moves the start address up another 80 bytes.  This obviously�
can't go on indefinitely, so the original program spent a great deal of time�
checking for overflow outside its 2K block of screen RAM (F800-FFFF).  For�
instance, the old code read:

    ld   (hl),a    ; put character on screen
    inc  hl        ; advance to next
    ld   a,h       ; get new address
    or   0F8h      ; if overflow to 0000,
    ld   h,a       ;   force it to F800-FFFF

  But is this really necessary?  The schematic revealed that the 2K RAM was�
partially decoded and actually occupied 16K in the Z80's address space�
(C000-FFFF).  It's far easier to insure that an address lies within this�
range:

    ld   (hl),a    ; put character on screen
    res  6,h       ; insure we don't wrap to 0000
    inc  hl        ; advance to next

CASE #3
�   Fast Disk I/O.  Way back in 8 B.C. (eight years Before Clones) I had an�
S-100 system.  Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAM�
spread over half a dozen furnace boards.  Two Shugart SA-801R single-sided�
8" drives provided disk storage, with CP/M 1.4 tying it all together.  That�
old war horse and I fought many battles together, until it finally died the�
Death-of-1000-Intermittents.

  Many of its "features" I'd rather forget, but it had one outstanding�
attribute: the fastest floppies I've ever seen.  Warm boots were done before�
your fingers were off the keys; Wordstar loaded in under a second; PIP�
copied files at 10K bytes/sec.  All without a fast CPU, DMA, vectored�
interrupts, or even a disk controller IC.  The "controller" was just a bunch�
of TTL chips implementing a parallel port, an 8-bit shift register, and a�
CRC checkcode generator.  The real work was done by the CPU, byte-banging�
out the IBM 3740 SD/DD format in software.

  How good was it?  An 8" disk spins at 360 rpm, or 6 revs/sec.  Each track�
held 6.5K (26 double-density sectors of 256 bytes each).  That makes the�
theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec.  It actually�
achieved 50% of this, or 20K bytes/sec.

  Few modern micros come anywhere near this level of performance.  The�
Kaypro I wrote this article on creeps through the disk at 4K/sec.  My PC�
clone is closer, at 12K/sec.  The problem is that the CPU spends most of its�
time in wait loops; waiting for the drive motor to start, for the head to�
load, for an index hole, for a certain sector to come around on the disk.�
The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancy�
disk controllers are thrown away by crude software.

  The CPU has better things to do.  If the disk isn't ready when an �
application program needs it, the BIOS should start the task, save the data�
in a buffer, and set up an interrupt to finish the task later when the disk�
is REALLY ready.  The time lost to wait loops is thus reclaimed to run your�
application programs.

  That's how my antique worked.  The BIOS maintained a track buffer in RAM.�
The first read from a particular track moved the head to the desired track�
and read the whole thing into the buffer.  Further reads from that track�
simply came from RAM, taking virtually no time at all.

  Similarly, writes to a sector on the current track just put data in the�
buffer and marked it as changed.  The actual write was performed later, when�
a new track was selected for read/write, or just before the drive timed out�
from a lack of disk activity.

  Physical track reads/writes were fast as well.  The key was to simply�
begin wherever the head was.  After seeking to the desired track, it read�
the ID# of each sector encountered and transferred it to/from the appropriate�
place in the RAM buffer.  No need to find the index hole, wait for a�
particular sector ID#, or worry about interleave; one revolution got it all.

  Such a system must be implemented carefully.  CP/M does not expect��delayed error messages, which can produce some odd results.  For instance, a�
BDOS read error might be reported when the real cause was a write error in�
flushing the previous track buffer to disk.  Also, modern drives do not have�
door locks to prevent disk removal if unwritten data remains in the track�
buffer.

  The main factor limiting my S-100 system's performance was the slow CPU�
and lack of DMA.  A double-density 8" disk has a peak data transfer rate of�
500K bits/sec, which only allows 16 microseconds between bytes. This�
required polled I/O where the CPU was 100% devoted to the disk during actual�
reads/writes.

  5-1/4" disks have a slower maximum transfer rate, but modern hardware has�
advantages that can make up for it.  A normal 5-1/4" disk spins at 300 rpm,�
or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximum�
transfer rate is 22.5K bytes/sec.  The peak data rate is 250K bits/sec, or�
32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)�
handle it on an interrupt basis.  Here's an interrupt handler to read 256�
bytes from a disk controller chip at 32 microseconds max. per byte:

T-states
  23                       ; time to finish longest instruction
  13                       ; Z80 interrupt mode 0 or 1 response
  11 int:  push af         ; save registers used
  11       in   a,(data)   ; read data byte from disk controller
  13 next: ld   (buffer),a ; store it in buffer (a variable)
  13       ld   a,(next+1) ; get buffer address
   4       inc  a          ;   increment
  13       ld   (next+1),a ;   save for next time
   7       jr   nz,done    ; if end of page, done
  10       pop  af         ;   else restore registers
  10       ret             ;   and return
 ----
 128 T-states max = 32 microseconds with a 4 MHz Z80

  But this routine barely squeaks by. It can't use interrupt mode 2 (which�
adds 6 T-states to the response time) or signal Z80 peripherals that the�
interrupt is complete with an RETI (which adds 4 T-states).  It's limited to�
a 256-byte sector.  Worse, some disk controller chips need processing time�
of their own.  The popular Western Digital FD179x series only allows 27.5�
microseconds for each byte.

  So we have to get clever again.  The following example reads pairs of�
bytes, the first on an interrupt and the second by polled I/O.  This�
improves performance to allow interrupt mode 2, larger sector sizes, and the�
slow response time of a FD179x chip:

T-states
  23                       ; time to finish longest instruction
  19                       ; Z80 interrupt mode 2 response time
  11 int:  push af         ; save A and flags
  11       in   a,(data)   ; read 1st byte from disk controller �   11       push hl         ; save HL
  10 next: ld   hl,buffer  ; get buffer address (a variable)
   7       ld   (hl),a     ; store byte in buffer
   6       inc  hl         ; advance buffer pointer
   6       inc  hl         ;   for next interrupt
  16       ld   (next+1),hl;   & store it
   6       dec  hl         ;   point to current address
11+11 check:in   a,(status) ; check disk controller status
4+ 4       rra             ; if not busy (bit 0=1),
7+ 7       jr   nc,done    ;   then we're done
4+ 4       rra             ; if next byte not ready (bit 1=0),
12+ 7       jr   nc,check   ;   then loop until it is
  11       in   a,(data)   ; get 2nd byte from disk controller
   7       ld   (hl),a     ;   & store it in buffer
  10       pop  hl         ; restore registers
  10       pop  af
  14       reti            ; return
 ----
 188 or 226 T-states (for 1 or 2 passes through status loop)

  This routine reads bytes from the controller chip within 17.75�
microseconds worst-case.  Interrupt overhead averages 80% for a 4 MHz Z80,�
leaving 20% for the main program execution.  The peculiar way of�
incrementing the address pointer minimizes the worst-case delay from an�
interrupt or status flag change until the byte is read.  We want to maximize�
the chance that the second character is ready the first time the status is�
checked.

  Why improve your disk system?  Because, as a practical matter, there's�
more to be gained by improving it than any other change you could make. �
It's disk I/O that sets the pace, not CPU speed or memory size.  Users�
almost never wait on CPU speed; it's the disk that keeps you twiddling your�
thumbs with the keyboard ignored, the screen frozen, and the disk drive�
emitting Bronx cheers.  Put a Commodore 64's tinkertoy disk system on an AT�
clone, and you'd have high-priced junk that only a masochist would use.�
Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire�
breathing dragon that would eat its competition alive.


                                Algorithms

  When a hardware engineer sits down to design a circuit, he doesn't begin�
with a blank sheet of paper.  He has a vast library of textbooks, data�
sheets, and catalogs of standard circuits to choose from.  Most of the task�
is simply connecting off-the-shelf components into one of these standard�
configurations, modifying them as necessary to satisfy any unique�
requirements.

  Algorithms are to programmers what IC chips are to hardware designers.�
Just as the engineer builds a library of standard parts and circuits, every�
programmer must continually build his own algorithm collection.  Whether�
it's a shoebox full of magazine clippings or a carefully indexed series of��notebooks, start NOW.

  Programming textbooks tend to concentrate on traditional computer �
algorithms for floating-point math, transcendental functions, and sorting�
routines.  The old standby is Knuth's "The Art of Programming".  Hamming's�
"Numerical Methods for Scientists and Engineers" explains the basics of�
iterative calculations.  "Digital Computation and Numerical Methods" by�
Southworth and Deeleeuw provides detailed flowcharts and sample code as�
well.

  Magazines are a great source and tend to be more down-to-earth and closer�
to the state of the art.  Read carefully!  Good algorithms may be expressed�
in BASIC listings, assembly code for some obscure processor, pocket�
calculator key sequences, and even disguised as circuit diagrams.�
Professional journals like EDN or Computer Design are often better than the�
popular magazines, which have pretty much abandoned education in favor of�
marketing.  Especially check out back issues.  The cruder the hardware, the�
trickier the algorithms had to be to make up for it.

  Manufacturers' technical literature is a gold mine. Get the �
manufacturers' own manuals, not some boiled-down paperback from the�
bookstore.  They won't be models of clarity but are full of hidden gold.�
Read everything, hardware and software manuals, data sheets, application�
notes, etc.

  User groups are the traditional source of solutions to specific �
problems.  Even better, they provide actual implementations in printed�
listings, on disk, or even by modem.

  Don't waste time reinventing the wheel.  Learn from others what works,�
and what doesn't.  Some of the best (and worst) algorithms I know were found�
by disassembling existing programs.  And once you find a good algorithm,�
recycle it.  That clever sort routine for an antique 8008 may be the�
foundation of the fastest '386 sort yet!


                                Conclusion

  These techniques are not new; in fact old-timers will recognize many of�
them from the early days of computing when hardware limitations were more�
severe.  However, they have fallen into disuse.  A whole generation of�
programmers has been taught that such techniques have no place in modern�
structured programming.

  The theory goes something like this: Programs written in a high-level�
language are faster and easier to write, debug
, document, and maintain.�
Memory and speed are viewed as infinite resources, so the performance loss�
is unimportant.  Programs should be totally generic; it is the compiler's or�
run-time library's job to worry about the hardware interface.

  These rules make sense in a mainframe environment, where the hardware�
resources are truly awesome, and teams of programmers spend years working on��one application.  But they impose severe penalties on a microcomputer�
system.  The user must pay for the programmer's luxuries with higher�
hardware cost and lackluster performance.

  It's easy to forget that "microcomputer" literally means "one millionth�
of a computer".  Microprocessors make abysmally bad CPUs.  Build a computer�
with one, and you'll wind up needing $5000 worth of memory and peripherals�
to support a $5 CPU chip.

  But micros make superlative controllers.  That's what they were designed�
for, and what they do best.  A single microcomputer can replace dozens of�
boards and hundreds of ICs with as little as a single chip.  That's why 90%�
of all microprocessors go into non-computer uses: calculators, auto emission�
controls, home entertainment equipment, industrial controls, and the like.�
Of 30 million Z80s sold last year, fewer than 1 million went into computers.

  Programming a controller is different than a computer.  Most applications�
demand real-time multi-tasking capabilities, and there is never enough speed�
or memory.  Inputs and outputs are physical hardware devices, not abstract�
data structures, so the code must inevitably be hardware-dependent. �
Computer languages are just not cut out for this sort of thing.

  The question is not, "How do I write a computer program to handle this�
data?"  Instead, you should ask yourself, "How must I manipulate this�
hardware to do the job?"  The techniques in this article may be out of place�
in the bureaucracy of a large computer but are right at home in the wild�
west world of a microcomputer.

  Lest you think this has nothing to do with a "real" computer like your PC�
clone, consider this.  Instead of a '286 with 1 meg of memory, suppose it�
contained ten Z80 controller boards, each with 64K of memory and a fast�
network to tie them together.  Each Z80 handles a different device:�
keyboard, screen, printer, modem, and one for each disk.  The rest are free�
to run application programs, several at a time!

  Suppose you're doing word processing on this system.  The keyboard does�
spelling correction on data entry.  The screen Z80 displays your text in�
bit-mapped fonts to match the printer's Z80, which is simultaneously�
printing a file.  The Z80 running the word processor itself suffers no�
annoying pauses or hesitation, since disk I/O is handled instantaneously via�
each drive's Z80 track buffer.  Meanwhile, the modem's Z80 is downloading a�
file while another assembles a program.  Pop-up utilities are ready and�
waiting in still other Z80s in case they're needed.

  Such a system would clearly have half the hardware cost of a PC, yet�
would outperform it by a wide margin.  True multi-tasking becomes child's�
play with multiple processors.  More processors can be readily added for�
even higher performance, or removed to save cost (or continue operation�
while waiting for a replacement).

  If the computer scientists really want to further the micro-revolution,�
they should stop trying to force antiquated mainframe languages onto micros��and concentrate on developing tools to maximize the use of micros as they�
are!

[This article was originally published in issue 40 of The Computer Journal,
P.O. Box 12, South Plainfield, NJ 07080-0012 and is reproduced with the
permission of the author and the publisher. Further reproduction for non-
commercial purposes is authorized. This copyright notice must be retained.
(c) Copyright 1989, 1991 Socrates Press and respective authors]