PROGRAMMING FOR PERFORMANCE

PROGRAMMING FOR PERFORMANCE
PART 2

by Lee A. Hart
The Computer Journal, Issue 40
Reproduced with permission
of author and publisher

Know The Hardware

Truly efficient software has an intimate, almost incestuous relationship�
with its hardware. They merge so thoroughly as to become inseparable;�
neither makes any sense without the other.

This requires that you, the programmer, must TOTALLY understand the�
hardware. I cannot stress this point too strongly. The strengths and�
weaknesses of the hardware influence program structure at every level, not�
just the low-level drivers. A system with weak disk I/O will be slow and�
unresponsive if your program relies on overlays. A shallow keyboard buffer�
requires frequent checks to avoid missing keys. The characteristics of the �
console device determines your whole approach to data displays. If you try�
to hide from these limitations in a high-level language, your program will�
work as if it were written in BASIC 101. Let's consider some actual case�
histories of what can be gained by paying attention to the hardware.

CASE #1

A customer needed a faster way to transfer data between two computers. �
He had been using a serial port at 9600 baud but complained that it was too�
slow and tied up the computer's serial port. Hardware mods were ruled out.

After study, I found that each computer had unused handshake lines in its�
RS-232 port. A special "Y" cable was built to cross-connect two of these�
lines, providing one bit of serial I/O in each direction. A "software UART"�
program was then written to transfer data between the two machines. This�
worked to about 30K bits per second before timing dither (due to�
interrupts, memory refresh, etc.) caused errors.

The serial port's UART could be programmed to generate an interrupt when�
the handshake line went low. Therefore, an interrupt-driven protocol with�
handshaking was devised. A '0' was sent by pulling the output low until the�
other computer echoed the low on its output. A '1' was sent by pulsing the�
output low and immediately back high and waiting until the other system�
echoed it. The data rate increased to over 100K bits per second, and�
transfers were now unaffected by disk I/O, keyboard activity, etc.

CASE 2

The firmware for a CRT terminal was to be upgraded to run 38400 bits per�
second without handshaking. Now, 38400 bps is fast, only 260 microseconds�
per character (about 75 instructions for a 3 MHz Z80).
� The slowest routines need the most attention. For example, clear-line�
was accomplished by moving the stack pointer to the end of the line and�
executing 36 PUSH HL instructions. The interrupt handler needed a 4-level�
stack, so the last 8 bytes were cleared normally. Clear-screen used 25�
iterations of clear-line.

This still isn't fast enough to complete every ESC sequence before the�
next one is received. This calls for an interrupt-driven system. Each�
character received generates an interrupt. The interrupt handler pushes the�
character into one end of a FIFO (First-In-First-Out) buffer in memory. The�
main program pops characters out the other end and processes them. The FIFO�
fills while we process slow commands like clear-screen and empties back out�
during fast commands.

But what if some idiot sends a long string of slow commands (like 100�
clear-screens in a row)? The FIFO would eventually overflow, and data would�
be lost. I prevented this with "look-ahead" logic. When the interrupt�
handler spots a clear-screen command, it sets a flag so MAIN expects it. �
MAIN can then ignore unnecessary commands (no sense altering a screen that's�
about to be cleared).

Scrolling is one of the most difficult actions. The obvious algorithm is�
to block move lines 2-24 up 1, then clear line 24. But that's what IBM did�
on the PC, and we all know how well that worked. So examine the 6845 CRT�
controller. The Start-Address register holds the address of the first�
character on the screen, the one displayed in the top left corner. If we�
add 80 to it, line 2 instantly becomes the top line, and we've scrolled the�
whole screen up a line. All that remains is to clear the 80 bytes that �
form the new 24th line, for which we have a fast routine.

Each scroll moves the start address up another 80 bytes. This obviously�
can't go on indefinitely, so the original program spent a great deal of time�
checking for overflow outside its 2K block of screen RAM (F800-FFFF). For�
instance, the old code read:

ld (hl),a ; put character on screen
inc hl ; advance to next
ld a,h ; get new address
or 0F8h ; if overflow to 0000,
ld h,a ; force it to F800-FFFF

But is this really necessary? The schematic revealed that the 2K RAM was�
partially decoded and actually occupied 16K in the Z80's address space�
(C000-FFFF). It's far easier to insure that an address lies within this�
range:

ld (hl),a ; put character on screen
res 6,h ; insure we don't wrap to 0000
inc hl ; advance to next

CASE #3
� Fast Disk I/O. Way back in 8 B.C. (eight years Before Clones) I had an�
S-100 system. Its 8080 CPU blazed along at 1.843 MHz, through 32K of RAM�
spread over half a dozen furnace boards. Two Shugart SA-801R single-sided�
8" drives provided disk storage, with CP/M 1.4 tying it all together. That�
old war horse and I fought many battles together, until it finally died the�
Death-of-1000-Intermittents.

Many of its "features" I'd rather forget, but it had one outstanding�
attribute: the fastest floppies I've ever seen. Warm boots were done before�
your fingers were off the keys; Wordstar loaded in under a second; PIP�
copied files at 10K bytes/sec. All without a fast CPU, DMA, vectored�
interrupts, or even a disk controller IC. The "controller" was just a bunch�
of TTL chips implementing a parallel port, an 8-bit shift register, and a�
CRC checkcode generator. The real work was done by the CPU, byte-banging�
out the IBM 3740 SD/DD format in software.

How good was it? An 8" disk spins at 360 rpm, or 6 revs/sec. Each track�
held 6.5K (26 double-density sectors of 256 bytes each). That makes the�
theoretical maximum transfer rate 6.5K x 6 = 39K bytes/sec. It actually�
achieved 50% of this, or 20K bytes/sec.

Few modern micros come anywhere near this level of performance. The�
Kaypro I wrote this article on creeps through the disk at 4K/sec. My PC�
clone is closer, at 12K/sec. The problem is that the CPU spends most of its�
time in wait loops; waiting for the drive motor to start, for the head to�
load, for an index hole, for a certain sector to come around on the disk.�
The capabilities of fast CPUs, elaborate interrupt systems, DMA, and fancy�
disk controllers are thrown away by crude software.

The CPU has better things to do. If the disk isn't ready when an �
application program needs it, the BIOS should start the task, save the data�
in a buffer, and set up an interrupt to finish the task later when the disk�
is REALLY ready. The time lost to wait loops is thus reclaimed to run your�
application programs.

That's how my antique worked. The BIOS maintained a track buffer in RAM.�
The first read from a particular track moved the head to the desired track�
and read the whole thing into the buffer. Further reads from that track�
simply came from RAM, taking virtually no time at all.

Similarly, writes to a sector on the current track just put data in the�
buffer and marked it as changed. The actual write was performed later, when�
a new track was selected for read/write, or just before the drive timed out�
from a lack of disk activity.

Physical track reads/writes were fast as well. The key was to simply�
begin wherever the head was. After seeking to the desired track, it read�
the ID# of each sector encountered and transferred it to/from the appropriate�
place in the RAM buffer. No need to find the index hole, wait for a�
particular sector ID#, or worry about interleave; one revolution got it all.

Such a system must be implemented carefully. CP/M does not expect��delayed error messages, which can produce some odd results. For instance, a�
BDOS read error might be reported when the real cause was a write error in�
flushing the previous track buffer to disk. Also, modern drives do not have�
door locks to prevent disk removal if unwritten data remains in the track�
buffer.

The main factor limiting my S-100 system's performance was the slow CPU�
and lack of DMA. A double-density 8" disk has a peak data transfer rate of�
500K bits/sec, which only allows 16 microseconds between bytes. This�
required polled I/O where the CPU was 100% devoted to the disk during actual�
reads/writes.

5-1/4" disks have a slower maximum transfer rate, but modern hardware has�
advantages that can make up for it. A normal 5-1/4" disk spins at 300 rpm,�
or 5 rev/sec. Assuming 9 sectors of 512 bytes per track, the maximum�
transfer rate is 22.5K bytes/sec. The peak data rate is 250K bits/sec, or�
32 microseconds per byte. This is slow enough for a 4 MHz Z80 to (barely)�
handle it on an interrupt basis. Here's an interrupt handler to read 256�
bytes from a disk controller chip at 32 microseconds max. per byte:

T-states
23 ; time to finish longest instruction
13 ; Z80 interrupt mode 0 or 1 response
11 int: push af ; save registers used
11 in a,(data) ; read data byte from disk controller
13 next: ld (buffer),a ; store it in buffer (a variable)
13 ld a,(next+1) ; get buffer address
4 inc a ; increment
13 ld (next+1),a ; save for next time
7 jr nz,done ; if end of page, done
10 pop af ; else restore registers
10 ret ; and return
----
128 T-states max = 32 microseconds with a 4 MHz Z80

But this routine barely squeaks by. It can't use interrupt mode 2 (which�
adds 6 T-states to the response time) or signal Z80 peripherals that the�
interrupt is complete with an RETI (which adds 4 T-states). It's limited to�
a 256-byte sector. Worse, some disk controller chips need processing time�
of their own. The popular Western Digital FD179x series only allows 27.5�
microseconds for each byte.

So we have to get clever again. The following example reads pairs of�
bytes, the first on an interrupt and the second by polled I/O. This�
improves performance to allow interrupt mode 2, larger sector sizes, and the�
slow response time of a FD179x chip:

T-states
23 ; time to finish longest instruction
19 ; Z80 interrupt mode 2 response time
11 int: push af ; save A and flags
11 in a,(data) ; read 1st byte from disk controller � 11 push hl ; save HL
10 next: ld hl,buffer ; get buffer address (a variable)
7 ld (hl),a ; store byte in buffer
6 inc hl ; advance buffer pointer
6 inc hl ; for next interrupt
16 ld (next+1),hl; & store it
6 dec hl ; point to current address
11+11 check:in a,(status) ; check disk controller status
4+ 4 rra ; if not busy (bit 0=1),
7+ 7 jr nc,done ; then we're done
4+ 4 rra ; if next byte not ready (bit 1=0),
12+ 7 jr nc,check ; then loop until it is
11 in a,(data) ; get 2nd byte from disk controller
7 ld (hl),a ; & store it in buffer
10 pop hl ; restore registers
10 pop af
14 reti ; return
----
188 or 226 T-states (for 1 or 2 passes through status loop)

This routine reads bytes from the controller chip within 17.75�
microseconds worst-case. Interrupt overhead averages 80% for a 4 MHz Z80,�
leaving 20% for the main program execution. The peculiar way of�
incrementing the address pointer minimizes the worst-case delay from an�
interrupt or status flag change until the byte is read. We want to maximize�
the chance that the second character is ready the first time the status is�
checked.

Why improve your disk system? Because, as a practical matter, there's�
more to be gained by improving it than any other change you could make. �
It's disk I/O that sets the pace, not CPU speed or memory size. Users�
almost never wait on CPU speed; it's the disk that keeps you twiddling your�
thumbs with the keyboard ignored, the screen frozen, and the disk drive�
emitting Bronx cheers. Put a Commodore 64's tinkertoy disk system on an AT�
clone, and you'd have high-priced junk that only a masochist would use.�
Conversely, the AT's DMA-based disk I/O would transform a C64 into a fire�
breathing dragon that would eat its competition alive.

Algorithms

When a hardware engineer sits down to design a circuit, he doesn't begin�
with a blank sheet of paper. He has a vast library of textbooks, data�
sheets, and catalogs of standard circuits to choose from. Most of the task�
is simply connecting off-the-shelf components into one of these standard�
configurations, modifying them as necessary to satisfy any unique�
requirements.

Algorithms are to programmers what IC chips are to hardware designers.�
Just as the engineer builds a library of standard parts and circuits, every�
programmer must continually build his own algorithm collection. Whether�
it's a shoebox full of magazine clippings or a carefully indexed series of��notebooks, start NOW.

Programming textbooks tend to concentrate on traditional computer �
algorithms for floating-point math, transcendental functions, and sorting�
routines. The old standby is Knuth's "The Art of Programming". Hamming's�
"Numerical Methods for Scientists and Engineers" explains the basics of�
iterative calculations. "Digital Computation and Numerical Methods" by�
Southworth and Deeleeuw provides detailed flowcharts and sample code as�
well.

Magazines are a great source and tend to be more down-to-earth and closer�
to the state of the art. Read carefully! Good algorithms may be expressed�
in BASIC listings, assembly code for some obscure processor, pocket�
calculator key sequences, and even disguised as circuit diagrams.�
Professional journals like EDN or Computer Design are often better than the�
popular magazines, which have pretty much abandoned education in favor of�
marketing. Especially check out back issues. The cruder the hardware, the�
trickier the algorithms had to be to make up for it.

Manufacturers' technical literature is a gold mine. Get the �
manufacturers' own manuals, not some boiled-down paperback from the�
bookstore. They won't be models of clarity but are full of hidden gold.�
Read everything, hardware and software manuals, data sheets, application�
notes, etc.

User groups are the traditional source of solutions to specific �
problems. Even better, they provide actual implementations in printed�
listings, on disk, or even by modem.

Don't waste time reinventing the wheel. Learn from others what works,�
and what doesn't. Some of the best (and worst) algorithms I know were found�
by disassembling existing programs. And once you find a good algorithm,�
recycle it. That clever sort routine for an antique 8008 may be the�
foundation of the fastest '386 sort yet!

Conclusion

These techniques are not new; in fact old-timers will recognize many of�
them from the early days of computing when hardware limitations were more�
severe. However, they have fallen into disuse. A whole generation of�
programmers has been taught that such techniques have no place in modern�
structured programming.

The theory goes something like this: Programs written in a high-level�
language are faster and easier to write, debug
, document, and maintain.�
Memory and speed are viewed as infinite resources, so the performance loss�
is unimportant. Programs should be totally generic; it is the compiler's or�
run-time library's job to worry about the hardware interface.

These rules make sense in a mainframe environment, where the hardware�
resources are truly awesome, and teams of programmers spend years working on��one application. But they impose severe penalties on a microcomputer�
system. The user must pay for the programmer's luxuries with higher�
hardware cost and lackluster performance.

It's easy to forget that "microcomputer" literally means "one millionth�
of a computer". Microprocessors make abysmally bad CPUs. Build a computer�
with one, and you'll wind up needing $5000 worth of memory and peripherals�
to support a $5 CPU chip.

But micros make superlative controllers. That's what they were designed�
for, and what they do best. A single microcomputer can replace dozens of�
boards and hundreds of ICs with as little as a single chip. That's why 90%�
of all microprocessors go into non-computer uses: calculators, auto emission�
controls, home entertainment equipment, industrial controls, and the like.�
Of 30 million Z80s sold last year, fewer than 1 million went into computers.

Programming a controller is different than a computer. Most applications�
demand real-time multi-tasking capabilities, and there is never enough speed�
or memory. Inputs and outputs are physical hardware devices, not abstract�
data structures, so the code must inevitably be hardware-dependent. �
Computer languages are just not cut out for this sort of thing.

The question is not, "How do I write a computer program to handle this�
data?" Instead, you should ask yourself, "How must I manipulate this�
hardware to do the job?" The techniques in this article may be out of place�
in the bureaucracy of a large computer but are right at home in the wild�
west world of a microcomputer.

Lest you think this has nothing to do with a "real" computer like your PC�
clone, consider this. Instead of a '286 with 1 meg of memory, suppose it�
contained ten Z80 controller boards, each with 64K of memory and a fast�
network to tie them together. Each Z80 handles a different device:�
keyboard, screen, printer, modem, and one for each disk. The rest are free�
to run application programs, several at a time!

Suppose you're doing word processing on this system. The keyboard does�
spelling correction on data entry. The screen Z80 displays your text in�
bit-mapped fonts to match the printer's Z80, which is simultaneously�
printing a file. The Z80 running the word processor itself suffers no�
annoying pauses or hesitation, since disk I/O is handled instantaneously via�
each drive's Z80 track buffer. Meanwhile, the modem's Z80 is downloading a�
file while another assembles a program. Pop-up utilities are ready and�
waiting in still other Z80s in case they're needed.

Such a system would clearly have half the hardware cost of a PC, yet�
would outperform it by a wide margin. True multi-tasking becomes child's�
play with multiple processors. More processors can be readily added for�
even higher performance, or removed to save cost (or continue operation�
while waiting for a replacement).

If the computer scientists really want to further the micro-revolution,�
they should stop trying to force antiquated mainframe languages onto micros��and concentrate on developing tools to maximize the use of micros as they�
are!

[This article was originally published in issue 40 of The Computer Journal,
P.O. Box 12, South Plainfield, NJ 07080-0012 and is reproduced with the
permission of the author and the publisher. Further reproduction for non-
commercial purposes is authorized. This copyright notice must be retained.
(c) Copyright 1989, 1991 Socrates Press and respective authors]