Zoology Computer Systems
25 Harbord St.
University of Toronto
Toronto, Ont. M5S1A1 Canada
{allegra,ihnp4,decvax,utai}!utzoo!henry
_A_B_S_T_R_A_C_T
Much is said about ``standing on other
people's shoulders, not their toes'', but in fact
the wheel is re-invented every day in the Unix/C
community. Worse, often it is re-invented badly,
with bumps, corners, and cracks. There are ways
of avoiding this: some of them bad, some of them
good, most of them under-appreciated and under-
used.
_I_n_t_r_o_d_u_c_t_i_o_n
``Everyone knows'' that that the UNIX/C|- community and its
programmers are the very paragons of re-use of software. In
some ways this is true. Brian Kernighan [1] and others have
waxed eloquent about how outstanding UNIX is as an environ-
ment for software re-use. Pipes, the shell, and the design
of programs as `filters' do much to encourage programmers to
build on others' work rather than starting from scratch.
Major applications can be, and often are, written without a
line of C. Of course, there are always people who insist on
doing everything themselves, often citing `efficiency' as
the compelling reason why they can't possibly build on the
work of others (see [2] for some commentary on this). But
surely these are the lamentable exceptions, rather than the
rule?
Well, in a word, no.
At the level of shell programming, yes, software re-use is
widespread in the UNIX/C community. Not quite as widespread
_________________________
|- UNIX is a trademark of Bell Laboratories.
February 21, 1989
- 2 -
or as effective as it might be, but definitely common. When
the time comes to write programs in C, however, the situa-
tion changes. It took a radical change in directory format
to make people use a library to read directories. Many new
programs still contain hand-crafted code to analyze their
arguments, even though prefabricated help for this has been
available for years. C programmers tend to think that
``re-using software'' means being able to take the source
for an existing program and edit it to produce the source
for a new one. While that _i_s a useful technique, there are
better ways.
Why does it matter that re-invention is rampant? Apart from
the obvious, that programmers have more work to do, I mean?
Well, extra work for the programmers is not exactly an
unmixed blessing, even from the programmers' viewpoint!
Time spent re-inventing facilities that are already avail-
able is time that is _n_o_t available to improve user inter-
faces, or to make the program run faster, or to chase down
the proverbial Last Bug. Or, to get really picky, to make
the code readable and clear so that our successors can
_u_n_d_e_r_s_t_a_n_d it.
Even more seriously, re-invented wheels are often square.
Every time that a line of code is re-typed is a new chance
for bugs to be introduced. There will always be the tempta-
tion to take shortcuts based on how the code will be used-
shortcuts that may turn around and bite the programmer when
the program is modified or used for something unexpected.
An inferior algorithm may be used because it's ``good
enough'' and the better algorithms are too difficult to
reproduce on the spur of the moment... but the definition of
``good enough'' may change later. And unless the program is
well-commented [here we pause for laughter], the next person
who works on it will have to study the code at length to
dispel the suspicion that there is some subtle reason for
the seeming re-invention. Finally, to quote [2], _i_f _y_o_u
_r_e-_i_n_v_e_n_t _t_h_e _s_q_u_a_r_e _w_h_e_e_l, _y_o_u _w_i_l_l _n_o_t _b_e_n_e_f_i_t _w_h_e_n _s_o_m_e_-
_b_o_d_y _e_l_s_e _r_o_u_n_d_s _o_f_f _t_h_e _c_o_r_n_e_r_s.
In short, re-inventing the wheel ought to be a rare event,
occurring only for the most compelling reasons. Using an
existing wheel, or improving an existing one, is usually
superior in a variety of ways. There is nothing dishonor-
able about stealing code* to make life easier and better.
_T_h_e_f_t _v_i_a _t_h_e _E_d_i_t_o_r
UNIX historically has flourished in environments in which
full sources for the system are available. This led to the
_________________________
* Assuming no software licences, copyrights, patents,
etc. are violated!
February 21, 1989
- 3 -
most obvious and crudest way of stealing code: copy the
source of an existing program and edit it to do something
new.
This approach does have its advantages. By its nature, it
is the most flexible method of stealing code. It may be the
only viable approach when what is desired is some variant of
a complex algorithm that exists only within an existing pro-
gram; a good example was V7 _d_u_m_p_d_i_r (which printed a table
of contents of a backup tape), visibly a modified copy of V7
_r_e_s_t_o_r (the only other program that understood the obscure
format of backup tapes). And it certainly is easy.
On the other hand, this approach also has its problems. It
creates two subtly-different copies of the same code, which
have to be maintained separately. Worse, they often have to
be maintained ``separately but simultaneously'', because the
new program inherits all the mistakes of the original. Fix-
ing the same bug repeatedly is so mind-deadening that there
is great temptation to fix it in only the program that is
actually giving trouble... which means that when the other
gives trouble, re-doing the cure must be preceded by re-
doing the investigation and diagnosis. Still worse, such
non-simultaneous bug fixes cause the variants of the code to
diverge steadily. This is also true of improvements and
cleanup work.
A program created in this way may also be inferior, in some
ways, to one created from scratch. Often there will be ves-
tigial code left over from the program's evolutionary ances-
tors. Apart from consuming resources (and possibly harbor-
ing bugs) without a useful purpose, such vestigial code
greatly complicates understanding the new program in isola-
tion.
There is also the possibility that the new program has
inherited a poor algorithm from the old one. This is actu-
ally a universal problem with stealing code, but it is espe-
cially troublesome with this technique because the original
program probably was not built with such re-use in mind.
Even if its algorithms were good for _i_t_s intended purpose,
they may not be versatile enough to do a good job in their
new role.
One relatively clean form of theft via editing is to alter
the original program's source to generate either desired
program by conditional compilation. This eliminates most of
the problems. Unfortunately, it does so only if the two
programs are sufficiently similar that they can share most
of the source. When they diverge significantly, the result
can be a maintenance nightmare, actually worse than two
separate sources. Given a close similarity, though, this
method can work well.
February 21, 1989
- 4 -
_T_h_e_f_t _v_i_a _L_i_b_r_a_r_i_e_s
The obvious way of using somebody else's code is to call a
library function. Here, UNIX has had some success stories.
Almost everybody uses the _s_t_d_i_o library rather than invent-
ing their own buffered-I/O package. (That may sound trivial
to those who never programmed on a V6 or earlier UNIX, but
in fact it's a great improvement on the earlier state of
affairs.) The simpler sorts of string manipulations are
usually done with the _s_t_r_x_x_x functions rather than by hand-
coding them, although efficiency issues and the wide diver-
sity of requirements have limited these functions to less
complete success. Nobody who knows about _q_s_o_r_t bothers to
write his own sorting function.
However, these success stories are pleasant islands in an
ocean of mud. The fact is that UNIX's libraries are a dis-
grace. They are well enough implemented, and their design
flaws are seldom more than nuisances, but there aren't
_e_n_o_u_g_h of them! Ironically, UNIX's ``poor cousin'', the
Software Tools community [3,4], has done much better at
this. Faced with a wild diversity of different operating
systems, they were forced to put much more emphasis on iden-
tifying clean abstractions for system services.
For example, the Software Tools version of _l_s runs
unchanged, _w_i_t_h_o_u_t conditional compilation, on dozens of
different operating systems [4]. By contrast, UNIX programs
that read directories invariably dealt with the raw system
data structures, until Berkeley turned this cozy little
world upside-down with a change to those data structures.
The Berkeley implementors were wise enough to provide a
library for directory access, rather than just documenting
the new underlying structure. However, true to the UNIX
pattern, they designed a library which quietly assumed (in
some of its naming conventions) that the underlying system
used _t_h_e_i_r structures! This particular nettle has finally
been grasped firmly by the IEEE POSIX project [5], at the
cost of yet another slightly-incompatible interface.
The adoption of the new directory libraries is not just a
matter of convenience and portability: in general the
libraries are faster than the hand-cooked code they replace.
Nevertheless, Berkeley's original announcement of the change
was greeted with a storm of outraged protest.
Directories, alas, are not an isolated example. The UNIX/C
community simply hasn't made much of an effort to identify
common code and package it for re-use. One of the two major
variants of UNIX still lacks a library function for binary
search, an algorithm which is notorious for both the perfor-
mance boost it can produce and the difficulty of coding a
fully-correct version from scratch. No major variant of
UNIX has a library function for either one of the following
February 21, 1989
- 5 -
code fragments, both omnipresent (or at least, they _s_h_o_u_l_d
be omnipresent [6]) in simple* programs that use the
relevant facilities:
These may sound utterly trivial, but in fact programmers
almost never produce as good an error message for _f_o_p_e_n as
ten lines of library code can, and half the time the return
value from _m_a_l_l_o_c isn't checked at all!
These examples illustrate a general principle, a side bene-
fit of stealing code: the way to encourage standardization|-
and quality is to make it easier to be careful and standard
than to be sloppy and non-standard. On systems with library
functions for error-checked _f_o_p_e_n and _m_a_l_l_o_c, it is easier
to use the system functions-which take some care to do ``the
right thing''-than to kludge it yourself. This makes con-
verts very quickly.
These are not isolated examples. Studying the libraries of
most any UNIX system will yield other ideas for useful
library functions (as well as a lot of silly nonsense that
UNIX doesn't need, usually!). A few years of UNIX systems
programming also leads to recognition of repeated needs.
Does _y_o_u_r* UNIX have library functions to:
+o decide whether a filename is well-formed (contains no
control characters, shell metacharacters, or white
space, and is within any name-length limits your
_________________________
* I include the qualification ``simple'' because
complex programs often want to do more intelligent
error recovery than these code fragments suggest.
However, _m_o_s_t of the programs that use these functions
_d_o_n'_t need fancy error recovery, and the error
responses indicated are _b_e_t_t_e_r than the ones those
programs usually have now!
|- Speaking of encouraging standardization: we use the
names _e_f_o_p_e_n and _e_m_a_l_l_o_c for the checked versions of
_f_o_p_e_n and _m_a_l_l_o_c, and arguments and returned values are
the same as the unchecked versions except that the
returned value is guaranteed non-NULL if the function
returns at all.
* As you might guess, my system has all of these. Most
of them are trivial to write, or are available in
public-domain forms.
February 21, 1989
- 6 -
system sets)?
+o close all file descriptors except the standard ones?
+o compute a standard CRC (Cyclic Redundancy Check
``checksum'')?
+o operate on _m_a_l_l_o_ced unlimited-length strings?
+o do what _a_c_c_e_s_s(2) does but using the effective
userid?
+o expand metacharacters in a filename the same way the
shell does? (the simplest way to make sure that the
two agree is to use _p_o_p_e_n and _e_c_h_o for anything com-
plicated)
+o convert integer baud rates to and from the speed
codes used by your system's serial-line _i_o_c_t_ls?
+o convert integer file modes to and from the _r_w_x
strings used|- to present such modes to humans?
+o do a binary search through a file the way _l_o_o_k(1)
does?
The above are fairly trivial examples of the sort of things
that _o_u_g_h_t to be in UNIX libraries. More sophisticated
libraries can also be useful, especially if the language
provides better support for them than C does; C++ is an
example [7]. Even in C, though, there is much room for
improvement.
Adding library functions does have its disadvantages. The
interface to a library function is important, and getting it
right is hard. Worse, once users have started using one
version of an interface, changing it is very difficult even
when hindsight clearly shows mistakes; the near-useless
return values of some of the common UNIX library functions
are obvious examples. Satisfactory handling of error condi-
tions can be difficult. (For example, the error-checking
_m_a_l_l_o_c mentioned earlier is very handy for programmers, but
invoking it from a library function would be a serious mis-
take, removing any possibility of more intelligent response
to that error.) And there is the perennial headache of try-
ing to get others to adopt your pet function, so that pro-
grams using it can be portable without having to drag the
source of the function around too. For all this, though,
libraries are in many ways the most satisfactory way of
_________________________
|- If you think only _l_s uses these, consider that _r_m and
some similar programs _o_u_g_h_t to use _r_w_x strings, not
octal modes, when requesting confirmation!
February 21, 1989
- 7 -
encouraging code theft.
Alas, encouraging code theft does not guarantee it. Even
widely-available library functions often are not used nearly
as much as they should be. A conspicuous example is _g_e_t_o_p_t,
for command-line argument parsing. _G_e_t_o_p_t supplies only
quite modest help in parsing the command line, but the stan-
dardization and consistency that its use produces is still
quite valuable; there are far too many pointless variations
in command syntax in the hand-cooked argument parsers in
most UNIX programs. Public-domain implementations of _g_e_t_o_p_t
have been available for years, and AT&T has published (!)
the source for the System V implementation. Yet people con-
tinue to write their own argument parsers. There is one
valid reason for this, to be discussed in the next section.
There are also a number of excuses, mostly the standard ones
for not using library functions:
+o ``It doesn't do quite what I want.'' _B_u_t _o_f_t_e_n _i_t _i_s
_c_l_o_s_e _e_n_o_u_g_h _t_o _s_e_r_v_e, _a_n_d _t_h_e _c_o_m_b_i_n_e_d _b_e_n_e_f_i_t_s _o_f
_c_o_d_e _t_h_e_f_t _a_n_d _s_t_a_n_d_a_r_d_i_z_a_t_i_o_n _o_u_t_w_e_i_g_h _t_h_e _m_i_n_o_r
_m_i_s_m_a_t_c_h_e_s.
+o ``Calling a library function is too inefficient.''
_T_h_i_s _i_s _m_o_s_t_l_y _h_e_a_r_d _f_r_o_m _p_e_o_p_l_e _w_h_o _h_a_v_e _n_e_v_e_r _p_r_o_-
_f_i_l_e_d _t_h_e_i_r _p_r_o_g_r_a_m_s _a_n_d _h_e_n_c_e _h_a_v_e _n_o reliable
_i_n_f_o_r_m_a_t_i_o_n _a_b_o_u_t _w_h_a_t _t_h_e_i_r _c_o_d_e'_s _e_f_f_i_c_i_e_n_c_y _p_r_o_b_-
_l_e_m_s _a_r_e [_2].
+o ``That whole concept is ugly, and should be
redesigned.'' (Often said of _g_e_t_o_p_t, since the usual
UNIX single-letter-option syntax that _g_e_t_o_p_t imple-
ments is widely criticized as user-hostile.) _H_o_w
_l_i_k_e_l_y _i_s _i_t _t_h_a_t _t_h_e _r_e_s_t _o_f _t_h_e _w_o_r_l_d _w_i_l_l _g_o _a_l_o_n_g
_w_i_t_h _y_o_u_r _r_e_d_e_s_i_g_n (_a_s_s_u_m_i_n_g _y_o_u _e_v_e_r _f_i_n_i_s_h _i_t)?
_C_o_n_s_i_s_t_e_n_c_y _a_n_d _a _h_i_g_h-_q_u_a_l_i_t_y _i_m_p_l_e_m_e_n_t_a_t_i_o_n _a_r_e
_v_a_l_u_a_b_l_e _e_v_e_n _i_f _t_h_e _s_t_a_n_d_a_r_d _b_e_i_n_g _i_m_p_l_e_m_e_n_t_e_d _i_s
_s_u_b_o_p_t_i_m_a_l.
+o ``I would have done it differently.'' _T_h_e _t_r_i_u_m_p_h _o_f
_p_e_r_s_o_n_a_l _t_a_s_t_e _o_v_e_r _p_r_o_f_e_s_s_i_o_n_a_l _p_r_o_g_r_a_m_m_i_n_g.
_T_h_e_f_t _v_i_a _T_e_m_p_l_a_t_e_s
_T_e_m_p_l_a_t_e_s are a major and much-neglected approach to code
sharing: ``boilerplate'' programs which contain a
carefully-written skeleton for some moderately stereotyped
task, which can then be adapted and filled in as needed.
This method has some of the vices of modifying existing pro-
grams, but the template can be designed for the purpose,
with attention to quality and versatility.
February 21, 1989
- 8 -
Templates can be particularly useful when library functions
are used in a stereotyped way that is a little complicated
to write from scratch; _g_e_t_o_p_t is an excellent example. The
one really valid objection to _g_e_t_o_p_t is that its invocation
is not trivial, and typing in the correct sequence from
scratch is a real test of memory. The usual _g_e_t_o_p_t manual
page contains a lengthy example which is essentially a tem-
plate for a _g_e_t_o_p_t-using program.
When the first public-domain _g_e_t_o_p_t appeared, it quickly
became clear that it would be convenient to have a template
for its use handy. This template eventually grew to incor-
porate a number of other things: a useful macro or two,
definition of _m_a_i_n, opening of files in the standard UNIX
filter fashion, checking for mistakes like opening a direc-
tory, filename and line-number tracking for error messages,
and some odds and ends. The full current version can be
found in the Appendix; actually it diverged into two dis-
tinct versions when it became clear that some filters wanted
the illusion of a single input stream, while others wanted
to handle each input file individually (or didn't care).
The obvious objection to this line of development is ``it's
more complicated than I need''. In fact, it turns out to be
surprisingly convenient to have all this machinery presup-
plied. _I_t _i_s _m_u_c_h _e_a_s_i_e_r _t_o _a_l_t_e_r _o_r _d_e_l_e_t_e _l_i_n_e_s _o_f _c_o_d_e
_t_h_a_n _t_o _a_d_d _t_h_e_m. If directories are legitimate input, just
delete the code that catches them. If no filenames are
allowed as input, or exactly one must be present, change one
line of code to enforce the restriction and a few more to
deal with the arguments correctly. If the arguments are not
filenames at all, just delete the bits of code that assume
they are. And so forth.
The job of writing an ordinary filter-like program is
reduced to filling in two or three blanks* in the template,
and then writing the code that actually processes the data.
Even quick improvisations become good-quality programs,
doing things the standard way with all the proper amenities,
because even a quick improvisation is easier to do by start-
ing from the template. _T_e_m_p_l_a_t_e_s _a_r_e _a_n _u_n_m_i_x_e_d _b_l_e_s_s_i_n_g;
_a_n_y_o_n_e _w_h_o _t_y_p_e_s _a _n_o_n-_t_r_i_v_i_a_l _p_r_o_g_r_a_m _i_n _f_r_o_m _s_c_r_a_t_c_h _i_s
_w_a_s_t_i_n_g _h_i_s _t_i_m_e _a_n_d _h_i_s _e_m_p_l_o_y_e_r'_s _m_o_n_e_y.
Templates are also useful for other stereotyped files, even
ones that are not usually thought of as programs. Most ver-
sions of UNIX have a simple template for manual pages hiding
somewhere (in V7 it was /_u_s_r/_m_a_n/_m_a_n_0/_x_x). Shell files that
want to analyze complex argument lists have the same _g_e_t_o_p_t
problem as C programs, with the same solution. There is
_________________________
* All marked with the string `xxx' to make them easy
for a text editor to find.
February 21, 1989
- 9 -
enough machinery in a ``production-grade'' _m_a_k_e file to make
a template worthwhile, although this one tends to get
altered fairly heavily; our current one is in the Appendix.
_T_h_e_f_t _v_i_a _I_n_c_l_u_s_i_o_n
Source inclusion (####iiiinnnncccclllluuuuddddeeee) provides a way of sharing both
data structures and executable code. Header files (e.g.
_s_t_d_i_o._h) in particular tend to be taken for granted. Again,
those who haven't been around long enough to remember V6
UNIX may have trouble grasping what a revolution it was when
V7 introduced systematic use of header files!
However, even mundane header files could be rather more use-
ful than they normally are now. Data structures in header
files are widely accepted, but there is somewhat less use of
them to declare the return types of functions. One or two
common header files like _s_t_d_i_o._h and _m_a_t_h._h do this, but
programmers are still used to the idea that the type of
(e.g.) _a_t_o_l has to be typed in by hand. Actually, all too
often the programmer says ``oh well, on my machine it works
out all right if I don't bother declaring _a_t_o_l'', and the
result is dirty and unportable code. The X3J11 draft ANSI
standard for C addresses this by defining some more header
files and requiring their use for portable programs, so that
the header files can do all the work and do it _r_i_g_h_t.
In principle, source inclusion can be used for more than
just header files. In practice, almost anything that can be
done with source inclusion can be done, and usually done
more cleanly, with header files and libraries. There are
occasional specialized exceptions, such as using macro
definitions and source inclusion to fake parameterized data
types.
_T_h_e_f_t _v_i_a _I_n_v_o_c_a_t_i_o_n
Finally, it is often possible to steal another program's
code simply by invoking that program. Invoking other pro-
grams via _s_y_s_t_e_m or _p_o_p_e_n for things that are easily done in
C is a common beginner's error. More experienced program-
mers can go too far the other way, however, insisting on
doing everything in C, even when a leavening of other
methods would give better results. The best way to sort a
large file is probably to invoke _s_o_r_t(1), not to do it your-
self. Even invoking a shell file can be useful, although a
bit odd-seeming to most C programmers, when elaborate file
manipulation is needed and efficiency is not critical.
Aside from invoking other programs at run time, it can also
be useful to invoke them at compile time. Particularly when
dealing with large tables, it is often better to dynamically
generate the C code from some more compact and readable
notation. _Y_a_c_c and _l_e_x are familiar examples of this on a
February 21, 1989
- 10 -
large scale, but simple _s_e_d and _a_w_k programs can build
tables in more specialized, application-specific ways.
Whether this is really theft is debatable, but it's a valu-
able technique all the same. It can neatly bypass a lot of
objections that start with ``but C won't let me write...''.
_A_n _E_x_c_e_s_s _o_f _I_n_v_e_n_t_i_o_n
With all these varied methods, why is code theft not more
widespread? Why are so many programs unnecessarily invented
from scratch?
The most obvious answer is the hardest to counter: theft
requires that there be something to steal. Use of library
functions is impossible unless somebody sets up a library.
Designing the interfaces for library functions is not easy.
Worse, doing it _w_e_l_l requires insight, which generally isn't
available on demand. The same is true, to varying degrees,
for the other forms of theft.
Despite its reputation as a hotbed of software re-use, UNIX
is actually hostile to some of these activities. If UNIX
directories had been complex and obscure, directory-reading
libraries would have been present from the beginning. As it
is, it was simply _t_o_o _e_a_s_y to do things ``the hard way''.
There _s_t_i_l_l is no portable set of functions to perform the
dozen or so useful manipulations of terminal modes that a
user program might want to do, a major nuisance because
changing those modes ``in the raw'' is simple but highly
unportable.
Finally, there is the Not Invented Here syndrome, and its
relatives, Not Good Enough and Not Understood Here. How
else to explain AT&T UNIX's persistent lack of the _d_b_m
library for hashed databases (even though it was developed
at Bell Labs and hence is available to AT&T), and Berkeley
UNIX's persistent lack of the full set of _s_t_r_x_x_x functions
(even though a public-domain implementation has existed for
years)? The X3J11 and POSIX efforts are making some pro-
gress at developing a common nucleus of functionality, but
they are aiming at a common subset of current systems, when
what is really wanted is a common superset.
_C_o_n_c_l_u_s_i_o_n
In short, never build what you can (legally) steal! Done
right, it yields better programs for less work.
_R_e_f_e_r_e_n_c_e_s
[1] Brian W. Kernighan, _T_h_e _U_n_i_x _S_y_s_t_e_m _a_n_d _S_o_f_t_w_a_r_e _R_e_u_s_a_-
_b_i_l_i_t_y, IEEE Transactions on Software Engineering, Vol
SE-10, No. 5, Sept. 1984, pp. 513-8.
February 21, 1989
- 11 -
[2] Geoff Collyer and Henry Spencer, _N_e_w_s _N_e_e_d _N_o_t _B_e _S_l_o_w,
Usenix Winter 1987 Technical Conference, pp. 181-190.
[3] Brian W. Kernighan and P.J. Plauger, _S_o_f_t_w_a_r_e _T_o_o_l_s,
Addison-Wesley, Reading, Mass. 1976.
[4] Mike O'Dell, _U_N_I_X: _T_h_e _W_o_r_l_d _V_i_e_w, Usenix Winter 1987
Technical Conference, pp. 35-45.
[5] IEEE, _I_E_E_E _T_r_i_a_l-_U_s_e _S_t_a_n_d_a_r_d _1_0_0_3._1 (_A_p_r_i_l _1_9_8_6): _P_o_r_t_-
_a_b_l_e _O_p_e_r_a_t_i_n_g _S_y_s_t_e_m _f_o_r _C_o_m_p_u_t_e_r _E_n_v_i_r_o_n_m_e_n_t_s, IEEE
and Wiley-Interscience, New York, 1986.
[6] Ian Darwin and Geoff Collyer, _C_a_n'_t _H_a_p_p_e_n _o_r /*
_N_O_T_R_E_A_C_H_E_D */ _o_r _R_e_a_l _P_r_o_g_r_a_m_s _D_u_m_p _C_o_r_e, Usenix Winter
1985 Technical Conference, pp. 136-151.