HTML "The Use of Name Spaces in Plan 9

HTML "The Use of Name Spaces in Plan 9
TL
The Use of Name Spaces in Plan 9
AU
Rob Pike
Dave Presotto
Ken Thompson
Howard Trickey
Phil Winterbottom
AI
MH
USA
AB
FS
Appeared in
I
Operating Systems Review,
R
Vol. 27, #2, April 1993, pp. 72-76
(reprinted from
I
Proceedings of the 5th ACM SIGOPS European Workshop,
R
Mont Saint-Michel, 1992, Paper nº 34).
FE
Plan 9 is a distributed system built at the Computing Sciences Research
Center of AT&T Bell Laboratories (now Lucent Technologies, Bell Labs) over the last few years.
Its goal is to provide a production-quality system for software
development and general computation using heterogeneous hardware
and minimal software. A Plan 9 system comprises CPU and file
servers in a central location connected together by fast networks.
Slower networks fan out to workstation-class machines that serve as
user terminals. Plan 9 argues that given a few carefully
implemented abstractions
it is possible to
produce a small operating system that provides support for the largest systems
on a variety of architectures and networks. The foundations of the system are
built on two ideas: a per-process name space and a simple message-oriented
file system protocol.
AE
PP
The operating system for the CPU servers and terminals is
structured as a traditional kernel: a single compiled image
containing code for resource management, process control,
user processes,
virtual memory, and I/O. Because the file server is a separate
machine, the file system is not compiled in, although the management
of the name space, a per-process attribute, is.
The entire kernel for the multiprocessor SGI Power Series machine
is 25000 lines of C,
the largest part of which is code for four networks including the
Ethernet with the Internet protocol suite.
Fewer than 1500 lines are machine-specific, and a
functional kernel with minimal I/O can be put together from
source files totaling 6000 lines. [Pike90]
PP
The system is relatively small for several reasons.
First, it is all new: it has not had time to accrete as many fixes
and features as other systems.
Also, other than the network protocol, it adheres to no
external interface; in particular, it is not Unix-compatible.
Economy stems from careful selection of services and interfaces.
Finally, wherever possible the system is built around
two simple ideas:
every resource in the system, either local or remote,
is represented by a hierarchical file system; and
a user or process
assembles a private view of the system by constructing a file
I
name space
R
that connects these resources. [Needham]
SH
File Protocol
PP
All resources in Plan 9 look like file systems.
That does not mean that they are repositories for
permanent files on disk, but that the interface to them
is file-oriented: finding files (resources) in a hierarchical
name tree, attaching to them by name, and accessing their contents
by read and write calls.
There are dozens of file system types in Plan 9, but only a few
represent traditional files.
At this level of abstraction, files in Plan 9 are similar
to objects, except that files are already provided with naming,
access, and protection methods that must be created afresh for
objects. Object-oriented readers may approach the rest of this
paper as a study in how to make objects look like files.
PP
The interface to file systems is defined by a protocol, called 9P,
analogous but not very similar to the NFS protocol.
The protocol talks about files, not blocks; given a connection to the root
directory of a file server,
the 9P messages navigate the file hierarchy, open files for I/O,
and read or write arbitrary bytes in the files.
9P contains 17 message types: three for
initializing and
authenticating a connection and fourteen for manipulating objects.
The messages are generated by the kernel in response to user- or
kernel-level I/O requests.
Here is a quick tour of the major message types.
The
CW auth
and
CW attach
messages authenticate a connection, established by means outside 9P,
and validate its user.
The result is an authenticated
I channel
that points to the root of the
server.
The
CW clone
message makes a new channel identical to an existing channel,
which may be moved to a file on the server using a
CW walk
message to descend each level in the hierarchy.
The
CW stat
and
CW wstat
messages read and write the attributes of the file pointed to by a channel.
The
CW open
message prepares a channel for subsequent
CW read
and
CW write
messages to access the contents of the file, while
CW create
and
CW remove
perform, on the files, the actions implied by their names.
The
CW clunk
message discards a channel without affecting the file.
None of the 9P messages consider caching; file caches are provided,
when needed, either within the server (centralized caching)
or by implementing the cache as a transparent file system between the
client and the 9P connection to the server (client caching).
PP
For efficiency, the connection to local
kernel-resident file systems, misleadingly called
I devices,
is by regular rather than remote procedure calls.
The procedures map one-to-one with 9P message types.
Locally each channel has an associated data structure
that holds a type field used to index
a table of procedure calls, one set per file system type,
analogous to selecting the method set for an object.
One kernel-resident file system, the
I
mount device,
R
translates the local 9P procedure calls into RPC messages to
remote services over a separately provided transport protocol
such as TCP or IL, a new reliable datagram protocol, or over a pipe to
a user process.
Write and read calls transmit the messages over the transport layer.
The mount device is the sole bridge between the procedural
interface seen by user programs and remote and user-level services.
It does all associated marshaling, buffer
management, and multiplexing and is
the only integral RPC mechanism in Plan 9.
The mount device is in effect a proxy object.
There is no RPC stub compiler; instead the mount driver and
all servers just share a library that packs and unpacks 9P messages.
SH
Examples
PP
One file system type serves
permanent files from the main file server,
a stand-alone multiprocessor system with a
350-gigabyte
optical WORM jukebox that holds the data, fronted by a two-level
block cache comprising 7 gigabytes of
magnetic disk and 128 megabytes of RAM.
Clients connect to the file server using any of a variety of
networks and protocols and access files using 9P.
The file server runs a distinct operating system and has no
support for user processes; other than a restricted set of commands
available on the console, all it does is answer 9P messages from clients.
PP
Once a day, at 5:00 AM,
the file server sweeps through the cache blocks and marks dirty blocks
copy-on-write.
It creates a copy of the root directory
and labels it with the current date, for example
CW 1995/0314 .
It then starts a background process to copy the dirty blocks to the WORM.
The result is that the server retains an image of the file system as it was
early each morning.
The set of old root directories is accessible using 9P, so a client
may examine backup files using ordinary commands.
Several advantages stem from having the backup service implemented
as a plain file system.
Most obviously, ordinary commands can access them.
For example, to see when a bug was fixed
P1
grep 'mouse bug fix' 1995/*/sys/src/cmd/8½/file.c
P2
The owner, access times, permissions, and other properties of the
files are also backed up.
Because it is a file system, the backup
still has protections;
it is not possible to subvert security by looking at the backup.
PP
The file server is only one type of file system.
A number of unusual services are provided within the kernel as
local file systems.
These services are not limited to I/O devices such
as disks. They include network devices and their associated protocols,
the bitmap display and mouse,
a representation of processes similar to
CW /proc
[Killian], the name/value pairs that form the `environment'
passed to a new process, profiling services,
and other resources.
Each of these is represented as a file system \(em
directories containing sets of files \(em
but the constituent files do not represent permanent storage on disk.
Instead, they are closer in properties to UNIX device files.
PP
For example, the
I console
device contains the file
CW /dev/cons ,
similar to the UNIX file
CW /dev/console :
when written,
CW /dev/cons
appends to the console typescript; when read,
it returns characters typed on the keyboard.
Other files in the console device include
CW /dev/time ,
the number of seconds since the epoch,
CW /dev/cputime ,
the computation time used by the process reading the device,
CW /dev/pid ,
the process id of the process reading the device, and
CW /dev/user ,
the login name of the user accessing the device.
All these files contain text, not binary numbers,
so their use is free of byte-order problems.
Their contents are synthesized on demand when read; when written,
they cause modifications to kernel data structures.
PP
The
I process
device contains one directory per live local process, named by its numeric
process id:
CW /proc/1 ,
CW /proc/2 ,
etc.
Each directory contains a set of files that access the process.
For example, in each directory the file
CW mem
is an image of the virtual memory of the process that may be read or
written for debugging.
The
CW text
file is a sort of link to the file from which the process was executed;
it may be opened to read the symbol tables for the process.
The
CW ctl
file may be written textual messages such as
CW stop
or
CW kill
to control the execution of the process.
The
CW status
file contains a fixed-format line of text containing information about
the process: its name, owner, state, and so on.
Text strings written to the
CW note
file are delivered to the process as
I notes,
analogous to UNIX signals.
By providing these services as textual I/O on files rather
than as system calls (such as
CW kill )
or special-purpose operations (such as
CW ptrace ),
the Plan 9 process device simplifies the implementation of
debuggers and related programs.
For example, the command
P1
cat /proc/*/status
P2
is a crude form of the
CW ps
command; the actual
CW ps
merely reformats the data so obtained.
PP
The
I bitmap
device contains three files,
CW /dev/mouse ,
CW /dev/screen ,
and
CW /dev/bitblt ,
that provide an interface to the local bitmap display (if any) and pointing device.
The
CW mouse
file returns a fixed-format record containing
1 byte of button state and 4 bytes each of
I x
and
I y
position of the mouse.
If the mouse has not moved since the file was last read, a subsequent read will
block.
The
CW screen
file contains a memory image of the contents of the display;
the
CW bitblt
file provides a procedural interface.
Calls to the graphics library are translated into messages that are written
to the
CW bitblt
file to perform bitmap graphics operations. (This is essentially a nested
RPC protocol.)
PP
The various services being used by a process are gathered together into the
process's
I
name space,
R
a single rooted hierarchy of file names.
When a process forks, the child process shares the name space with the parent.
Several system calls manipulate name spaces.
Given a file descriptor
CW fd
that holds an open communications channel to a service,
the call
P1
mount(int fd, char *old, int flags)
P2
authenticates the user and attaches the file tree of the service to
the directory named by
CW old .
The
CW flags
specify how the tree is to be attached to
CW old :
replacing the current contents or appearing before or after the
current contents of the directory.
A directory with several services mounted is called a
I union
directory and is searched in the specified order.
The call
P1
bind(char *new, char *old, int flags)
P2
takes the portion of the existing name space visible at
CW new ,
either a file or a directory, and makes it also visible at
CW old .
For example,
P1
bind("1995/0301/sys/include", "/sys/include", REPLACE)
P2
causes the directory of include files to be overlaid with its
contents from the dump on March first.
PP
A process is created by the
CW rfork
system call, which takes as argument a bit vector defining which
attributes of the process are to be shared between parent
and child instead of copied.
One of the attributes is the name space: when shared, changes
made by either process are visible in the other; when copied,
changes are independent.
PP
Although there is no global name space,
for a process to function sensibly the local name spaces must adhere
to global conventions.
Nonetheless, the use of local name spaces is critical to the system.
Both these ideas are illustrated by the use of the name space to
handle heterogeneity.
The binaries for a given architecture are contained in a directory
named by the architecture, for example
CW /mips/bin ;
in use, that directory is bound to the conventional location
CW /bin .
Programs such as shell scripts need not know the CPU type they are
executing on to find binaries to run.
A directory of private binaries
is usually unioned with
CW /bin .
(Compare this to the
I
ad hoc
R
and special-purpose idea of the
CW PATH
variable, which is not used in the Plan 9 shell.)
Local bindings are also helpful for debugging, for example by binding
an old library to the standard place and linking a program to see
if recent changes to the library are responsible for a bug in the program.
PP
The window system,
CW 8½
[Pike91], is a server for files such as
CW /dev/cons
and
CW /dev/bitblt .
Each client sees a distinct copy of these files in its local
name space: there are many instances of
CW /dev/cons ,
each served by
CW 8½
to the local name space of a window.
Again,
CW 8½
implements services using
local name spaces plus the use
of I/O to conventionally named files.
Each client just connects its standard input, output, and error files
to
CW /dev/cons ,
with analogous operations to access bitmap graphics.
Compare this to the implementation of
CW /dev/tty
on UNIX, which is done by special code in the kernel
that overloads the file, when opened,
with the standard input or output of the process.
Special arrangement must be made by a UNIX window system for
CW /dev/tty
to behave as expected;
CW 8½
instead uses the provision of the corresponding file as its
central idea, which to succeed depends critically on local name spaces.
PP
The environment
CW 8½
provides its clients is exactly the environment under which it is implemented:
a conventional set of files in
CW /dev .
This permits the window system to be run recursively in one of its own
windows, which is handy for debugging.
It also means that if the files are exported to another machine,
as described below, the window system or client applications may be
run transparently on remote machines, even ones without graphics hardware.
This mechanism is used for Plan 9's implementation of the X window
system: X is run as a client of
CW 8½ ,
often on a remote machine with lots of memory.
In this configuration, using Ethernet to connect
MIPS machines, we measure only a 10% degradation in graphics
performance relative to running X on
a bare Plan 9 machine.
PP
An unusual application of these ideas is a statistics-gathering
file system implemented by a command called
CW iostats .
The command encapsulates a process in a local name space, monitoring 9P
requests from the process to the outside world \(em the name space in which
CW iostats
is itself running. When the command completes,
CW iostats
reports usage and performance figures for file activity.
For example
P1
iostats 8½
P2
can be used to discover how much I/O the window system
does to the bitmap device, font files, and so on.
PP
The
CW import
command connects a piece of name space from a remote system
to the local name space.
Its implementation is to dial the remote machine and start
a process there that serves the remote name space using 9P.
It then calls
CW mount
to attach the connection to the name space and finally dies;
the remote process continues to serve the files.
One use is to access devices not available
locally. For example, to write a floppy one may say
P1
import lab.pc /a: /n/dos
cp foo /n/dos/bar
P2
The call to
CW import
connects the file tree from
CW /a:
on the machine
CW lab.pc
(which must support 9P) to the local directory
CW /n/dos .
Then the file
CW foo
can be written to the floppy just by copying it across.
PP
Another application is remote debugging:
P1
import helix /proc
P2
makes the process file system on machine
CW helix
available locally; commands such as
CW ps
then see
CW helix 's
processes instead of the local ones.
The debugger may then look at a remote process:
P1
db /proc/27/text /proc/27/mem
P2
allows breakpoint debugging of the remote process.
Since
CW db
infers the CPU type of the process from the executable header on
the text file, it supports
cross-architecture debugging, too.
Care is taken within
CW db
to handle issues of byte order and floating point; it is possible to
breakpoint debug a big-endian MIPS process from a little-endian i386.
PP
Network interfaces are also implemented as file systems [Presotto].
For example,
CW /net/tcp
is a directory somewhat like
CW /proc :
it contains a set of numbered directories, one per connection,
each of which contains files to control and communicate on the connection.
A process allocates a new connection by accessing
CW /net/tcp/clone ,
which evaluates to the directory of an unused connection.
To make a call, the process writes a textual message such as
CW 'connect
CW 135.104.53.2!512'
to the
CW ctl
file and then reads and writes the
CW data
file.
An
CW rlogin
service can be implemented in a few of lines of shell code.
PP
This structure makes network gatewaying easy to provide.
We have machines with Datakit interfaces but no Internet interface.
On such a machine one may type
P1
import helix /net
telnet tcp!ai.mit.edu
P2
The
CW import
uses Datakit to pull in the TCP interface from
CW helix ,
which can then be used directly; the
CW tcp!
notation is necessary because we routinely use multiple networks
and protocols on Plan 9\(emit identifies the network in which
CW ai.mit.edu
is a valid name.
PP
In practice we do not use
CW rlogin
or
CW telnet
between Plan 9 machines. Instead a command called
CW cpu
in effect replaces the CPU in a window with that
on another machine, typically a fast multiprocessor CPU server.
The implementation is to recreate the
name space on the remote machine, using the equivalent of
CW import
to connect pieces of the terminal's name space to that of
the process (shell) on the CPU server, making the terminal
a file server for the CPU.
CPU-local devices such as fast file system connections
are still local; only terminal-resident devices are
imported.
The result is unlike UNIX
CW rlogin ,
which moves into a distinct name space on the remote machine,
or file sharing with
CW NFS ,
which keeps the name space the same but forces processes to execute
locally.
Bindings in
CW /bin
may change because of a change in CPU architecture, and
the networks involved may be different because of differing hardware,
but the effect feels like simply speeding up the processor in the
current name space.
SH
Position
PP
These examples illustrate how the ideas of representing resources
as file systems and per-process name spaces can be used to solve
problems often left to more exotic mechanisms.
Nonetheless there are some operations in Plan 9 that are not
mapped into file I/O.
An example is process creation.
We could imagine a message to a control file in
CW /proc
that creates a process, but the details of
constructing the environment of the new process \(em its open files,
name space, memory image, etc. \(em are too intricate to
be described easily in a simple I/O operation.
Therefore new processes on Plan 9 are created by fairly conventional
CW rfork
and
CW exec
system calls;
CW /proc
is used only to represent and control existing processes.
PP
Plan 9 does not attempt to map network name spaces into the file
system name space, for several reasons.
The different addressing rules for various networks and protocols
cannot be mapped uniformly into a hierarchical file name space.
Even if they could be,
the various mechanisms to authenticate,
select a service,
and control the connection would not map consistently into
operations on a file.
PP
Shared memory is another resource not adequately represented by a
file name space.
Plan 9 takes care to provide mechanisms
to allow groups of local processes to share and map memory.
Memory is controlled
by system calls rather than special files, however,
since a representation in the file system would imply that memory could
be imported from remote machines.
PP
Despite these limitations, file systems and name spaces offer an effective
model around which to build a distributed system.
Used well, they can provide a uniform, familiar, transparent
interface to a diverse set of distributed resources.
They carry well-understood properties of access, protection,
and naming.
The integration of devices into the hierarchical file system
was the best idea in UNIX.
Plan 9 pushes the concepts much further and shows that
file systems, when used inventively, have plenty of scope
for productive research.
SH
References
LP
[Killian] T. Killian, ``Processes as Files'', USENIX Summer Conf. Proc., Salt Lake City, 1984
br
[Needham] R. Needham, ``Names'', in
I
Distributed systems,
R
S. Mullender, ed.,
Addison Wesley, 1989
br
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
``Plan 9 from Bell Labs'',
UKUUG Proc. of the Summer 1990 Conf.,
London, England,
1990
br
[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
UKUUG Proc. of the Summer 1990 Conf.,
London, England,
1990
br
[Pike91] Pike, R., ``8.5, The Plan 9 Window System'', USENIX Summer
Conf. Proc., Nashville, 1991