The Unix and Internet Fundamentals HOWTO
 by Eric S. Raymond
 v1.4, 25 September 1999

 This document describes the working basics of PC-class computers,
 Unix-like operating systems, and the Internet in non-technical lan-
 guage.
 ______________________________________________________________________

 Table of Contents


 1. Introduction

    1.1 Purpose of this document

 2. What's new

    2.1 Related resources
    2.2 New versions of this document
    2.3 Feedback and corrections

 3. Basic anatomy of your computer

 4. What happens when you switch on a computer?

 5. What happens when you log in?

 6. What happens when you run programs from the shell?

 7. How do input devices and interrupts work?

 8. How does my computer do several things at once?

 9. How does my computer keep processes from stepping on each other?

 10. How does my computer store things in memory?

    10.1 Numbers Numbers are represented as either words or pairs of words, depending on your processor's word size.  One 32-bit machine word is the most common size. Integer arithmetic is close to but not actually mathematical base-two.  The low-order bit is 1, next 2, then 4 and so forth as in pure binary.  But signed numbers are represented in
    10.2 Characters

 11. How does my computer store things on disk?

    11.1 Low-level disk and file system structure
    11.2 File names and directories
    11.3 Mount points
    11.4 How a file gets looked up
    11.5 File ownership, permissions and security
    11.6 How things can go wrong

 12. How do computer languages work?

    12.1 Compiled languages
    12.2 Interpreted languages
    12.3 P-code languages

 13. How does the Internet work?

    13.1 Names and locations
    13.2 Packets and routers
    13.3 TCP and IP
    13.4 HTTP, an application protocol


 ______________________________________________________________________

 11..  IInnttrroodduuccttiioonn



 11..11..  PPuurrppoossee ooff tthhiiss ddooccuummeenntt

 This document is intended to help Linux and Internet users who are
 learning by doing.  While this is a great way to acquire specific
 skills, sometimes it leaves peculiar gaps in one's knowledge of the
 basics -- gaps which can make it hard to think creatively or
 troubleshoot effectively, from lack of a good mental model of what is
 really going on.

 I'll try to describe in clear, simple language how it all works.  The
 presentation will be tuned for people using Unix or Linux on PC-class
 hardware.  Nevertheless I'll usually refer simply to `Unix' here, as
 most of what I will describe is constant across platforms and across
 Unix variants.

 I'm going to assume you're using an Intel PC.  The details differ
 slightly if you're running an Alpha or PowerPC or some other Unix box,
 but the basic concepts are the same.

 I won't repeat things, so you'll have to pay attention, but that also
 means you'll learn from every word you read.  It's a good idea to just
 skim when you first read this; you should come back and reread it a
 few times after you've digested what you have learned.

 This is an evolving document.  I intend to keep adding sections in
 response to user feedback, so you should come back and review it
 periodically.


 22..  WWhhaatt''ss nneeww

 New in 1.2: The section `How does my computer store things in
 memory?'.  New in 1.3: The sections `What happens when you log in?'
 and `File ownership, permissions and security'.


 22..11..  RReellaatteedd rreessoouurrcceess

 If you're reading this in order to learn how to hack, you should also
 read the How To Become A Hacker FAQ
 <http://www.tuxedo.org/~esr/faqs/hacker-howto.html>.  It has links to
 some other useful resources.


 22..22..  NNeeww vveerrssiioonnss ooff tthhiiss ddooccuummeenntt

 New versions of the Unix and Internet Fundamentals HOWTO will be
 periodically posted to comp.os.linux.help and
 news:comp.os.linux.announce and news.answers.  They will also be
 uploaded to various Linux WWW and FTP sites, including the LDP home
 page.

 You can view the latest version of this on the World Wide Web via the
 URL <http://metalab.unc.edu/LDP/HOWTO/Unix-Internet-Fundamentals-
 HOWTO.html>.


 22..33..  FFeeeeddbbaacckk aanndd ccoorrrreeccttiioonnss


 If you have questions or comments about this document, please feel
 free to mail Eric S. Raymond, at [email protected]. I welcome any
 suggestions or criticisms. I especially welcome hyperlinks to more
 detailed explanations of individual concepts.  If you find a mistake
 with this document, please let me know so I can correct it in the next
 version. Thanks.


 33..  BBaassiicc aannaattoommyy ooff yyoouurr ccoommppuutteerr

 Your computer has a processor chip inside it that does the actual
 computing.  It has internal memory (what DOS/Windows people call
 ``RAM'' and Unix people often call ``core'').  The processor and
 memory live on the _m_o_t_h_e_r_b_o_a_r_d which is the heart of your computer.

 Your computer has a screen and keyboard.  It has hard drives and
 floppy disks.  The screen and your disks have _c_o_n_t_r_o_l_l_e_r _c_a_r_d_s that
 plug into the motherboard and help the computer drive these outboard
 devices.  (Your keyboard is too simple to need a separate card; the
 controller is built into the keyboard chassis itself.)

 We'll go into some of the details of how these devices work later.
 For now, here are a few basic things to keep in mind about how they
 work together:

 All the inboard parts of your computer are connected by a _b_u_s.
 Physically, the bus is what you plug your controller cards into (the
 video card, the disk controller, a sound card if you have one).  The
 bus is the data highway between your processor, your screen, your
 disk, and everything else.

 The processor, which makes everything else go, can't actually see any
 of the other pieces directly; it has to talk to them over the bus.
 The only other subsystem it has really fast, immediate access to is
 memory (the core).  In order for programs to run, then, they have to
 be _i_n _c_o_r_e.

 When your computer reads a program or data off the disk, what actually
 happens is that the processor uses the bus to send a disk read request
 to your disk controller.  Some time later the disk controller uses the
 bus to signal the computer that it has read the data and put it in a
 certain location in core.  The processor can then use the bus to look
 at that memory.

 Your keyboard and screen also communicate with the processor via the
 bus, but in simpler ways.  We'll discuss those later on.  For now, you
 know enough to understand what happens when you turn on your computer.


 44..  WWhhaatt hhaappppeennss wwhheenn yyoouu sswwiittcchh oonn aa ccoommppuutteerr??

 A computer without a program running is just an inert hunk of
 electronics.  The first thing a computer has to do when it is turned
 on is start up a special program called an _o_p_e_r_a_t_i_n_g _s_y_s_t_e_m.  The
 operating system's job is to help other computer programs to work by
 handling the messy details of controlling the computer's hardware.

 The process of bringing up the operating system is called _b_o_o_t_i_n_g
 (originally this was _b_o_o_t_s_t_r_a_p_p_i_n_g and alluded to the difficulty of
 pulling yourself up ``by your bootstraps'').  Your computer knows how
 to boot because instructions for booting are built into one of its
 chips, the BIOS (or Basic Input/Output System) chip.

 The BIOS chip tells it to look in a fixed place on the lowest-numbered
 hard disk (the _b_o_o_t _d_i_s_k) for a special program called a _b_o_o_t _l_o_a_d_e_r
 (under Linux the boot loader is called LILO).  The boot loader is
 pulled into core and started.  The boot loader's job is to start the
 real operating system.
 The loader does this by looking for a _k_e_r_n_e_l, loading it into core,
 and starting it.  When you boot Linux and see "LILO" on the screen
 followed by a bunch of dots, it is loading the kernel.  (Each dot
 means it has loaded another _d_i_s_k _b_l_o_c_k of kernel code.)

 (You may wonder why the BIOS doesn't load the kernel directly -- why
 the two-step process with the boot loader?  Well, the BIOS isn't very
 smart.  In fact it's very stupid, and Linux doesn't use it at all
 after boot time.  It was originally written for primitive 8-bit PCs
 with tiny disks, and literally can't access enough of the disk to load
 the kernel directly.  The boot loader step also lets you start one of
 several operating systems off different places on your disk, in the
 unlikely event that Unix isn't good enough for you.)

 Once the kernel starts, it has to look around, find the rest of the
 hardware, and get ready to run programs.  It does this by poking not
 at ordinary memory locations but rather at _I_/_O _p_o_r_t_s -- special bus
 addresses that are likely to have device controller cards listening at
 them for commands.  The kernel doesn't poke at random; it has a lot of
 built-in knowledge about what it's likely to find where, and how
 controllers will respond if they're present.  This process is called
 _a_u_t_o_p_r_o_b_i_n_g.

 Most of the messages you see at boot time are the kernel autoprobing
 your hardware through the I/O ports, figuring out what it has
 available to it and adapting itself to your machine.  The Linux kernel
 is extremely good at this, better than most other Unixes and _m_u_c_h
 better than DOS or Windows.  In fact, many Linux old-timers think the
 cleverness of Linux's boot-time probes (which made it relatively easy
 to install) was a major reason it broke out of the pack of free-Unix
 experiments to attract a critical mass of users.

 But getting the kernel fully loaded and running isn't the end of the
 boot process; it's just the first stage (sometimes called _r_u_n _l_e_v_e_l
 _1).  After this first stage, the kernel hands control to a special
 process called `init' which spawns several housekeeping processes.

 The init process's first job is usually to check to make sure your
 disks are OK.  Disk file systems are fragile things; if they've been
 damaged by a hardware failure or a sudden power outage, there are good
 reasons to take recovery steps before your Unix is all the way up.
 We'll go into some of this later on when we talk about ``how file
 systems can go wrong''.

 Init's next step is to start several _d_a_e_m_o_n_s.  A daemon is a program
 like a print spooler, a mail listener or a WWW server that lurks in
 the background, waiting for things to do.  These special programs
 often have to coordinate several requests that could conflict.  They
 are daemons because it's often easier to write one program that runs
 constantly and knows about all requests than it would be to try to
 make sure that a flock of copies (each processing one request and all
 running at the same time) don't step on each other.  The particular
 collection of daemons your system starts may vary, but will almost
 always include a print spooler (a gatekeeper daemon for your printer).

 Once all daemons are started, we're at _r_u_n _l_e_v_e_l _2.  The next step is
 to prepare for users.  Init starts a copy of a program called getty to
 watch your console (and maybe more copies to watch dial-in serial
 ports).  This program is what issues the login prompt to your console.
 We're now at _r_u_n _l_e_v_e_l _3 and ready for you to log in and run programs.


 55..  WWhhaatt hhaappppeennss wwhheenn yyoouu lloogg iinn??

 When you log in (give a name and password) you identify yourself to
 getty and the computer.  It then runs a program called (naturally
 enough) login, which checks to see if you are authorized to be using
 the machine.  If you aren't, your login attempt will be rejected.  If
 yo are, login does a few housekeeping things and then starts up a
 command interpreter, the _s_h_e_l_l.  (Yes, getty and login could be one
 program.  They're separate for historical reasons not worth going into
 here.)

 Here's a bit more about what the system does before giving you a
 shell; you'll need to understand them later when we talk about file
 permissions.  You identify yourself with a login name and password.
 That login name is looked up in a file called /etc/password, which is
 a sequence of lines each describing a user account.

 One of these fields is an encrypted version of the account password.
 What you enter as an account password is encrypted in exactly the same
 way, and the login program checks to see if they match.  The security
 of this method depends on the fact that, while it's easy to go from
 your clear password to the encrypted version, the reverse is very
 hard.  Thus, even if someone can see the encrypted version of your
 password, they can't use your account.  (It also means that if you
 forget your password, there's no way to recover it, only to change it
 to something else you choose.)

 Once you have successfully logged in, you get all the privileges
 associated with the individual account you are using.  You may also be
 recognized as part of a _g_r_o_u_p.  A group is a named collection of users
 set up by the system administrator.  Groups can have privileges
 independently of their members' privileges.  A user can be a member of
 multiple groups.  (For details about how Unix privileges work, see the
 section below on ``''.)

 (Note that although you will normally refer to users and groups by
 name, they are actually stored internally as numeric IDs.  The
 password file maps your account name to a user ID; the /etc/group file
 maps group names to numeric group IDs.  Commands that deal with
 accounts and groups do the trranslation automatically.)

 Your account entry also contains your _h_o_m_e _d_i_r_e_c_t_o_r_y, the place in the
 Unix file system where your personal files will live.  Finally, your
 account entry also sets your _s_h_e_l_l, the command interpreter that login
 will start up to accept your commmands.


 66..  WWhhaatt hhaappppeennss wwhheenn yyoouu rruunn pprrooggrraammss ffrroomm tthhee sshheellll??

 The normal shell gives you the '$' prompt that you see after logging
 in (unless you've customized it to something else).  We won't talk
 about shell syntax and the easy things you can see on the screen here;
 instead we'll take a look behind the scenes at what's happening from
 the computer's point of view.

 After boot time and before you run a program, you can think of your
 computer of containing a zoo of processes that are all waiting for
 something to do.  They're all waiting on _e_v_e_n_t_s. An event can be you
 pressing a key or moving a mouse.  Or, if your machine is hooked to a
 network, an event can be a data packet coming in over that network.

 The kernel is one of these processes.  It's s special one, because it
 controls when the other _u_s_e_r _p_r_o_c_e_s_s_e_s can run, and it is normally the
 only process with direct access to the machine's hardware.  In fact,
 user processes have to make requests to the kernel when they want to
 get keyboard input, write to your screen, read from or write to disk,
 or do just about anything other than crunching bits in memory.  These
 requests are known as _s_y_s_t_e_m _c_a_l_l_s.


 Normally all I/O goes through the kernel so it can schedule the
 operations and prevent processes from stepping on each other.  A few
 special user processes are allowed to slide around the kernel, usually
 by being given direct access to I/O ports.  X servers (the programs
 that handle other programs' requests to do screen graphics on most
 Unix boxes) are the most common example of this.  But we haven't
 gotten to an X server yet; you're looking at a shell prompt on a
 character console.

 The shell is just a user process, and not a particularly special one.
 It waits on your keystrokes, listening (through the kernel) to the
 keyboard I/O port.  As the kernel sees them, it echos them to your
 screen then passes them to the shell.  When the kernel sees an `Enter'
 it passes your line of text to the shell. The shell tries to interpret
 those keystrokes as commands.

 Let's say you type `ls' and Enter to invoke the Unix directory lister.
 The shell applies its built-in rules to figure out that you want to
 run the executable command in the file `/bin/ls'.  It makes a system
 call asking the kernel to start /bin/ls as a new _c_h_i_l_d process and
 give it access to the screen and keyboard through the kernel.  Then
 the shell goes to sleep, waiting for ls to finish.

 When /bin/ls is done, it tells the kernel it's finished by issuing an
 _e_x_i_t system call.  The kernel then wakes up the shell and tells it it
 can continue running.  The shell issues another prompt and waits for
 another line of input.

 Other things may be going on while your `ls' is executing, however
 (we'll have to suppose that you're listing a very long directory).
 You might switch to another virtual console, log in there, and start a
 game of Quake, for example.  Or, suppose you're hooked up to the
 Internet.  Your machine might be sending or receiving mail while
 /bin/ls runs.


 77..  HHooww ddoo iinnppuutt ddeevviicceess aanndd iinntteerrrruuppttss wwoorrkk??

 Your keyboard is a very simple input device; simple because it
 generates small amounts of data very slowly (by a computer's
 standards).  When you press or release a key, that event is signalled
 up the keyboard cable to raise a _h_a_r_d_w_a_r_e _i_n_t_e_r_r_u_p_t.

 It's the operating system's job to watch for such interrupts.  For
 each possible kind of interrupt, there will be an _i_n_t_e_r_r_u_p_t _h_a_n_d_l_e_r, a
 part of the operating system that stashes away any data associated
 with them (like your keypress/keyrelease value) until it can be
 processed.

 What the interrupt handler for your keyboard actually does is post the
 key value into a system area near the bottom of core.  There, it will
 be available for inspection when the operating system passes control
 to whichever program is currently supposed to be reading from the
 keyboard.

 More complex input devices like disk or network cards work in a
 similar way.  Above, we referred to a disk controller using the bus to
 signal that a disk request has been fulfilled.  What actually happens
 is that the disk raises an interrupt.  The disk interrupt handler then
 copies the retrieved data into memory, for later use by the program
 that made the request.

 Every kind of interrupts has an associated _p_r_i_o_r_i_t_y _l_e_v_e_l.  Lower-
 priority interrupts (like keyboard events) have to wait on higher-
 priority interrupts (like clock ticks or disk events).  Unix is
 designed to give high priority to the kinds of events that need to be
 processed rapidly in order to keep the machine's response smooth.

 In your OS's boot-time messages, you may see references to _I_R_Q
 numbers.  You may be aware that one of the common ways to misconfigure
 hardware is to have two different devices try to use the same IRQ,
 without understanding exactly why.

 Here's the answer.  IRQ is short for "Interrupt Request".  The
 operating system needs to know at startup time which numbered
 interrupts each hardware device will use, so it can associate the
 proper handlers with each one.  If two different devices try use the
 same IRQ, interrupts will sometimes get dispatched to the wrong
 handler.  This will usually at least lock up the device, and can
 sometimes confuse the OS badly enough that it will flake out or crash.


 88..  HHooww ddooeess mmyy ccoommppuutteerr ddoo sseevveerraall tthhiinnggss aatt oonnccee??

 It doesn't, actually.  Computers can only do one task (or _p_r_o_c_e_s_s) at
 a time.  But a computer can change tasks very rapidly, and fool slow
 human beings into thinking it's doing several things at once.  This is
 called _t_i_m_e_s_h_a_r_i_n_g.

 One of the kernel's jobs is to manage timesharing.  It has a part
 called the _s_c_h_e_d_u_l_e_r which keeps information inside itself about all
 the other (non-kernel) processes in your zoo.  Every 1/60th of a
 second, a timer goes off in the kernel, generating a clock interrupt.
 The scheduler stops whatever process is currently running, suspends it
 in place, and hands control to another process.

 1/60th of a second may not sound like a lot of time.  But on today's
 microprocessors it's enough to run tens of thousands of machine
 instructions, which can do a great deal of work.  So even if you have
 many processes, each one can accomplish quite a bit in each of its
 timeslices.

 In practice, a program may not get its entire timeslice. If an
 interrupt comes in from an I/O device, the kernel effectively stops
 the current task, runs the interrupt handler, and then returns to the
 current task.  A storm of high-priority interrupts can squeeze out
 normal processing; this misbehavior is called _t_h_r_a_s_h_i_n_g and is
 fortunately very hard to induce under modern Unixes.

 In fact, the speed of programs is only very seldom limited by the
 amount of machine time they can get (there are a few exceptions to
 this rule, such as sound or 3-D graphics generation).  Much more
 often, delays are caused when the program has to wait on data from a
 disk drive or network connection.

 An operating system that can routinely support many simultaneous
 processes is called "multitasking".  The Unix family of operating
 systems was designed from the ground up for multitasking and is very
 good at it -- much more effective than Windows or the Mac OS, which
 have had multitasking bolted into it as an afterthought and do it
 rather poorly.  Efficient, reliable multitasking is a large part of
 what makes Linux superior for networking, communications, and Web
 service.


 99..  HHooww ddooeess mmyy ccoommppuutteerr kkeeeepp pprroocceesssseess ffrroomm sstteeppppiinngg oonn eeaacchh ootthheerr??

 The kernel's scheduler takes care of dividing processes in time.  Your
 operating system also has to divide them in space, so that processes
 can't step on each others' working memory.  Even if you assume that
 all programs are trying to be cooperative, you don't want a bug in one
 of them to be able to corrupt others.  The things your operating
 system does to solve this problem are called _m_e_m_o_r_y _m_a_n_a_g_e_m_e_n_t.

 Each process in your zoo needs its own area of core memory, as a place
 to run its code from and keep variables and results in.  You can think
 of this set as consisting of a read-only _c_o_d_e _s_e_g_m_e_n_t (containing the
 process's instructions) and a writeable _d_a_t_a _s_e_g_m_e_n_t (containing all
 the process's variable storage).  The data segment is truly unique to
 each process, but if two processes are running the same code Unix
 automatically arranges for them to share a single code segment as an
 efficiency measure.

 Efficiency is important, because core memory is expensive.  Sometimes
 you don't have enough to hold the entirety of all the programs the
 machine is running, especially if you are using a large program like
 an X server.  To get around this, Unix uses a strategy called _v_i_r_t_u_a_l
 _m_e_m_o_r_y.  It doesn't try to hold all the code and data for a process in
 core.  Instead, it keeps around only a relatively small _w_o_r_k_i_n_g _s_e_t;
 the rest of the process's state is left in a special _s_w_a_p _s_p_a_c_e area
 on your hard disk.

 As the process runs, Unix tries to anticipate how the working set will
 change and have only the pieces that are needed in core.  Doing this
 effectively is both complicated and tricky, so I won't try and
 describe it all here -- but it depends on the fact that code and data
 references tend to happen in clusters, with each new one likely to
 refer to somewhere close to an old one.  So if Unix keeps around the
 code or data most frequently (or most recently) used, you will usually
 succeed in saving time.

 Note that in the past, that "Sometimes" two paragraphs ago was "Almost
 always," -- the size of core was typically small relative to the size
 of running programs, so swapping was frequent.  Memory is far less
 expensive nowadays and even low-end machines have quite a lot of it.
 On modern single-user machines with 64MB of core and up, it's possible
 to run X and a typical mix of jobs without ever swapping.

 Even in this happy situation, the part of the operating system called
 the _m_e_m_o_r_y _m_a_n_a_g_e_r still has important work to do.  It has to make
 sure that programs can only alter their own data segments -- that is,
 prevent erroneous or malicious code in one program from garbaging the
 data in another.  To do this, it keeps a table of data and code
 segments.  The table is updated whenever a process either requests
 more memory or releases memory (the latter usually when it exits).

 This table is used to pass commands to a specialized part of the
 underlying hardware called an _M_M_U or _m_e_m_o_r_y _m_a_n_a_g_e_m_e_n_t _u_n_i_t.  Modern
 processor chips have MMUs built right onto them.  The MMU has the
 special ability to put fences around areas of memory, so an out-of-
 bound reference will be refused and cause a special interrupt to be
 raised.

 If you ever see a Unix message that says "Segmentation fault", "core
 dumped" or something similar, this is exactly what has happened; an
 attempt by the running program to access memory outside its segment
 has raised a fatal interrupt.  This indicates a bug in the program
 code; the _c_o_r_e _d_u_m_p it leaves behind is diagnostic information
 intended to help a programmer track it down.

 There is another aspect to protecting processes from each other
 besides segregating the memory they access.  You also want to be able
 to control their file accesses so a buggy or malicious program can't
 corrupt critical pieces of the system.  This is why Unix has ``file
 permissions''m which we'll discuss later.



 1100..  HHooww ddooeess mmyy ccoommppuutteerr ssttoorree tthhiinnggss iinn mmeemmoorryy??

 You probably know that everything on a computer is stored as strings
 of bits (binary digits; you can think of them as lots of little on-off
 switches).  Here we'll explain how those bits are used to represent
 the letters and numbers that your computer is crunching.

 Before we can go into this, you need to understand about the the _w_o_r_d
 _s_i_z_e of your computer.  The word size is the computer's preferred size
 for moving units of information around; technically it's the width of
 your processor's _r_e_g_i_s_t_e_r_s, which are the holding areas your processor
 uses to do arithmetic and logical calculations.  When people write
 about computers having bit sizes (calling them, say, ``32-bit'' or
 ``64-bit'') computers, this is what they mean.

 Most computers (including 386, 486, Pentium and Pentium II PCs) have a
 word size of 32 bits.  The old 286 machines had a word size of 16.
 Old-style mainframes often had 36-bit words.  A few processors (like
 the Alpha from what used to be DEC and is now Compaq) have 64-bit
 words.  The 64-bit word will become more common over the next five
 years; Intel is planning to replaced the Pentium II with a 64-bit chip
 called `Merced'.

 The computer views your core memory as a sequence of words numbered
 from zero up to some large value dependent on your memory size. That
 value is limited by your word size, which is why older machines like
 286s had to go through painful contortions to address large amounts of
 memory.  I won't describe them here; they still give older programmers
 nightmares.


 1100..11..  NNuummbbeerrss aarree rreepprreesseenntteedd aass eeiitthheerr wwoorrddss oorr ppaaiirrss ooff wwoorrddss,,
 ddeeppeennddiinngg oonn yyoouurr pprroocceessssoorr''ss wwoorrdd ssiizzee..  OOnnee 3322--bbiitt mmaacchhiinnee wwoorrdd iiss
 tthhee mmoosstt ccoommmmoonn ssiizzee.. IInntteeggeerr aarriitthhmmeettiicc iiss cclloossee ttoo bbuutt nnoott aaccttuuaallllyy
 mmaatthheemmaattiiccaall bbaassee--ttwwoo..  TThhee llooww--oorrddeerr bbiitt iiss 11,, nneexxtt 22,, tthheenn 44 aanndd ssoo
 ffoorrtthh aass iinn ppuurree bbiinnaarryy..  BBuutt ssiiggnneedd nnuummbbeerrss aarree rreepprreesseenntteedd iinnttwwooss--
 ccoommpplleemmeenntt nnoottaattiioonn..  TThhee hhiigghheesstt--oorrddeerr bbiitt iiss aassiiggnn bbiitt wwhhiicchh mmaakkeess
 tthhee qquuaannttiittyy nneeggaattiivvee,, aanndd eevveerryy nneeggaattiivvee nnuummbbeerr ccaann bbee oobbttaaiinneedd ffrroomm
 tthhee ccoorrrreessppoonnddiinngg ppoossiittiivvee vvaalluuee bbyy iinnvveerrttiinngg aallll tthhee bbiittss..  TThhiiss iiss
 wwhhyy iinntteeggeerrss oonn aa 3322--bbiitt mmaacchhiinnee hhaavvee tthhee rraannggee --22^^3311 ++ 11 ttoo 22^^3311 -- 11
 ((wwhheerree ^^ iiss tthhee ``ppoowweerr'' ooppeerraattiioonn,, 22^^33 == 88))..  TThhaatt 3322nndd bbiitt iiss bbeeiinngg
 uusseedd ffoorr ssiiggnn..  SSoommee ccoommppuutteerr llaanngguuaaggeess ggiivvee yyoouu aacccceessss ttoouunnssiiggnneedd
 aarriitthhmmeettiicc wwhhiicchh iiss ssttrraaiigghhtt bbaassee 22 wwiitthh zzeerroo aanndd ppoossiittiivvee nnuummbbeerrss
 oonnllyy..  MMoosstt pprroocceessssoorrss aanndd ssoommee llaanngguuaaggeess ccaann ddoo iinn ffllooaattiinngg--ppooiinntt
 nnuummbbeerrss ((tthhiiss ccaappaabbiilliittyy iiss bbuuiilltt iinnttoo aallll rreecceenntt pprroocceessssoorr cchhiippss))..
 FFllooaattiinngg--ppooiinntt nnuummbbeerrss ggiivvee yyoouu aa mmuucchh wwiiddeerr rraannggee ooff vvaalluueess tthhaann
 iinntteeggeerrss aanndd lleett yyoouu eexxpprreessss ffrraaccttiioonnss..  TThhee wwaayyss tthhiiss iiss ddoonnee vvaarryy
 aanndd aarree rraatthheerr ttoooo ccoommpplliiccaatteedd ttoo ddiissccuussss iinn ddeettaaiill hheerree,, bbuutt tthhee ggeenn--
 eerraall iiddeeaa iiss mmuucchh lliikkee ssoo--ccaalllleedd ``sscciieennttiiffiicc nnoottaattiioonn'',, wwhheerree oonnee
 mmiigghhtt wwrriittee ((ssaayy)) 11..223344 ** 1100^^2233;; tthhee eennccooddiinngg ooff tthhee nnuummbbeerr iiss sspplliitt
 iinnttoo aa mmaannttiissssaa ((11..223344)) aanndd tthhee eexxppoonneenntt ppaarrtt ((2233)) ffoorr tthhee ppoowweerr--ooff--
 tteenn mmuullttiipplliieerr..  NNuummbbeerrss

 1100..22..  CChhaarraacctteerrss


 Characters are normally represented as strings of seven bits each in
 an encoding called ASCII (American Standard Code for Information
 Interchange).  On modern machines, each of the 128 ASCII characters is
 the low seven bits of an 8-bit _o_c_t_e_t; octets are packed into memory
 words so that (for example) a six-character string only takes up two
 memory words.  For an ASCII code chart, type `man 7 ascii' at your
 Unix prompt.

 The preceding paragraph was misleading in two ways.  The minor one is
 that the term `octet' is formally correct but seldom actually used;
 most people refer to an octet as _b_y_t_e and expect bytes to be eight
 bits long.  Strictly speaking, the term `byte' is more general; there
 used to be, for example, 36-bit machines with 9-bit bytes (though
 there probably never will be again).

 The major one is that not all the world uses ASCII.  In fact, much of
 the world can't -- ASCII, while fine for American English, lacks many
 accented and other special characters needed by users of other
 languages.  Even British English has trouble with the lack of a pound-
 currency sign.

 There have been several attempts to fix this problem.  All use the
 extra high bit that ASCII doesn't, making it the low half of a
 256-character set.  The most widely-used of these is the so-called
 `Latin-1' character set (more formally called ISO 8859-1).  This is
 the default character set for Linux, HTML, and X.  Microsoft Windows
 uses a mutant version of Latin-1 that adds a bunch of characters such
 as right and left double quotes in places proper Latin-1 leaves
 unassigned for historical reasons (for a scathing account of the
 trouble this causes, see the demoroniser
 <http://www.fourmilab.ch/webtools/demoroniser/> page).

 Latin-1 handles the major European languages, including English,
 French, German, Spanish, Italian, Dutch, Norwegian, Swedish, Danish.
 However, this isn't good enough either, and as a result there is a
 whole series of Latin-2 through -9 character sets to handle things
 like Greek, Arabic, Hebrew, and Serbo-Croatian.  For details, see the
 ISO alphabet soup
 <http://www.utia.cas.cz/user_data/vs/documents/ISO-8859-X-
 charsets.html> page.

 The ultimate solution is a huge standard called Unicode (and its
 identical twin ISO/IEC 10646-1:1993).  Unicode is identical to Latin-1
 in its lowest 256 slots.  Above these in 16-bit space it includes
 Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali,
 Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai,
 Lao, Georgian, Tibetan, Japanese Kana, the complete set of modern
 Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK)
 ideographs. For details, see the Unicode Home Page
 <http://www.unicode.org/>


 1111..  HHooww ddooeess mmyy ccoommppuutteerr ssttoorree tthhiinnggss oonn ddiisskk??

 When you look at a hard disk under Unix, you see a tree of named
 directories and files.  Normally you won't need to look any deeper
 than that, but it does become useful to know what's going on
 underneath if you have a disk crash and need to try to salvage files.
 Unfortunately, there's no good way to describe disk organization from
 the file level downwards, so I'll have to describe it from the
 hardware up.


 1111..11..  LLooww--lleevveell ddiisskk aanndd ffiillee ssyysstteemm ssttrruuccttuurree

 The surface area of your disk, where it stores data, is divided up
 something like a dartboard -- into circular tracks which are then pie-
 sliced into sectors.  Because tracks near the outer edge have more
 area than those close to the spindle at the center of the disk, the
 outer tracks have more sector slices in them than the inner ones.
 Each sector (or _d_i_s_k _b_l_o_c_k) has the same size, which under modern
 Unixes is generally 1 binary K (1024 8-bit words).  Each disk block
 has a unique address or _d_i_s_k _b_l_o_c_k _n_u_m_b_e_r.

 Unix divides the disk into _d_i_s_k _p_a_r_t_i_t_i_o_n_s.  Each partition is a
 continuous span of blocks that's used separately from any other
 partition, either as a file system or as swap space.  The lowest-
 numbered partition is often treated specially, as a _b_o_o_t _p_a_r_t_i_t_i_o_n
 where you can put a kernel to be booted.

 Each partition is either _s_w_a_p _s_p_a_c_e (used to implement ``virtual
 memory'' or a _f_i_l_e _s_y_s_t_e_m used to hold files.  Swap-space partitions
 are just treated as a linear sequence of blocks.  File systems, on the
 other hand, need a way to map file names to sequences of disk blocks.
 Because files grow, shrink, and change over time, a file's data blocks
 will not be a linear sequence but may be scattered all over its
 partition (from wherever the operating system can find a free block
 when it needs one).


 1111..22..  FFiillee nnaammeess aanndd ddiirreeccttoorriieess

 Within each file system, the mapping from names to blocks is handled
 through a structure called an _i_-_n_o_d_e.  There's a pool of these things
 near the ``bottom'' (lowest-numbered blocks) of each file system (the
 very lowest ones are used for housekeeping and labeling purposes we
 won't describe here).  Each i-node describes one file.  File data
 blocks live above the inodes.

 Every i-node contains a list of the disk block numbers in the file it
 describes.  (Actually this is a half-truth, only correct for small
 files, but the rest of the details aren't important here.)  Note that
 the i-node does _n_o_t contain the name of the file.

 Names of files live in _d_i_r_e_c_t_o_r_y _s_t_r_u_c_t_u_r_e_s.  A directory structure
 just maps names to i-node numbers.  This is why, in Unix, a file can
 have multiple true names (or _h_a_r_d _l_i_n_k_s); they're just multiple
 directory entries that happen to point to the same inode.


 1111..33..  MMoouunntt ppooiinnttss

 In the simplest case, your entire Unix file system lives in just one
 disk partition.  While you'll see this arrangement on some small
 personal Unix systems, it's unusual.  More typical is for it to be
 spread across several disk partitions, possibly on different physical
 disks.   So, for example, your system may have one small partition
 where the kernel lives, a slightly larger one where OS utilities live,
 and a much bigger one where user home directories live.

 The only partition you'll have access to immediately after system boot
 is your _r_o_o_t _p_a_r_t_i_t_i_o_n, which is (almost always) the one you booted
 from.  It holds the root directory of the file system, the top node
 from which everything else hangs.

 The other partitions in the system have to be attached to this root in
 order for your entire, multiple-partition file system to be
 accessible.  About midway through the boot process, your Unix will
 make these non-root partitions accessible.  It will _m_o_u_n_t each one
 onto a directory on the root partition.

 For example, if you have a Unix directory called `/usr', it is
 probably a mount point to a partition that contains many programs
 installed with your Unix but not required during initial boot.


 1111..44..  HHooww aa ffiillee ggeettss llooookkeedd uupp

 Now we can look at the file system from the top down.  When you open a
 file (such as, say, /home/esr/WWW/ldp/fundamentals.sgml) here is what
 happens:

 Your kernel starts at the root of your Unix file system (in the root
 partition).  It looks for a directory there called `home'.  Usually
 `home' is a mount point to a large user partition elsewhere, so it
 will go there.  In the top-level directory structure of that user
 partition, it will look for a entry called `esr' and extract an inode
 number.  It will go to that i-node, notice it is a directory
 structure, and look up `WWW'.  Extracting _t_h_a_t i-node, it will go to
 the corresponding subdirectory and look up `ldp'.  That will take it
 to yet another directory inode.  Opening that one, it will find an i-
 node number for `fundamentals.sgml'.  That inode is not a directory,
 but instead holds the list of disk blocks associated with the file.


 1111..55..  FFiillee oowwnneerrsshhiipp,, ppeerrmmiissssiioonnss aanndd sseeccuurriittyy

 To keep programs from accidentally or maliciously stepping on data
 they shouldn't, Unix has _p_e_r_m_i_s_s_i_o_n features.  These were originally
 designed to support timesharing by protecting multiple users on the
 same machine from each other, back in the days when Unix ran mainly on
 expensive shared minicomputers.

 In order to understand file permissions, you need to recall our
 description of of users and groups in the section ``What happens when
 you log in?''.  Each file has an owning user and an owning group.
 These are initially those of the file's creator; they can be changed
 with the programs chown(1) and chgrp(1).

 The basic permissions that can be associated with a file are `read'
 (permission to read data from it), `write' (permission to modify it)
 and `execute' (permission to run it as a program).  Each file has
 three sets of permissions; one for its owning user, one for any user
 in its owning group, and one for everyone else.  The `privileges' you
 get when you log in are just the ability to do read, write, and
 execute on those files for which the permission bits match your user
 ID or one of the groups you are in.

 To see how these may interact and how Unix displays them, let's look
 at some file listings on a hypothetical Unix system.  Here's one:



      snark:~$ ls -l notes
      -rw-r--r--   1 esr      users         2993 Jun 17 11:00 notes




 This is an ordinary data file.  The listing tells us that it's owned
 by the user `esr' and was created with the owning group `users'.
 Probably the machine we're on puts every ordinary user in this group
 by default; other groups you commonly see on timesharing machines are
 `staff', `admin', or `wheel' (for obvious reasons, groups are not very
 important on single-user workstations or PCs).  Your Unix may use a
 different default group, perhaps one named after your user ID.

 The string `-rw-r--r--' represents the permission bits for the file.
 The very first dash is the position for the directory bit; it would
 show `d' if the file were a directory.  After that, the first three
 places are user permissions, the second three group permissions, and
 the third are permissions for others (often called `world'
 permissions).  On this file, the owning user `esr' may read or write
 the file, other people in the `users' group may read it, and everybody
 else in the world may read it.  This is a pretty typical set of
 permissions for an ordinary data file.


 Now let's look at a file with very different permissions.  This file
 is GCC, the GNU C compiler.



      snark:~$ ls -l /usr/bin/gcc
      -rwxr-xr-x   3 root     bin         64796 Mar 21 16:41 /usr/bin/gcc




 This file belongs to a user called `root' and a group called `bin'; it
 can be written (modified) only by root, but read or executed by
 anyone.  This is a typical ownership and set of permissions for a pre-
 installed system command.  The `bin' group exists on some Unixes to
 group together system commands (the name is a historical relic, short
 for `binary').  Your Unix might use a `root' group instead (not quite
 the same as the `root' user!).

 The `root' user is the conventional name for numeric user ID 0, a
 special, privileged account that can override all privileges.  Root
 access is useful but dangerous; a typing mistake while you're logged
 in as root can clobber critical system files that the same command
 executed from an ordinary user account could not touch.

 Because the root account is so powerful, access to it should be
 guarded very carefully.  Your root password is the single most
 critical piece of security information on your system, and it is what
 any crackers and intruders who ever come after you will be trying to
 get.

 (About passwords: Don't write them down -- and don't pick a passwords
 that can easily be guessed, like the first name of your
 girlfriend/boyfriend/spouse.  This is an astonishingly common bad
 practice that helps crackers no end...)

 Now let's look at a third case:



      snark:~$ ls -ld ~
      drwxr-xr-x  89 esr      users          9216 Jun 27 11:29 /home2/esr
      snark:~$




 This file is a directory (note the `d' in the first permissions slot).
 We see that it can be written only by esr, but read and executed by
 anybody else.  Permissions are interpreted in a special way on
 directories; they control access to the files below them in the
 directory.

 Read permission on a directory is simple; it just means you can get
 through the directory to open the files and directories below it.
 Write permission gives you the ability to create and delete files in
 the directory.  Execute permission gives you the ability to _s_e_a_r_c_h the
 directory -- that is, to list it and see the names of files and
 directories it contains.  Occasionally you'll see a directory that is
 world-readable but not world-executable; this means a random user can
 get to files and directories beneath it, but only by knowing their
 exact names.

 Finally, let's look at the permissions of the login program itself.


      snark:~$ ls -l /bin/login
      -rwsr-xr-x   1 root     bin         20164 Apr 17 12:57 /bin/login




 This has the permissions we'd expect for a system command -- except
 for that 's' where the owner-execute bit ought to be.  This is the
 visible manifestation of a special permission called the `set-user-id'
 or _s_e_t_u_i_d _b_i_t.

 The setuid bit is normally attached to programs that need to give
 ordinary users the privileges of root, but in a controlled way.  When
 it is set on an executable program, you get the privileges of the
 owner of that program file while the program is running on your
 behalf, whether or not they match your own.

 Like the root account itself, setuid programs are useful but
 dangerous.  Anyone who can subvert or modify a setuid program owned by
 root can use it to spawn a shell with root privileges.  For this
 reason, opening a file to write it automatically turns off its setuid
 bit on most Unixes.  Many attacks on Unix security try to exploit bugs
 in setuid programs in order to subvert them.  Security-conscious
 system administrators are therefore extra-careful about these programs
 and relucvtant to install new ones.

 There are a couple of important details we glossed over when
 discussing permissions above; namely, how the owning group and
 permissions are assigned when a file is first created.  The group is
 an issue because users can be members of multiple groups, but one of
 them (specified in the user's /etc/passwd entry) is the user's _d_e_f_a_u_l_t
 _g_r_o_u_p and will normally own files created by the user.

 The story with initial permission bits is a little more complicated.
 A program that creates a file will normally specify the permissions it
 is to start with.  But these will be modified by a variable in the
 user's environment called the _u_m_a_s_k.  The umask specifies which
 permission bits to _t_u_r_n _o_f_f when creating a file; the most common
 value, and the default on most systems, is -------w- or 002, which
 turns off the world-write bit.  See the documentation of the umask
 command on your shell's manual page for details.


 1111..66..  HHooww tthhiinnggss ccaann ggoo wwrroonngg

 Earlier we hinted that file systems can be fragile things.  Now we
 know that to get to file you have to hopscotch through what may be an
 arbitrarily long chain of directory and i-node references.  Now
 suppose your hard disk develops a bad spot?

 If you're lucky, it will only trash some file data.  If you're
 unlucky, it could corrupt a directory structure or i-node number and
 leave an entire subtree of your system hanging in limbo -- or, worse,
 result in a corrupted structure that points multiple ways at the same
 disk block or inode.  Such corruption can be spread by normal file
 operations, trashing data that was not in the original bad spot.

 Fortunately, this kind of contingency has become quite uncommon as
 disk hardware has become more reliable.  Still, it means that your
 Unix will want to integrity-check the file system periodically to make
 sure nothing is amiss.  Modern Unixes do a fast integrity check on
 each partition at boot time, just before mounting it.  Every few
 reboots they'll do a much more thorough check that takes a few minutes
 longer.


 If all of this sounds like Unix is terribly complex and failure-prone,
 it may be reassuring to know that these boot-time checks typically
 catch and correct normal problems _b_e_f_o_r_e they become really
 disasterous.  Other operating systems don't have these facilities,
 which speeds up booting a bit but can leave you much more seriously
 screwed when attempting to recover by hand (and that's assuming you
 have a copy of Norton Utilities or whatever in the first place...).


 1122..  HHooww ddoo ccoommppuutteerr llaanngguuaaggeess wwoorrkk??

 We've already discussed ``how programs are run''.  Every program
 ultimately has to execute as a stream of bytes that are instructions
 in your computer's _m_a_c_h_i_n_e _l_a_n_g_u_a_g_e.  But human beings don't deal with
 machine language very well; doing so has become a rare, black art even
 among hackers.

 Almost all Unix code except a small amount of direct hardware-
 interface support in the kernel itself is nowadays written in a _h_i_g_h_-
 _l_e_v_e_l _l_a_n_g_u_a_g_e.  (The `high-level' in this term is a historical relic
 meant to distinguish these from `low-level' _a_s_s_e_m_b_l_e_r _l_a_n_g_u_a_g_e_s, which
 are basically thin wrappers around machine code.)

 There are several different kinds of high-level languages.  In order
 to talk about these, you'll find it useful to bear in mind that the
 _s_o_u_r_c_e _c_o_d_e of a program (the human-created, editable version) has to
 go through some kind of translation into machine code that the machine
 can actually run.


 1122..11..  CCoommppiilleedd llaanngguuaaggeess

 The most conventional kind of language is a _c_o_m_p_i_l_e_d _l_a_n_g_u_a_g_e.
 Compiled languages get translated into runnable files of binary
 machine code by a special program called (logically enough) a
 _c_o_m_p_i_l_e_r.  Once the binary has been generated, you can run it directly
 without looking at the source code again.  (Most software is delivered
 as compiled binaries made from code you don't see.)

 Compiled languages tend to give excellent performance and have the
 most complete access to the OS, but also to be difficult to program
 in.

 C, the language in which Unix itself is written, is by far the most
 important of these (with its variant C++).  FORTRAN is another
 compiled language still used among engineers and scientists but years
 older and much more primitive.  In the Unix world no other compiled
 languages are in mainstream use.  Outide it, COBOL is very widely used
 for financial and business software.

 There used to be many other compiler languages, but most of them have
 either gone extinct or are strictly research tools.  If you are a new
 Unix developer using a compiled language, it is overwhelmingly likely
 to be C or C++.


 1122..22..  IInntteerrpprreetteedd llaanngguuaaggeess

 An _i_n_t_e_r_p_r_e_t_e_d _l_a_n_g_u_a_g_e depends on an interpreter program that reads
 the source code and translates it on the fly into computations and
 system calls.  The source has to be re-interpreted (and the
 interpreter present) each time the code is executed.

 Interpreted languages tend to be slower than compiled languages, and
 often have limited access to the underlying operating system and
 hardware.  On the other hand, they tend to be easier to program and
 more forgiving of coding errors than compiled languages.

 Many Unix utilities, including the shell and bc(1) and sed(1) and
 awk(1), are effectively small interpreted languages.  BASICs are
 usually interpreted.  So is Tcl.  Historically, the most important
 interpretive language has been LISP (a major improvement over most of
 its successors).  Today Perl is very widely used and steadily growing
 more popular.


 1122..33..  PP--ccooddee llaanngguuaaggeess

 Since 1990 a kind of hybrid language that uses both compilation and
 interpretation has become increasingly important.  P-code languages
 are like compiled languages in that the source is translated to a
 compact binary form which is what you actually execute, but that form
 is not machine code.  Instead it's _p_s_e_u_d_o_c_o_d_e (or _p_-_c_o_d_e), which is
 usually a lot simpler but more powerful than a real machine language.
 When you run the program, you interpret the p-code.

 P-code can can run nearly as fast as a compiled binary (p-code
 interpreters can be made quite simple, small and speedy).  But p-code
 languages can keep the flexibility and power of a good interpreter.

 Important p-code languages include Python and Java.


 1133..  HHooww ddooeess tthhee IInntteerrnneett wwoorrkk??

 To help you understand how the Internet works, we'll look at the
 things that happen when you do a typical Internet operation --
 pointing a browser at the front page of this document at its home on
 the Web at the Linux Documentation Project.  This document is


 http://metalab.unc.edu/LDP/HOWTO/Fundamentals.html



 which means it lives in the file LDP/HOWTO/Fundamentals.html under the
 World Wide Web export directory of the host metalab.unc.edu.


 1133..11..  NNaammeess aanndd llooccaattiioonnss


 The first thing your browser has to do is to establish a network
 connection to the machine where the document lives.  To do that, it
 first has to find the network location of the _h_o_s_t metalab.unc.edu
 (`host' is short for `host machine' or `network host'; metalab.unc.edu
 is a typical _h_o_s_t_n_a_m_e).  The corresponding location is actually a
 number called an _I_P _a_d_d_r_e_s_s (we'll explain the `IP' part of this term
 later).

 To do this, your browser queries a program called a _n_a_m_e _s_e_r_v_e_r.  The
 name server may live on your machine, but it's more likely to run on a
 service machine that yours talks to.  When you sign up with an ISP,
 part of your setup procedure will almost certainly involve telling
 your Internet software the IP address of a nameserver on the ISP's
 network.

 The name servers on different machines talk to each other, exchanging
 and keeping up to date all the information needed to resolve hostnames
 (map them to IP addresses).  Your nameserver may query three or four
 different sites across the network in the process of resolving
 metalab.unc.edu, but this usually happens very quickly (as in less
 than a second).

 The nameserver will tell your browser that Metalab's IP address is
 152.2.22.81; knowing this, your machine will be able to exchange bits
 with metalab directly.


 1133..22..  PPaacckkeettss aanndd rroouutteerrss


 What the browser wants to do is send a command to the Web server on
 Metalab that looks like this:


 GET /LDP/HOWTO/Fundamentals.html HTTP/1.0



 Here's how that happens.  The command is made into a _p_a_c_k_e_t, a block
 of bits like a telegram that is wrapped with three important things;
 the _s_o_u_r_c_e _a_d_d_r_e_s_s (the IP address of your machine), the _d_e_s_t_i_n_a_t_i_o_n
 _a_d_d_r_e_s_s (152.2.22.81), and a _s_e_r_v_i_c_e _n_u_m_b_e_r or _p_o_r_t _n_u_m_b_e_r (80, in
 this case) that indicates that it's a World Wide Web request.

 Your machine then ships the packet down the wire (modem connection to
 your ISP, or local network) until it gets to a specialized machine
 called a _r_o_u_t_e_r.  The router has a map of the Internet in its memory
 -- not always a complete one, but one that completely describes your
 network neighborhood and knows how to get to the routers for other
 neighborhoods on the Internet.

 Your packet may pass through several routers on the way to its
 destination.  Routers are smart.  They watch how long it takes for
 other routers to acknowledge having received a packet.  They use that
 information to direct traffic over fast links.  They use it to notice
 when another routers (or a cable) have dropped off the network, and
 compensate if possible by finding another route.

 There's an urban legend that the Internet was designed to survive
 nuclear war.  This is not true, but the Internet's design is extremely
 good at getting reliable performance out of flaky hardware in am
 uncertain world..  This is directly due to the fact that its
 intelligence is distributed through thousands of routers rather than a
 few massive switches (like the phone network).  This means that
 failures tend to be well localized and the network can route around
 them.

 Once your packet gets to its destination machine, that machine uses
 the service number to feed the packet to the web server.  The web
 server can tell where to reply to by looking at the command packet's
 source IP address. When the web server returns this document, it will
 be broken up into a number of packets.  The size of the packets will
 vary according to the transmission media in the network and the type
 of service.


 1133..33..  TTCCPP aanndd IIPP

 To understand how multiple-packet transmissions are handled, you need
 to know that the Internet actually uses two protocols, stacked one on
 top of the other.

 The lower level, _I_P (Internet Protocol), knows how to get individual
 packets from a source address to a destination address (this is why
 these are called IP addresses).  However, IP is not reliable; if a
 packet gets lost or dropped, the source and destination machines may
 never know it.  In network jargon, IP is a _c_o_n_n_e_c_t_i_o_n_l_e_s_s protocol;
 the sender just fires a packet at the receiver and doesn't expect an
 acknowledgement.

 IP is fast and cheap, though.  Sometimes fast, cheap and unreliable is
 OK.  When you play networked Doom or Quake, each bullet is represented
 by an IP packet.  If a few of those get lost, that's OK.

 The upper level, _T_C_P (Transmission Control Protocol), gives you
 reliability.  When two machines negotiate a TCP connection (which they
 do using IP), the receiver knows to send acknowledgements of the
 packets it sees back to the sender.  If the sender doesn't see an
 acknowledgement for a packet within some timeout period, it resends
 that packet.  Furthermore, the sender gives each TCP packet a sequence
 number, which the receiver can use you reassemble packets in case they
 show up out of order.  (This can happen if network links go up or down
 during a connection.)

 TCP/IP packets also contain a checksum to enable detection of data
 corrupted by bad links.  So, from the point of view of anyone using
 TCP/IP and nameservers, it looks like a reliable way to pass streams
 of bytes between hostname/service-number pairs.  People who write
 network protocols almost never have to think about all the
 packetizing, packet reassembly, error checking, checksumming, and
 retransmission that goes on below that level.


 1133..44..  HHTTTTPP,, aann aapppplliiccaattiioonn pprroottooccooll

 Now let's get back to our example.  Web browsers and servers speak an
 _a_p_p_l_i_c_a_t_i_o_n _p_r_o_t_o_c_o_l that runs on top of TCP/IP, using it simply as a
 way to pass strings of bytes back and forth.  This protocol is called
 _H_T_T_P (Hyper-Text Transfer Protocol) and we've already seen one command
 in it -- the GET shown above.

 When the GET command goes to metalab.unc.edu's webserver with service
 number 80, it will dispatched to a _s_e_r_v_e_r _d_a_e_m_o_n listening on port 80.
 Most Internet services are implemented by server daemons that do
 nothing but wait on ports, watching for and executing incoming
 commands.

 If the design of the Internet has one overall rule, it's that all the
 parts should be as simple and human-accessible as possible.  HTTP, and
 its relatives (like the Simple Mail Transfer Protocol, _S_M_T_P, that is
 used to move electronic mail between hosts) tend to use simple
 printable-text commands that end with a carriage-return/line feed.

 This is marginally inefficient; in some circumstances you could get
 more speed by using a tightly-coded binary protocol.  But experience
 has shown that the benefits of having commands be easy for human
 beings to describe and understand outweigh any marginal gain in
 efficiency that you might get at the cost of making things tricky and
 opaque.

 Therefore, what the server daemon ships back to you via TCP/IP is also
 text.  The beginning of the response will look something like this (a
 few headers have been suppressed):


 HTTP/1.1 200 OK
 Date: Sat, 10 Oct 1998 18:43:35 GMT
 Server: Apache/1.2.6 Red Hat
 Last-Modified: Thu, 27 Aug 1998 17:55:15 GMT
 Content-Length: 2982
 Content-Type: text/html

 These headers will be followed by a blank line and the text of the web
 page (after which the connection is dropped).  Your browser just
 displays that page.  The headers tell it how (in particular, the
 Content-Type header tells it the returned data is really HTML).