3. TECHNICAL

3. TECHNICAL

3.1. More detailed explanation of basic sed

Sed takes a script of editing commands and applies each command, in
order, to each line of input. After all the commands have been
applied to the first line of input, that line is output. A second
input line is taken for processing, and the cycle repeats. Sed
scripts can address a single line by line number or by matching a
/RE pattern/ on the line. An exclamation mark '!' after a regex
('/RE/!') or line number will select all lines that do NOT match
that address. Sed can also address a range of lines in the same
manner, using a comma to separate the 2 addresses.

$d # delete the last line of the file
/[0-9]\{3\}/p # print lines with 3 consecutive digits
5!s/ham/cheese/ # except on line 5, replace 'ham' with 'cheese'
/awk/!s/aaa/bb/ # unless 'awk' is found, replace 'aaa' with 'bb'
17,/foo/d # delete all lines from line 17 up to 'foo'

Following an address or address range, sed accepts curly braces
'{...}' so several commands may be applied to that line or to the
lines matched by the address range. On the command line, semicolons
';' separate each instruction and must precede the closing brace.

sed '/Owner:/{s/yours/mine/g;s/your/my/g;s/you/me/g;}' file

Range addresses operate differently depending on which version of
sed is used (see section 6.8.5, below). For further information on
using sed, consult the references in section 2.3, above. The online
manual ("man pages") on Unix/Linux systems may be helpful (try "man
sed"), but man pages are notoriously obscure for first-time users.

3.2. Common one-line sed scripts

A separate document of over 70 handy "one-line" sed commands is
available at <[130]http://www.cornerstonemag.com/sed/sed1line.txt>. Here
are fourteen of the most common sed commands for one-line use.
MS-DOS users should replace single quotes ('...') with double
quotes ("...") in these examples. A specific filename ("file")
usually follows the script, though the input may also come via
piping ("sort somefile | sed 'somescript'").

# 1. Double space a file
sed G file

# 2. Triple space a file
sed 'G;G' file

# 3. Under UNIX: convert DOS newlines (CR/LF) to Unix format
sed 's/.$//' file # assumes that all lines end with CR/LF
sed 's/^M$// file # in bash/tcsh, press Ctrl-V then Ctrl-M

# 4. Under DOS: convert Unix newlines (LF) to DOS format
sed 's/$//' file # method 1
sed -n p file # method 2

# 5. Delete leading whitespace (spaces/tabs) from front of each line
# (this aligns all text flush left). '^t' represents a true tab
# character. Under bash or tcsh, press Ctrl-V then Ctrl-I.
sed 's/^[ ^t]*//' file

# 6. Delete trailing whitespace (spaces/tabs) from end of each line
sed 's/[ ^t]*$//' file # see note on '^t', above

# 7. Delete BOTH leading and trailing whitespace from each line
sed 's/^[ ^t]*//;s/[ ^]*$//' file # see note on '^t', above

# 8. Substitute "foo" with "bar" on each line
sed 's/foo/bar/' file # replaces only 1st instance in a line
sed 's/foo/bar/4' file # replaces only 4th instance in a line
sed 's/foo/bar/g' file # replaces ALL instances within a line

# 9. Substitute "foo" with "bar" ONLY for lines which contain "baz"
sed '/baz/s/foo/bar/g' file

# 10. Delete all CONSECUTIVE blank lines from file except the first.
# This method also deletes all blank lines from top and end of file.
# (emulates "cat -s")
sed '/./,/^$/!d' file # this allows 0 blanks at top, 1 at EOF
sed '/^$/N;/\n$/D' file # this allows 1 blank at top, 0 at EOF

# 11. Delete all leading blank lines at top of file (only).
sed '/./,$!d' file

# 12. Delete all trailing blank lines at end of file (only).
sed -e :a -e '/^\n*$/{$d;N;};/\n$/ba' file

# 13. If a line ends with a backslash, join the next line to it.
sed -e :a -e '/\\$/N; s/\\\n//; ta' file

# 14. If a line begins with an equal sign, append it to the
# previous line (and replace the "=" with a single space).
sed -e :a -e '$!N;s/\n=/ /;ta' -e 'P;D' file

3.3. Addressing and address ranges

Sed commands may have an optional "address" or "address range"
prefix. If there is no address or address range given, then the
command is applied to all the lines of the input file or text
stream. Three commands cannot take an address prefix:

- labels, used to branch or jump within the script
- the close brace, '}', which ends the '{' "command"
- the '#' comment character, also technically a "command"

An address can be a line number (such as 1, 5, 37, etc.), a regular
expression (written in the form /RE/ or \xREx where 'x' is any
character other than '\' and RE is the regular expression), or the
dollar sign ($), representing the last line of the file. An
exclamation mark (!) after an address or address range will apply
the command to every line EXCEPT the ones named by the address. A
null regex ("//") will be replaced by the last regex which was
used. Also, some seds do not support \xREx as regex delimiters.

5d # delete line 5 only
5!d # delete every line except line 5
/RE/s/LHS/RHS/g # substitute only if RE occurs on the line
/^$/b label # if the line is blank, branch to ':label'
/./!b label # ... another way to write the same command
\%.%!b label # ... yet another way to write this command
$!N # on all lines but the last, get the Next line

Note that an embedded newline can be represented in an address by
the symbol \n, but this syntax is needed only if the script puts 2
or more lines into the pattern space via the N, G, or other
commands. The \n symbol does not match the newline at an
end-of-line because when sed reads each line into the pattern space
for processing, it strips off the trailing newline, processes the
line, and adds a newline back when printing the line to standard
output. To match the end-of-line, use the '$' metacharacter, as
follows:

/tape$/ # matches the word 'tape' at the end of a line
/tape$deck/ # matches the word 'tape$deck' with a literal '$'
/tape\ndeck/ # matches 'tape' and 'deck' with a newline between

The following sed commands usually accept only a single address.
All other commands (except labels, '}', and '#') accept both single
addresses and address ranges.

= print to stdout the line number of the current line
a after printing the current line, append "text" to stdout
i before printing the current line, insert "text" to stdout
q quit after the current line is matched
r file prints contents of "file" to stdout after line is matched

Note that we said "usually." If you need to apply the '=', 'a',
'i', or 'r' commands to each and every line within an address
range, this behavior can be coerced by the use of braces. Thus,
"1,9=" is an invalid command, but "1,9{=;}" will print each line
number followed by its line for the first 9 lines (and then print
the rest of the rest of the file normally).

Address ranges occur in the form

<address1>,<address2> or <address1>,<address2>!

where the address can be a line number or a standard /regex/.
<address2> can also be a dollar sign, indicating the end of file.
Under HHsed and gsed302a, <address2> may also be a notation of the
form +num, indicating the next num lines after <address1> is
matched.

Address ranges are:

(1) Inclusive. The range "/From here/,/eternity/" matches all the
lines containing "From here" up to and including the line
containing "eternity". It will not stop on the line just prior to
"eternity". (If you don't like this, see section 4.15.)

(2) Plenary. They always match full lines, not just parts of lines.
In other words, a command to change or delete an address range will
change or delete whole lines; it won't stop in the middle of a
line.

(3) Multilinear. Address ranges normally match 2 lines or more. The
second address will never match the same line the first address
did; therefore a valid address range always spans at least two
lines, with these exceptions which match only one line:

- if the first address matches the last line of the file
- if using the syntax "/RE/,3" and /RE/ occurs only once in the
file at line 3 or below
- if using HHsed v1.5. See section 6.8.5.

(4) Minimalist. In address ranges with /regex/ as <address2>, the
range "/foo/,/bar/" will stop at the first "bar" it finds, provided
that "bar" occurs on a line below "foo". If the word "bar" occurs
on several lines below the word "foo", the range will match all the
lines from the first "foo" up to the first "bar". It will not
continue hopping ahead to find more "bar"s. In other words, address
ranges are not "greedy," like regular expressions.

(5) Repeating. An address range will try to match more than one
block of lines in a file. However, the blocks cannot nest. In
addition, a second match will not "take" the last line of the
previous block. For example, given the following text,

start
stop start
stop

the sed command '/start/,/stop/d' will only delete the first two
lines. It will not delete all 3 lines.

(6) Relentless. If the address range finds a "start" match but
doesn't find a "stop", it will match every line from "start" to the
end of the file. Thus, beware of the following behaviors:

/RE1/,/RE2/ # if /RE2/ is not found, matches from /RE1/ to the
# end-of-file

20,/RE/ # if /RE/ is not found, matches from line 20 to the
# end-of-file

/RE/,30 # if /RE/ occurs any time after line 30, each
# occurrence will be matched in HHsed, sedmod, and
# gsed302. GNU sed v2.05 and 1.18 will match from
# the 2nd occurrence of /RE/ to the end-of-file.

If these behaviors seem strange, remember that they occur because
sed does not look "ahead" in the file. Doing so would stop sed from
being a stream editor and have adverse effects on its efficiency.
If these behaviors are undesirable, they can be circumvented or
corrected by the use of nested testing within braces. The following
scripts work under GNU sed 3.02:

# Execute your_commands on range "/RE1/,/RE2/", but if /RE2/ is
# not found, do nothing.
/RE1/{:a;N;/RE2/!ba;your_commands;}

# Execute your_commands on range "20,/RE/", but if /RE/ is not
# found, do nothing.
20{:a;N;/RE/!ba;your_commands;}

As a side note, once we've used N to "slurp" lines together to test
for the ending expression, the pattern space will have gathered
many lines (possibly thousands) together and concatenated them as a
single expression, with the \n sequence marking line breaks. The
REs within the pattern space may have to be modified (e.g., you
must write '/\nStart/' instead of '/^Start/' and '/[^\n]*/' instead
of '/.*/') and other standard sed commands will be unavailable or
difficult to use.

# Execute your_commands on range "/RE/,30", but if /RE/ occurs
# on line 31 or later, do not match it.
1,30{/RE/,$ your_commands;}

For related suggestions on using address ranges, see sections 4.2,
4.15, and 4.19 of this FAQ. Note that HHsed contains a bug or
nonstandard feature in how it implements address ranges; also, GNU
sed 3.02a supports a zero (0) in addressing. For more details, see
section 6.8.5 ("Range addressing with GNU sed and HHsed").

3.4. [reserved]

3.5. [reserved]

3.6. Notes about s2p, the sed-to-perl translator

s2p (sed to perl) is a Perl program to convert sed scripts into the
Perl programming language; it is included with many versions of
Perl. These problems have been found when using s2p:

(1) Doesn't recognize the semicolon properly after s/// commands.

s/foo/bar/g;

(2) Doesn't trim trailing whitespace after s/// commands. Even lone
trailing spaces, without comments, produce an error.

(3) Doesn't handle multiple commands within braces. E.g.,

1,4{=;G;}

will produce perl code with missing braces, and miss the second "G"
command as well. In fact, any commands after the first one are
missed in the perl output script, and the output perl script will
also contain mismatched braces.

3.7. GNU/POSIX extensions to regular expressions

GNU sed supports "character classes" in addition to regular
character sets, such as [0-9A-F]. Like regular character sets,
character classes represent any single character within a set.

"Character classes are a new feature introduced in the POSIX
standard. A character class is a special notation for describing
lists of characters that have a specific attribute, but where the
actual characters themselves can vary from country to country
and/or from character set to character set. For example, the notion
of what is an alphabetic character differs in the USA and in
France." [quoted from the docs for GNU awk v3.0.3]

Though character classes don't generally conserve space on the
line, they help make scripts portable for international use. The
equivalent character sets *for U.S. users* follow:

[[:alnum:]] - [A-Za-z0-9] Alphanumeric characters
[[:alpha:]] - [A-Za-z] Alphabetic characters
[[:blank:]] - [ \x09] Space or tab characters only
[[:cntrl:]] - [\x00-\x19\x7F] Control characters
[[:digit:]] - [0-9] Numeric characters
[[:graph:]] - [!-~] Printable and visible characters
[[:lower:]] - [a-z] Lower-case alphabetic characters
[[:print:]] - [ -~] Printable (non-Control) characters
[[:punct:]] - [!-/:-@[-`{-~] Punctuation characters
[[:space:]] - [ \t\v\f] All whitespace chars
[[:upper:]] - [A-Z] Upper-case alphabetic characters
[[:xdigit:]] - [0-9a-fA-F] Hexadecimal digit characters

Note that [[:graph:]] does not match the space " ", but [[:print:]]
does. Some character classes may (or may not) match characters in
the high ASCII range (ASCII 128-255 or 0x80-0xFF), depending on
which C library was used to compile sed. For non-English languages,
[[:alpha:]] and other classes may also match high ASCII characters.
_______________________________________________________________________________