% date: 2018-03-08
POSIX Shell Scripting Survival Guide
====================================
Authors: ulcer <
[email protected]>
License: CC-BY-SA 4.0
Published: 2018-03-08
Updated: 2018-05-08
Link:
gopher://sdf.org/0/users/ulcer/kb/kb/article-script-survival.md
Mirror:
https://uuuuu.github.io/article-script-survival.txt
Buy me cookies:
https://patreon.com/ulcer
Table of contents:
1. Introduction
1.1. Note on shells: bourne shell syntax, bashisms and POSIX
1.2. Tools of trade: BSD userland, GNU coreutils
1.3. Shell scripting limitations
2. Scripting basics
2.1. echo vs printf and few more details
2.2. Variables, eval and quotation
2.3. Conditions and loops: if, while
2.4. Options parsing and validation: getopt vs getopts, case
2.5. Life without arrays
2.6. Change file
2.7. Check lock
3. Commentary on tools
3.1. Working with pipes: stdbuf and pv
3.2. Notes on grep
3.3. Notes on sed
3.4. Notes on awk
3.5. Notes on portable syntax
3.6. Notes on UTF-8 compatibility
3.7. Working with XML and HTML: curl, tidy, xmlstarlet and others
3.8. Working with JSON: jq
3.9. Working with CSV: miller
4. Advanced topics
4.1. Reading user input
4.2. Internal field separator
4.3. Command line options manupulation
4.4. Nested constructions difficulties and recursion
4.5. Libraries and trap cascading
4.6. Debugging
4.7. Testing and deploying in new environment
4.8. Networking with shell
4.9. Paste safety
5. Further reading
6. References
7. Changelog
% body
1. Introduction
---------------
While it is true that tasks shell solves are of limited scope,
using POSIX shell/toolset you may still do plenty of day-to-day and
administration/deployment/maintenance work done not caring much about
platform you use - BSD or GNU.
This guide was motivated by watching fellow SDFers doing common
mistakes and assumes you know how to do "hello world" . It also should
provide you with answer to question "how do I solve real life problems
with this junk". Given number of historical tool alternatives and
spreaded functionality, it's no obvious question. This guide is highly
opiniated, when possible link with reasoning provided.
### 1.1. Note on shells: bourne shell syntax, bashisms and POSIX
Since this guide assumes you know some basics, it was probably bash. If
you didn't really dive into it (except for arrays), you should be aware
that there's minimal number of differences, called bashisms [1],
holding you from POSIX compliant syntax (which is subset of language
features you may successfully run under any major contemporary shells
without extra movements). This may also drop question about bash
incompatibilities under different versions. Same bashisms curse follows
ksh users.
Other reason for getting rid of bashisms is real slowness of bash.
While it's good for interactive usage, there is better choice, as zsh
holds ultimate position at command completion. Famous ShellShock
vulnerability also suggests necessity of less bloated shell for system
tasks.
You may check common bashisms list at [2] and use "checkbashisms"
script bundled in Debian "devscripts" package [3].
Closest to POSIX standard shells are Almquist shell successors: dash
and busybox ash. I usually develop scripts in dash, which is default
Debian system shell. Bash usually doesn't need even minor changes for
dash scripts, but it can be run in "bash --posix" mode. Zsh provides
bare minimum default compatibility with POSIX syntax, requiring "zsh -c
'emulate sh; sh'" to run POSIX scripts.
As you noticed, POSIX sh has one major drawback: it has no arrays.
Therefore there were csh, ksh, bash and so on. One thing you should
know looking at shell alternative, csh (tcsh, if you like) is dead
branch [4] [5]. If you ask which shell is best compromise between POSIX
compliance and bare minimum improvement - arrays, it'd be probably mksh
(MirBSD ksh) which is already default Android shell.
Alternative syntaxes not compatible with POSIX sh: rc, fish and zsh.
Fish is contemporary effort to fish bash problems, which doesn't focus
on executation efficiency. You may surely take a look at rc publication
and plan9 ecosystem to spot problematic parts of traditional shell [6].
Zsh while being not compatible with POSIX syntax offers significantly
improved over bash interactive experience and best currently available
completion system.
Shells are largely bloatware. Example comparison of manpages size
between shells (using "MANWIDTH=80 manpage dash | wc -l"): dash - 1590,
bash - 5742, rc - 1143. Don't be surprised when you hit bug: dash at
the moment I extensively studied it (using Debian 7 version) had empty
"trap" output and buggy printf operation. So don't be afraid of trying
your portable work on multiple shells (e.g. dash and bash), if you're
not sure of shell misbehaviour.
### 1.2. Tools of trade: BSD userland, GNU coreutils
Part about GNU/Linux in linux name is about GNU ecosystem, including
coreutils, like "cat", "sed", "tr" and so on. Relevant part of BSD
world is called userland. Embedded systems often utilize busybox, which
bundles all these tools under single binary with often too restricted
functions even for POSIX.
You may expect any of these tools to fully support UTF-8, to be
adherent to POSIX standard and to be bug free. Maybe in the future, so
prepare to switch tools on occasion.
Following "do one thing and do it well", you should be aware of
existing tools not to waste time attempting to solve your problem with
wrong tool (like replacing newlines with sed).
### 1.3. Shell scripting limitations
Small subset of shell scripting deficiencies you should be aware:
- speed. Lack of arrays and fancy operations with firing subshell each
time you need to do something contributes to general slowness. Yet
individual operations (be it grep or awk) provide great speed benefit
over scripting languages
- subshells consume variables: it's hard to make piped constructs to
return multiple values and impossible to carry exception-like
messages, and so forth
- untyped and unstructured values: shell efficiency can be extended to
topic of typed/binary data flows but that's another story
Following comment in Structural Regular Expressions, "the silent limits
placed on line lengths by most tools can be frustrating." [7]
In general, deeper you'll dive with shell scripting, more limitations
you'll discover. Given all subtle details in every small facility
support (and even development) is a burden.
2. Scripting basics
-------------------
As promised in title, it's more like guide, so I won't duplicate
detailed answers and only give solutions in order suitable for
learning.
### 2.1. echo vs printf and few more details
To make story short, there are differences between build-in and
standalone echo implementations, there are subtle backspace and option
parsing defails, and you shouldn't use echo for anything other than
fixed string (preferable without backslash and other special chars).
For other cases you got printf. [8]
It's usually done like this:
$ a=$( printf "%a\n" "${var}" | something )
Note that command substitution consumes trailing newline, so depending
on what you do you may need trailing line returned back. Piping to
while loop is quite common pattern:
$ a=$( seq 1 5 ) ; printf "%s" "$a" | \
while IFS= read -r var; do
printf "%s\n" "${var}"
done
Which gives only 4 result lines, as we haven't write 'printf "%s\n"'.
That said, you shouldn't use backticks for command substitution, always
favouring $() notation [9]. Also note that "read" by default reacts on
backspaces, which is turned off with "-r" switch. For whole lines you
may also use line(1) utility.
In case you need sole newline character, it's done like this:
$ nl=$(printf "\n "); nl="${nl% }"
There's also no way to expand $'\' bashisms, so for sake of being
univarsal you just printf to variable in octal.
You may see redirections in certain order, which is important. Like this:
$ date >/dev/null 2&>1
One thing you should remember about them: they take precedence in order
they are written, so writing "2>&1 >/dev/null" won't disable stderr.
### 2.2. Variables, eval and quotation
Let's go back to string printing example:
$ printf "%s\n" "${var}"
As you see, I wrote variable inside of curly braces inside of quotes,
as that being the only safe way of using variable (quotes usage and
variable expansion "${}" being separate topics). [10]
Regarding "${}" safety consider next examples:
$ class="core"; var="path"; core_path="${HOME}"
$ eval dest=\"\$$class_$var\"
$ echo "Destination: ${dest}"
Which won't work until you cover "class" from second line in braces
(because it's recognized as "class_").
See next example with unquoted variable evaluation:
$ AWK=mawk AWK_OPTS="-W interactive"
$ { while :; do date; sleep 1; done; } | $AWK $AWK_OPTS '{print}' -
Next example splits unquoted QUERY_STRING according to IFS into
positional params, available with $1, $2 and so on:
$ set -- ${QUERY_STRING}
All you have to know about "eval" operation - it best way of shooting
oneself in the foot, because all your newlines, semicolons and all
kinds of expansions take place. Don't do it without extreme reason and
proper quotation.
Unix filenames may include newlines. All the fuzz for proper quotation,
"xargs -0" and such is about safety from crashes and other malicious
actions (e.g. with "while read line" loops).
Make it a rule to quote variables in double quotes any place you use
them to prevent at least IFS split. Double quotes around command
substitution ("$()") are not necessary.
One note for variables scope. POSIX doesn't define "local" for local
variables in functions, but you may find it in literally any shell
around. In other case just use unique names and preserve shared
variables (like IFS) in backups.
### 2.3. Conditions and loops: if, while
Let's look at syntax from this example:
[ ! -f "${list_raw}" ] ||
[ "${TMWW_DRYRUN}" != "yes" \
-a $(( $(date +%s) - $( stat ${STATSEC} "${list_raw}" ) )) \
-gt "${TMWW_DELTA}" 2>/dev/null ] && \
fetch_all
When you write conditions, "[" is equivalent to "test" built-in (with
"]" being optional decoration). It's quite powerful operator, but error
messages too often lack problem cause. First of all, there's only "="
being correct (which is frequent typo).
Only correct syntax for shell built-in calculations (arithmetic
expansion) is "$(())": "$ i=$(( $i + 1 ))". More complex calculations
are solved using expr(1), specifically when you need string
manipulation functions without resorting to mastodon like awk.
Thing you may often see in scripts is ":" which is equivalent to "NOP"
machine instruction and does exactly nothing. Like this:
$ while :; do date; sleep 1; done
$ : ${2?aborting: missing second param}
$ : >flush_or_create_file
Few words about "if" statement. Tests like '[ "x$a" -eq "x" ]' are
archaic, related to earliest shells and absolutely useless nowadays.
With tests written as "test -n" or "test -z" you shouldn't ever think
if variable is "empty" or "unset", but something like '[ "$b" = "" ]'
is good too.
"while read" piped constructs with external calls are most slow part
contributing to general scripts speed. They also suffer from lack of
values with newline support. Being pretty obscure case, it still may be
addressed with xargs [11]:
... | tr '\n' '\0' | xargs -0 -n1
### 2.4. Options parsing and validation: getopt vs getopts, case
General note on options notation: there are short options like "-a",
which can be written concatenated, like "-ab 'value'" depending on
how smart your option parser is, and GNU-style long options, like
"--version" and "--help" (this two are most ubiquitous for GNU tools).
When you need explicitly tell option parser to stop accept options,
there's handy "--" empty option:
$ kill -- "-${pgid}"
$ random_input | grep -- "${filter}"
Note that "${filter}" variable in last example may start with dash, so
it's always good to put "--" beforehand.
The only "getopt" you should use is shell embedded "getopts". If it
happens you need long options for thing like shell script, you really
should reevaluate right tool for your task. [12]
Common pattern for switching "shift" usage.
NOTE: if you struggle for "yes/no" and other interactive facilities in
your script, remember that you loose all scripting/piping benefits
of unix filter-type programs
Now we came close to "case" instruction; there are few hints to care
about. First, when you treat cases, make sure you place empty case
before "*" wildcard (note unquoted values):
case "$s" in
foo) echo bar ;;
'') echo empty ;;
*) echo else ;;
esac
You may also do basic validation:
case "${input}" in
*[!0-9]*) echo "Not a number" ; ;;
esac
These checks are limited with glob patterns (same you use in
interactive sessions, like "rm *~"), so you should grep/sed for any
stronger validation.
### 2.5. Life without arrays
If you can't rely on kind of mksh for array support, then there's still
life there.
Most probably your data is line oriented, which you append/sort/uniq.
Let's query it:
$ a=$( seq 1 3 ); seq 40 60 | grep -F "$a"
If you need to search field separated data (key-value or any kind of
CSV) per line, fast lookup is done with "join" utility. Here,
"storage" file is ordered by first column CSV with 1st column to be
queried, second file is ordered term per line:
$ join -t ',' storage request
The simple way of walking string of words is parameter substitution.
Let's for example split word from word list:
$ array="alpha beta gamma"
$ head="${array%% *}"; array="${array#* }"
NOTE: if you can't remember which one for prefix and suffix, "#" is on
the left (under "3") on IBM keyboard, which is prefix, thus "%"
being suffix (you're reading this in english anyway). See how you
can split hostname from a link:
$ link="
gopher://sdf.org/0/users/ulcer/"
$ hostname="${link#*://}"; hostname="${hostname%%/*}"
The other includes splitting by IFS and is prone to errors. See "4.2
Internal field separator" chapter. The rest rely on printf and pipes:
$ a=$( printf "%s\n" "$a" | grep -v "exclude_me" )
$ result=$( printf "%s\n" "${date}"; while read line; do
...
done; )
Size of variable you operate is limited to size of argument you pass to
underlying exec(3) call [13]. Usually it's order of hundred of KBs.
When you don't want to use awk's "getline" while-cycles, usual practice
is feeding awk with multiple files and comparing end of first with
NR==FNR check:
$ echo www-data | awk 'NR==FNR{split($0,a,":"); b[a[1]]=$0; next} \
{print 123, b[$1]}' /etc/passwd -
Furthermore, jump between files listed on command line is done with
"nextfile" awk command, which is not POSIX but is widely supported
(e.g. with gawk and mawk).
In order to pass more data to awk(1) with line oriented data you may
send it as awk variable:
$ a=$( seq 1 10; ); echo | mawk -v a="$a" \
'END{split(a,f,"\n"); for (i in f) print f[i]}'
But if you think you need arrays of arrays, kinds of linked lists and
so on it's time to reevaluate if you 'd be able to read this script
again if you solve everything with shell/awk.
### 2.6. Change file
File edition is not that trivial as you may expect. Simplest way is
e.g. inplace sed editing with GNU sed "-i" switch. But what if you
don't have GNU tools or want to edit file with awk?
Usual template looks like this [14]:
inplace() {
local input
tmp=$( mktemp )
[ $? -ne 0 ] && { echo failed creating temp file; exit 1; }
trap "rm -f '${tmp}'" 0
input="$1"
shift
"$@" <"${input}" >"${tmp}" && cat "${tmp}" >"${input}"
rm -f "${tmp}"
}
inplace "target_file" sed "s/foo/bar/"
You may certainly use mv or cp, but cat here is most safe option as it
preserves permissions and hard links. This is behavior ed(1) provides.
Regarding attributes: cat is surely safest way of writing changes.
One more or less complex example includes sharing files for
creation/removal/write access between group members, which includes
2770 mode on directory, umask 002 and proper FS ACL settings. cat is
only utility which won't break permissions on modification. In such
environments you may additionally check files with "chmod +w".
Pay attention to pending disk writes with sync utility [15].
### 2.7. Check lock
Depending on system preferences, /var/lock or /var/run/lock can be
available for unpriviliged locks.
Locking using mkdir(1) is preferred to other methods because it's
atomic operation (no separate "check lock", then "create lock"
operations) and is native.
mkdir "$1" 2>/dev/null || { echo locked; exit 1; }
Prepare parent path with separate call, as "mkdir -p" will ignore
errors on existing target directory.
3. Commentary on tools
----------------------
When you lack arrays (and given shell execution speed) you are kind of
forced to piping data to specialized filter. Luckily there are enough
of them nowadays.
### 3.1. Working with pipes: stdbuf and pv
You may essentially expect realtime output in piped constructs, but
result depends on line buffering capability of used tools.
stdbuf(1) is a wrapper which forces target tool to output string when
it's ready [16]. This feature is best employed when you mangle heavy
datasets. Try something like this to get better idea:
$ while :; do date; sleep 1; done | mawk -W interactive '{print}' -
Fast grep won't help you if you pass results over tool without line
buffering. Almost every tool in your standard toolset (grep, cut, tr,
etc.) requires tweaks or external wrapper.
Line buffering support per tool:
- grep: GNU grep has "--line-buffered" switch
- mawk: has "-W interactive" switch
- sed: has "-u" switch
- cat: OK
- gawk, nawk, tr, cut: require stdbuf wrapper
Solutions outside of linux vary. See [17] for detailed explanations and
solutions comparison (e.g. TCL expect "unbuffer" program). For awk
"fflush()" (POSIX) may be attempted.
pv(1) helps when you want to know quantity of data passed over pipe,
measure network speed with netcat and so on [18]. mbuffer(1) can be
used as alternative to pv, exclusive for buffering tasks.
### 3.2. Notes on grep
Looks like in 2018 you can't market "grep" without "data science", "big
data" and other buzzwords. Ok, here are few hints for faster grep.
First of all, "-F" switch stops interpretting pattern as regular
expression, which means faster grep. Other thing to consider - setting "LANG=C" instead of UTF
locale and utilization of your multiple CPU cores. See relevant
publications [19] for xargs(1) (see "-P" switch) and parallel(1) (note
"--linebuffer" switch). Also don't forget about "-m" switch if you know
you need only few matches.
Again, remember about "grep --" when your search argument may start
from dash.
In case of fuzzy search term, there's approximate grep agrep(1). This
tool searches regex patterns for given levelnstein distance (which
means all kinds of letter rearrangement, like mixing, dropping or
placing extra letter). Try agrep on next example to get idea:
wesnoth # agrep -0 wesnoth
wesnoht # agrep -1 wesnoth
westnorth # agrep -2 wesnoth
western # agrep -3 wesnoth
GNU grep is also good for searching binary data:
$ LANG=C grep -obUaP "PK\x03\x04" chrome_extension.crx
Which can also be performed with binwalk(1). And if you just want to
peek at file for readable strings, there's coreutils strings(1) tool.
### 3.3. Notes on sed
Most important thing you should remember about sed - it's a tool and
not programming language. People may write brainfuck interpreters in
sed - just leave them alone.
You probably know about legacy of basic and extended versions of
regular expressions (try man 7 regex on Linux) - "Some people,
when confronted with a problem, think "I know, I'll use regular
expressions." Now they have two problems" //Jamie Zawinski. Non of
these being POSIX or GNU knows about non-greedy quantifier. Usual
workaround for look-ahead feature for single character looks like this:
$ echo "foo
https://www.ietf.org/standards/ bar" | \
sed 's|http[s]://\([^/]*\)/[^ ]*|\1|'
foo www.ietf.org bar
For non-greedy replacement of character sequence, target sequence is
first replaced with unique for given input character and then followed
previous example.
If you want to avoid sed extended regexp flag inconsistency, usual
thing lacking in basic regexp is "?" quantifier, which can be emulated
with "\(x|\)" kind of expression (sed prefers leftmost matching part).
GNU sed is unicode aware and supports lower/upper case conversion which
works like this:
$ echo Aa Bb Cc | sed 's/\(..\) \(..\) \(..\)/\L\1 \U\2 \E\3/'
aa BB Cc
Depending on implementation sed may force you to write each command
under separate "-e" option and can be strict about closing semicolons
";", like here:
$ sed -n '/marker/{p;q;}' example
Safe assumption about sed is it doesn't know about escape sequences so
you have to put desired character either literally (e.g. tab character)
or from shell variable, like "tabchar=$( printf \011 )".
### 3.4. Notes on awk
Which awk? Short answer is: gawk when you need UTF-8 support, mawk when
you don't. Relevant literature: mawk vs gawk and other languages speed
comparison [20].
mawk 1.3.3 which is most widely distributed version suffers from number
of sore bugs, like lack of regex quantifiers "{}" support.
Few survival hints for awk:
- delete array: split ("", array)
- variable passed as parameter to function: should be array, like
global[0], to be mutable:
$ echo | mawk 'function a(){b=5} { b=2; a(); print b}'
5
$ echo | mawk 'function a(c){c=5} { b=2; a(b); print b}'
2
$ echo | mawk 'function a(c){c[0]=5} { b[0]=2; a(b); print b[0]}'
5
### 3.5. Notes on portable syntax
When you want to know if tool is available, most portable way is using
"command -v" which should be built-in:
$ AWK=$( command -v mawk 2>/dev/null ); AWK="${AWK:-awk}"
Sometimes you have to access user local directories, which FHS
(Filesystem Hierarchy Standard) doesn't cover. Two well established
paths are "~/.config" and "~/.local/share". There is also X
Desktop Group approach of searching ~/.config/user-dirs.dirc with
xdg-user-dir(1) with XDG_CONFIG_HOME and XDG_CACHE_HOME corresponding
to previously mentioned dirs. Graceful query of local configuration
path may look like this, but it's most certainly just "~/.config/":
$ config="${${XDG_CONFIG_HOME:-${HOME}/.config}/ACMETOOL/config}"
Temporary files can be attempted at TMPDIR (which is POSIX),
XDG_RUNTIME_DIR (which is relatively recent addition), and then
resorted to /tmp.
Most common GNU/BSD discrepancies:
- sed extended regex switch: "sed -r" vs "sed -E"
- print file in reverse order: "tac" vs "tail -r"
- file modification date in seconds since epoch:
"stat -c %Z" vs "stat -f %Uc"
BSD naming of GNU utils starts with "g", like "gawk": "gdate", "gsed",
etc. BSD date is different from GNU date in some details. Further
examples for ISO8601 works well for both date versions:
$ date -u +'%Y-%m-%dT%H:%M:%SZ'
$ echo '2018-03-05T19:24:21Z' | date +%s
Last but not least, your sort(1) results surprisingly may vary between
systems depending on LC_COLLATE setting.
### 3.6. Notes on UTF-8 compatibility
gawk works with UTF-8 by default. awk language wasn't crafted with
multibyte encoding in mind, so there could be problems if you work with
binary data. By default sprintf("%c", 150) will print multibyte char
like "0xC2 0x96" and fail to compare string if you pass 0x96 from shell
variable like 'printf "\226"' in dash. gawk here requires "-b" switch,
mawk works well by default.
GNU sed is unicode aware. Other seds support vary; without proper
support you won't get it working at all because of improper single char
"." length.
GNU "wc -c" still doesn't count UTF chars properly, tr also doesn't
work here, so you have to resort to sed "y" command with GNU sed. You
shouldn't rely on shell "${#var}" expansion as it still often displays
number of bytes and not characters.
### 3.7. Working with XML and HTML: curl, tidy and xmlstarlet
This is relevant to web page scraping.
There are two URL encodings you may encounter: punycode and percent
encoding. Punycode of "xn--" prefix covers internationalized domains
and can be converted back and forth with idn(1).
Percent encoding operations (urldecode and urlencode):
$ echo 'a%20b+c' | sed "s|+| |g;s|%|\\\\x|g" | xargs -L1 printf "%b"
$ printf "%s" test | od -An -tx1 | tr ' ' '%' | xargs printf "%s"
Pages are usually grabbed with curl(1) or wget(1) with wget being more
suitable for mirroring/recursive download. curl invocation with user
agent set and requesting compressed data:
$ curl -H 'Accept-encoding: gzip' | gunzip -
$ curl -H 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/40.0.2214.85 Safari/537.36'
Picky servers require referer and user agent to be set. See also curl
manpage for "retry" set of options for responsible downloads. If you
fetch data repeatedly within some seconds, curl is responsible for
timeout delays, where not a single options ensures wget to run in
defined time.
Don't forget to strip CR characters with "tr -d '\r'".
When you're done fetching, it's time to convert your HTML source to
uniformed XML. tidy(1) is corrects and reformats HTML to well-formed
XHTML [21]:
$ tidy -iqw0 -asxml input.html >output.xml
When you finished scalpeling sample input, you may validate and pretty
print it with xmllint(1), bundled with libxml(3):
$ xmllint --format ugly.xml >pretty.xml
Now it's time to finally make queries. xmlstarlet(1) "sel" command
provides compact way of writing XSLT right on command line:
$ xmlstarlet sel --help
XML tree structure inspection command is handy for unfamiliar input:
$ xmlstarlet el -u
Finally, you may omit schema definition by using default schema
shortcut "_":
$ xmlstarlet sel -t -v "//_:p[@content]"
One notable limitation of xmlstarlet is input file size. It doesn't
provide SAX parser and will fail on files of few hundred MB size.
Most important, W3 maintains set of C tools dedicated for XML/HTML
works, like hxtoc TOC creation, hxpipe conversion to YAML-like
structure, more usable for awk(1) parsing, hxselect for quarying with
CSS selector and others [22].
### 3.8. Working with JSON: jq
JSON (RFC4627) does what it has to - serializes data in text
representation without much bloat, so you can read it without kind of
decoder [23]. It lacks standard validation methods, but work has been
done in that direction [24].
Best tool to mutate and query JSON data is jq, which is already popular
enough.
I should mention that YAML with entity per line also lacks bloat of XML
and is just perfect for parsing with awk without extra tools.
jq is also tolerant to JSON lines input (each line is valid json
expression) [25], which is compromise for mungling it with grep/awk and
jq and solves disgusting problem of root braces {}. Besides, it's great
for parallelization.
Just an example jq call to rename field "a" to "b":
$ jq -c 'with_entries(if .key == "a" then {key:"e",value} \
else . end)'
### 3.9. Working with CSV: miller
Most common predecessor for CSV data is some flavour of spreadsheet.
There is ubiquitour python-based csvkit, but probably more correct
would be Gnumeric supplementary converter [26]:
$ ssconvert input.xls output.csv
Now that we have our data ready, let's process it. Without quoted
values, usual tools are enough: awk with "-F" flag, cut, sort with "-k"
flag. Handy shortcut for skipping CSV fields on search:
$ field="[^,${tabchar}]+[,${tabchar}]"
$ egrep -m1 "^${field}${field}${search_string}[,${tabchar}]"
See "4.2 Internal field separator" chapter for more examples for CSV.
miller(1) [27] jumps in straight where jq got its place: it's easy to
drag around single binary, it combines speed of awk with features of
popular scripting languages CSV tools. Recently (around 2016?) it
finally got double quoted RFC4180 compliant CSV support.
Depending on your language preferences, there are too much of CSV
processing tools. Two of such are csvkit and q (both in python). While
csvkit being obvious, q is way to query CSV with SQL [28].
4. Advanced topics
------------------
Next problems are not related to average scripting needs. As was stated
before, think twice about right choice of tool.
### 4.1. Reading user input
You may attach GNU readline capabilities for any input using rlwrap,
just note that it'll obviously store input history (consult rlwrap
manpage). Here is Debian specific example for running SDF chat program:
$ rlwrap annotate-output ssh
[email protected] 'com 2>/dev/null'
For timeout of line/char reading routins, most portable solution
includes stty with dd. In other case you may get "line -t", coreutils
timeout(1) or bash "read -t" working.
### 4.2. Internal field separator
Which you usually shouldn't touch, because this path carries only
pitfalls.
IFS is composed of characters, which act as field separator for
unquoted values in eval loop. First IFS char works for output too. IFS
can be handy when you are certain about proper composition of your
input. Any user provided input without proper quotation carries
possibility of execution of arbitrary code.
Let's see how it works:
$ tab=$( printf '\011' )
$ get_element() { eval printf \"\%s\" \$$(( $1 + 1 )) ; }
$ get_csv() { local id IFS; id="$1"; shift; IFS=",${tab}"
get_element $id $* ; }
$ get_csv 3 'foo,\\$(date);date \$(date),bar'
bar
$ print_csv() { local IFS; IFS=,; printf "%s" "$*"; }
$ print_csv a b c
a,b,c
Such code doesn't outsource tasks to external calls and especially
pipes of external calls, but itself doesn't provide greater advantage
over e.g. printf to cut(1) though.
IFS is often used in read cycles:
# read whole line
while IFS= read -r line; do
printf "%s\n" "${line}"
done
# read CSV fields with residue going into $c3
debug() { printf "<%s>\n" "$@"; }
while IFS=, read -r c1 c2 c3; do
debug "$c1" "$c2" "$c3"
done <csv_example
Preserve your IFS in some variable if you're going to write script to
be inlined somewhere. Also note that there's special set of unit
separation characters from ASCII, good for unique IFS: 1C-1F (see RFC20
or try man 7 ascii).
### 4.3. Command line options manupulation
You don't usually want to do this, though there are few useful cases.
Options restored for getopts parsing require resetting of previously
invoked getopts with OPTIND set to 1. This is also important if your
script is going to be inlined.
OPTIND=1
while getopts ...
Example code to preserve and rearrange (as seen with "prefix" leading
parameter) positional params:
cmd_prefix="prefix"
escape_params() {
for i in "$@"; do
printf "%s" "$i" | sed -e "s/'/'\"'\"'/g" -e "s/.*/'&' /"
done
}
params="${cmd_prefix} "$( escape_params "$@" )
printf "DEBUG %s\n" "${params}"
eval set -- ${params}
printf "DEBUG arg: \"%s\"\n" "$@"
shift 3
# save params
params=$( escape_params "$@" )
# split string
split_me="a:b:c"
backifs="${IFS}"
IFS=:
set -- ${split_me}
printf "TEST arg: \"%s\"\n" "$@"
IFS="${backifs}"
# restore params
eval set -- ${params}
printf "DEBUG arg: \"%s\"\n" "$@"
This code doesn't handle parameters with newlines.
### 4.4. Nested constructions difficulties and recursion
Pipes spawn subshells. This is a problem because you can't pass
variable from subshell script to parent script, which causes painful
code rearrangements and workarounds.
One workaround for being unable to pass variable upwards is to use
temporary files as stdin for e.g. while loops. Other way is capture of
stdout/stderr output into variable. Make sure you know what "Useless
Use of cat Award" is about [30].
You should make habit of writing conditional switches in whole
"if-then-else" form and not compact "[ ] &&/|| something" form, for "["
being shell build-in and rarely incorrectly interpreted.
Recursion is useful in shell scripts, because without piped subshell
calls scope of variables doesn't go isolated. Safe assumption would be
something below 1000 calls (artificial limit on mksh and zsh, others
prefer crash). Test with:
$ sh -c 'a() { echo $1; a $(( $1 + 1 )); }; a 1'
### 4.5. Libraries and trap cascading
You may source heavy chunks of code formed in libraries, which may also
perform some kind of initialization. One way of preventing such
duplicate init code runs is dedicated library management with core code
like this:
require_plugin() {
if ! printf "%s" "${plugin_array}" | \
grep -qw "$1" 2>/dev/null; then
echo >&2 "Exporting: $1"
plugin_array="${plugin_array} $1"
. "$LIBPATH/$1"
fi
}
Your very first inlined script will rise question about parent trap
handlers preservation. Here is proposed solution:
# $1 -- trap name (to prevent duplicates)
# $2 -- trap command
# $3 $* -- signals
trap_add() {
# stay silent on incorrect options and on fail to set trap
[ -z "$3" ] && return
printf "%s" "${trap_array}" | grep -qw "$1" 2>/dev/null && return 0
trap_array="${trap_array} $1 "
trap_cmd="$2"; shift 2
for trap_signal in "$@"; do
trap -- "$(
extract_trap_cmd() { printf '%s\n' "$3"; }
eval extract_trap_cmd $( trap | \
sed -n "H;/^trap --/h;/${trap_signal}$/{x;p;q;}" )
printf '%s\n' "${trap_cmd}"
);" "${trap_signal}" || return
done
}
# debug
trap_add test 'echo 123' INT TERM EXIT
trap_add test2 'date' INT TERM EXIT
trap
Never forget to test your code chunks in multiple shells, as e.g.
dash and mksh do not provide POSIX exception for treating command
substitution with empty trap as not called from subshell, where it
should dump existing signal handlers (so former example will work only
in bash/zsh and require preservation of all trapped signal handlers in
dedicated variables for being portable).
### 4.6. Debugging
Debugging shell scripts is done either with "set -x" shell option to
output executed commands with timestamp, which also works well for
performance troubleshooting, or with debug printfs all around the code.
Usual debug routine is 'printf "debug \"%s\"\n" "$@"', which expands to
param per line.
hexdump(1) may be missed on target machine. Alternatives are xxd(1)
often distributed with vim(1) or od(1), which od being most ubiquitous.
Problem cases can be much obscure. One common pitfall is forgotten ";"
inside of "$()"/"{}"/"()" constructs with closing parentheses being on
same line with last instruction.
### 4.7. Testing and deploying in new environment
shtest [29] is a single file in POSIX shell without any dependency. It
takes different approach from unit tests in that is doesn't force you
to write anything. You copy commands to be tested with some prefix (by
default with "$", like any documentation does), run "shtest -r", which
records those commands output and exit statuses and finally run tests
in new environment (or under different shell) to get diff output if
something went wrong. Also shtest tests are easily embeddable in
markdown documentation.
shtest was inspired by similar cram tool in python [31], just that it
doesn't need python to run.
For more systematic approach see BATS [32], which can output
TAP-compatible messages, suitable for continuous integration.
### 4.8. Networking with shell
This one is completely esoteric part, which carries little practical
value. With already mentioned curl, doing its job for HTTP (and
GOPHER!) protocols, it's horribly inefficient but yet possible to do
simple tasks with binary data flow under just shell and awk. You still
need few specialized tools.
tcpick(8) encodes byte streams into hex [33]:
/usr/bin/stdbuf -oL /usr/sbin/tcpick -i eth0 \
"src $1 && port 5122 && \
tcp[((tcp[12:1] & 0xf0) >> 2):2] = 0x9500" -yH | awk ...
Actually, you may do full-fledged networking with just netcat and sh
[34]. This one was functional online game client. Netcat stream was
read word by word with dd, expect-like behavior and binary packets
parsing were done in pure shell. It still serves as good example on
scalpeling binary data with only shell.
### 4.9. Paste safety
It's survival guide after all, so let's step away from target subject
and look at how you actually do scripting. Obviously, noone reads mans
and writes code by copying parts from web.
See this question [35] for details and at the very least inspect your
paste with "xsel -o | hd".
5. Further reading
------------------
POSIX standard is your best friend [36].
You should obviously see manpages. Dash and mawk manpages are pretty
excellent and compact for learning corresponding topics and using as
language reference.
For GNU bloatware important info is often contained within GNU info(1)
infopages (which means man pages miss what you may seek, take a look at
e.g. sed). This format should certainly diy one day but so far it is
accessible with info(1) or somewhat abandoned pinfo(1) for colored
output. And of course is available at [37].
Then take a look at next resources:
- comp.unix.questions / comp.unix.shell archived FAQ, which contains
still relevant answers for particular scripting topics [38].
- harmful.cat-v.org [39]
And finally few of individual widely circulated pages:
- Rich's POSIX sh tricks, covering advanced topics [40]
- Sculpting text with regex, grep, sed, awk [41]
- 7 command-line tools for data science [42]
Bash FAQ, which still covers lot of newbie POSIX-related pitfalls: [43]
6. References
-------------
1. Introduction
[1] Ubuntu Wiki, "Dash as /bin/sh"
https://wiki.ubuntu.com/DashAsBinSh
[2] Greg's Wiki, "How to make bash scripts work in dash"
https://mywiki.wooledge.org/Bashism
[3] checkbashisms - check for bashisms in /bin/sh scripts
https://manpages.debian.org/testing/devscripts/checkbashisms.1.en.html
[4] Tom Christiansen, "Csh Programming Considered Harmful", 1996-10-06
http://harmful.cat-v.org/software/csh
[5] Bruce Barnett, "Top Ten Reasons not to use the C shell", 2009-06-28
http://www.grymoire.com/unix/CshTop10.txt
[6] Tom Duff, "Rc — The Plan 9 Shell"
http://doc.cat-v.org/plan_9/4th_edition/papers/rc
[7] Rob Pike, "Structural Regular Expressions", 1987
http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf
2. Scripting basics
[8] Stéphane Chazelas. "Why is printf better than echo?"
https://unix.stackexchange.com/a/65819
[9] Why is $(...) preferred over `...` (backticks)?
http://mywiki.wooledge.org/BashFAQ/082
[10] Fred Foo. "When do we need curly braces around shell variables?"
https://stackoverflow.com/a/8748880
[11] Tobia. Make xargs execute the command once for each line of input
https://stackoverflow.com/a/28806991
[12] Dennis Williamson. "Cross-platform getopt for a shell script"
https://stackoverflow.com/a/2728625
[13] Graeme. "What defines the maximum size for a command single argument?"
https://unix.stackexchange.com/a/120842
[14] William Pursell. "sed edit file in place"
https://stackoverflow.com/a/12696585
[15] Wikipedia - sync (Unix)
https://en.wikipedia.org/wiki/Sync_(Unix)
3. Commentary on tools
[16] Pádraig Brady, "stdio buffering", 2006-05-26
http://www.pixelbeat.org/programming/stdio_buffering/
[17] Aaron Digulla. "Turn off buffering in pipe"
https://unix.stackexchange.com/questions/25372/turn-off-buffering-in-pipe
[18] Martin Streicher, "Speaking UNIX: Peering into pipes", 2009-11-03
https://www.ibm.com/developerworks/aix/library/au-spunix_pipeviewer/index.html
[19] Adam Drake, "Command-line Tools can be 235x Faster than your Hadoop Cluster", 2014-01-18
https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
[20] Brendan O'Connor, "Don’t MAWK AWK – the fastest and most elegant big data munging language!", 2012-10-25
https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/
[21] The granddaddy of HTML tools, with support for modern standards
https://github.com/htacg/tidy-html5
[22] HTML and XML manipulation utilities
https://www.w3.org/Tools/HTML-XML-utils/README
[23] Douglas Crockford, "JSON: The Fat-Free Alternative to XML", 2006-12-06
http://www.json.org/fatfree.html
[24] JSON Schema is a vocabulary that allows you to annotate and validate JSON documents
http://json-schema.org/
[25] JSON Lines
http://jsonlines.org/
[26] The Gnumeric Manual, "Converting Files"
https://help.gnome.org/users/gnumeric/stable/sect-files-ssconvert.html.en
[27] Miller is like awk, sed, cut, join, and sort for CSV
https://johnkerl.org/miller/doc/index.html
[28] Run SQL directly on CSV files
http://harelba.github.io/q/
4. Advanced topics
[29] Useless Use of Cat Award
http://porkmail.org/era/unix/award.html#uucaletter
[30] shtest - run command line tests
https://github.com/uuuuu/shtest
[31] Cram is a functional testing framework based on Mercurial's unified test format
https://bitheap.org/cram/
[32] Bats: Bash Automated Testing System
https://github.com/sstephenson/bats
[33] tcpick with awk example
https://github.com/uuuuu/tmww/blob/master/utils/accsniffer
[34] shamana - tmwa ghetto bot engine made with POSIX shell
https://github.com/uuuuu/shamana
[35] Sam Hocevar. "How can I protect myself from this kind of clipboard abuse?"
https://security.stackexchange.com/questions/39118/how-can-i-protect-myself-from-this-kind-of-clipboard-abuse
5. Further reading
[36] The Open Group Base Specifications Issue 7, 2016 Edition
http://pubs.opengroup.org/onlinepubs/9699919799/
[37] GNU Coreutils
https://www.gnu.org/software/coreutils/manual/html_node/index.html
[38] Unix - Frequently Asked Questions
http://www.faqs.org/faqs/unix-faq/faq/
[39] Encyclopedia of things considered harmful
http://harmful.cat-v.org/
[40] Rich’s sh (POSIX shell) tricks
http://www.etalabs.net/sh_tricks.html
[41] Matt Might: Sculpting text with regex, grep, sed, awk
http://matt.might.net/articles/sculpting-text/
[42] Jeroen Janssens, "7 command-line tools for data science", 2013-09-19
http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html
[43] Greg's Wiki, "Bash Pitfalls"
http://mywiki.wooledge.org/BashPitfalls
7. Changelog
------------
2018-03-08 initial release
2018-03-29 ADD missed to mention W3 hx* tools from html-xml-utils
ADD few newbie awk hints
2018-04-11 ADD portable urlencode/urldecode, fast querying with "join"
2018-05-08 FIX example in 3.4 "Notes on awk" about mutable variable passed as parameter