#[1]Medium [2]alternate
[3]Sign in
(BUTTON) Get started
Galileo Onwards
* [4]Philosophy
* [5]Arts
* [6]Computers
* [7]Tech
* [8]Climate Change
* [9]Archive
* [10]About
[1*xPlEFrRBdZ9YZELUkY277A.jpeg?q=20]
[1*xPlEFrRBdZ9YZELUkY277A.jpeg?q=20] [1*xPlEFrRBdZ9YZELUkY277A.jpeg]
The shell of a common tower snail, 1826. A beautiful organism which,
like modern shell programs, builds around what’s already in place.
Source: [11]Wikipedia. License: [12]CC BY-SA 3.0.
An Ode to the Shell
Emulating common SQL operations with just shell tools, and the UNIX
philosophy and microservices
[13]Vijay Lakshminarayanan
[14]Vijay Lakshminarayanan
(BUTTON) Follow
[15]Feb 6 · 24 min read
The shell is the coolest thing out there. It is unmatched in terms of
pedagogy — helping people learn concepts — and yet, people seem to know
very little of it. I figured, since I like complaining, I’d write this
and complain that developers aren’t reading my post rather than
complain about their ignorance (which is always unfair because
ignorance isn’t a sin). This has the added benefit that now, when
friends challenge me about my inaction, I can point to this post.
So, here we go. We’re going to cover the following “cool” industry
topics:
* [16]MapReduce
* [17]Databases
* [18]Microservices
(It’s possible some of these topics aren’t as cool as they once were
but industry trends, like fashion trends, keep changing and I don’t pay
much attention to them. After all, I’m someone who appreciates
something as retro as the shell. And, if I’m able to do all right in
this industry despite eschewing trends — and thus avoiding the dreaded
[19]FOMO — I argue it’s because of the shell. The shell lets me
understand the concepts behind whatever’s new and fashionable without
getting bogged down by the details.)
Preliminaries
Before we get into the details of the programs below, let’s go through
a lightning tour of shell programs. The best way to understand the
programs shown here is to run them. Run them, tweak them, run them
again. In time you’ll be writing your own scripts and we can exchange
notes. One wonderful thing I’ve observed collaborating on the shell is
no two people solve any problem the same way so exchanging notes is
always educational.
Quick note: throughout this article, when I say shell, I mean the bash
shell. If you already know some other shell, great, skim through this
section and see how the discussed concepts can be applied in your shell
of choice.
Feel free to skip this section if you’re already familiar with basic
shell primitives.
Quick note 2: for pedagogical reasons, I minimize the use of shell
primitives. It’s a given that many of the scripts in this post can be
done more concisely or differently than what’s shown. There are some
things which are “wrong” in the technical sense but still justifiable
in the same way that introductory mathematics courses teaching the
numbers only discuss the Naturals (0,1,2,3…) without discussing the
other infinite classes of numbers like Integers, Rationals, Reals, and
Imaginary[20]¹.
A mental model for shell programs
One way to think shell programs is to think of them as functions which
accept arguments and return some output. All their arguments are
strings and all their return types are strings. All shell scripts are
of the form program arg1 arg2 arg3... argN. The program interprets args
based on its implementation. The documentation of any program, X, is
typically available by typing man X. As a start, tryman man.
Some basic programs
cat — prints each given input file
man — the reference manual for most shell programs
echo — outputs its arguments, separated by spaces, terminated with a
newline
rev — reverses a line characterwise
sort — sorts, merges, or compares all the lines from the given files
There are others but we’ll look into them as we go.
Some shell primitives
> redirect — writes its input to a file
| pipe — passes the output of another program to the next one
' quote — quotes its arguments, preventing their evaluation
# comment — the shell ignores everything that follows until the end of
the line
There are others but we won’t use them. Once you’ve played with these
examples, you can explore them on your own.
Using the above shell examples
For all our examples, we’ll need some files to act on. Let’s create
them using the shell.
For our first file, we’ll create a file, named digits, containing the
digits 0–9 on one line like so:
$ echo 0 1 2 3 4 5 6 7 8 9 > digits
echo repeats its arguments
> writes its input to the file digits. In this case, it so happens that
the input to > is the output of echo.
You can print the file using cat
$ cat digits
0 1 2 3 4 5 6 7 8 9
If you want to repeat the contents of digits, just pass it twice to
cat.
$ cat digits digits
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Using just the digits 0–9 isn’t sufficient. Let’s change things a
little. Let’s also save the digits in reversed order. We could just
type it out in reverse, but why type when the shell can do it for you?
We’ll instead use the rev program to do our work. (Programming
principle: DRY — don’t repeat yourself.)
$ rev digits
9 8 7 6 5 4 3 2 1 0
Let’s save the output of the above program to a file named stigid,
which is “digits” reversed. (Shell programs are full of inside jokes
like this. See, for instance the program, [21]tac. It’s the same as cat
but prints the output in order of reversed lines.)
$ rev digits > stigid
Now, we can cat both these files
$ cat stigid digits
9 8 7 6 5 4 3 2 1 0
0 1 2 3 4 5 6 7 8 9
If you’ve followed until now, you’re ready for the second shell
primitive, the | (read as “pipe”). The pipe operator is a way for one
program to communicate with the next. (Get it? It acts as a pipe.)
Once we’ve cat’d the above two files, we can send their output to sort
which will sort the lines like so:
$ cat stigid digits | sort
0 1 2 3 4 5 6 7 8 9
9 8 7 6 5 4 3 2 1 0
Even if you cat several outputs, sort will sort the lines as you
expect.
$ cat stigid digits stigid digits stigid | sort
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
9 8 7 6 5 4 3 2 1 0
Note that sort sorts on the basis of the first column of each line, so
in the above, all the lines were sorted by comparing 0’s and 9’s.
We’ll consider more complex examples and the other primitives, # and ',
as we get into the actual concepts.
How to read this post
“The only way to learn a new programming language is by writing
programs in it.” — Brian Kernighan and Dennis Ritchie, The C
Programming Language
Open up a bash shell terminal, copy-paste the sample commands into your
terminal; then explore, experiment, discover. As a matter of
convention, when showing shell commands, the first character is a $,
which indicates the shell prompt. Don’t paste it into your terminal.
MapReduce
Let’s quickly recap what MapReduce does before we look at replicating
it on the shell. I will also add that practically every shell program
is an implementation of the MapReduce paradigm.
In MapReduce, you have an input which you convert (or transform or, you
know, map) into another form, doing this as many times as you feel
necessary, before finally aggregating the results. This aggregation is
called, you’ve guessed it, reduction.
The classic, “Hello, World!” example of a MapReduce program is a word
counter. Given a large corpus of words, you write a MapReduce program
whose task it is to count the frequency of occurrence of each word. As
a negative example, i.e. how not to go about learning MapReduce, I
refer you to the Apache MapReduce Tutorial page:
[22]
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hado
op-mapreduce-client-core/MapReduceTutorial.html.
Let’s now do word count using the lowly shell.
First, we’ll create our sample input and, in the process, meet our
third shell primitive, ', the single quote.
$ echo 'hello
world
hello
goodbye
cruel
WORLD' > input.txt
The ' lets us generate a string that spans multiple lines. We echo this
into a file named input.txt.
Here’s how to do a word count on this input:
$ cat input.txt | sort | uniq -c
1 WORLD
1 cruel
1 goodbye
2 hello
1 world
Let’s break it down. The first part, cat input.txt, should be obvious
so I won’t belabor it. What do we get when passing it to sort?
$ cat input.txt | sort
WORLD
cruel
goodbye
hello
hello
world
As one might expect, the output of sort is sorted. And, as readers
familiar with ASCII are no doubt aware, the uppercase characters are
sorted before the lowercase ones.
Next, we have the program uniq to “[23]report or omit repeated lines”.
However, when invoked with the argument -c, uniq counts how many times
it’s seen every occurrence in its input.
And with just three programs — cat, sort, and uniq — we have a simple
shell word counter.
Translation to MapReduce terms
In a MapReduce program, the mappers output every word they read as the
word itself followed by 1. A mapper which received the words, “hello”,
“world”, “hello” would output “hello 1”, “world 1”, “hello 1”
respectively. After this phase, the MapReduce system sorts (typically
an external sort) the mapper output. The output after the sort phase
will be: “hello 1”, “hello 1”, “world 1”. Now, one guarantee the MR
system gives is that the same keys will always end up at the same
reducer. A reducer that receives “hello 1” followed by “hello 1” will
output “hello 2”. (A reducer that receives “hello 301”, “hello 13” will
output “hello 314” — it sums the values of the same keys.) The output
from all the reducers will be the word count.
In the case of our shell script, cat was our mapper which output every
words as itself, sort was the external sort, and uniq the reducer.
__________________________________________________________________
Now, suppose you want the same word counter but you want to count words
in a case-insensitive manner? Right now, we print WORLD 1 and world 1
which, you feel, isn’t ideal. In MapReduce terms, you need to introduce
a new mapper. In shell terms, you need to introduce one program in the
middle. We’ll get into the details after we see the solution:
$ cat input.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c
1 cruel
1 goodbye
2 hello
2 world
This introduces the program tr. It’s short for translate and does just
that. In the above, program, it converts all uppercase characters to
lowercase characters. Below I print the output of tr alone.
$ cat input.txt | tr '[:upper:]' '[:lower:]'
hello
world
hello
goodbye
cruel
world
Of course, this is the simplest sample. You can do this and much more.
(And I give you an exercise below.) Let’s take a look at databases.
Databases
The shell allows you to explore databases without concerning yourself
with SQL or setting up a database. Many a time, I’ve actually used the
shell to extract answers from log files which would have otherwise
taken too long just to setup.
We’ll be looking at the following aspects of databases:
⒈ [24]Projection (select only some columns from the table)
⒉ [25]Filtering (removing unwanted rows, selecting only some rows)
⒊ [26]Joins
⒋ [27]Query Planning
As with all database tutorials, let’s first setup our data. In true
shell-ian fashion, our tables will be files, they’ll have a schema but
we’ll have to keep it all in our heads. We’ll use the reference dataset
from MySQL. For more details about the dataset, see
[28]
https://dev.mysql.com/doc/employee/en/employees-introduction.html.
(It’s licensed [29]CC BY-SA 3.0, the same as this post.)
We’ll look at the tables and their schemata as we go on.
Projection
Of course, we’ve already seen the simplest SQL projection possible. The
equivalent of SELECT * from a table is just cat. What we’re really
interested in is selecting only particular columns from the table. It’s
time to look at our first table. The employees table gives us this.
Below is a sample of our table.
$ head load_employees.data
10001 1953-09-02 Georgi Facello M 1986-06-26
10002 1964-06-02 Bezalel Simmel F 1985-11-21
10003 1959-12-03 Parto Bamford M 1986-08-28
10004 1954-05-01 Chirstian Koblick M 1986-12-01
10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
10006 1953-04-20 Anneke Preusig F 1989-06-02
10007 1957-05-23 Tzvetan Zielinski F 1989-02-10
10008 1958-02-19 Saniya Kalloufi M 1994-09-15
10009 1952-04-19 Sumant Peac F 1985-02-18
10010 1963-06-01 Duangkaew Piveteau F 1989-08-24
The table’s schema is employee_id, birth_date, first_name, last_name,
gender, hire_date. As is typical in the shell, the columns are
space-separated. This is also the time to introduce the head program to
“[30]output the first part of files”. By default, head prints the first
ten lines of each of its given files and that’s what we see.
The entire table has a little over three hundred thousand lines
(300,024 to be exact) so we will use head most of the time. The full
file, [31]load_employees.data, is available to download at my
repository, [32]test_db-fork. It has the same file as MySQL’s
[33]load_employees.dump with the SQL bits removed. The script that
converted the SQL to the above file is also available there[34]².
Let’s execute the equivalent of SELECT employee_id, first_name, gender:
$ cat load_employees.data | head | cut -d' ' -f1,3,5
10001 Georgi M
10002 Bezalel F
10003 Parto M
10004 Chirstian M
10005 Kyoichi M
10006 Anneke F
10007 Tzvetan F
10008 Saniya M
10009 Sumant F
10010 Duangkaew F
cut’s documentation says it can “remove sections from each line of
files” and that’s exactly what we’ve done. As before, we only read the
first ten lines of the file. cut take any character as its delimiter,
this is specified above with the argument -d' ' which tells cut to use
the space character as its delimiter; the argument -f specifies which
fields cut must retain in every line. In this case, we retain the
columns 1, 3, and 5.
Filtering
Most of the time when we query a database (or any data source) we are
interested in extracting only some portions of it. Portions that match
some criteria we’re interested in. Given what we know about the
employee table’s schema, let’s invent some queries and then go ahead
and execute them.
The queries will be:
⒈ Find all male employees and
⒉ Find the employee_ids and last names of all female employees employed
after the year 2000
Interested readers should try to solve it by themselves before looking
at the answers below. Typically your approach will differ from mine and
it’s always fun to exchange notes.
⒈ Find all male employees
$ cat load_employees.data | grep ' M [12]' | head
10001 1953-09-02 Georgi Facello M 1986-06-26
10003 1959-12-03 Parto Bamford M 1986-08-28
10004 1954-05-01 Chirstian Koblick M 1986-12-01
10005 1955-01-21 Kyoichi Maliniak M 1989-09-12
10008 1958-02-19 Saniya Kalloufi M 1994-09-15
10012 1960-10-04 Patricio Bridgland M 1992-12-18
10013 1963-06-07 Eberhardt Terkki M 1985-10-20
10014 1956-02-12 Berni Genin M 1987-03-11
10015 1959-08-19 Guoxiang Nooteboom M 1987-07-02
10016 1961-05-02 Kazuhito Cappelletti M 1995-01-27
Say hello to grep. Probably the shell command I most use and easily my
favorite. According to its documentation, grep “[35]print lines that
match patterns”. Its manual spans 592 lines. (At 30 lines a page,
that’s 19 pages.) grep takes a pattern as its first argument and (in
this case) considers its input the data coming through the pipe. (In
addition to its ubiquity and popularity among all shell tools, grep is
also extraordinarily fast³.) For simplicity, we only print the first
ten lines.
Let’s take a look at our grep pattern in more detail. The fun thing
about the shell is you typically come up with inventive (and
non-intuitive) ways to get your answers. In this case, rather than
finding a way to say the equivalent of WHERE gender='M' we’ve done
something a bit different. Let’s take a look at our data. All our data
is space separated and none of the names have numbers in them. What
we’ve done is tell grep to pick out all lines which have the character
‘M’ surrounded by spaces on either side followed by [12]. grep
interprets the brackets ([ and ]) as defining a character class. In
this case,[12] matches either 1 or 2 and nothing else. As it so
happens, hire_date succeeds gender, and since all dates either start
with 1 or 2 (twentieth or twenty-first century), the grep pattern
matches only the gender rows.
⒉ Find the employee_ids and last names of all female employees employed
after the year 2000
$ cat load_employees.data | grep ' F 2' | cut -d' ' -f1,4
60134 Rathonyi
72329 Luit
205048 Alblas
222965 Perko
226633 Benzmuller
422990 Verspoor
499553 Delgrande
This solution should be self-explanatory because it introduces nothing
new. I’ll leave understanding it as an exercise.
Joins
Databases would be almost useless if not for joins. For our exploration
of joins, we’ll be using the shell program named, funnily enough, join.
The man page says it can “[36]join lines of two files on a common
field” which is exactly what a database join does. We’ll look at each
JOIN variant below.
It’s time to introduce two more tables: departments and
department_employees.
departments is a small table with only two columns: department_id and
department_name. The second table, department_employees, is the mapping
table with the following columns: employee_id, department_id,
from_date, to_date.
Here’s what both look like. The departments table is small. It only has
9 rows which we’ve shown in full.
$ cat load_departments.data
d001 Marketing
d002 Finance
d003 Human_Resources
d004 Production
d005 Development
d006 Quality_Management
d007 Sales
d008 Research
d009 Customer_Service
department_employees contains at least as many rows as there are
employees (employees who have worked in multiple departments will be
listed multiple times). As it happens, the table has 331,603 rows which
is 31,579 more than the number of employees. Here’s a sample of the
table:
$ tail load_dept_emp.data
499991 d009 1997-02-11 9999-01-01
499992 d001 1995-05-31 9999-01-01
499992 d003 1987-05-10 1995-05-31
499993 d004 1997-04-07 9999-01-01
499994 d004 1993-02-22 1993-10-27
499995 d004 1997-06-02 9999-01-01
499996 d004 1996-05-13 9999-01-01
499997 d005 1987-08-30 9999-01-01
499998 d002 1993-12-27 9999-01-01
499999 d004 1997-11-30 9999-01-01
The last column indicates the date on which the employee left the
department. If the employee is still in that department, the year is
set to 9999–01–01. Notice that employee_id 499992 (Siamak Salverda)
repeats twice. She was originally in department d003 (Human Resources)
until May 1995 when she shifted to department d001 (Finance). It’s also
a good time to meet tail, a program to “[37]output the last part of
files”. Like head, tail prints the last ten lines of its input by
default.
MapReduce exercise: list all employee(s) who have changed departments.
INNER JOIN
An inner join combines two tables on the basis of what both of them
have in common. This is exactly what join does.
join requires that its data be sorted on the join field (the column in
common between the tables). This is reasonable when you think about it.
I hope you do because I’m leaving the answer as an exercise. (Hint: how
would you write join? Is it easier when the data is sorted?) join
expects its files to be sorted lexicographically (alphabetically) as
opposed to numerically. To satisfy this we’ll first sort our three
tables on their join fields.
Query: find the names and birthdays of all employees still working in
Marketing.
Let’s start simple. First we’ll find the Marketing department’s
department_id. We get that as below:
$ cat load_departments.data | grep Marketing
d001 Marketing
Next, we’ll use this information to join the employees and
department_employeesbut first we’ll sort them lexicographically on
their join field.
$ cat load_dept_emp.data | sort > load_dept_emp.data.sorted
$ cat load_employees.data | sort > load_employees.data.sorted
Below is the result. The full result set has 14,842 rows of which we
only see the last ten.
$ cat load_dept_emp.data.sorted | grep d001 | grep 9999-01-01 | join - load_empl
oyees.data.sorted | cut -d' ' -f6-8 | tail
1964-05-15 Gully Hebert
1964-08-21 Arunas Tokunaga
1960-08-10 Yishai Cesareni
1960-11-02 Zengping Rindone
1962-08-06 Ranga DuCasse
1956-06-25 Spyrose Waymire
1960-10-13 Jaewon Paludetto
1962-01-06 Along Schaap
1960-01-12 Amalendu Apsitis
1952-06-18 Matt Motley
But that’s cheating, I hear you say. In a real database, the three
tables would be INNER JOIN’d in one SQL statement. While true,
remember, we’re only trying to achieve understanding, not replicate the
database. That said, we’ll do a three table join in one statement after
we’ve analyzed the above shell script.
The first part, cat load_dept_emp.data.sorted | grep d001 | grep
9999-01-01 brings in the data we want, namely, all employees in
Marketing (d001) who’re still in the department (date: 9999-01-01). It
looks like below:
$ cat load_dept_emp.data.sorted | grep d001 | grep 9999-01-01 | tail
99751 d001 1989-08-02 9999-01-01
99765 d001 1989-11-03 9999-01-01
99784 d001 1994-08-19 9999-01-01
99869 d001 1991-09-21 9999-01-01
99884 d001 1986-04-14 9999-01-01
99933 d001 1985-03-06 9999-01-01
99936 d001 1995-04-11 9999-01-01
99950 d001 2002-07-13 9999-01-01
99965 d001 1990-04-10 9999-01-01
99988 d001 1999-04-21 9999-01-01
The second part, join - load_employees.data.sorted, is the more
interesting part. join as we’ve discussed, takes two arguments, both
files to join. In this case, the second file is
load_employees.data.sorted which contains all the employee data. The
first file, listed as - is not a file in the sense of being saved as a
named file on disk, rather, it tells join to treat the input received
from the pipe as a file. The output of the first two parts is shown
below. The lines overflow so you’ll need to squint a bit; or you might
find the screenshot that follows the results more convenient to view.
$ cat load_dept_emp.data.sorted | grep d001 | grep 9999-01-01 | join - load_empl
oyees.data.sorted | tail
99751 d001 1989-08-02 9999-01-01 1964-05-15 Gully Hebert F 1989-08-02
99765 d001 1989-11-03 9999-01-01 1964-08-21 Arunas Tokunaga M 1989-04-21
99784 d001 1994-08-19 9999-01-01 1960-08-10 Yishai Cesareni F 1992-12-27
99869 d001 1991-09-21 9999-01-01 1960-11-02 Zengping Rindone F 1991-09-21
99884 d001 1986-04-14 9999-01-01 1962-08-06 Ranga DuCasse M 1986-04-14
99933 d001 1985-03-06 9999-01-01 1956-06-25 Spyrose Waymire F 1985-03-06
99936 d001 1995-04-11 9999-01-01 1960-10-13 Jaewon Paludetto M 1994-07-02
99950 d001 2002-07-13 9999-01-01 1962-01-06 Along Schaap M 1990-03-04
99965 d001 1990-04-10 9999-01-01 1960-01-12 Amalendu Apsitis M 1990-04-10
99988 d001 1999-04-21 9999-01-01 1952-06-18 Matt Motley M 1994-06-14
[1*PXm1lyrob0lCl9HMw3vU1g.png?q=20]
[1*PXm1lyrob0lCl9HMw3vU1g.png?q=20] [1*PXm1lyrob0lCl9HMw3vU1g.png]
The output of `cat load_dept_emp.data.sorted | grep d001 | grep
9999–01–01 | join — load_employees.data.sorted | tail`. Data grep
matched is highlighted in red. See: grep --color=always.
As you can see the output of joinis the join field, employee_id in this
case, followed by the columns of the first table (filtered rows from
load_dept_emp.data) followed by all columns of the second table
(load_employees.data) excluding the join field.
Now let’s join the three tables in one command. Recall that join
requires sorted inputs and note that that means we cannot join
load_departments.data with load_dept_emp.data because they are sorted
on different keys. Thus, before we join with a single line, we’ll sort
load_dept_emp.data on department_id pipe it forward, join it, then sort
it again. (This is very inefficient but we’ll ignore inefficiencies for
now. A better way would be to sort the file and save it in a temporary
file and use that. See also the following section [38]§Query Planning.)
$ cat load_dept_emp.data | sort -k2 | join -22 load_departments.data - | grep Ma
rketing | sort -k3 | join -13 - load_employees.data.sorted | grep 9999-01-01 | c
ut -d' ' -f7-9 | tail
1964-05-15 Gully Hebert
1964-08-21 Arunas Tokunaga
1960-08-10 Yishai Cesareni
1960-11-02 Zengping Rindone
1962-08-06 Ranga DuCasse
1956-06-25 Spyrose Waymire
1960-10-13 Jaewon Paludetto
1962-01-06 Along Schaap
1960-01-12 Amalendu Apsitis
1952-06-18 Matt Motley
Take a second to verify that both times we’ve received the same output.
We see a new variant of sort above. Everything else should be familiar.
We pass it the argument -k2. This tells sort to sort its data on its
second column.
Query Planning
Let’s take a closer look at the join between the tables departments and
dept_emp. When joining them, we have at least the following options:
⒈ Filter ‘Marketing’ in departments, join with dept_emp, filter current
employees (date=9999–01–01)
⒉ Filter current employees in dept_emp, join with departments, filter
‘Marketing’
⒊ Join, filter current employees, filter ‘Marketing’ and
⒋ Join, filter ‘Marketing’, filter current employees.
Does it matter which approach we use among ⑴ or ⑵ or ⑶ or ⑷? (Spoiler:
it must matter otherwise all commercial databases and DBAs will be out
of a job.)
Is there a measurable difference between them? Let’s find out. Below I
list the shell commands which represent each of the query types. In the
interests of brevity, I don’t include the results. Verifying that all
variants give the same results is left, you guessed it, as an exercise.
Prediction
Feel free to skip over to the results but let’s first make some
predictions of which will win. My estimates in execution time are: ⑴ ≪
⑵ < ⑶ ≅ ⑷. In words: ⑴ will finish much faster than ⑵ which will be
faster than ⑶. I expect ⑶ and ⑷ will take around same time.
It stands to reason that ⑴ will be fastest because it first removes 90%
of data in the small table which (assuming the same number of employees
in each department) will remove 90% of all data in the larger table,
and then filtering out the date. Since it’ll remove 90% of all data
early on, we expect it’ll be the best. I expect ⑵ will be slower
because filtering out current employees will not achieve as great a
reduction in data size as ⑴. I expect ⑶ and ⑷ to take approximately the
same time because in both cases we’ve done no filtering, we’re keeping
the same data size and only eventually filtering. I don’t think the
filtering order matters much.
Let’s get to the results.
Results
## (1) filter Marketing, join, filter date
$ time cat load_departments.data | grep Marketing | join -12 load_dept_emp.data.
sorted-f2 - | grep 9999-01-01 | tail## (2) filter date, join, filter Marketing
$ time cat load_dept_emp.data.sorted-f2 | grep 9999-01-01 | join -12 - load_depa
rtments.data | grep Marketing | tail## (3) join, filter date, filter Marketing
$ time join -12 load_dept_emp.data.sorted-f2 load_departments.data | grep 9999-0
1-01 | grep Marketing | tail## (4) join, filter Marketing, filter date
$ time join -12 load_dept_emp.data.sorted-f2 load_departments.data | grep Market
ing | grep 9999-01-01 | tail|------------+----------------|
| Query type | Time taken (s) |
|------------+----------------|
| 1 | 0.101 |
| 2 | 0.650 |
| 3 | 0.626 |
| 4 | 0.283 |
|------------+----------------|
Query ⑴ is clearly the fastest, almost 3x faster than the next best
option. Queries ⑵ and ⑶ are the worst by far, more than 6x slower than
the best and more than 2x slower than query ⑷. It seems the largest
performance predictor is time to read through the entire file. For ⑴
the bigger table is scanned twice, once during the join and next when
filtering the date. It turns out, the join output’s results are merely
6% of the bigger table. (The output is 20,211 rows from an original
size of 331,603.)
For query ⑵, we scan the big table thrice, once during date filtering
which only removes 28% of the data, once again during the join, and
once again when filtering out ‘Marketing’. (The number of rows with
date 9999–01–01 is 240,124 or 72%.)
Queries ⑶ and ⑷, like ⑵, scan the big table thrice but in the case of
⑷, the second filter doesn’t have much to do for the same reasons as ⑴.
In the process of our mini query analysis, we’ve also encountered our
fourth shell primitive, the humble #. The shell disregards everything
from a # until the end of the line. As a matter of convention,
explicatory comments start with two #’s followed by a space ‘ ’ and
their lines are kept to a length of less than 80 characters.
We also encounter the time program which, “[39]causes timing statistics
to be printed… once [a program] finishes”.
(Incidentally, if you too want to generate such fancy tables that are
auto-indented and auto-aligned, use [40]GNU Emacs⁴. and its wonderful
[41]org-mode Tables support.)
Retrospective
If you read the §Prediction section, you know I got it wrong. My
prediction was ⑴≪⑵<⑶≅⑷ but the actual order is ⑴≪⑷<⑶≅⑵. Turns out
filtering order does matter. (And it seems obvious in retrospect!) I’ll
eat my humble pie now.
Also, if possible, thank your local DBA. Query analysis is hard work.
join’s man page describes how to do the equivalent of LEFT OUTER JOINs
and RIGHT OUTER JOINs (-a).
Microservices
Wikipedia gives the [42]definition of a microservice as “a software
development technique that arranges an application as a collection of
loosely coupled services. In a microservices architecture, services are
fine-grained and the protocols are lightweight.”
If you’ve read this far, I need not elaborate. We can replace the
Wikipedia definition with what we know about shell and get practically
the same definition. For example: as we’ve seen shell scripting is “a
software development technique that [gets answers from] a collection of
loosely coupled [programs] services. In a [shell script], [programs]
are fine-grained and the protocols are lightweight.” It’s hard to find
a more lightweight protocol than string which is the typical medium of
exchange in shell programs. Shell programs, like micro services (citing
again from the same Wikipedia article) embody the Unix principle to “do
one-thing and do-it-really-well”.
Let’s run through what Wikipedia says are some “of the defining
characteristics” and compare them with shell programs. Wikipedia’s
points are in quotes followed by what we’ve seen thus far about the
shell.
* “Services in a microservice architecture are independently
deployable.”
‣ Programs in the shell are independently deployed.
* “Services are organized around business capabilities.”
‣ Each program does “one thing” and does “it really well”.
* “Services can be implemented using different programming languages,
databases, hardware and software environment, depending on what
fits best.”
‣ This is a truism wrt shell programs.
* “Services are small in size, messaging-enabled, bounded by
contexts, autonomously developed, independently deployable,
decentralized and built and released with automated processes.”
‣ Most shell programs are small in terms of their behavior but not
necessarily small (grep does one thing — string-search — but is not
a small program)
‣ The good ones are specifically messaging-enabled, for example,
join, as we saw above, can either take a file as argument or read
from its input pipe
‣ “autonomously developed”: ✓
‣ “independently deployable”: ✓
‣ “decentralized”: ✓
‣ We don’t now if they’re built and released with automated
processes but that’s always a good practice to have.
It would be dishonest of me to not include Wikipedia’s [43]§Criticism
and Concerns. While I’m not going to discuss them each individually
here, I encourage you to go through them. Of the list, specifically wrt
shell, I find this point particularly valid: “Viewing the size of
services as the primary structuring mechanism can lead to too many
services when the alternative of internal modularization may lead to a
simpler design.”
The Unix philosophy precedes microservices by a few decades. [44]GNU
Coreutils’s documentation, (the source of all programs used in this
post, except grep), describes how the original “Unix developers,…
professional programmers and trained computer scientists,… found that
with the right machinery for hooking programs together, that the whole
was greater than the sum of the parts. By combining several special
purpose programs, you could accomplish a specific task that none of the
programs was designed for, and accomplish it much more quickly and
easily than if you had to write a special purpose program.” [my
emphasis]
Conclusion
In WWII, scientists developing the atomic bomb had to give the military
an estimate of the bomb’s detonation capacity but since everything
about it was new, the techniques to correctly estimate the capacity
were yet to be developed. Enrico Fermi, one of the scientists working
on the project, performed a quick and rough estimate of the bomb when
it was first tested. Here’s his description of the event:
About 40 seconds after the explosion, the air blast reached me. I
tried to estimate its strength by dropping from about six feet small
pieces of paper before, during, and after the passage of the blast
wave. Since, at the time, there was no wind I could observe very
distinctly and actually measure the displacement of the pieces of
paper that were in the process of falling while the blast was
passing. The shift was about 2½ meters, which, at the time, I
estimated to correspond to the blast that would be produced by ten
thousand tons of T.N.T. [45]⁵
Fermi predicted 10 kilotons and the actual result, it turned out, was
21 kilotons[46]⁵. He was off by a mere factor of 2. Pretty impressive
for dropping bits of paper. In many ways, the shell gives us the same
kind of tools. It allows us to make quick estimations that are grounded
near the actual answers though not perfect.
I hope this post has shown you that shell programming can be useful,
edifying, and fun. It has helped and continues to help me in more ways
than I can quantify.
Acknowledgements
Many thanks to [47]Manish Sharma and [48]Shuveb Hussain for careful and
helpful editorial comments and suggestions. Thanks also to Shuveb for
titular assistance. Any errors that remain are, of course, my own.
Footnotes
¹ I wrote about enumerating the countable infinities (Natural numbers,
Integers, and Rationals) in my post titled “Catch a spy to enumerate
all Rational numbers” available at
[49]
https://medium.com/galileo-onwards/catch-a-spy-to-enumerate-all-rat
ional-numbers-c3ceb35e87cb.
² The command which cleaned up the original SQL is part of the commit
message available at:
[50]
https://github.com/lvijay/test_db-fork/commit/d3e3615b29111ce20698c
bf9860e639916290cf9.
³ More information on grep’s performance is available at the following
two links:
[51]
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019
310.html and
[52]
http://ridiculousfish.com/blog/posts/old-age-and-treachery.html. I
am indebted to [53]Shuveb Hussain for sharing these references.
⁴ GNU Emacs is an amazing software. Sacha Chua’s blog is also a great
place to learn Emacs. Her blog is available at
[54]
https://sachachua.com/blog/category/emacs/.
⁵ Fermi’s comments are taken from at the Atomic Heritage Foundation
website at
[55]
https://www.atomicheritage.org/key-documents/trinity-test-eyewitnes
ses. The US government’s Manhattan Project lists more anecdotes from
the first Trinity Test, it’s available at
[56]
https://www.osti.gov/opennet/manhattan-project-history/Events/1945/
trinity.htm. Technical details about the blast are available at
[57]
https://www.lanl.gov/science/weapons_journal/wj_pubs/11nwj2-05.pdf.
[58]
Galileo Onwards
Notes on philosophy, art, computer science, among other things.
(BUTTON) Follow
(BUTTON) 5
[59]Some rights reserved
* [60]Programming
* [61]Shell
* [62]Database
* [63]Unix
* [64]Microservices
(BUTTON) 5 claps
(BUTTON)
[65]Vijay Lakshminarayanan
Written by
[66]Vijay Lakshminarayanan
(BUTTON) Follow
I infrequently write.
(BUTTON) Follow
[67]Galileo Onwards
[68]Galileo Onwards
(BUTTON) Follow
Notes on philosophy, art, computer science, among other things.
(BUTTON) Follow
[69]
Write the first response
[70]
Discover Medium
Welcome to a place where words matter. On Medium, smart voices and
original ideas take center stage - with no ads in sight. [71]Watch
[72]
Make Medium yours
Follow all the topics you care about, and we’ll deliver the best
stories for you to your homepage and inbox. [73]Explore
[74]
Become a member
Get unlimited access to the best stories on Medium — and support
writers while you’re at it. Just $5/month. [75]Upgrade
[76]About[77]Help[78]Legal
References
Visible links
1.
https://medium.com/osd.xml
2. android-app://com.medium.reader/https/medium.com/p/6717f57afc2e
3.
https://medium.com/m/signin?operation=login&redirect=
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e&source=post_page-----6717f57afc2e---------------------nav_reg-
4.
https://medium.com/galileo-onwards/philosophy/home?source=post_page-----6717f57afc2e----------------------
5.
https://medium.com/galileo-onwards/arts/home?source=post_page-----6717f57afc2e----------------------
6.
https://medium.com/galileo-onwards/computers/home?source=post_page-----6717f57afc2e----------------------
7.
https://medium.com/galileo-onwards/tech/home?source=post_page-----6717f57afc2e----------------------
8.
https://medium.com/galileo-onwards/climate/home?source=post_page-----6717f57afc2e----------------------
9.
https://medium.com/galileo-onwards/archive?source=post_page-----6717f57afc2e----------------------
10.
https://medium.com/galileo-onwards/about?source=post_page-----6717f57afc2e----------------------
11.
https://en.wikipedia.org/wiki/File:Turritella_communis_fossiel.jpg
12.
https://creativecommons.org/licenses/by-sa/3.0/deed.en
13.
https://medium.com/@lvijay?source=post_page-----6717f57afc2e----------------------
14.
https://medium.com/@lvijay?source=post_page-----6717f57afc2e----------------------
15.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e?source=post_page-----6717f57afc2e----------------------
16.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#adeb
17.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#5060
18.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#643d
19.
https://en.wikipedia.org/wiki/Fear_of_missing_out
20.
https://medium.com/galileo-onwards/catch-a-spy-to-enumerate-all-rational-numbers-c3ceb35e87cb
21.
https://www.gnu.org/software/coreutils/manual/coreutils.html#tac-invocation
22.
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
23.
https://www.gnu.org/software/coreutils/uniq
24.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#ab8c
25.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#98dc
26.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#4076
27.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#8389
28.
https://dev.mysql.com/doc/employee/en/employees-introduction.html
29.
https://creativecommons.org/licenses/by-sa/3.0/
30.
https://www.gnu.org/software/coreutils/head
31.
https://github.com/lvijay/test_db-fork/blob/master/load_employees.data
32.
https://github.com/lvijay/test_db-fork
33.
https://github.com/datacharmer/test_db/blob/master/load_employees.dump
34.
https://github.com/lvijay/test_db-fork/commit/d3e3615b29111ce20698cbf9860e639916290cf9
35.
https://www.gnu.org/software/grep/
36.
https://www.gnu.org/software/coreutils/join
37.
https://www.gnu.org/software/coreutils/tail
38.
https://medium.com/galileo-onwards/an-ode-to-the-shell-6717f57afc2e#8389
39.
https://www.gnu.org/software/bash/manual/html_node/Pipelines.html#Pipelines
40.
https://www.gnu.org/software/emacs/
41.
https://orgmode.org/manual/Tables.html
42.
https://en.wikipedia.org/wiki/Microservices
43.
https://en.wikipedia.org/wiki/Microservices#Criticism_and_Concerns
44.
https://www.gnu.org/software/coreutils/manual/coreutils.html#Toolbox-introduction
45.
https://www.atomicheritage.org/key-documents/trinity-test-eyewitnesses
46.
https://www.atomicheritage.org/key-documents/trinity-test-eyewitnesses
47.
https://medium.com/u/4b32e54ea238?source=post_page-----6717f57afc2e----------------------
48.
https://unixism.net/
49.
https://medium.com/galileo-onwards/catch-a-spy-to-enumerate-all-rational-numbers-c3ceb35e87cb
50.
https://github.com/lvijay/test_db-fork/commit/d3e3615b29111ce20698cbf9860e639916290cf9
51.
https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
52.
http://ridiculousfish.com/blog/posts/old-age-and-treachery.html
53.
https://unixism.net/
54.
https://sachachua.com/blog/category/emacs/
55.
https://www.atomicheritage.org/key-documents/trinity-test-eyewitnesses
56.
https://www.osti.gov/opennet/manhattan-project-history/Events/1945/trinity.htm
57.
https://www.lanl.gov/science/weapons_journal/wj_pubs/11nwj2-05.pdf
58.
https://medium.com/galileo-onwards?source=post_sidebar--------------------------post_sidebar-
59.
https://creativecommons.org/licenses/by-sa/4.0/
60.
https://medium.com/tag/programming
61.
https://medium.com/tag/shell
62.
https://medium.com/tag/database
63.
https://medium.com/tag/unix
64.
https://medium.com/tag/microservices
65.
https://medium.com/@lvijay?source=follow_footer--------------------------follow_footer-
66.
https://medium.com/@lvijay?source=follow_footer--------------------------follow_footer-
67.
https://medium.com/galileo-onwards?source=follow_footer--------------------------follow_footer-
68.
https://medium.com/galileo-onwards?source=follow_footer--------------------------follow_footer-
69.
https://medium.com/p/6717f57afc2e/responses/show?source=follow_footer--------------------------follow_footer-
70.
https://medium.com/about?autoplay=1&source=post_page-----6717f57afc2e----------------------
71.
https://medium.com/about?autoplay=1&source=post_page-----6717f57afc2e----------------------
72.
https://medium.com/topics?source=post_page-----6717f57afc2e----------------------
73.
https://medium.com/topics?source=post_page-----6717f57afc2e----------------------
74.
https://medium.com/membership?source=post_page-----6717f57afc2e----------------------
75.
https://medium.com/membership?source=post_page-----6717f57afc2e----------------------
76.
https://medium.com/about?autoplay=1&source=post_page-----6717f57afc2e----------------------
77.
https://help.medium.com/?source=post_page-----6717f57afc2e----------------------
78.
https://medium.com/policy/9db0094a1e0f?source=post_page-----6717f57afc2e----------------------
Hidden links:
80.
https://medium.com/?source=post_page-----6717f57afc2e----------------------
81.
https://medium.com/galileo-onwards?source=post_page-----6717f57afc2e----------------------
82.
https://medium.com/p/6717f57afc2e/share/twitter?source=post_actions_header---------------------------
83.
https://medium.com/p/6717f57afc2e/share/facebook?source=post_actions_header---------------------------
84.
https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fgalileo-onwards%2Fan-ode-to-the-shell-6717f57afc2e&source=post_actions_header--------------------------bookmark_sidebar-
85.
https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fgalileo-onwards%2Fan-ode-to-the-shell-6717f57afc2e&source=post_sidebar-----6717f57afc2e---------------------clap_sidebar-
86.
https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fgalileo-onwards%2Fan-ode-to-the-shell-6717f57afc2e&source=post_sidebar--------------------------bookmark_sidebar-
87.
https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fgalileo-onwards%2Fan-ode-to-the-shell-6717f57afc2e&source=post_actions_footer-----6717f57afc2e---------------------clap_footer-
88.
https://medium.com/p/6717f57afc2e/share/twitter?source=post_actions_footer---------------------------
89.
https://medium.com/p/6717f57afc2e/share/facebook?source=post_actions_footer---------------------------
90.
https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2Fgalileo-onwards%2Fan-ode-to-the-shell-6717f57afc2e&source=post_actions_footer--------------------------bookmark_sidebar-
91.
https://medium.com/?source=post_page-----6717f57afc2e----------------------