##
...................................................
...................................................
...................................................
...................................................
......... ........
......... ........
......... DATA ANALYSIS ........
......... ........
......... ON ........
......... ........
......... THE COMMAND LINE ........
......... ........
...................................................
............ Enzo "KatolaZ" Nicosia ...........
............
[email protected] ...........
...................................................
##
___I___
/= | #\
/.__-| __ \
|/ _\_/_ \|
(( __ \__))
NO DATA IS BIG ENOUGH __ ((()))))()) __
,' |()))))(((()|# `.
/ |^))()))))(^| =\
IF YOUR BEARD IS GREY ENOUGH / /^v^(())()()v^\' .\
|__.'^v^v^))))))^v^v`.__|
/_ ' \______(()_____( |
_..-' _//_____[xxx]_____\.-|
/,_#\.=-' /v^v^v^v^v^v^v^v^| _|
\)|) v^v^v^v^v^v^v^v^v| _|
|| :v^v^v^v^v^v`.-' |# \,
|| v^v^v^v`_/\__,--.|\_=_/
>< :v^v____| \_____|_
, || v^ / \ /
//\_||_)\ `/_..-._\ )_...__\
|| \/ #| |_='_( | =_(_
|| _/\_ | / =\ / ' =\
\\/ \/ )/ gnv |=____#| '=....#|
`
##
_____ _ ____ ___ ____
|_ _| |__ ___ | __ )_ _/ ___|
| | | '_ \ / _ \ | _ \| | | _
| | | | | | __/ | |_) | | |_| |
|_| |_| |_|\___| |____/___\____|
_ ___ ______ _____
| | | \ \ / / _ \| ____|
| |_| |\ V /| |_) | _|
| _ | | | | __/| |___
|_| |_| |_| |_| |_____|
##
. '@(@@@@@@@)@. (@@) ` . '
. @@'((@@@@@@@@@@@)@@@@@)@@@@@@@)@
@@(@@@@@@@@@@))@@@@@@@@@@@@@@@@)@@` .
@.((@@@@ _ ___ __ _ ___ @)@@@ .
(@@@@@@@@|_) | /__ | \ /\ | /\ @@@@)
(@@@@@@@@)|_) _|_ \_| |_/ /--\ | /--\ @@) `
.@(@@@@)# @ . .'
@@`(@@)###&&&&&!!;;;;;;::-_=@@\\@)@`@. ............................
` @@(@###&&&&!!;;;;;::-=_=@.@\\@@ ' : OH NO! BIG DATA IS HERE! :
` @.#####&&&!!;;;::=-_= .@ \\ : WE WILL ALL DIE!!!! :
####&&&!!;;::=_- ` ............................
###&&!!;;:-_= .----------. /
##&&!;::_= / .-. .-. \ /
##&&!;:= / | | | | \ /
##&&!:- \ `-' `-' _/
#&!;:- /\ .--. / |
#&!;= \ | / / / /
#&!- / | `--' /\ \
#&= /`-------' \ \ Jym Dyer
jgs #&-
\\#/'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
##
____ _ _
/ ___| | ___ _ _ __| | _ ___
| | | |/ _ \| | | |/ _` | / \ |_ _|
| |___| | (_) | |_| | (_| | / _ \ | |
\____|_|\___/ \__,_|\__,_| / ___ \ | |
_ _ _ /_/ \_\___|
| | | | __ _ __| | ___ ___ _ __
| |_| |/ _` |/ _` |/ _ \ / _ \| '_ \
| _ | (_| | (_| | (_) | (_) | |_) |
_ _ ____ _ \__,_|\___/ \___/| .__/
| \ | | ___/ ___| __ _| | |_|
| \| |/ _ \___ \ / _` | |
| |\ | (_) |__) | (_| | | ____ ____ _ ____ _ __
|_| \_|\___/____/ \__, |_| / ___|| _ \ / \ | _ \| |/ /
|_| \___ \| |_) / _ \ | |_) | ' /
___) | __/ ___ \| _ <| . \
|____/|_| /_/ \_\_| \_\_|\_\
##
____ _ _ _
| _ \ ___ __ _| (_) |_ _ _
| |_) / _ \/ _` | | | __| | | |
| _ < __/ (_| | | | |_| |_| |
|_| \_\___|\__,_|_|_|\__|\__, |
|___/
_ _
___| |__ ___ ___| | __
/ __| '_ \ / _ \/ __| |/ /
| (__| | | | __/ (__| <
\___|_| |_|\___|\___|_|\_\
##
- Data produced each day:
.===================================.
| |
| 2.5 quintillion bytes (2.5x10^18) |
| |
.===================================.
which is about one Sextillion bytes per year (0.9x10^21)
##
BUT AT THE SAME TIME:
=====================
- Not all this data is accessible/usable/USEFUL
- Computational power has increased at a similar pace
- We waste 90% of CPU time on layers upon layers upon
layers upon layer of "abstraction" (in the end just
to watch facebook feeds...)
##
Example 0: Shell-script vs Hadoop (235-to-1)
============================================
o 1.75 GB of data (2 million chess games)
o Task: compute some statistics
o Solution 1 ==> use hadoop map/reduce
several machines ==> 26 MINUTES
o Solution 2 ==> `find ... | xargs ... | awk ...`
a single CPU ==> 12 SECONDS
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
##
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@ @@@@@@@@@@
@@@@@@@@@@ MOST DATA ANALYSIS CAN @@@@@@@@@@
@@@@@@@@@@ @@@@@@@@@@
@@@@@@@@@@ (AND SHOULD) @@@@@@@@@@
@@@@@@@@@@ @@@@@@@@@@
@@@@@@@@@@ BE DONE ON THE @@@@@@@@@@
@@@@@@@@@@ @@@@@@@@@@
@@@@@@@@@@ COMMAND LINE @@@@@@@@@@
@@@@@@@@@@ @@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
##
__| __| | __| __| __ __| _ _| \ | __|
\__ \ _| | _| ( | | . | (_ |
____/ ___| ____| ___| \___| _| ___| _|\_| \___|
(decide what you need)
##
Example 1: How many packages available?
=======================================
Alpine Linux keeps an index of available packages in APKINDEX
+---------------------------------------------------------------------------+
| C:Q1EPp2wfY2W6JurBN2CuY+6fS1fGI= |
--->| P:tftp-hpa |
| V:5.2-r2 |
| A:x86_64 |
| S:30831 |
| I:114688 |
| T:Official tftp server |
| U:
https://www.kernel.org/pub/software/network/tftp/ |
| L:BSD |
| o:tftp-hpa |
| m:Natanael Copa <
[email protected]> |
| t:1557154288 |
| c:730cdcef6901750f4029d4c3b8639ce02ee3ead1 |
| D:so:libc.musl-x86_64.so.1 |
| p:cmd:in.tftpd cmd:tftp |
| |
+---------------------------------------------------------------------------+
##
Example 1: How many packages available?
=======================================
Data flow:
+--------------------+ +----------------+
+-------+ | Get all lines that | | Count them and |
| INPUT |-->| begin by "P:" +----->| show the |
+-------+ | | | result |
+--------------------+ +----------------+
Which translates to:
$ egrep "^P:" APKINDEX | wc -l
---> 10335
##
+-------------------------------------------------------------+
| |
| Rule #0: identify what you want to know |
| -------- |
| Rule #1: split your proplem into smaller sub-problems |
| -------- |
| Rule #2: use a specialised tool to solve each sub-problem |
| -------- |
+-------------------------------------------------------------+
##
__| _ \ | | \ | __ __| _ _| \ | __|
( ( | | | . | | | . | (_ |
\___| \___/ \__/ _|\_| _| ___| _|\_| \___|
##
Example 2: How many packagers in Alpine?
========================================
+---------------------------------------------------------------------------+
| C:Q1EPp2wfY2W6JurBN2CuY+6fS1fGI= |
| P:tftp-hpa |
| V:5.2-r2 |
| A:x86_64 |
| S:30831 |
| I:114688 |
| T:Official tftp server |
| U:
https://www.kernel.org/pub/software/network/tftp/ |
| L:BSD |
| o:tftp-hpa |
--->| m:Natanael Copa <
[email protected]> |
| t:1557154288 |
| c:730cdcef6901750f4029d4c3b8639ce02ee3ead1 |
| D:so:libc.musl-x86_64.so.1 |
| p:cmd:in.tftpd cmd:tftp |
| |
+---------------------------------------------------------------------------+
##
Example 2: How many packagers in Alpine?
========================================
Data Flow:
+-------+ +----------------------+ +----------------------+
| INPUT +----->| Select rows starting +------->| Count how many times |
+-------+ | with "m:" | | the same row is |
+----------------------+ | present |
+----------------------+
Which translates to:
egrep "^m:" APKINDEX | ???
##
Example 2: How many packagers in Alpine?
========================================
At a closer look:
+----------------------+
| Count how many times |
| the same row is |
| present |
+----------------------+
Becomes:
+---------------+ +--------------+ +---------+
| Sort input in | | Remove | | Count |
| alphabetical +---->| Repeated +----->| Lines |
| order | | Lines | | |
+---------------+ +--------------+ +---------+
##
Example 2: How many packagers in Alpine?
========================================
Amended data flow:
+-------+ +----------------------+ +---------------+ +--------------+ +---------+
| INPUT +->| Select rows starting +->| Sort input in | | Remove | | Count |
+-------+ | with "m:" | | alphabetical +->| Repeated +->| Lines |
+----------------------+ | order | | Lines | | |
+---------------+ +--------------+ +---------+
Which translates to:
$ egrep "^m:" APKINDEX | sort | uniq | wc -l
181
##
+---------------------------------------------------------------------+
| |
| Rule #4: almost any "simple" task can be split into simpler tasks |
| -------- |
| Rule #5: reusing is much better than writing from scratch |
| -------- |
+---------------------------------------------------------------------+
##
__| _ _| | __ __| __| _ \ _ _| \ | __|
_| | | | _| / | . | (_ |
_| ___| ____| _| ___| _|_\ ___| _|\_| \___|
(focus on important stuff)
##
Are you sure you are counting it right?
=======================================
$ egrep "^m:" APKINDEX | sort | uniq | tail -n +25 | head -15
+------------------------------------------------------+
| m:Breno Leitao <
[email protected]> |
| m:Cameron Banta <
[email protected]> |
| m:Camille Scholtz <
[email protected]> |
!!! | m:Carlo Landmeter <
[email protected]> |
!!! | m:Carlo Landmeter <
[email protected]> |
| m:Chloe Kudryavtsev <
[email protected]> |
| m:Christian Kampka <
[email protected]> |
| m:Christine Dodrill <
[email protected]> |
| m:Clayton Craft <
[email protected]> |
%%% | m:Corey Oliver <
[email protected]> |
%%% | m:Corey Oliver <
[email protected]> |
| m:Cág <
[email protected]> |
| m:Dan Theisen <
[email protected]> |
| m:Daniel Isaksen <
[email protected]> |
| m:Daniel Sabogal <
[email protected]> |
+------------------------------------------------------+
##
Are you sure you are counting it right?
=======================================
We actually would like either of the following:
m:Carlo Landmeter <
[email protected]>
m:Carlo Landmeter <
[email protected]>
to be counted as the single user "Carlo Landmeter". The same for:
m:Corey Oliver <
[email protected]>
m:Corey Oliver <
[email protected]>
which should be counted as the single user "Corey Oliver".
##
Are you sure you are counting it right?
=======================================
Data flow:
+-------+ +--------+ +-------------+ +------+ +------------+ +-------+
| INPUT | | Select | | Strip email | | Sort | | Delete | | Count |
| | | rows | | | | rows | | duplicates | | |
+-------+ +--------+ +-------------+ +------+ +------------+ +-------+
##
Are you sure you are counting it right?
=======================================
Data flow:
+-------+ +--------+ +-------------+ +------+ +------------+ +-------+
| INPUT | | Select | | Strip email | | Sort | | Delete | | Count |
| | | rows | | | | rows | | duplicates | | |
+-------+ +--------+ +-------------+ +------+ +------------+ +-------+
: : : :
Which becomes: : : : :
: : : :
$ egrep "^m:" APKINDEX | cut -d"<" -f1 | sort | uniq | wc -l
152
and is different from:
$ egrep "^m:" APKINDEX | sort | uniq | wc -l
181
##
+---------------------------------------------------------------------+
| |
| Rule #6: data is rarely as you would like it to be |
| -------- |
| Rule #7: make sure your data flow actually does what you want |
| -------- |
+---------------------------------------------------------------------+
##
\ | _ \ _ \ __| __| _ \ | | \ | __ __| _ _| \ | __|
|\/ | ( | / _| ( ( | | | . | | | . | (_ |
_| _| \___/ _|_\ ___| \___| \___/ \__/ _|\_| _| ___| _|\_| \___|
##
Example 3: Who is the most prolific packager?
=============================================
$ egrep "^m:" APKINDEX | cut -d"<" -f1 | sort | head -15
m:7heo
m:7heo
m:7heo
m:7heo
m:7heo
m:7heo
m:7heo
m:A. Wilcox
m:A. Wilcox
m:A. Wilcox
m:A. Wilcox
m:A. Wilcox
m:A. Wilcox
m:A. Wilcox
m:A. Wilcox
##
Example 3: Who is the most prolific packager?
=============================================
Counting repeated (sorted) lines is easy:
... | sort | uniq -c
New data flow:
+-------+ +--------+ +-------------+ +------+ +------------+
| INPUT | | Select | | Strip email | | Sort | | Count |
| | | rows | | | | rows | | duplicates |
+-------+ +--------+ +-------------+ +------+ +------------+
$ egrep "^m:" APKINDEX | cut -d"<" -f1 | sort | uniq -c
2 m:lemon
32 m:prspkt
5 m:stef
57 m:tcely
2 m:wener
.........
##
Example 3: Who is the most prolific packager?
=============================================
We can easily rank the packagers by total number of packages!
+-------+ +--------+ +-------------+ +------+ +------------+ +---------+ +---------+
| INPUT | | Select | | Strip email | | Sort | | Count | | Sort by | | Get top |
| | | rows | | | | rows | | duplicates | | numbers | | ten |
+-------+ +--------+ +-------------+ +------+ +------------+ +---------+ +---------+
$ egrep "^m:" APKINDEX | cut -d"<" -f1 | sort | uniq -c | sort -rn | head -10
4081 m:Natanael Copa
1104 m:Francesco Colista
484 m:Jakub Jirutka
478 m:Leonardo Arena
392 m:Timo Teräs
271 m:Carlo Landmeter
262 m:Stuart Cardall
253 m:Fabian Affolter
242 m:Valery Kartel
145 m:Rasmus Thomsen
##
________ ___ __ __ ___ _ _ ____ ___ _ _ _____
|__ / _ \ / _ \| \/ |_ _| \ | |/ ___| / _ \| | | |_ _|
/ / | | | | | | |\/| || || \| | | _ | | | | | | | | |
/ /| |_| | |_| | | | || || |\ | |_| | | |_| | |_| | | |
/____\___/ \___/|_| |_|___|_| \_|\____| \___/ \___/ |_|
##
Making new tools out of simple ones
===================================
This is the pipeline we used to get the top-10 of Alpine Linux contributors:
............................................................................
+-------+:+--------+ +-------------+:+------+ +------------+ +---------+ +---------+:
| INPUT |:| Select | | Strip email |:| Sort | | Count | | Sort by | | Get top |:
| |:| rows | | |:| rows | | duplicates | | numbers | | ten |:
+-------+:+--------+ +-------------+:+------+ +------------+ +---------+ +---------+:
: : :
:..........................:................................................
Get interesting data points: Compute the top-10 of rows
on separate rows : given as input
TWO LOGICAL BLOCKS
INPUT +-------------+ rows +------------+
---------->| Select rows +-------------->| Get top-10 |
+-------------+ +------------+
##
+-------------+
| Select rows +
+-------------+
$ cat select_rows
+----------------------------------------------+
| #!/bin/sh |
| |
| FIN=${1:-"/dev/stdin"} |
| egrep "^m:" ${FIN} | cut -d"<" -f1 |
| |
+----------------------------------------------+
##
+------------+
| Get top-10 |
+------------+
$ cat top10
+-------------------------------------------------------+
| #!/bin/sh |
| |
| FIN=${1:-"/dev/stdin"} |
| sort $FIN | uniq -c | sort -rn | head -10 |
| |
+-------------------------------------------------------+
##
+---------------------------------------------------------------------+
| |
| Rule #8: identify reusable blocks, and make tools out of them |
| -------- |
| Rule #9: good tools only spit out data to other tools (no rubbish) |
| -------- |
+---------------------------------------------------------------------+
##
_ _
___| |_ __ _| |_ ___
/ __| __/ _` | __/ __|
\__ \ || (_| | |_\__ \
|___/\__\__,_|\__|___/
__
/ _| ___ _ __
| |_ / _ \| '__|
| _| (_) | |
|_| \___/|_|
_ _
__| |_ _ _ __ ___ _ __ ___ (_) ___ ___
/ _` | | | | '_ ` _ \| '_ ` _ \| |/ _ \/ __|
| (_| | |_| | | | | | | | | | | | | __/\__ \
\__,_|\__,_|_| |_| |_|_| |_| |_|_|\___||___/
##
The typical (unmotivated) complaint
===================================
THE UNIX SHELL IS NOT GOOD FOR COMPUTATIONS!!!
but...
THERE ARE MANY UNIX TOOLS WHICH ARE
VERY GOOD AT THAT!
##
Example 4: Learn from the champions
===================================
TWIC: The Week In Chess -- 2.4+ Million Chess Games in official tournaments
+----------------------------------------------------------------------------+
| [Event "Sparkassen Gp 1"] |
| [Site "Dortmund GER"] |
| [Date "2002.07.06"] |
| [Round "1"] |
| [White "Topalov, V"] |
| [Black "Lutz, C"] |
| [Result "1-0"] |
--->| [WhiteElo "2745"] |
--->| [BlackElo "2650"] |
| [ECO "B48"] |
| [EventDate "2002.07.06"] |
| |
| 1.e4 c5 2.Nf3 e6 3.d4 cxd4 4.Nxd4 Nc6 5.Nc3 Qc7 6.Be3 a6 7.Qd2 Nf6 8.O-O-O |
| Bb4 9.f3 Ne5 10.Nb3 b5 11.Kb1 Nc4 12.Bxc4 bxc4 13.Nc1 Qb7 14.N1e2 Rb8 15. |
| b3 O-O 16.Bf4 Ra8 17.Bd6 Bxd6 18.Qxd6 cxb3 19.axb3 a5 20.Rd4 Ra6 21.Qa3 d5 |
| 22.exd5 exd5 23.Nf4 Be6 24.Rhd1 h6 25.Ncxd5 Nxd5 26.Nxd5 Rb8 27.Nf6+ gxf6 |
| 28.Rd8+ Rxd8 29.Rxd8+ Kh7 30.Qf8 Kg6 31.Qg8+ Kh5 32.Qg7 f5 33.Rd4 Bc8 34. |
| g3 1-0 |
+----------------------------------------------------------------------------+
##
Example 4: Learn from the champions
===================================
We want to find the average of the Elo score across all the games.
Data Flow:
INPUT +--------+ +---------+
+------->| Select +------->| Compute |
| rows | | average |
+--------+ +---------+
Which translates to:
egrep "^\[.....Elo " twic.pgn | ???
##
Example 4: Learn from the champions
===================================
Let's have a look at what comes out of:
egrep "^\[.....Elo " twic.pgn | head -5
[WhiteElo "2745"]
[BlackElo "2650"]
[WhiteElo "2710"]
[BlackElo "2697"]
[WhiteElo "2745"]
So we need to:
+-------------------+ +------------+ +--------------+ +----------------+
| get what follows | | Remove ']' | | Sum all the | | Divide the sum |
| the first space +-->| and '"' +-->| rows +-->| by the num of |
| | | | | | | rows |
+-------------------+ +------------+ +--------------+ +----------------+
##
Example 4: Learn from the champions
===================================
+-------------------+ +------------+ +--------------+ +----------------+
| get what follows | | Remove ']' | | Sum all the | | Divide the sum |
| the first space +-->| and '"' +-->| rows +-->| by the num of |
| | | | | | | rows |
+-------------------+ +------------+ +--------------+ +----------------+
Translates to:
cut -d" " -f2 | sed -r 's/\]//g;s/\"//g' | awk '{t += $1}END{print t/NR}'
So in the end we get:
$ egrep "^\[.....Elo " twic.pgn | cut -d" " -f 2 | \
sed -r 's/\]//g;s/\"//g' | awk '{t += $1}END {print t/NR}'
2263.95
So the average player in there is at least a Candidate Master!
##
+----------------------------------------------------------------+
| |
| Rule #10: leave the computations to the right tool |
| --------- |
| Rule #11: AWK is normally the right tool |
| --------- |
+----------------------------------------------------------------+
##
| _)
_ \ _| -_) (_-< -_) \ _| | \ _` |
.__/ _| \___| ___/ \___| _| _| \__| _| _| _| \__, |
_| ____/
| | _ \ | | _|
\_, | \___/ \_,_| _|
___/
| |
_| -_) (_-< | | | _| (_-<
_| \___| ___/ \_,_| _| \__| ___/
##
Example 5: How many "good" players in the TWIC DB?
==================================================
How many games by players with a score higher than 2000 (1st class)?
And how many by players with a score higher than 2200 (Candidate Master)?
And how many by players with a score higher than 2300 (FIDE Master)?
...
Reusing the data last data flow:
+------------+ +-------------+ +-------------+
| Remove ']' | | Select good | | Construct a |
......--->| and '"' +-->| Players +--->| histogram |
| | | | | |
+------------+ +-------------+ +-------------+
##
Example 5: How many "good" players in the TWIC DB?
==================================================
We use a trick: Maintain only the two largest digits of the Elo score!
2456 --> 24
2786 --> 27
....
And then we count how many 20, 21, 22, etc we have!
##
Example 5: How many "good" players in the TWIC DB?
==================================================
+-------------+ +-------------+
| Select good | | Construct a |
-->| Players +--->| histogram |
| | | |
+-------------+ +-------------+
... | egrep "2..." | sed -E 's/(2.).*/\1/g' | sort | uniq -c
404184 20
548579 21
670332 22
714168 23
717261 24
534279 25
246087 26
75139 27
6114 28
So we have:
404184 players whose score is between 2000 and 2099,
548579 players whose score is between 2100 and 2199,
.....
Let's produce a histogram now!
##
Example 5: How many "good" players in the TWIC DB?
==================================================
egrep "^\[.....Elo " twic.pgn | cut -d" " -f 2 | \
sed -r 's/\]//g;s/\"//g' | egrep "2..." | sed -E 's/(2.).*/\1/g' | \
sort | uniq -c | gnuplot dumbplot
800000 +------------------------------------------------------------------+
| + + * + + |
700000 |-+ * * * +-|
| * * * |
600000 |-+ * * * * +-|
500000 |-+ * * * * * +-|
| * * * * * |
400000 |-+ * * * * * * +-|
| * * * * * * |
300000 |-+ * * * * * * +-|
200000 |-+ * * * * * * * +-|
| * * * * * * * |
100000 |-+ * * * * * * * +-|
| * * * * * * * * + |
0 +------------------------------------------------------------------+
20 22 24 26 28
##
Example 5: How many "good" players in the TWIC DB?
==================================================
The magic is done by gnuplot:
$ cat dumbplot
+----------------------------------------------------------------+
| set term dumb size 80,18 ; |
| set xrange [19:29] |
| plot "< cat -" u 2:1 w impulses title ""; |
| |
+----------------------------------------------------------------+
##
+----------------------------------------------------------------+
| |
| Rule #12: using a shell does not preclude good visualisations |
| --------- |
| Rule #13: learn plot/gnuplot |
| --------- |
+----------------------------------------------------------------+
##
| _)
_` | | | -_) \
\__,_| _| _| \___| _| _|
_| |
_| _ \ _| ` \ _` | _| (_-<
_| \___/ _| _|_|_| \__,_| \__| ___/
##
Example 6: a geographic data set
=================================
Geographic data sets (points, lines, polygons) are often available in CSV format
(Comma-Separated Values).
Points of the perimeter of Wards across the UK (45+Million points - 30GB+):
+--------------------------------------------------------------------------------+
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441163.8963,236323.0026 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441173.4029,236325.102 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441185.6962,236353.9038 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441199.2017,236366.9 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441223.4998,236279.3951 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441257.9972,236236.2974 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441266.704,236429.3022 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441271.2965,236222.7013 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441299.7007,236234.7978 |
| 1538,"Adderbury, Bloxham & Bodicote Ward",E05011348,441330.8011,236493.3038 |
+--------------------------------------------------------------------------------+
##
Example 6: a geographic data set
=================================
Objective: construct a graph where each node is a ward and two wards are connected
by an edge if they share a point on their perimeter
Data flow:
+---------------+ +------------------+
INPUT | Select ID and | | Join two IDS if |
+-------->| coordinates +---->| their boundaries |
| | | intersect |
+---------------+ +------------------+
Which translates to:
csvtool cols 1,4,5 wards.csv | sed -r 's/,/ /' | sort -k2 | ?????
##
Example 6: a geographic data set
=================================
+------------------+
| Join two IDS if |
---->| their boundaries |
| intersect |
+------------------+
To perform this operation, we will use a tool as old as myself:
_ _ __ _ __
(_) ___ (_)_ __ / / / | \ \ (*)
| |/ _ \| | '_ \ | | | | | |
| | (_) | | | | | | | | | | |
_/ |\___/|_|_| |_| | | |_| | |
|__/ \_\ /_/
(*) join(1) appeared in Unix V7 (1979)
##
Example 6: a geographic data set
=================================
+---------------+ +------------------+
INPUT | Select ID and | | Join two IDS if |
+-------->| coordinates +---->| their boundaries |
| | | intersect |
+---------------+ +------------------+
Translates to:
csvtool cols 1,4,5 wards.csv | sed -r 's/,/ /' | sort -k2 > tmpfile
followed by:
join -1 2 -2 2 tmpfile tmpfile | cut -d " " -f 2,3 | sort | uniq
where the last " ... | sort | uniq " removes duplicated edges.
(*) Each edge still appears twice in the list though... use AWK to solve this ;P
##
+----------------------------------------------------------------+
| |
| Rule #14: what you need for data analysys is there already |
| --------- |
| Rule #15: no, really, you don't need Python or Javascript |
| --------- |
+----------------------------------------------------------------+
##
Other tools useful for data analysis
====================================
- xargs, parallel (run commands in parallel on many input chunks)
- jq (parse JSON structures)
- bc/dc (arbitrary precision calculators)
- tbl/eqn/pic (troff packages to create tables, equations, graphs)
- split/csplit (split files)
- comm (get common lines between files)
- paste (paste two files side-by-side)
- seq (create sequences of numbers -- useful in for loops...)
##
_____ _ _ _
|_ _| |__ __ _ _ __ | | _____ | |
| | | '_ \ / _` | '_ \| |/ / __| | |
| | | | | | (_| | | | | <\__ \ |_|
|_| |_| |_|\__,_|_| |_|_|\_\___/ (_)
##
Credits
=======
- This presentation was created with vim(1)
https://www.vim.org
- Box-and-arrow plots were made with gramscii(1)
git clone git://kalos.mine.nu/gramscii
- Headings were created with figlet(6) and toilet(1)
http://figlet.org
http://caca.zoy.org/wiki/toilet
- This presentation was delivered with catpoint
git clone git://r-36.net/catpoint
DISCLAIMER: most ASCII arts are (c) by their original
authors. If no author is indicated, it is
either unknown or myself. In the latter
cases the ASCII arts are free for all.
##
Contacts
========
- email: katolaz [at] freaknet [dot] org [dot] remove [dot] dots [dot]
-
gopher://kalos.mine.nu
-
gopher://republic.circumlunar.space/1/~katolaz
-
gopher://medialab.freaknet.org/1/katolaz/
-
gopher://cgit.mine.nu
-
https://cgit.mine.nu
##
Other References
================
csvtool - manage CSV anc TSV files
http://forge.ocamlcore.org/projects/csv/
jq - manage and convert JSON files
https://github.com/stedolan/jq
TWIC The Week In Chess -- The oldest e-zine about chess
https://theweekinchess.com/
twic_script: keep an updated Chess games databases from TWIC
git clone git://cgit.mine.nu/twic_script