From: [email protected]
Date: 2018-06-29
Subject: Visualizing the History of Programming Languages

Recently, I came across the Wikipedia article, Timeline of Program-
ming Languages[1].  It has nicely-formatted tables for each  decade
since  the 1940s.  Each table has the same format: one language per
row and, for each language, the year it came into being, its  name,
creator, and a list of the languages that influenced it.  When data
establishes a relationship between elements, as this page does with
the  list of influences for each language, we have the makings of a
graph.

I've written before[2] about using PlantUML to create graphics from
textual information.  In this case I'm using a similar tool, Plant-
Text[3], to create this graph.  Both  PlantUML  and  PlantText  are
web-based  front-ends  for  GraphViz.   GraphViz  uses  a language,
called DOT, to define the content and structure of  information  to
be visualized.  PlantText provides a number of examples that illus-
trate how different kinds of visualizations can  be  created.   I'm
basing  this graph on the World Dynamics template.  I also got some
additional ideas from "Drawing  graphs  with  dot"[4]  by  Gansner,
Koutsofios, and North.

The  first  thing  I need to do is put the tables into a structured
format that I can parse with awk.  I'm using git-bash for this, and
git-bash lacks my go-to tools for something like this, lynx or w3m.
I do have Pandoc installed, and I can use that to transform HTML to
plain  text.   I  recently  created  a  function called myw3m in my
bash_profile.

    myw3m ()
    {
        curl -s "$1" | perl -pe 'use open qw(:std :utf8); s/[^[:ascii:]]//g;' | pandoc -fhtml -tplain -
    }

This function pulls down the page using curl, passes it to perl  to
strip  out non-ascii characters, and then to Pandoc to transform to
plain text.  I can pull the Wikipedia page to a  local  file,  con-
verting it to plain text along the way:

    myw3m https://en.wikipedia.org/wiki/Timeline_of_programming_languages > timeline.txt

This  works fine in git-bash, but on my Debian VPS, I have an older
version of Pandoc that wraps over-zealously.  I am able to get sim-
ilar  results  with w3m, using a large width value to prevent wrap-
ping.  This also requires a little manual futzing with the data  to
widen inter-column space.

    w3m -cols 400 -dump https://en.wikipedia.org/wiki/Timeline_of_programming_languages > timeline.txt

Now  I  have  the whole page as plain text including the table rows
which look like this:

      1970    Pascal              Niklaus Wirth, Kathleen Jensen                  ALGOL 60, ALGOL W

I want to pass this into awk, so I need to put this data into  tab-
delimited  fields.   First,  I need to strip out any lines that are
not data I need to process.  This is pretty easy because each  line
from  the  table starts with two spaces and a four-digit number.  I
can filter out everything else using sed:

    sed -ne '/^  [[:digit:]]4/p'

It looks like Pandoc has separated each column with three  or  more
spaces.  We are lucky that no field data contains three consecutive
spaces.  This means that we can use sed to replace any instance  of
three  or more spaces with a tab.  We should also strip off leading
spaces while we're at it.  We are trying to make this easy to  work
with in awk.

    sed -e 's/^  *//;s/    *//g'

Now a row in the table looks like this:

    1970^IPascal^INiklaus Wirth, Kathleen Jensen^IALGOL 60, ALGOL W

Pretty slick, no?  There are three sections of graph that I need to
build out.  My first subgraph is the collection of years in the da-
ta with the relationship between each year made explicit.  This ef-
fectively creates an X-axis that will  organize  the  rest  of  the
graph.   By  building  a list of all the years, I can create a sub-
graph that looks like this:

    {
    "1988" -> "1989";
    "1989" -> "1990";
    "1990" -> "1991";
    }

The next thing I need to do is tell GraphViz to associate each lan-
guage  with  its corresponding year in the subgraph.  This will en-
courage GraphViz to visually rank  each  language  along  with  its
year.  I do this with a series of subgraphs that look like this:

    {rank=same; "1988";  "rpg/400";  "tcl";  "stos basic";  "actor";  "object rexx";  "spark";  "a+";  "hamilton c shell"; }
    {rank=same; "1989";  "turbo pascal oop";  "modula-3";  "powerbasic";  "lpc";  "bash";  "magik";  "python"; }
    {rank=same; "1990";  "amos basic";  "object oberon";  "j";  "haskell";  "z shell"; }

You'll  notice that everything is in lower case.  I found the capi-
talization to be inconsistent within the Wikipedia  article,  so  I
have to fix that to avoid duplicates.

Finally,  I just need to define all the other nodes and edges.  The
edges are directional, so I define them going from the  influencing
language to the influenced language.  For a line like this:

    1987    Perl               Larry Wall                         C, sed, awk, sh

I create DOT statements like this:

    "c" -> "perl";
    "sed" -> "perl";
    "awk" -> "perl";
    "sh" -> "perl";

The  finished  product[5].   I had to cut it off after 2005 because
GraphViz would crash on me.  You can also  see  the  awk  script[6]
used to create the data that I fed to PlantText.

I  had  a lot of fun with this challenge.  Projects like this are a
great way to learn.  I learned more about DOT and awk  while  doing
this.   It's  also fulfilling to see the potential in something and
then make it happen.

Happy hacking.

References:

1. https://en.wikipedia.org/wiki/Timeline_of_programming_languages
2. http://davebucklin.com/work/2017/09/11/diagrams-from-text-with-plantuml.html
3. https://www.planttext.com
4. http://www.graphviz.org/pdf/dotguide.pdf
5. http://davebucklin.com/assets/img/lang6.png
6. http://davebucklin.com/assets/toplg.txt