From: dbucklin@sdf.org

From: [email protected]
Date: 2018-06-29
Subject: Visualizing the History of Programming Languages

Recently, I came across the Wikipedia article, Timeline of Program-
ming Languages[1]. It has nicely-formatted tables for each decade
since the 1940s. Each table has the same format: one language per
row and, for each language, the year it came into being, its name,
creator, and a list of the languages that influenced it. When data
establishes a relationship between elements, as this page does with
the list of influences for each language, we have the makings of a
graph.

I've written before[2] about using PlantUML to create graphics from
textual information. In this case I'm using a similar tool, Plant-
Text[3], to create this graph. Both PlantUML and PlantText are
web-based front-ends for GraphViz. GraphViz uses a language,
called DOT, to define the content and structure of information to
be visualized. PlantText provides a number of examples that illus-
trate how different kinds of visualizations can be created. I'm
basing this graph on the World Dynamics template. I also got some
additional ideas from "Drawing graphs with dot"[4] by Gansner,
Koutsofios, and North.

The first thing I need to do is put the tables into a structured
format that I can parse with awk. I'm using git-bash for this, and
git-bash lacks my go-to tools for something like this, lynx or w3m.
I do have Pandoc installed, and I can use that to transform HTML to
plain text. I recently created a function called myw3m in my
bash_profile.

myw3m ()
{
curl -s "$1" | perl -pe 'use open qw(:std :utf8); s/[^[:ascii:]]//g;' | pandoc -fhtml -tplain -
}

This function pulls down the page using curl, passes it to perl to
strip out non-ascii characters, and then to Pandoc to transform to
plain text. I can pull the Wikipedia page to a local file, con-
verting it to plain text along the way:

myw3m https://en.wikipedia.org/wiki/Timeline_of_programming_languages > timeline.txt

This works fine in git-bash, but on my Debian VPS, I have an older
version of Pandoc that wraps over-zealously. I am able to get sim-
ilar results with w3m, using a large width value to prevent wrap-
ping. This also requires a little manual futzing with the data to
widen inter-column space.

w3m -cols 400 -dump https://en.wikipedia.org/wiki/Timeline_of_programming_languages > timeline.txt

Now I have the whole page as plain text including the table rows
which look like this:

1970 Pascal Niklaus Wirth, Kathleen Jensen ALGOL 60, ALGOL W

I want to pass this into awk, so I need to put this data into tab-
delimited fields. First, I need to strip out any lines that are
not data I need to process. This is pretty easy because each line
from the table starts with two spaces and a four-digit number. I
can filter out everything else using sed:

sed -ne '/^ [[:digit:]]4/p'

It looks like Pandoc has separated each column with three or more
spaces. We are lucky that no field data contains three consecutive
spaces. This means that we can use sed to replace any instance of
three or more spaces with a tab. We should also strip off leading
spaces while we're at it. We are trying to make this easy to work
with in awk.

sed -e 's/^ *//;s/ *//g'

Now a row in the table looks like this:

1970^IPascal^INiklaus Wirth, Kathleen Jensen^IALGOL 60, ALGOL W

Pretty slick, no? There are three sections of graph that I need to
build out. My first subgraph is the collection of years in the da-
ta with the relationship between each year made explicit. This ef-
fectively creates an X-axis that will organize the rest of the
graph. By building a list of all the years, I can create a sub-
graph that looks like this:

{
"1988" -> "1989";
"1989" -> "1990";
"1990" -> "1991";
}

The next thing I need to do is tell GraphViz to associate each lan-
guage with its corresponding year in the subgraph. This will en-
courage GraphViz to visually rank each language along with its
year. I do this with a series of subgraphs that look like this:

{rank=same; "1988"; "rpg/400"; "tcl"; "stos basic"; "actor"; "object rexx"; "spark"; "a+"; "hamilton c shell"; }
{rank=same; "1989"; "turbo pascal oop"; "modula-3"; "powerbasic"; "lpc"; "bash"; "magik"; "python"; }
{rank=same; "1990"; "amos basic"; "object oberon"; "j"; "haskell"; "z shell"; }

You'll notice that everything is in lower case. I found the capi-
talization to be inconsistent within the Wikipedia article, so I
have to fix that to avoid duplicates.

Finally, I just need to define all the other nodes and edges. The
edges are directional, so I define them going from the influencing
language to the influenced language. For a line like this:

1987 Perl Larry Wall C, sed, awk, sh

I create DOT statements like this:

"c" -> "perl";
"sed" -> "perl";
"awk" -> "perl";
"sh" -> "perl";

The finished product[5]. I had to cut it off after 2005 because
GraphViz would crash on me. You can also see the awk script[6]
used to create the data that I fed to PlantText.

I had a lot of fun with this challenge. Projects like this are a
great way to learn. I learned more about DOT and awk while doing
this. It's also fulfilling to see the potential in something and
then make it happen.

Happy hacking.

References:

1. https://en.wikipedia.org/wiki/Timeline_of_programming_languages
2. http://davebucklin.com/work/2017/09/11/diagrams-from-text-with-plantuml.html
3. https://www.planttext.com
4. http://www.graphviz.org/pdf/dotguide.pdf
5. http://davebucklin.com/assets/img/lang6.png
6. http://davebucklin.com/assets/toplg.txt