#[1]Gwern.net [2]ATOM/RSS feed of Gwern.net additions/miscellaneous
  writings

  [3]Gwern.net Site Logo [a Gothic/Fraktur blackletter capital G]
  [4]Home [5]Site [6]Me [7]New: [8]mail [9]/r/gwern [10]support on
  PATREON

                                Archiving URLs

  JS Disabled
  For support of [11]website features
  (table-sorting/collapsible-sections/image-zooms/floating-footnotes/Disq
  us-comment), please enable JavaScript.

  Archiving the Web, because nothing lasts forever: statistics, online
  archive services, extracting URLs automatically from browsers, and
  creating a daemon to regularly back up URLs to multiple sources.
  topics: [12]Haskell, [13]archiving, [14]Bitcoin, [15]meta, [16]shell,
  [17]R, [18]tutorial
  [19]source; created: 10 Mar 2011; modified: 5 Jan 2019; status:
  finished; [20]confidence: certain; [21]importance: 8
    __________________________________________________________________

    * [22]Link rot
    * [23]Detection
    * [24]Prevention
         + [25]Remote caching
         + [26]Local caching
              o [27]Caching Proxy
              o [28]Batch job downloads
              o [29]Daemon
              o [30]Cryptographic timestamping local archives
              o [31]Resource consumption
              o [32]URL sources
                   # [33]Browser history
                   # [34]Document links
                   # [35]Website spidering
    * [36]Reacting to broken links
    * [37]External Links
    * [38]Appendices
         + [39]Cryptographic timestamping
         + [40]filter-urls
         + [41]sort --key compression trick
              o [42]Locality
              o [43]Web archives
              o [44]Separate mirrors
              o [45]Alternatives
              o [46]External Links

    Links on the Internet last forever or a year, whichever comes first.
    This is a major problem for anyone serious about writing with good
    references, as link rot will cripple several percent of all links
    each year, and compounding.

    To deal with link rot, I present my multi-pronged archival strategy
    using a combination of scripts, daemons, and Internet archival
    services: URLs are regularly dumped from both my web browser’s daily
    browsing and my website pages into an archival daemon I wrote, which
    pre-emptively downloads copies locally and attempts to archive them
    in the Internet Archive. This ensures a copy will be available
    indefinitely from one of several sources. Link rot is then detected
    by regular runs of linkchecker, and any newly dead links can be
    immediately checked for alternative locations, or restored from one
    of the archive sources.

    As an additional flourish, my local archives are [47]efficiently
    cryptographically timestamped using Bitcoin in case forgery is a
    concern, and I demonstrate a simple compression trick for
    [48]substantially reducing sizes of large web archives such as
    crawls (particularly useful for repeated crawls such as my [49]DNM
    archives).

  Given my interest in [50]long term content and extensive linking,
  [51]link rot is an issue of deep concern to me. I need backups not just
  for my files[52]^1, but for the web pages I read and use—they’re all
  part of my [53]exomind. It’s not much good to have an extensive essay
  on some topic where half the links are dead and the reader can neither
  verify my claims nor get context for my claims.

[54]Link rot

    “Decay is inherent in all compound things. Work out your own
    salvation with diligence.”

    Last words of the Buddha

  The dimension of digital decay is dismal and distressing.
  [55]Wikipedia:

    In a 2003 experiment, [56]Fetterly et al. discovered that about one
    link out of every 200 disappeared each week from the Internet.
    [57]McCown et al 2005 discovered that half of the URLs cited in
    [58]D-Lib Magazine articles were no longer accessible 10 years after
    publication [the irony!], and other studies have shown link rot in
    academic literature to be even worse ([59]Spinellis, 2003,
    [60]Lawrence et al., 2001). [61]Nelson and Allen (2002) examined
    link rot in digital libraries and found that about 3% of the objects
    were no longer accessible after one year.

  [62]Bruce Schneier remarks that one friend experienced 50% linkrot in
  one of his pages over less than 9 years (not that the situation was any
  better [63]in 1998), and that his own blog posts link to news articles
  that go dead in days[64]^2; [65]Vitorio checks bookmarks from 1997,
  finding that hand-checking indicates a total link rot of 91% with only
  half of the dead available in sources like the Internet Archive;
  [66]Ernie Smith found 1 semi-working link in a 1994 book about the
  Internet; the [67]Internet Archive itself has estimated the average
  lifespan of a Web page at [68]100 days. A [69]Science study looked at
  articles in prestigious journals; they didn’t use many Internet links,
  but when they did, 2 years later ~13% were dead[70]^3. The French
  company Linterweb studied external links on the [71]French Wikipedia
  before setting up [72]their cache of French external links, and
  found—back in 2008—already [73]5% were dead. (The English Wikipedia has
  seen a 2010-January 2011 spike from a few thousand dead links to
  [74]~110,000 out of [75]~17.5m live links.) [76]A followup check of the
  viral [77]The Million Dollar Homepage, where it cost up to $38k to
  insert a link by the last one on January 2006, found that a decade
  later in 2017, at least half the links were dead or squatted.[78]^4
  Bookmarking website [79]Pinboard, which provides some archiving
  services, [80]noted in August 2014 that 17% of 3-year-old links & 25%
  of 5-year-old links were dead. The dismal [81]studies [82]just [83]go
  [84]on [85]and [86]on [87]and [88]on ([89]and [90]on). Even in a highly
  stable, funded, curated environment, link rot happens anyway. For
  example, about [91]11% of Arab Spring-related tweets were gone within a
  year (even though Twitter is—currently—still around). And sometimes
  they just get quietly lost, like when MySpace admitted it lost [92]all
  music worldwide uploaded 2003–2015, euphemistically describing the mass
  deletion as [93]“We completely rebuilt Myspace and decided to move over
  some of your content from the old MySpace” (only some of the 2008–2010
  MySpace music could be rescued & put on the IA by [94]“an anonymous
  academic study”).

  My specific target date is 2070, 60 years from now. As of 10 March
  2011, gwern.net has around 6800 external links (with around 2200 to
  non-Wikipedia websites)[95]^5. Even at the lowest estimate of 3% annual
  linkrot, few will survive to 2070. If each link has a 97% chance of
  surviving each year, then the chance a link will be alive in 2070 is
  0.972070−2011≈0.16 (or to put it another way, an 84% chance any given
  link will die). The 95% [96]confidence interval for such a [97]binomial
  distribution says that of the 2200 non-Wikipedia links, ~336–394 will
  survive to 2070[98]^6. If we try to predict using a more reasonable
  estimate of 50% linkrot, then an average of 0 links will survive
  (0.502070−2011×2200=1.735×10−16×2200≃0). It would be a good idea to
  simply assume that no link will survive.

  With that in mind, one can consider remedies. (If we lie to ourselves
  and say it won’t be a problem in the future, then we guarantee that it
  will be a problem. [99]“People can stand what is true, for they are
  already enduring it.”)

  If you want to pre-emptively archive a specific page, that is easy: go
  to the IA and archive it, or in your web browser, print a PDF of it, or
  for more complex pages, use a browser plugin (eg for Firefox,
  [100]Mozilla Archive Format or [101]ScrapBook). But I find I visit and
  link so many web pages that I rarely think in advance of link rot to
  save a copy; what I need is a more systematic approach to detect and
  create links for all web pages I might need.

[102]Detection

    “With every new spring
    the blossoms speak not a word
    yet expound the Law—
    knowing what is at its heart
    by the scattering storm winds.”

    Shōtetsu[103]^7

  The first remedy is to learn about broken links as soon as they happen,
  which allows one to react quickly and scrape archives or search engine
  caches ([104]‘lazy preservation’). I currently use [105]linkchecker to
  spider gwern.net looking for broken links. linkchecker is run in a
  [106]cron job like so:
@monthly linkchecker --check-extern --timeout=35 --no-warnings --file-output=htm
l \
                     --ignore-url=^mailto --ignore-url=^irc --ignore-url=http:/
/.*\.onion \
                     --ignore-url=paypal.com --ignore-url=web.archive.org \
                    https://www.gwern.net

  Just this command would turn up many false positives. For example,
  there would be several hundred warnings on Wikipedia links because I
  link to redirects; and linkchecker respects [107]robots.txts which
  forbid it to check liveness, but emits a warning about this. These can
  be suppressed by editing ~/.linkchecker/linkcheckerrc to say
  ignorewarnings=http-moved-permanent,http-robots-denied (the available
  warning classes are listed in linkchecker -h).

  The quicker you know about a dead link, the sooner you can look for
  replacements or its new home.

[108]Prevention

[109]Remote caching

    “Anything you post on the internet will be there as long as it’s
    embarrassing and gone as soon as it would be useful.”

    [110]taejo

  We can ask a third party to keep a cache for us. There are several
  [111]archive site possibilities:
   1. the [112]Internet Archive
   2. [113]WebCite
   3. [114]Perma.cc ([115]highly limited)
   4. Linterweb’s WikiWix[116]^8.
   5. Peeep.us (defunct as of 2018)
   6. [117]Archive.is
   7. [118]Pinboard (with the $25/yr archiving option[119]^9)
   8. [120]Hiyo.jp & [121]Megalodon.jp (may be difficult to use)

  There are other options but they are not available like Google[122]^10
  or various commercial/government archives[123]^11

  (An example would be
  [124]bits.blogs.nytimes.com/2010/12/07/palm-is-far-from-game-over-says-
  former-chief/ being archived at [125]webcitation.org/5ur7ifr12.)

  These archives are also good for archiving your own website:
   1. you may be keeping backups of it, but your own website/server
      backups can be lost (I can speak from personal experience here), so
      it’s good to have external copies
   2. Another benefit is the reduction in ‘bus-factor’: if you were hit
      by a bus tomorrow, who would get your archives and be able to
      maintain the websites and understand the backups etc? While if
      archived in IA, people already know how to get copies and there are
      tools to download entire domains.
   3. A focus on backing up only one’s website can blind one to the need
      for archiving the external links as well. Many pages are
      meaningless or less valuable with broken links. A linkchecker
      script/daemon can also archive all the external links.

  So there are several benefits to doing web archiving beyond simple
  server backups.

  My first program in this vein of thought was a bot which fired off
  WebCite, Internet Archive/Alexa, & Archive.is requests: [126]Wikipedia
  Archiving Bot, quickly followed up by a [127]RSS version. (Or you could
  install the [128]Alexa Toolbar to get automatic submission to the
  Internet Archive, if you have ceased to care about privacy.)

  The core code was quickly adapted into a [129]gitit wiki plugin which
  hooked into the save-page functionality and tried to archive every link
  in the newly-modified page, [130]Interwiki.hs

  Finally, I wrote [131]archiver, a daemon which watches[132]^12/reads a
  text file. Source is available via git clone
  https://github.com/gwern/archiver-bot.git. (A similar tool is
  [133]Archiveror.)

  The library half of archiver is a simple wrapper around the appropriate
  HTTP requests; the executable half reads a specified text file and
  loops as it (slowly) fires off requests and deletes the appropriate
  URL.

  That is, archiver is a daemon which will process a specified text file,
  each line of which is a URL, and will one by one request that the URLs
  be archived or spidered

  Usage of archiver might look like archiver ~/.urls.txt [email protected].
  In the past, archiver would sometimes crash for unknown reasons, so I
  usually wrap it in a while loop like so: while true; do archiver
  ~/.urls.txt [email protected]; done. If I wanted to put it in a detached
  [134]GNU screen session: screen -d -m -S "archiver" sh -c 'while true;
  do archiver ~/.urls.txt [email protected]; done'. Finally, rather than
  start it manually, I use a cron job to start it at boot, for a final
  invocation of
@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/
urls.txt [email protected] \
       "cd ~/www && nice -n 20 ionice -c3 wget --unlink --limit-rate=20k --page
-requisites --timestamping \
       -e robots=off --reject .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.f
lv,.mp4,.avi,.webm \
       --user-agent='Firefox/4.9'" 500; done'

[135]Local caching

  Remote archiving, while convenient, has a major flaw: the archive
  services cannot keep up with the growth of the Internet and are
  woefully incomplete. I experience this regularly, where a link on
  gwern.net goes dead and I cannot find it in the Internet Archive or
  WebCite, and it is a general phenomenon: [136]Ainsworth et al 2012 find
  <35% of common Web pages ever copied into an archive service, and
  typically only one copy exists.

[137]Caching Proxy

  The most ambitious & total approach to local caching is to set up a
  proxy to do your browsing through, and record literally all your web
  traffic; for example, using [138]Live Archiving Proxy (LAP) or
  [139]WarcProxy which will save as [140]WARC files every page you visit
  through it. ([141]Zachary Vance explains how to set up a local HTTPS
  certificate to MITM your HTTPS browsing as well.)

  One may be reluctant to go this far, and prefer something
  lighter-weight, such as periodically extracting a list of visited URLs
  from one’s web browser and then attempting to archive them.

[142]Batch job downloads

  For a while, I used a shell script named, imaginatively enough,
  local-archiver:
#!/bin/sh
set -euo pipefail

cp `find ~/.mozilla/ -name "places.sqlite"` ~/
sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                      WHERE moz_places.id = moz_historyvisits.place_id \
                            and visit_date > strftime('%s','now','-1.5 month')*
1000000 ORDER by \
                      visit_date;" | filter-urls >> ~/.tmp
rm ~/places.sqlite
split -l500 ~/.tmp ~/.tmp-urls
rm ~/.tmp

cd ~/www/
for file in ~/.tmp-urls*;
do (wget --unlink --continue --page-requisites --timestamping --input-file $fil
e && rm $file &);
done

find ~/www -size +4M -delete

  The code is not the prettiest, but it’s fairly straightforward:
   1. the script grabs my Firefox browsing history by extracting it from
      the history SQL database file[143]^13, and feeds the URLs into
      [144]wget.
      wget is not the best tool for archiving as it will not run
      JavaScript or Flash or download videos etc. It will download
      included JS files but the JS will be obsolete when run in the
      future and any dynamic content will be long gone. To do better
      would require a headless browser like [145]PhantomJS which saves to
      MHT/MHTML, but PhantomJS [146]refuses to support it and I’m not
      aware of an existing package to do this. In practice, static
      content is what is most important to archive, most JS is of highly
      questionable value in the first place, and any important YouTube
      videos can be archived manually with youtube-dl, so wget’s
      limitations haven’t been so bad.
   2. The script splits the long list of URLs into a bunch of files and
      runs that many wgets in parallel because wget apparently has no way
      of simultaneously downloading from multiple domains. There’s also
      the chance of wget hanging indefinitely, so parallel downloads
      continues to make progress.
   3. The [147]filter-urls command is another shell script, which removes
      URLs I don’t want archived. This script is a hack which looks like
      this:
#!/bin/sh
set -euo pipefail
cat /dev/stdin | sed -e "s/#.*//" | sed -e "s/&sid=.*$//" | sed -e "s/\/$//" | g
rep -v -e 4chan -e reddit ...
   4. delete any particularly large (>4MB) files which might be media
      files like videos or audios (podcasts are particular offenders)

  A local copy is not the best resource—what if a link goes dead in a way
  your tool cannot detect so you don’t know to put up your copy
  somewhere? But it solves the problem decisively.

  The downside of this script’s batch approach soon became apparent to
  me:
   1. not automatic: you have to remember to invoke it and it only
      provides a single local archive, or if you invoke it regularly as a
      cron job, you may create lots of duplicates.
   2. unreliable: wget may hang, URLs may be archived too late, it may
      not be invoked frequently enough, >4MB non-video/audio files are
      increasingly common…
   3. I wanted copies in the Internet Archive & elsewhere as well to let
      other people benefit and provide redundancy to my local archive

  It was to fix these problems that I began working on archiver—which
  would run constantly archiving URLs in the background, archive them
  into the IA as well, and be smarter about media file downloads. It has
  been much more satisfactory.

[148]Daemon

  archiver has an extra feature where any third argument is treated as an
  arbitrary sh command to run after each URL is archived, to which is
  appended said URL. You might use this feature if you wanted to load
  each URL into Firefox, or append them to a log file, or simply download
  or archive the URL in some other way.

  For example, instead of a big local-archiver run, I have archiver run
  wget on each individual URL: screen -d -m -S "archiver" sh -c 'while
  true; do archiver ~/.urls.txt [email protected] "cd ~/www && wget
  --unlink --continue --page-requisites --timestamping -e robots=off
  --reject
  .iso,.exe,.gz,.xz,.rar,.7z,.tar,.bin,.zip,.jar,.flv,.mp4,.avi,.webm
  --user-agent='Firefox/3.6' 120"; done'. (For private URLs which require
  logins, such as [149]darknet markets, wget can still grab them with
  some help: installing the Firefox extension [150]Export Cookies,
  logging into the site in Firefox like usual, exporting one’s
  cookies.txt, and adding the option --load-cookies cookies.txt to give
  it access to the cookies.)

  Alternately, you might use curl or a specialized archive downloader
  like the Internet Archive’s crawler [151]Heritrix.

[152]Cryptographic timestamping local archives

  We may want cryptographic timestamping to prove that we created a file
  or archive at a particular date and have not since altered it.
  [153]Using a timestamping service’s API, I’ve written 2 shell scripts
  which implement downloading (wget-archive) and timestamping strings or
  files (timestamp). With these scripts, extending the archive bot is as
  simple as changing the shell command:
@reboot sleep 4m && screen -d -m -S "archiver" sh -c 'while true; do archiver ~/
urls.txt [email protected] \
       "wget-archive" 200; done'

  Now every URL we download is automatically cryptographically
  timestamped with ~1-day resolution for free.

[154]Resource consumption

  The space consumed by such a backup is not that bad; only 30–50
  gigabytes for a year of browsing, and less depending on how hard you
  prune the downloads. (More, of course, if you use linkchecker to
  archive entire sites and not just the pages you visit.) Storing this is
  quite viable in the long term; while page sizes have [155]increased 7x
  between 2003 and 2011 and pages average around 400kb[156]^14,
  [157]Kryder’s law has also been operating and has increased disk
  capacity by ~128x—in 2011, $80 will buy you at least [158]2 terabytes,
  that works out to 4 cents a gigabyte or 80 cents for the low estimate
  for downloads; that is much better than the $25 annual fee that
  somewhere like [159]Pinboard charges. Of course, you need to back this
  up yourself. We’re relatively fortunate here—most Internet documents
  are ‘born digital’ and easy to migrate to new formats or inspect in the
  future. We can download them and worry about how to view them only when
  we need a particular document, and Web browser backwards-compatibility
  already stretches back to files written in the early 1990s. (Of course,
  we’re probably screwed if we discover the content we wanted was
  dynamically presented only in Adobe Flash or as an inaccessible ‘cloud’
  service.) In contrast, if we were trying to preserve programs or
  software libraries instead, we would face a much more formidable task
  in keeping a working ladder of binary-compatible virtual machines or
  interpreters[160]^15. The situation with [161]digital movie
  preservation hardly bears thinking on.

  There are ways to cut down on the size; if you tar it all up and run
  [162]7-Zip with maximum compression options, you could probably compact
  it to 1/5th the size. I found that the uncompressed files could be
  reduced by around 10% by using [163]fdupes to look for duplicate files
  and turning the duplicates into a space-saving [164]hard link to the
  original with a command like fdupes --recurse --hardlink ~/www/.
  (Apparently there are a lot of bit-identical JavaScript (eg.
  [165]JQuery) and images out there.)

  Good filtering of URL sources can help reduce URL archiving count by a
  large amount. Examining my manual backups of Firefox browsing history,
  over the 1153 days from 2014-02-25 to 2017-04-22, I visited 2,370,111
  URLs or 2055 URLs per day; after passing through my filtering script,
  that leaves 171,446 URLs, which after de-duplication yields 39,523 URLs
  or ~34 unique URLs per day or 12,520 unique URLs per year to archive.

  This shrunk my archive by 9GB from 65GB to 56GB, although at the cost
  of some archiving fidelity by removing many filetypes like CSS or
  JavaScript or GIF images. As of 22 April 2017, after ~6 years of
  archiving, between xz compression (at the cost of easy searchability),
  aggressive filtering, occasional manual deletion of overly bulky
  domains I feel are probably adequately covered in the IA etc, my full
  WWW archives weigh 55GB.

[166]URL sources

[167]Browser history

  There are a number of ways to populate the source text file. For
  example, I have a script firefox-urls:
#!/bin/sh
set -euo pipefail

cp --force `find ~/.mozilla/firefox/ -name "places.sqlite"|sort|head -1` ~/
sqlite3 -batch places.sqlite "SELECT url FROM moz_places, moz_historyvisits \
                      WHERE moz_places.id = moz_historyvisits.place_id and \
                      visit_date > strftime('%s','now','-1 day')*1000000 ORDER
by \
                      visit_date;" | filter-urls
rm ~/places.sqlite

  (filter-urls is the same script as in local-archiver. If I don’t want a
  domain locally, I’m not going to bother with remote backups either. In
  fact, because of WebCite’s rate-limiting, archiver is almost
  perpetually back-logged, and I especially don’t want it wasting time on
  worthless links like [168]4chan.)

  This is called every hour by cron:
@hourly firefox-urls >> ~/.urls.txt

  This gets all visited URLs in the last time period and prints them out
  to the file for archiver to process. Hence, everything I browse is
  backed-up through archiver.

  Non-Firefox browsers can be supported with similar strategies; for
  example, Zachary Vance’s Chromium scripts likewise extracts URLs from
  Chromium’s [169]SQL history & [170]bookmarks.

[171]Document links

  More useful perhaps is a script to extract external links from Markdown
  files and print them to standard out: [172]link-extractor.hs

  So now I can take find . -name "*.page", pass the 100 or so Markdown
  files in my wiki as arguments, and add the thousand or so external
  links to the archiver queue (eg. find . -name "*.page" -type f -print0
  | xargs -0 ~/wiki/haskell/link-extractor.hs | filter-urls >>
  ~/.urls.txt); they will eventually be archived/backed up.

[173]Website spidering

  Sometimes a particular website is of long-term interest to one even if
  one has not visited every page on it; one could manually visit them and
  rely on the previous Firefox script to dump the URLs into archiver but
  this isn’t always practical or time-efficient. linkchecker inherently
  spiders the websites it is turned upon, so it’s not a surprise that it
  can build a [174]site map or simply spit out all URLs on a domain;
  unfortunately, while linkchecker has the ability to output in a
  remarkable variety of formats, it cannot simply output a
  newline-delimited list of URLs, so we need to post-process the output
  considerably. The following is the shell one-liner I use when I want to
  archive an entire site (note that this is a bad command to run on a
  large or heavily hyper-linked site like the English Wikipedia or
  [175]LessWrong!); edit the target domain as necessary:
linkchecker --check-extern -odot --complete -v --ignore-url=^mailto --no-warning
s http://www.longbets.org
   | fgrep http # [
   | fgrep -v -e "label=" -e "->" -e '" [' -e '" ]' -e "/ "
   | sed -e "s/href=\"//" -e "s/\",//" -e "s/ //"
   | filter-urls
   | sort --unique >> ~/.urls.txt

  When linkchecker does not work, one alternative is to do a wget
  --mirror and extract the URLs from the filenames—list all the files and
  prefix with a “http://” etc.

[176]Reacting to broken links

  archiver combined with a tool like link-checker means that there will
  rarely be any broken links on gwern.net since one can either find a
  live link or use the archived version. In theory, one has multiple
  options now:
   0. Search for a copy on the live Web
   1. link the Internet Archive copy
   2. link the WebCite copy
   3. link the WikiWix copy
   4. use the wget dump
      If it’s been turned into a full local file-based version with
      --convert-links --page-requisites, one can easily convert the dump
      into something like a standalone PDF suitable for public
      distribution. (A PDF is easier to store and link than the original
      directory of bits and pieces or other HTML formats like a ZIP
      archive of said directory.)
      I use [177]wkhtmltopdf which does a good job; an example of a dead
      webpage with no Internet mirrors is
      http://www.aeiveos.com/~bradbury/MatrioshkaBrains/MatrioshkaBrainsP
      aper.html which can be found at
      [178]1999-bradbury-matrioshkabrains.pdf, or Sternberg et al’s 2001
      review [179]“The Predictive Value of IQ”.

[180]External Links

    * [181]archivenow (library for submission to archive.is, archive.st,
      perma.cc, Megalodon, WebCite, & IA)
    * [182]Memento meta-archive search engine (for checking IA & other
      archives)
    * [183]Archive-It -(by the Internet Archive); [184]donate to the IA
    * [185]Pinboard
         + [186]“Bookmark Archives That Don’t”
    * [187]“Testing 3 million hyperlinks, lessons learned”, Stack
      Exchange
    * [188]“Backup All The Things”, muflax
    * [189]Tesoro (archive service; [190]discussion)
    * [191]“Digital Resource Lifespan”, XKCD
    * [192]“Archiving web sites”, LWN
    * software:
         + [193]webrecorder.io (online service); [194]webrecorder
           (offline tool; WARC-based & can deal with dynamically-loaded
           content)
         + [195]pywb
         + [196]ArchiveBox (headless Chrome)
         + [197]IPFS
         + [198]WordPress plugin: Broken Link Checker
         + [199]youtube-dl
         + [200]PhantomJS
         + [201]erised
         + [202]wpull
    * [203]Hacker News discussion
    * [204]“Indeed, it seems that Google is forgetting the old Web”
      ([205]HN)
    * [206]“The Lost Picture Show: Hollywood Archivists Can’t Outpace
      Obsolescence—Studios invested heavily in magnetic-tape storage for
      film archiving but now struggle to keep up with the technology”

[207]Appendices

[208]Cryptographic timestamping

  Due to length, this section has [209]been moved to a separate page.

[210]filter-urls

  A raw dump of URLs, while certainly archivable, will typically result
  in a very large mirror of questionable value (is it really necessary to
  archive Google search queries or Wikipedia articles? usually, no) and
  worse, given the rate-limiting necessary to store URLs in the Internet
  Archive or other services, may wind up delaying the archiving of the
  important links & risking their total loss. Disabling the remote
  archiving is unacceptable, so the best solution is to simply take a
  little time to manually blacklist various domains or URL patterns.

  This blacklisting can be as simple as a command like filter-urls | grep
  -v en.wikipedia.org, but can be much more elaborate. The following
  shell script is the skeleton of my own custom blacklist, derived from
  manually filtering through several years of daily browsing as well as
  spiders of [211]dozens of websites for various people & purposes,
  demonstrating a variety of possible techniques: regexps for domains &
  file-types & query-strings, sed-based rewrites, fixed-string matches
  (both blacklists and whitelists), etc:
#!/bin/sh

# USAGE: `filter-urls` accepts on standard input a list of newline-delimited URL
s or filenames,
# and emits on standard output a list of newline-delimited URLs or filenames.
#
# This list may be shorter and entries altered. It tries to remove all unwanted
entries, where 'unwanted'
# is a highly idiosyncratic list of regexps and fixed-string matches developed o
ver hundreds of thousands
# of URLs/filenames output by my daily browsing, spidering of interesting sites,
and requests
# from other people to spider sites for them.
#
# You are advised to test output to make sure it does not remove
# URLs or filenames you want to keep. (An easy way to test what is removed is to
use the `comm` utility.)
#
# For performance, it does not sort or remove duplicates from output; both can b
e done by
# piping `filter-urls` to `sort --unique`.

set -euo pipefail

cat /dev/stdin \
   | sed -e "s/#.*//" -e 's/>$//' -e "s/&sid=.*$//" -e "s/\/$//" -e 's/$/\n/' -
e 's/\?sort=.*$//' \
     -e 's/^[ \t]*//' -e 's/utm_source.*//' -e 's/https:\/\//http:\/\//' -e 's/
\?showComment=.*//' \
   | grep "\." \
   | fgrep -v "*" \
   | egrep -v -e '\/\.rss$' -e "\.tw$" -e "//%20www\." -e "/file-not-found" -e
"258..\.com/$" \
      -e "3qavdvd" -e "://avdvd" -e "\.avi" -e "\.com\.tw" -e "\.onion" -e "\?f
nid\=" -e "\?replytocom=" \
      -e "^lesswrong.com/r/discussion/comments$" -e "^lesswrong.com/user/gwern$
" \
      -e "^webcitation.org/query$" \
      -e "ftp.*" -e "6..6\.com" -e "6..9\.com" -e "6??6\.com" -e "7..7\.com" -e
"7..8\.com" -e "7..\.com" \
      -e "78..\.com" -e "7??7\.com" -e "8..8\.com" -e "8??8\.com" -e "9..9\.com
" -e "9??9\.com" \
      -e gold.*sell -e vip.*club \
   | fgrep -v -e "#!" -e ".bin" -e ".mp4" -e ".swf" -e "/mediawiki/index.php?ti
tle=" -e "/search?q=cache:" \
     -e "/wiki/Special:Block/" -e "/wiki/Special:WikiActivity" -e "Special%3ASe
arch" \
     -e "Special:Search" -e "__setdomsess?dest="
     # ...

# prevent URLs from piling up at the end of the file
echo ""

  filter-urls can be used on one’s local archive to save space by
  deleting files which may be downloaded by wget as dependencies. For
  example:
find ~/www | sort --unique >> full.txt && \
   find ~/www | filter-urls | sort --unique >> trimmed.txt
comm -23 full.txt trimmed.txt | xargs -d "\n" rm
rm full.txt trimmed.txt

[212]sort --key compression trick

    Programming folklore notes that one way to get better lossless
    compression efficiency is to rearrange files inside the archive to
    group ‘similar’ files together and expose redundancy to the
    compressor, in accordance with information-theoretical principles. A
    particularly easy and broadly-applicable way of doing this, which
    does not require using any unusual formats or tools and is fully
    compatible with the default archive methods, is to sort the files by
    filename and especially file extension. I show how to do this with
    the standard Unix command-line sort tool, using the so-called “sort
    --key trick”, and give examples of the large space-savings possible
    from my archiving work for personal website mirrors and for making
    [213]darknet market mirror datasets where the redundancy at the file
    level is particularly extreme and the sort --key trick shines
    compared to the naive approach.

  One way to look at data compression is as a form of intelligence (see
  the [214]Hutter Prize & [215]Burfoot 2011): a compression tool like
  [216]xz is being asked to predict what the next bit of the original
  file is, and the better its predictions, the less data needs to be
  stored to correct its mistaken predictions (“I know how to spell
  ‘banana’, I just don’t know when to stop”). The smarter the program is,
  the better its compression will be; but on the flip side, you can also
  improve the compression ratio by giving it a little help and organizing
  the data into an equivalent form the program can understand better—for
  example, by using the [217]Burrows-Wheeler transform. Or by preserving
  spatial locality: keeping similar data together, and not dispersing it
  all over. (This also explains why multimedia files barely compress:
  because the lossy encodings are the product of decades of work
  specialized to the particular domain of audio or images or video, and a
  general-purpose lossless compression would have to be very intelligent,
  on par with [218]PAQ, to beat them at their own game.) Files on one
  topic should be compressed together.

[219]Locality

  Spatial locality can be subtle. For example, natural language text,
  though not structured line-by-line as visibly as a dictionary, is still
  far from random and has many local correlations a compression tool
  might be able to pick up. If this is true, we would expect that with a
  sufficiently smart compressor, a text file would compress better in its
  natural form than if it were randomly shuffled. Is this the case, or
  are compression utilities are too stupid to see any different between
  random lines and English prose? Taking 14M of text from gwern.net, we
  can see for ourselves:
## uncompressed
cat *.page */*.page */*/*.page |
wc --bytes
# 13588814

## compressed, files in lexicographic order
cat *.page */*.page */*/*.page |                      xz -9 --extreme --stdout |
wc --bytes
# 3626732
## compressed, all lines sorted alphabetically
cat *.page */*.page */*/*.page | sort               | xz -9 --extreme --stdout |
wc --bytes
# 3743756
## compressed, all lines randomly shuffled except for non-unique lines sorted to
gether
cat *.page */*.page */*/*.page | sort --random-sort | xz -9 --extreme --stdout |
wc --bytes
# 3831740
## compressed, all lines randomly shuffled
cat *.page */*.page */*/*.page | shuf               | xz -9 --extreme --stdout |
wc --bytes
# 3862632

  The unmolested text compresses to 3.626M, but the same text randomly
  shuffled is 3.862M! I also included an intermediate form of
  randomization: despite the name, the --random-sort option to sort is
  not actually a random shuffle but a random hash (this is not documented
  in the man page, though it is in the info page) and so any repeated
  non-unique lines will be sorted together (allowing for some easy
  duplication deletion), and so we would expect the only-partial
  randomization of --random-sort to maybe perform a bit better than the
  true random shuffle of shuf. And it does.

[220]Web archives

  Spatial locality also applies to our web archiving. If you are
  mirroring websites, or otherwise compiling a lot of directories with
  redundant data on a file-by-file level, there’s a cute trick to
  massively improve your compression ratios: don’t sort the usual
  lexicographic way, but sort by a subdirectory. (I learned about this
  trick a few years ago while messing around with archiving my home
  directory using find and tar.) This is one of the issues with archiving
  gigabytes of crawls from thousands of domains: URLs have a historical
  oddity where they are not consistently hierarchical. URLs were
  originally modeled after hierarchical Unix filesystems; this page, for
  example, lives at the name /home/gwern/wiki/Archiving-URLs.page, which
  follows a logical left-to-right pattern of increasing narrowness. If
  one lists my entire filesystem in lexicographic order, all the files in
  /home/gwern/ will be consecutive, and the files in wiki/ will be
  consecutive, and so on. unfortunately, the top level of URLs breaks
  this scheme—one does not visit
  https://com/google/mail/?shva=1#search/l%3aunread, one visits
  https://mail.google.com/mail/?shva=1#search/l%3aunread; one does not
  visit http://net/www/gwern/Archiving-URLs but
  https://www.gwern.net/Archiving-URLs. So if I download a.google.com and
  then later z.google.com, a lexicographic list of downloaded files will
  separate the files as much as possible (even though they are
  semantically probably similar). A quick example from my current WWW
  archive:
ls
# ...
# typemoon.wikia.com/
# tytempletonart.wordpress.com/
# ubc-emotionlab.ca/
# ubook.info/
# ucblibrary3.berkeley.edu/
# uhaweb.hartford.edu/
# ujsportal.pacourts.us/
# ukpmc.ac.uk/
# uk.reuters.com/
# ultan.org.uk/
# ...

  The directories are indeed sorted, but aside from the initial 2 letters
  or so, look nothing like each other: a Wikia subdomain rubs shoulders
  with a WordPress blog, a .ca domain is between a .com, a .info, and a
  .edu (with a .us and .uk thrown in for variety), and so on. Is there
  any way to sort these directories with a bit of parsing thrown in? For
  example, maybe we could reverse each line? Some web browsers store URLs
  reversed right-to-left to enable more efficient database operations, as
  do Google’s [221]BigTable systems[222]^16 (to assist their relatively
  weak compression utility [223]Snappy). Turns out GNU [224]sort already
  supports something similar, the --key & --field-separator options; the
  man page is not very helpful but the [225]info page tells us:
'-t SEPARATOR'
'--field-separator=SEPARATOR'
    Use character SEPARATOR as the field separator when finding the
    sort keys in each line.  By default, fields are separated by the
    empty string between a non-blank character and a blank character.
    By default a blank is a space or a tab, but the 'LC_CTYPE' locale
    can change this.

    That is, given the input line ' foo bar', 'sort' breaks it into
    fields ' foo' and ' bar'.  The field separator is not considered
    to be part of either the field preceding or the field following,
    so with 'sort -t " "' the same input line has three fields: an
    empty field, 'foo', and 'bar'.  However, fields that extend to the
    end of the line, as '-k 2', or fields consisting of a range, as
    '-k 2,3', retain the field separators present between the
    endpoints of the range.


'-k POS1[,POS2]'
'--key=POS1[,POS2]'
    Specify a sort field that consists of the part of the line between
    POS1 and POS2 (or the end of the line, if POS2 is omitted),
    _inclusive_.

    Each POS has the form 'F[.C][OPTS]', where F is the number of the
    field to use, and C is the number of the first character from the
    beginning of the field.  Fields and character positions are
    numbered starting with 1; a character position of zero in POS2
    indicates the field's last character.  If '.C' is omitted from
    POS1, it defaults to 1 (the beginning of the field); if omitted
    from POS2, it defaults to 0 (the end of the field).  OPTS are
    ordering options, allowing individual keys to be sorted according
    to different rules; see below for details.  Keys can span multiple
    fields.

    Example:  To sort on the second field, use '--key=2,2' ('-k 2,2').
    See below for more notes on keys and more examples.  See also the
    '--debug' option to help determine the part of the line being used
    in the sort.

  Hence, we can do better by ordering sort to break on the dots and focus
  on the second part of a URL, like so:
ls | sort --key=2 --field-separator="."
# ...
# uhaweb.hartford.edu/
# adsabs.harvard.edu/
# chandra.harvard.edu/
# cmt.harvard.edu/
# dash.harvard.edu/
# gking.harvard.edu/
# isites.harvard.edu/
# ...

  There’s many possible ways to sort, though. So I took my WWW archive as
  of 15 June 2014, optimized all PNGs & JPEGs with optipng & jpegoptim,
  ran all the files through filter-urls & deleted the ones which failed
  (this took out all of the JS files, which is fine since I don’t think
  those are useful for archival purposes), and was left with ~86.5GB of
  files. Then I tested out several ways of sorting the filenames to see
  what gave the best compression on my corpus.

  First, I establish the baseline:
   1. Size of uncompressed unsorted tarball, which eliminates filesystem
      overhead and tells us how much compression is really saving:
cd ~/www/ && find ./ -type f -print0 | tar c --to-stdout --no-recursion --null -
-files-from - | wc --bytes
# 86469734400
## 1x
   2. Size of sorted tarball:
cd ~/www/ && find . -type f -print0 | sort --zero-terminated | \
            tar c --to-stdout --no-recursion --null --files-from - | wc --bytes
# 86469734400
## 1x
      So sorting a tarball doesn’t give any benefits. This is mostly as I
      expected, since tar is only supposed to produce a linear archive
      packing together all the specified files and otherwise preserve
      them exactly. I thought there might have been some sort of
      consolidation of full path-names which might yield a small space
      savings, but apparently not.

  Now we can begin sorting before compression. I thought of 6 approaches;
  in decreasing order of final archive (smaller=better):
   1. Sort by file names, simply by reversing, sorting, unreverse
      (foo.png ... bar.png reverses to gnp.oof ... gnp.rab, which then
      sort together, and then losslessly reverse back to bar.png /
      foo.png / ...):
cd ~/www/ && find . -type f | rev | sort | rev | \
            tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout |
wc --bytes
# 24970605748
## 0.2887x
      (Note that find+rev doesn’t correctly handle filenames with the
      wrong/non-UTF-8 encoding; I ultimately used brute force in the form
      of [226]detox to find all the non-UTF-8 files and rename them.)
   2. Compress tarball, but without any sorting (files are consumed in
      the order find produces them in the filesystem):
cd ~/www/ && find . -type f -print0 | \
            tar c --to-stdout --no-recursion --null --files-from - | xz -9 --st
dout | wc --bytes
# 24268747400
## 0.2806x
   3. Sort by file suffixes, trying to parsing the filenames first:
cd ~/www/ && find . -type f -printf '%f/%p\n' | sort --field-separator="." --key
=2  | cut -f2- -d/ | \
            tar c --to-stdout --no-recursion --files-from - | xz -9 --stdout |
wc --bytes
# 24097155132
## 0.2786x
   4. Sort normally, in lexicographic order (subdomain, domain, TLD,
      subdirectories & files etc):
cd ~/www/ && find . -type f -print0 | sort --zero-terminated | \
   tar c --to-stdout --no-recursion --null --files-from - | xz -9 --stdout | wc
--bytes
# 23967317664
## 0.2771x
   5. Sort by middle of domain:
cd ~/www/ && find . -type f -print0 | sort --zero-terminated --key=3 --field-sep
arator="." | \
             tar c --to-stdout --no-recursion --null --files-from - | xz -9 --s
tdout | wc --bytes
# 23946061272
## 0.2769x
   6. Sort by first subdirectory (if there’s a bunch of
      foo.com/wp-content/* & bar.com/wp-content/* files, then the
      /wp-content/ files will all sort together regardless of “f” and “b”
      being far from each other; similarly for domain.com/images/,
      domain.com/css/ etc):
cd ~/www/ && find . -type f -print0 | sort --zero-terminated --key=3 --field-sep
arator="/"  | \
            tar c --to-stdout --no-recursion --null --files-from - | xz -9 --st
dout | wc --bytes
# 23897682908
## 0.2763x

  Surprisingly, #1, reversing filenames in order to sort on the suffixes,
  turns out to be even worse than not sorting at all. The improved
  attempt to sort on filetypes doesn’t do much better, although it at
  least beats the baseline of no-sorting; it may be that to get a
  compression win real semantic knowledge of filetypes will be needed
  (perhaps calling file on every file and sorting by the detected
  filetype?). The regular sort also performs surprisingly well, but my
  intuitions win for once and it’s beaten by my previous strategy of
  sorting by the middle of domains. Finally, the winner is a bit of a
  surprise too, a sort I only threw in out of curiosity because I noticed
  blogs tend to have similar site layouts.

  In this case, the best version saved 24268747400−23897682908=371064492
  or 371MB over the unsorted version. Not as impressive as in the next
  use case, but enough to show this seems like a real gain

[227]Separate mirrors

  Top-level domains are not the only thing we might want to sort
  differently on. To take my mirrors of darknet market drug sites such as
  [228]Silk Road 1: I download a site each time as a separate wget run in
  a timestamped folder. So in my Silk Road 2 folder, I have both
  2013-12-25/ & 2014-01-15/. These share many similar & identical files
  so they compress together with xz down from 1.8GB to 0.3GB.

  But they could compress even better: the similar files may be thousands
  of files and hundreds of megabytes away by alphabetical or file-inode
  order, so even with a very large window and a top-notch compressor, it
  will fail to spot many long-range redundancies. In between
  2013-12-25/default.css and 2014-01-15/default.css is going to be all
  sorts of files which have nothing to do with CSS, like
  2014-01-16/items/2-grams-of-pure-vanilla-ketamine-powder-dutchdope?vend
  or_feedback_page=5 and
  2014-01-16/items/10-generic-percocet-10-325-plus-1-30mg-morphine. You
  see the problem.

  Because we sort the files by ‘all files starting with “2013”’ and then
  ‘all files starting “2014”’, we lose all proximity. If instead, we
  could sort by subfolder and only then by the top-level folder, then
  we’d have everything line up nicely. Fortunately, we already know how
  to do this! Reuse the sort-key trick, specify “/” as the delimiter to
  parse on, and the nth field to sort on. We feed it a file list, tell it
  to break filenames by “/”, and then to sort on a lower level, and if we
  did it right, we will indeed get output like 2013-12-25/default.css
  just before 2014-01-15/default.css, which will do wonders for our
  compression, and which will pay ever more dividends as we accumulate
  more partially-redundant mirrors.

  Here is an example of output for my Pandora mirrors, where, due to
  frequent rumors of its demise triggering mirroring on my part, I have 5
  full mirrors at the time of testing; and naturally, if we employ the
  sort-key trick (find . -type f | sort --key=3 --field-separator="/"),
  we find a lot of similar-sounding files:
/2014-01-15/profile/5a66e5238421f0422706b267b735d2df/6
/2014-01-16/profile/5a9df4f5482d55fb5a8997c270a1e22d
/2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/1
/2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.1
/2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/2
/2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d.2
/2013-12-25/profile/5a9df4f5482d55fb5a8997c270a1e22d/3
/2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/4
/2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/5
/2014-01-15/profile/5a9df4f5482d55fb5a8997c270a1e22d/6
/2013-12-25/profile/5abb81db167294478a23ca110284c587
/2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/1
/2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.1
/2013-12-25/profile/5acc44d370e305e252dd4e2b91fda9d0/2
/2014-01-15/profile/5acc44d370e305e252dd4e2b91fda9d0.2

  Note the interleaving of 5 different mirrors, impossible in a normal
  left-to-right alphabetical sort. You can bet that these 4 files (in 15
  versions) are going to compress much better than if they were separated
  by a few thousand other profile pages.

  So here’s an example invocation (doing everything in pipelines to avoid
  disk IO):
find . -type f -print0 | sort --zero-terminated --key=3 --field-separator="/"
| tar --no-recursion --null --files-from - -c
| xz -9 --extreme --stdout
> ../mirror.tar.xz

  Used on my two Silk Road 2 mirrors which together weigh 1800M, a normal
  run without the --key/--field-separator options, as mentioned before,
  yields a 308M archive. That’s not too bad. Certainly much better than
  hauling around almost 2GB. However—if I switch to the sort-key trick,
  however, the archive is now 271M, or 37M less. Same compression
  algorithm, same files, same unpacked results, same speed, just 2 little
  obscure sort options… and I get an archive 87% the size of the
  original.

  Not impressed? Well, I did say that the advantage increases with the
  number of mirrors to extract redundancy from. With only 2 mirrors, the
  SR2 results can’t be too impressive. How about the Pandora mirrors? 5
  of them gives the technique more scope to shine. And as expected, it’s
  even more impressive when I compare the Pandora archives: 71M vs 162M.
  The sort-keyed archive saves 56% of the regular archive’s size. The
  final Pandora archive, [229]released as part of my DNM archive, boasts
  105 mirrors from 2013-12-25 to 2014-11-05 of 1,745,778 files (many
  duplicates due to being present over multiple crawls) taking up 32GB
  uncompressed. This archive is 0.58GB (606,203,320 bytes) when
  sort-keyed; when sorted alphabetically, it is instead 7.5GB
  (7,466,374,836 bytes) or 12.31x larger, so the sort-keyed archive saves
  92%. Forum mirrors are even more redundant than market mirrors because
  all content added to a forum will be present in all subsequent crawls,
  so each crawl will yield the previous crawl plus some additional posts;
  hence, the Pandora forum’s raw size of 8,894,521,228 bytes shrinks to
  283,747,148 bytes when compressed alphabetically but shrinks
  dramatically to the 31x smaller sort-keyed archive of 9,084,508 bytes,
  saving 97%!

  Clearly in these DNM use-cases, sort-keying has gone from a cute trick
  to indispensable.

[230]Alternatives

  The sort-key trick is most useful when we can infer a lot of spatial
  locality just from parts of the file names, it’s not a panacea. There
  are other approaches:
   1. if we have multiple temporally-ordered datasets and we don’t mind
      making it more difficult to access older copies, it may be simpler
      to store the data as a DVCS like [231]git where each dataset is a
      large patch
   2. sort files by minimizing binary differences between them using Timm
      S. Müller’s [232]binsort utility
      The default optimization setting of -o15 underperforms the sort-key
      trick on both the SR2 and Pandora datasets by 5–10MB, and a higher
      setting like -o1000 is best. Note that binsort is O(n2) in number
      of files, so it’s limited to sorting <10,000 files. An example
      pipeline for compressing posts from the Silk Road forums regarding
      [233]a minor Ponzi scheme there, since the filenames offer little
      guidance for a semantically-meaningful sort:
find dkn255hz262ypmii.onion/ -type f -exec fgrep -i -l vladimir  {} \; >> ponzi.
txt
find dkn255hz262ypmii.onion/ -type f -exec fgrep -i -l Limetless {} \; >> ponzi.
txt

mkdir ponzi/ && cp `cat ponzi.txt` ponzi/
~/bin/binsort/binsort -t 6 -o 1000 ponzi/ > ponzi-list.txt
cat ponzi-list.txt | tar --no-recursion --files-from - -c | xz -9 --extreme --st
dout > ~/srf1-ponzi.tar.xz
      (Embarrassingly, binsort seems to underperform on this dataset: the
      files are 380M uncompressed, 3.536M sort-keyed, and 3.450M sorted.)
   3. if we know there are many duplicates but they are far apart by a
      lexicographic sort and we don’t have any good way to sort filenames
      and a lot of RAM, we can try out a compression tool which
      specifically looks for long-range redundancies throughout the
      entire bitstream, such as [234]lrzip ([235]homepage/[236]Arch
      wiki). lrzip is packaged in Debian and should be at least as good
      as xz since it too uses LZMA; but it is an obscure tool which is
      not a default install on Linux distributions like xz is, and this
      makes distributing archives to other people difficult.

[237]External Links

    * [238]“Sorting multiple keys with Unix sort”
    * [239]“SimHash: Hash-based Similarity Detection” (see also
      [240]MinHash)
    __________________________________________________________________

   1. I use [241]duplicity & [242]rdiff-backup to backup my entire home
      directory to a cheap 1.5TB hard drive (bought from Newegg using
      forre.st’s [243]“Storage Analysis—GB/$ for different sizes and
      media” price-chart); a limited selection of folders are backed up
      to [244]Backblaze B2 using duplicity.
      I used to semiannually tar up my important folders, add [245]PAR2
      redundancy, and burn them to DVD, but that’s no longer really
      feasible; if I ever get a Blu-ray burner, I’ll resume WORM backups.
      (Magnetic media doesn’t strike me as reliable over many decades,
      and it would ease my mind to have optical backups.)[246]↩︎
   2. [247]“When the Internet Is My Hard Drive, Should I Trust Third
      Parties?”, Wired:

    Bits and pieces of the web disappear all the time. It’s called ‘link
    rot’, and we’re all used to it. A friend saved 65 links in 1999 when
    he planned a trip to Tuscany; only half of them still work today. In
    my own blog, essays and news articles and websites that I link to
    regularly disappear – sometimes within a few days of my linking to
    them.
      [248]↩︎
   3. [249]“Going, Going, Gone: Lost Internet References”; abstract:

    The extent of Internet referencing and Internet reference activity
    in medical or scientific publications was systematically examined in
    more than 1000 articles published between 2000 and 2003 in the New
    England Journal of Medicine, The Journal of the American Medical
    Association, and Science. Internet references accounted for 2.6% of
    all references (672/25548) and in articles 27 months old, 13% of
    Internet references were inactive.
      [250]↩︎
   4. The Million Dollar Homepage still gets a surprising amount of
      traffic, so one fun thing one could do is buy up expired domains
      which paid for particularly large links.[251]↩︎
   5. By 6 January 2013, the number has increased to ~12000 external
      links, ~7200 to non-Wikipedia domains.[252]↩︎
   6. If each link has a fixed chance of dying in each time period, such
      as 3%, then the total risk of death is an [253]exponential
      distribution; over the time period 2011–2070 the cumulative chance
      it will beat each of the 3% risks is 0.1658. So in 2070, how many
      of the 2200 links will have beat the odds? Each link is
      independent, so they are like flipping a biased coin and form a
      [254]binomial distribution. The binomial distribution, being
      discrete, has no easy equation, so we just ask R how many links
      survive at the 5th percentile/quantile (a lower bound) and how many
      survive at the 95th percentile (an upper bound):
qbinom(c(0.05, 0.95), 2200, 0.97^(2070-2011))
# [1] 336 394

## the 50% annual link rot hypothetical:
qbinom(c(0.05, 0.50), 2200, 0.50^(2070-2011))
# [1] 0 0
      [255]↩︎
   7. 101, ‘Buddhism related to Blossoms’; [256]Unforgotten Dreams: Poems
      by the Zen monk Shōtetsu; trans. Steven D. Carter, ISBN
      0-231-10576-2[257]↩︎
   8. Which I suspect is only accidentally ‘general’ and would shut down
      access if there were some other way to ensure that Wikipedia
      external links still got archived.[258]↩︎
   9. Since Pinboard is a bookmarking service more than an archive site,
      I asked whether treating it as such would be acceptable and Maciej
      said “Your current archive size, growing at 20 GB a year, should
      not be a problem. I’ll put you on the heavy-duty server where my
      own stuff lives.”[259]↩︎
  10. Google Cache is generally recommended only as a last ditch resort
      because pages expire quickly from it. Personally, I’m convinced
      that Google would never just delete colossal amounts of Internet
      data—this is Google, after all, the epitome of storing unthinkable
      amounts of data—and that Google Cache merely ceases to make public
      its copies. And to request a Google spider visit, one has to solve
      a CAPTCHA—so that’s not a scalable solution.[260]↩︎
  11. Which would not be publicly accessible or submittable; I know they
      exist, but because they hide themselves, I know only from random
      comments online eg. [261]“years ago a friend of mine who I’d lost
      contact with caught up with me and told me he found a cached copy
      of a website I’d taken down in his employer’s equivalent to the
      Wayback Machine. His employer was a branch of the federal
      government.”.[262]↩︎
  12. [263]Version 0.1 of my archiver daemon didn’t simply read the file
      until it was empty and exit, but actually watched it for
      modifications with [264]inotify. I removed this functionality when
      I realized that the required WebCite choking (just one URL every
      ~25 seconds) meant that archiver would never finish any reasonable
      workload.[265]↩︎
  13. Much easier than it was in the past; [266]Jamie Zawinski records
      his travails with the previous Mozilla history format in the
      aptly-named [267]“when the database worms eat into your
      brain”.[268]↩︎
  14. An older [269]2010 Google article put the average at 320kb, but
      that was an average over the entire Web, including all the old
      content.[270]↩︎
  15. Already one runs old games like the classic [271]LucasArts
      adventure games in emulators of the DOS operating system like
      [272]DOSBox; but those emulators will not always be maintained. Who
      will emulate the emulators? Presumably in 2050, one will instead
      emulate some ancient but compatible OS—Windows 7 or Debian 6.0,
      perhaps—and inside that run DOSBox (to run the DOS which can run
      the game).[273]↩︎
  16. [274]“BigTable: A Distributed Storage System for Structured Data”,
      Chang et al 2006:

    BigTable maintains data in lexicographic order by row key. The row
    range for a table is dynamically partitioned. Each row range is
    called a tablet, which is the unit of distribution and load
    balancing. As a result, reads of short row ranges are efficient and
    typically require communication with only a small number of
    machines. Clients can exploit this property by selecting their row
    keys so that they get good locality for their data accesses. For
    example, in Webtable, pages in the same domain are grouped together
    into contiguous rows by reversing the hostname components of the
    URLs. For example, we store data for maps.google.com/index.html
    under the key com.google.maps/index.html. Storing pages from the
    same domain near each other makes some host and domain analyses more
    efficient.
      [275]↩︎

  [276]Submit Anonymous Feedback

  Enable JavaScript for Disqus comments

References

  1. http://www.gwern.net/
  2. https://old.reddit.com/r/gwern/search.rss?q=flair:'Gwern'&sort=new&restrict_sr=on
  3. https://www.gwern.net/./index
  4. https://www.gwern.net/./index
  5. https://www.gwern.net/./About
  6. https://www.gwern.net/./Links
  7. https://www.gwern.net/./Changelog
  8. https://tinyletter.com/gwern
  9. https://old.reddit.com/r/gwern/
 10. https://www.patreon.com/gwern
 11. https://www.gwern.net/./About#design
 12. https://www.gwern.net/./tags/Haskell
 13. https://www.gwern.net/./tags/archiving
 14. https://www.gwern.net/./tags/Bitcoin
 15. https://www.gwern.net/./tags/meta
 16. https://www.gwern.net/./tags/shell
 17. https://www.gwern.net/./tags/R
 18. https://www.gwern.net/./tags/tutorial
 19. https://www.gwern.net/./Archiving-URLs.page
 20. https://www.gwern.net/./About#confidence-tags
 21. https://www.gwern.net/./About#importance-tags
 22. https://www.gwern.net/Archiving-URLs#link-rot
 23. https://www.gwern.net/Archiving-URLs#detection
 24. https://www.gwern.net/Archiving-URLs#prevention
 25. https://www.gwern.net/Archiving-URLs#remote-caching
 26. https://www.gwern.net/Archiving-URLs#local-caching
 27. https://www.gwern.net/Archiving-URLs#caching-proxy
 28. https://www.gwern.net/Archiving-URLs#batch-job-downloads
 29. https://www.gwern.net/Archiving-URLs#daemon
 30. https://www.gwern.net/Archiving-URLs#cryptographic-timestamping-local-archives
 31. https://www.gwern.net/Archiving-URLs#resource-consumption
 32. https://www.gwern.net/Archiving-URLs#url-sources
 33. https://www.gwern.net/Archiving-URLs#browser-history
 34. https://www.gwern.net/Archiving-URLs#document-links
 35. https://www.gwern.net/Archiving-URLs#website-spidering
 36. https://www.gwern.net/Archiving-URLs#reacting-to-broken-links
 37. https://www.gwern.net/Archiving-URLs#external-links
 38. https://www.gwern.net/Archiving-URLs#appendices
 39. https://www.gwern.net/Archiving-URLs#cryptographic-timestamping
 40. https://www.gwern.net/Archiving-URLs#filter-urls
 41. https://www.gwern.net/Archiving-URLs#sort---key-compression-trick
 42. https://www.gwern.net/Archiving-URLs#locality
 43. https://www.gwern.net/Archiving-URLs#web-archives
 44. https://www.gwern.net/Archiving-URLs#separate-mirrors
 45. https://www.gwern.net/Archiving-URLs#alternatives
 46. https://www.gwern.net/Archiving-URLs#external-links-1
 47. https://www.gwern.net/./Timestamping
 48. https://www.gwern.net/./Archiving-URLs#sort---key-compression-trick
 49. https://www.gwern.net/./DNM-archives
 50. https://www.gwern.net/./About#long-content
 51. https://en.wikipedia.org/wiki/link_rot
 52. https://www.gwern.net/Archiving-URLs#fn1
 53. https://en.wikipedia.org/wiki/exomind
 54. https://www.gwern.net/Archiving-URLs#link-rot
 55. https://en.wikipedia.org/wiki/Link_rot#Prevalence
 56. https://www.gwern.net/./docs/www/www2003.org/c220b6d34d074987565bf1515cae7d8117131926.html
 57. https://www.gwern.net/./docs/www/pdfs.semanticscholar.org/623f1cf6f89c4e662cb4080b5ec309fea437a015.pdf
 58. https://en.wikipedia.org/wiki/D-Lib_Magazine
 59. https://www.gwern.net/./docs/www/www.spinellis.gr/f49aecb7855870028c010c7a51296616051c78d5.html
 60. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.9695&rep=rep1&type=pdf
 61. https://www.gwern.net/./docs/www/www.dlib.org/8a11e08c0ffba13eb4220502190eb5d0c7cd9fb5.html
 62. https://en.wikipedia.org/wiki/Bruce_Schneier
 63. https://www.gwern.net/./docs/www/www.pantos.org/10723128e7820b8612d887af0fd0052d7ca41741.html
 64. https://www.gwern.net/Archiving-URLs#fn2
 65. https://notes.pinboard.in/u:vitorio/05dec9f04909d9b6edff
 66. https://motherboard.vice.com/en_us/article/paabgg/i-bought-a-book-about-the-internet-from-1994-and-none-of-the-links-worked
 67. https://en.wikipedia.org/wiki/Internet_Archive
 68. https://web.archive.org/web/20071019225237/http://www.wired.com/culture/lifestyle/news/2001/10/47894
 69. https://en.wikipedia.org/wiki/Science_(journal)
 70. https://www.gwern.net/Archiving-URLs#fn3
 71. https://en.wikipedia.org/wiki/French_Wikipedia
 72. http://www.wikiwix.com/
 73. http://fr.wikipedia.org/wiki/Utilisateur:Pmartin/Cache
 74. https://en.wikipedia.org/wiki/File:Articles-w-Dead-Links-Jan-2011.png
 75. https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_External_links/Webcitebot2#Summary
 76. https://lil.law.harvard.edu/blog/2017/07/21/a-million-squandered-the-million-dollar-homepage-as-a-decaying-digital-artifact/
 77. https://en.wikipedia.org/wiki/The_Million_Dollar_Homepage
 78. https://www.gwern.net/Archiving-URLs#fn4
 79. https://en.wikipedia.org/wiki/Pinboard_(website)
 80. https://twitter.com/pinboard/status/501406303747457024
 81. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253
 82. http://jnci.oxfordjournals.org/content/96/12/969.full
 83. https://www.gwern.net/./docs/linkrot/2007-dimitrova.pdf
 84. https://www.gwern.net/./docs/linkrot/2008-wren.pdf
 85. https://www.gwern.net/./docs/www/archderm.ama-assn.org/9a7a7ce4aa904413f5b5cd674ea704ab9e63aa9b.pdf
 86. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2213465/
 87. https://www.gwern.net/./docs/www/pdfs.semanticscholar.org/a93d671d3fa25a19baabb16cdde8f98951adb260.pdf
 88. https://www.gwern.net/./docs/www/www.fasebj.org/9a57fd429260c51e1d6eb22da088f4d4461000dd.html
 89. https://www.gwern.net/./docs/linkrot/2012-moghaddam.pdf
 90. https://www.gwern.net/./docs/www/cdn.harvardlawreview.org/462cd5385125d17e90515a9fec0d463051375267.pdf
 91. https://arxiv.org/abs/1209.3026v1
 92. https://www.gwern.net/./docs/www/old.reddit.com/3fa6bd447f8e0c36e551dcae4d8bc6568b019022.html#t1_e1n8hzn
 93. https://www.gwern.net/./docs/www/help.myspace.com/e92581637046e3b764df3e2d03b4b3a00121bff6.html
 94. https://archive.org/details/myspace_dragon_hoard_2010
 95. https://www.gwern.net/Archiving-URLs#fn5
 96. https://en.wikipedia.org/wiki/confidence_interval
 97. https://en.wikipedia.org/wiki/binomial_distribution
 98. https://www.gwern.net/Archiving-URLs#fn6
 99. http://wiki.lesswrong.com/wiki/Litany_of_Gendlin
100. https://addons.mozilla.org/en-US/firefox/addon/mozilla-archive-format/
101. https://github.com/danny0838/firefox-scrapbook/wiki/Features
102. https://www.gwern.net/Archiving-URLs#detection
103. https://www.gwern.net/Archiving-URLs#fn7
104. https://www.gwern.net/./docs/www/www.cs.odu.edu/5d0f24ecdfa4b26f35532b92ddf6fc43fdf2b4c3.html
105. http://wummel.github.io/linkchecker/
106. https://en.wikipedia.org/wiki/cron
107. https://en.wikipedia.org/wiki/robots.txt
108. https://www.gwern.net/Archiving-URLs#prevention
109. https://www.gwern.net/Archiving-URLs#remote-caching
110. https://news.ycombinator.com/item?id=21270467
111. https://en.wikipedia.org/wiki/archive_site
112. https://en.wikipedia.org/wiki/Internet_Archive
113. https://en.wikipedia.org/wiki/WebCite
114. http://perma.cc/
115. https://blogs.harvard.edu/perma/2019/01/07/introducing-individual-account-subscription-tiers-for-perma/
116. https://www.gwern.net/Archiving-URLs#fn8
117. http://archive.today/
118. https://pinboard.in/
119. https://www.gwern.net/Archiving-URLs#fn9
120. https://www.gwern.net/./docs/www/hiyo.jp/20b25c1bc5641a4d93e64aab2128e9246f15a469.html
121. https://www.gwern.net/./docs/www/megalodon.jp/d58907b22d706b8374f8b22a3761daacdcc5df4e.html
122. https://www.gwern.net/Archiving-URLs#fn10
123. https://www.gwern.net/Archiving-URLs#fn11
124. http://bits.blogs.nytimes.com/2010/12/07/palm-is-far-from-game-over-says-former-chief/
125. https://www.gwern.net/./docs/www/webcitation.org/ec6cc4cb308f20688510fda4d5e55f07dd81bf5a.html
126. https://www.gwern.net/./haskell/Wikipedia-Archive-Bot
127. https://www.gwern.net/./haskell/Wikipedia-RSS-Archive-Bot
128. https://en.wikipedia.org/wiki/Alexa_Toolbar
129. https://hackage.haskell.org/package/gitit
130. https://github.com/jgm/gitit/blob/master/plugins/Interwiki.hs
131. https://hackage.haskell.org/package/archiver
132. https://www.gwern.net/Archiving-URLs#fn12
133. https://github.com/rahiel/archiveror
134. https://en.wikipedia.org/wiki/GNU_screen
135. https://www.gwern.net/Archiving-URLs#local-caching
136. https://arxiv.org/abs/1212.6177
137. https://www.gwern.net/Archiving-URLs#caching-proxy
138. https://github.com/INA-DLWeb/LiveArchivingProxy
139. https://github.com/odie5533/WarcProxy
140. https://en.wikipedia.org/wiki/Web_ARChive
141. https://www.gwern.net/./docs/www/blog.za3k.com/e6b85edad0874ce9f51ecac2c7a2225dbf016a6e.html
142. https://www.gwern.net/Archiving-URLs#batch-job-downloads
143. https://www.gwern.net/Archiving-URLs#fn13
144. https://en.wikipedia.org/wiki/wget
145. https://en.wikipedia.org/wiki/PhantomJS
146. https://github.com/ariya/phantomjs/issues/10117
147. https://www.gwern.net/Archiving-URLs#filter-urls
148. https://www.gwern.net/Archiving-URLs#daemon
149. https://www.gwern.net/./DNM-archives
150. https://addons.mozilla.org/en-US/firefox/addon/export-cookies/
151. http://crawler.archive.org/
152. https://www.gwern.net/Archiving-URLs#cryptographic-timestamping-local-archives
153. https://www.gwern.net/./Timestamping
154. https://www.gwern.net/Archiving-URLs#resource-consumption
155. https://www.gwern.net/./docs/www/www.websiteoptimization.com/e636df6f082c30b99a930d0450c4af5fd912e729.html
156. https://www.gwern.net/Archiving-URLs#fn14
157. https://en.wikipedia.org/wiki/Kryder's_law
158. https://www.gwern.net/./docs/www/forre.st/07963fe85eed8b845122c702ce0cbe668816ac02.html#hdd
159. http://pinboard.in/upgrade/
160. https://www.gwern.net/Archiving-URLs#fn15
161. https://www.gwern.net/./docs/www/www.davidbordwell.net/f933619e20a14d5d4fbf35ab7ea25b4acea8b23b.html
162. https://en.wikipedia.org/wiki/7-Zip
163. https://en.wikipedia.org/wiki/fdupes
164. https://en.wikipedia.org/wiki/hard_link
165. https://en.wikipedia.org/wiki/JQuery
166. https://www.gwern.net/Archiving-URLs#url-sources
167. https://www.gwern.net/Archiving-URLs#browser-history
168. https://en.wikipedia.org/wiki/4chan
169. https://github.com/za3k/rip-chrome-history
170. https://github.com/za3k/export-chrome-bookmarks
171. https://www.gwern.net/Archiving-URLs#document-links
172. https://www.gwern.net/./haskell/link-extractor.hs
173. https://www.gwern.net/Archiving-URLs#website-spidering
174. https://en.wikipedia.org/wiki/site_map
175. http://lesswrong.com/
176. https://www.gwern.net/Archiving-URLs#reacting-to-broken-links
177. https://www.gwern.net/./docs/www/wkhtmltopdf.org/1166a9ebce4000a02b4175174daae3731a5d756c.html
178. https://www.gwern.net/./docs/ai/1999-bradbury-matrioshkabrains.pdf
179. https://www.gwern.net/./docs/iq/2001-sternberg.pdf
180. https://www.gwern.net/Archiving-URLs#external-links
181. https://github.com/oduwsdl/archivenow
182. http://timetravel.mementoweb.org/
183. http://www.archive-it.org/
184. https://archive.org/donate/
185. http://pinboard.in/
186. https://web.archive.org/web/20170707135337/http://blog.pinboard.in:80/2010/11/bookmark_archives_that_don_t
187. https://www.gwern.net/./docs/www/samsaffron.com/5f76aa4a1c761f34c30a84292e11bed153b6420d.html#comment-31366
188. https://www.gwern.net/./docs/linkrot/2011-muflax-backup.pdf
189. https://www.gwern.net/./docs/www/tesoro.io/fde1f38f35a3caa5d2cc214dc79bfdf4ecffa3d5.html
190. https://news.ycombinator.com/item?id=14644441
191. https://xkcd.com/1909/
192. https://lwn.net/Articles/766374/
193. https://webrecorder.io/
194. https://github.com/webrecorder/webrecorder
195. https://github.com/ikreymer/pywb
196. https://archivebox.io/
197. https://en.wikipedia.org/wiki/IPFS
198. https://www.gwern.net/./docs/www/wordpress.org/87feddbeb90c8f8f076dbac7ab01d77296329ea2.html
199. https://rg3.github.io/youtube-dl/
200. https://en.wikipedia.org/wiki/PhantomJS
201. https://github.com/marvelm/erised
202. https://github.com/chfoo/wpull
203. https://news.ycombinator.com/item?id=6504331
204. https://www.gwern.net/./docs/www/stop.zona-m.net/783905af0715e433863b32c879821ed20cee5de3.html
205. https://news.ycombinator.com/item?id=19604135
206. https://www.gwern.net/./docs/www/spectrum.ieee.org/99737546cb17f6d5cd38009b1d8ae3ead68119df.html
207. https://www.gwern.net/Archiving-URLs#appendices
208. https://www.gwern.net/Archiving-URLs#cryptographic-timestamping
209. https://www.gwern.net/./Timestamping
210. https://www.gwern.net/Archiving-URLs#filter-urls
211. http://lesswrong.com/lw/7kg/rationalist_sites_worth_archiving/
212. https://www.gwern.net/Archiving-URLs#sort---key-compression-trick
213. https://www.gwern.net/./DNM-archives
214. https://en.wikipedia.org/wiki/Hutter_Prize
215. https://arxiv.org/abs/1104.5466
216. https://en.wikipedia.org/wiki/xz
217. https://en.wikipedia.org/wiki/Burrows-Wheeler_transform
218. https://en.wikipedia.org/wiki/PAQ
219. https://www.gwern.net/Archiving-URLs#locality
220. https://www.gwern.net/Archiving-URLs#web-archives
221. https://en.wikipedia.org/wiki/BigTable
222. https://www.gwern.net/Archiving-URLs#fn16
223. https://en.wikipedia.org/wiki/Snappy_(software)
224. https://en.wikipedia.org/wiki/sort_(Unix)
225. https://en.wikipedia.org/wiki/info_(Unix)
226. https://www.gwern.net/./docs/www/linux.die.net/7f87c1cf8f6af23204c59fb718697d7a4a645e7b.html
227. https://www.gwern.net/Archiving-URLs#separate-mirrors
228. https://www.gwern.net/./Silk-Road
229. https://www.gwern.net/./DNM-archives
230. https://www.gwern.net/Archiving-URLs#alternatives
231. https://en.wikipedia.org/wiki/Git_(software)
232. https://www.gwern.net/./docs/www/neoscientists.org/9802872843f29498229b95588e7bdebd7d86cad8.html
233. https://old.reddit.com/r/SilkRoad/comments/1yzndq/a_trip_down_memory_lane_vladimir_limetless_and/
234. https://en.wikipedia.org/wiki/rzip
235. https://www.gwern.net/./docs/www/ck.kolivas.org/47782d34488b34eef4ba08da0d7c26063f9242d2.html
236. https://wiki.archlinux.org/index.php/Lrzip
237. https://www.gwern.net/Archiving-URLs#external-links-1
238. https://stackoverflow.com/questions/357560/sorting-multiple-keys-with-unix-sort
239. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.473.7179&rep=rep1&type=pdf
240. https://en.wikipedia.org/wiki/MinHash
241. http://duplicity.nongnu.org/
242. http://www.nongnu.org/rdiff-backup/
243. https://www.gwern.net/./docs/www/forre.st/07963fe85eed8b845122c702ce0cbe668816ac02.html#hdd
244. https://en.wikipedia.org/wiki/Backblaze
245. https://en.wikipedia.org/wiki/PAR2
246. https://www.gwern.net/Archiving-URLs#fnref1
247. https://web.archive.org/web/20121108093008/http://www.wired.com/politics/security/commentary/securitymatters/2008/02/securitymatters_0221
248. https://www.gwern.net/Archiving-URLs#fnref2
249. https://www.gwern.net/./docs/www/scimaps.org/d0b8798bc2a55f9a99ff87a4cb002bb043702294.pdf
250. https://www.gwern.net/Archiving-URLs#fnref3
251. https://www.gwern.net/Archiving-URLs#fnref4
252. https://www.gwern.net/Archiving-URLs#fnref5
253. https://en.wikipedia.org/wiki/exponential_distribution
254. https://en.wikipedia.org/wiki/binomial_distribution
255. https://www.gwern.net/Archiving-URLs#fnref6
256. https://www.gwern.net/./docs/japanese/1997-carter-shotetsu-unforgottendreams.pdf
257. https://www.gwern.net/Archiving-URLs#fnref7
258. https://www.gwern.net/Archiving-URLs#fnref8
259. https://www.gwern.net/Archiving-URLs#fnref9
260. https://www.gwern.net/Archiving-URLs#fnref10
261. https://news.ycombinator.com/item?id=2880427
262. https://www.gwern.net/Archiving-URLs#fnref11
263. http://hackage.haskell.org/package/archiver-0.1
264. https://en.wikipedia.org/wiki/inotify
265. https://www.gwern.net/Archiving-URLs#fnref12
266. https://en.wikipedia.org/wiki/Jamie_Zawinski
267. http://www.jwz.org/blog/2004/03/when-the-database-worms-eat-into-your-brain/
268. https://www.gwern.net/Archiving-URLs#fnref13
269. https://web.archive.org/web/20100628055041/http://code.google.com/speed/articles/web-metrics.html
270. https://www.gwern.net/Archiving-URLs#fnref14
271. https://en.wikipedia.org/wiki/LucasArts_adventure_games
272. https://en.wikipedia.org/wiki/DOSBox
273. https://www.gwern.net/Archiving-URLs#fnref15
274. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.96&rep=rep1&type=pdf
275. https://www.gwern.net/Archiving-URLs#fnref16
276. https://docs.google.com/forms/d/17cNf0ohMHKeTwmUM_V_hDVkirjsf8HLMWsJco8gt2Eg/viewform