TITLE: Scraping instagram without an account
DATE: 2019-10-20
AUTHOR: John L. Godlee
====================================================================


There are lots of people I would like to follow on Instagram,
mostly woodworkers, bicycle people, and outdoors people. It seems
to be a really good method of delivering content. Unfortunately for
Instagram, there is absolutely no way I would make an account with
them. I fear it would be too much of a time sink, and I'm paranoid
of giving too much detail of my personal interests to Facebook.

I found a command line tool called InstaLooter which you can use to
scrape public Instagram profiles without an account and save the
images on my local machine which I can then read at my leisure, in
the spirit of RSS. This is how I implemented the program.

 [InstaLooter]: https://github.com/althonos/InstaLooter

I created a text file which lives in my $HOME called .ig_subs.txt.
The file holds a list of Instagram user IDs for the accounts I want
to scrape from:

   kelsoparadiso
   lloyd.kahn
   exploringalternatives
   barnthespoon
   terrybarentsen
   woodlands.co.uk
   zedoutdoors
   mossy_bottom

Then I made a shell script which lives in my path, called insta_dl:

   #!/bin/bash

   # Make directory if it doesn't exist
   mkdir -p $HOME/Downloads/ig

   # make newlines the only separator
   IFS=$'\n'

   # disable globbing
   set -f

   # Loop
   for i in $(cat < "$HOME/.ig_subs.txt"); do
     instalooter user $i $HOME/Downloads/ig/ -n 1 -N -T
{username}.{date}.{id}
   done

instalooter user $i downloads photos from each user i. -n 1 only
downloads the most recent post, whether that post is one photo or
multiple. -N only downloads images which don't already exist in the
destination directory ($HOME/Downloads/ig/), based on the filename.
-T {username}.{date}.{id} sets the filename of each photo. {id} is
unique for each photo on Instagram, so it uniquely identifies each
file downloaded for use by -N. The filenames then look something
like this:

   exploringalternatives.2019-09-27.2142383070393557093.jpg
   kelsoparadiso.2019-10-09.2150831532411304437.jpg
   kelsoparadiso.2019-10-09.2150831532419588103.jpg
   kelsoparadiso.2019-10-09.2150831532419839765.jpg
   lloyd.kahn.2019-10-11.2152638264107259024.jpg
   mossy_bottom.2019-10-09.2151026330651686709.jpg
   terrybarentsen.2019-10-03.2146722625883638769.jpg
   terrybarentsen.2019-10-03.2146722625900303797.jpg
   terrybarentsen.2019-10-03.2146722625950630270.jpg
   woodlands.co.uk.2019-10-11.2152273592812162360.jpg
   zedoutdoors.2019-10-02.2145942922787735607.jpg

If I wanted to I guess I could further file each image into its own
directory based on username or date, but I don't want that.

I can now create a cronjob or a LaunchAgents script to automate
this to run everyday or every week in the background.

Update - 2019_10_31

I updated the insta_dl shell script so that it also grabs the
caption of each instagram post downloaded and stores it in a text
file. InstaLooter can download post metadata as a JSON file by
adding the -d flag (--dump-json). Then I use jq to parse the JSON
file for each post to extract the full name of the account
(.owner.full_name), the @username of the account (.owner.username)
and the content of the caption of the post
(.edge_media_to_caption[][][].text). Then I use sed to put a blank
line between each caption to make it easier to read and delete the
original JSON files:

   #!/bin/bash

   # Make directory if it doesn't exist
   mkdir -p $HOME/Downloads/ig

   DIR=$HOME/Downloads/ig

   # make newlines the only separator
   IFS=$'\n'

   # Loop
   for i in $(cat < "$HOME/.ig_subs.txt"); do
       instalooter user $i $DIR -v -d -n 1 -N -T
{username}.{date}.{id}
   done

   for i in $DIR/*json ; do
       cat $i | jq '(.owner.full_name + " (" + .owner.username +
"): " + .edge_media_to_caption[][][].text)'
   done > $DIR/description.txt

   sed -i 'G' $DIR/description.txt

   rm $HOME/Downloads/ig/*.json