With a bit of python, lynx, and tidy I was able to pull very clean
plain text versions of my WordPress posts. The sparse HTML can be found
at [1]
http://tokyogringo.myjp.net and the markdown text version can be
found on my gopher site at [2]
gopher://sdf.org:70/0/users/tokyogringo/
How did I do it? This site has full text RSS for everyone's enjoyment.
No one has to actually visit
https://www.prjorgensen.com in order to
consume the high value content I generate. The feed contains everything
needed for this plain text life. How to make use of it?
I fumbled through my first in a long time python script relying heavily
on the very powerful feedparser module.
This Just In: python's documentation is terse almost to the point of
incomprehension While accurate, the documentation does not help
beginning (and maybe middling) python coders get to solving problems.
Oddly, the Reddits and StackExchange sites are also of limited utility
as the answers there often point back to or copy the documentation.
Anyway, taking a very Unix approach I decided not to do everything in
python. I know tidy for making valid HTML. I know lynx for
terminal-based web browsing, and the '-dump' option produces markdown
versions of web pages.
Once I got the script to the point of providing the website data in a
reliable and eventually parse-able way, then I turned to getting all my
posts.
I cranked the RSS feed of prjorgensen.com up to 20,000 to make sure the
feed briefly included all of my posts. I moved my parsing script to my
MacBook Pro because I didn't want to choke the sdf.org servers with my
madness. I installed modules and localized the script to run on the
MBP.
I ran the script. I checked my email. I then got up to … hmmm. The
script finished in under two minutes. Suddenly I had all of my posts
back to 2011 in both very clean HTML and in plain text. I synced them
to their proper home. I reset my website feed back to a more reasonable
number.
There are any number of improvements I can make:
* My script does not grab images
* I capture categories and tags from WordPress but don't do anything
useful with them
* I need to include modifying my gophermap and my index.html (as
appropriate)
* A full text RSS feed of the plain HTML site
* A full text RSS feed of the gopher site
* Maybe use a static web site generator like Jekyll for the plain
HTML site
* Maybe use this for tokyogringo.com and PVCSec.com? If so, then I
need to handle …
* Media enclosures
Watch this space for the link to my script on GitHub. Which is [3]here!
__________________________________________________________________
My original entry is here: [4]Plain text life, including gopher. It
posted Fri, 08 Jun 2018 11:51:38 +0000.
Filed under: administrivia, tech,
References
1.
http://tokyogringo.myjp.net/
2.
gopher://sdf.org/0/users/tokyogringo/
3.
https://github.com/zenshinji/gopher-parser
4.
https://www.prjorgensen.com/?p=1203