From:
[email protected]
Date: 2018-04-17
Subject: Evernote Extraction
I take notes all the time. I love having access to my notes wher-
ever I go. Evernote does that. However, I've become increasingly
dissatisfied with the complexity of their client software. Also,
they recently stopped supporting Geeknote, a CLI client. [1] Gee-
knote has it's own problems, so maybe it's time to make a change.
After evaluating a number of solutions, I settled on vimwiki. [2]
Vimwiki will let me manage my information in plaintext and I can
even publish an HTML version of it. My entire collection of notes
should be small enough that I can pull everything down to my phone.
Now I just have to extract my data from Evernote. Easy, right?
Evernote doesn't make a desktop client for linux, so I fired up my
Mac Mini since I need to use the desktop client to export my data.
I exported each of my notebooks into a separate enex file (Ever-
note's XML format). Looking at it, I wonder if it's even valid
XML. How am I going to get my data out of here?
My first move is to install html-xml-utils. After experimenting
with `hxpipe` and `hxextract`, it seems like html-xml-utils are
more about manipulating html/xml and retaining the format, not fil-
tering the data away from the format.
I had a quick chat with tomasino [3] and he referred me to ev-
er2simple [4]. Ever2simple is a tool that aims to help people mi-
grate from Evernote to Simplenote. After some trial and error, I
was able to install ever2simple, but I first had to install python-
pip, python-libxml2, python-lxml, and python-lxslt.
I'm starting with one of my smallest notebooks, a journal, just so
I can prove the concept. I want to migrate these journal entries
to my journal.txt file that I maintain with jrnl. [5] I tried the
`-f dir` option first, hoping this would just give me a folder full
of text files. That's exactly what it does, but there's no metada-
ta. I need the timestamps. Using ever2simple with the `-f json`
option gives me my metadata, but now everything is in a huge JSON
stream. After some experimentation with sed, I conclude that sed
is not the right tool for this job.
I remember hearing about something called `jq` that should let me
work with JSON. The apt package description for `jq` starts with,
"jq is like sed for JSON...". Well, I'm sold. Also, no dependen-
cies! What a bonus. The man page is full of explanations and exam-
ples, but I'm going to need to experiment with the filters. After
some experimentation, I land on
jq '.[] | .createdate,.content' journal.json
This cycles through each top-level element and extracts the create-
date and content values. Now I wonder how I can add a separator so
that I can dissect the data into discrete files with awk or some-
thing. I should be able to add a literal to the list of filters.
jq '.[] | .createdate,.content,"%%"' journal.json
Well, the %% lines include the quotes, but that's not the end of
the world. I wonder what date format I need for jrnl. Each jrnl
entry starts with
YYYY-MM-DD HH:MM Title
Evernote gives me dates that look like
Jul 25 2011 HH:MM:SS
`date --help` to the rescue!
Looking at date handling in `jq`, I should be able to convert the
dates from the format used by Evernote to the format used by jrnl
with the filter
strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")
All together, then.
jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json
I still have some garbage in there, but I'm getting close to being
able to just prepend this to my journal.txt file. OK, I'm close
enough with this:
jq '.[] | (.createdate|strptime("%b %d %Y %H:%M:%S")|strftime("%Y-%m-%d %H:%M")),.content,"%%"' journal.json | sed -e 's/^"//;s/"$//;s/\\n/\n/g' | sed -e '/^ *$/d' >journal.part
Okay, let's try the recipes notebook. My recipes notebook should
be a little more challenging than my journal entries, but it's not
as massive as my main notebook.
ever2simple -f json -o recipes.json recipes.enex
My journal json file was 5k. This one is 105k. Running the same
command as before gives me pretty legible output. I know some of
these notes had attachments, but I don't see them in the JSON. I
wonder if they are mime-encoded in the XML file.
Looking back at my recipes.enex file, attachments do appear to be
base64 encoded in the XML, but ever2simple doesn't copy this data
into the JSON file it creates. This makes sense since its target
is Simplenote. Maybe html-xml-utils can help me get these files
out.
hxextract 'resource' recipes.enex
It looks like the files are encapsulated within resource elements.
The resource element contains metadata about the attachment and the
base64-encoded data itself is inside a data element. I can isolate
the data using hxselect.
hxselect -c -s '\n\n' data < recipes.enex > recipes.dat
This gives me all the mime attachments in a single file. Each
base64-encoded file is separated by two newlines. This doesn't
preserve my metadata, but I'm anxious to get the data out and see
what's in there. Let's see if I can pipe the first one into base64
-d to decode it. An awk one-liner should let me terminate output
at the first blank line.
awk '/^$/ {exit}{print $0}' recipes.dat | base64 -d > testfile
Now I can use `file` to find out what kind of file it is.
file testfile
This tells me that it's an image. A JPEG, to be specific, and it's
300 dpi and 147x127. That seems seems small. I wonder if Evernote
encoded all of the images that were in the html pages I saved.
Opening the file in an image viewer, I can see that that's exactly
what it is. How many attachments are in there? Could I...
sed -e '/^./d' recipes.dat | wc
Damn, that's slick. There are 74 files in there. I'll bet only a
handful of them have any value to me. I think the easiest way to
go forward is to copy each base64 attachment into it's own file.
Looking at split(1), it splits on line count, not a delimiter.
What if I do something like...
#!/usr/bin/awk -f
BEGIN {fcount=1}
/^$/ {fcount++;}
{ print $0 >> "dump/" fcount ".base64"}
This goes through my recipes.dat file and puts each base64-encoded
attachment into its own file. Now I need to decode them and give
them an appropriate suffix.
#/bin/bash
for f in dump/*
do
outfile="${f%.*}.out"
base64 -d "${f}" > "${outfile}"
type=$(file ${outfile})
type="${type#* }"
type="${type%% *}"
newout="${outfile%.out}.${type}"
mv "$outfile" "$newout"
done
Phew! Now I have 74 files to look through. Most of these are
garbage from web pages I saved. There's really only five of these
that I want to keep. There are a few problems with this approach:
* I lose the original file name.
* I use the file utility to reconstruct the filename extension.
* I lose the association between the file and the note.
This has been a lot of work, and there's a lot more to be done.
Looking at my main notebook, I may revisit ever2simple's `-f dir`
option. I could even look at the source and see if there's a way
to tack on metadata.
I assume there are better ways to go about this, but I love chal-
lenges like this because it's an excuse to learn new tools and get
better at using the tools I'm already familiar with. Next time,
I'll show you what happens next, and how I migrate this information
to vimwiki.
## References
1.
http://www.geeknote.me/
2.
https://vimwiki.github.io/
3.
gopher://gopher.black
4.
https://vimwiki.github.io/
5.
http://jrnl.sh