This is a text-only version of the following page on https://raymii.org:
---
Title       :   Word occurrence counter and analyzer
Author      :   Remy van Elst
Date        :   07-03-2013
URL         :   https://raymii.org/s/articles/Word_occurrence_counter_and_analyzer.html
Format      :   Markdown/HTML
---



With these commands you can analyze a text file. It will count all the
occurrences of all words and put out the stats. It is usefull for song lyrics,
books, notes and everything. It helps me analyze my writing style, which words
do I use more often, where are my spelling errors and such. It is also nice to
win an argument against someone over a dragonforce song. This example will use
lyrics as example, but it is applicable to all text files.

<p class="ad"> <b>Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:</b><br><br> <a href="https://leafnode.nl">I'm developing an open source monitoring app called  Leaf Node Monitoring, for windows, linux & android. Go check it out!</a><br><br> <a href="https://github.com/sponsors/RaymiiOrg/">Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.</a><br><br> <a href="https://www.digitalocean.com/?refcode=7435ae6b8212">You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $100 credit for 60 days. </a><br><br> </p>


##### Get the Lyrics (text)

First get the lyrics, or the text you want to analyze into a text file. I've
heard nano, vi(m) and emacs are quite good with text. In this song I will use a
song by Dragonforce. It does not matter which one because they're all full of
the same words.

My lyrics file is named: `df1.txt`

##### Sanitize them

The tools we are going to use do not like all those comma's, colons, exclamation
marks and weird non-alphanumeric characters. So sanitize the file like this:



   cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt


What this does is pump the file through the tr command, that command (with these
arguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what we
want.

##### Analyze it Now we do the magic:



   sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20



   remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
   72 the
   32
   25 and
   22 of
   20 in
   17 we
   16 on
   14 our
   13 a
   8 were
   8 lost
   8 for
   7 will
   7 still
   7 light
   6 to
   6 so
   6 fire
   6 far
   5 through


### Other Example

#### on my class notes about blood and the immune system



   remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt
   remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20
   195
   108 de
   80 een
   72 van
   65 het
   51 in
   46 is
   40 en
   24 zijn
   24 op
   24 afweer
   22 die
   20 vraag
   20 deze
   19 worden
   18 kan
   17 bij
   16 dit
   15 er
   14 of


After stripping it of the non-usefull words:



   remy@vps8:~$ cat afwres.txt | head -n 10
   24 afweer
   14 cellen
   11 bacterin
   9 waar
   9 reactie
   9 antigeen
   8 specifieke
   7 milieu
   7 lymfocyten
   7 lichaam


#### Fabian Scherschels NanoWriMo 2011 Book: Nightwatch

[GIT tree of the book][2] & [NaNoWiMo page][3] Book is `Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License`



   1020 the
   454 he
   421 and
   418 of
   357 to
   347 had
   297 a
   267 was
   257 his
   241 that
   216 in
   132 it
   130 marc
   112 him
   108 as
   105 this
   105 they
   93 with
   90 but
   82 were
   82 from
   82 been
   82 at
   74 on
   70 would
   68 for
   68 could
   56 their
   56 be
   53 out
   51 into
   50 man
   49 all
   48 there
   48 so
   48 by
   47 looked
   46 not
   44 up
   44 them
   44 like


#### Analyzing IP and log files

Today I found another usefull use for this command. Analyzing IP adresses. First
I grepped my entire lighttpd log file:

cat access.log | egrep -o
'[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr
[:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr

(egrep -o spits out only the IP adress, not the whole line on which the IP
adress is on)

That gives out this nice list (this list is made up, not real IP adresses):



   2 83.64.150.248
   2 94.0.74.75
   2 94.142.55.252
   2 95.237.133.3
   2 98.225.130.26
   3 108.100.28.45
   3 213.93.70.87
   5 81.30.145.69
   348 66.228.43.247
   467 173.255.236.50


[Thanks to the wonderfull community at stackexchange][4]

  [1]: https://www.digitalocean.com/?refcode=7435ae6b8212
  [2]: https://gitorious.org/nano2011
  [3]: http://www.nanowrimo.org/en/participants/fabsh
  [4]: http://unix.stackexchange.com/questions/39039/get-text-file-word-occurrence-count-of-all-words-print-output-sorted

---

License:
All the text on this website is free as in freedom unless stated otherwise.
This means you can use it in any way you want, you can copy it, change it
the way you like and republish it, as long as you release the (modified)
content under the same license to give others the same freedoms you've got
and place my name and a link to this site with the article as source.

This site uses Google Analytics for statistics and Google Adwords for
advertisements. You are tracked and Google knows everything about you.
Use an adblocker like ublock-origin if you don't want it.

All the code on this website is licensed under the GNU GPL v3 license
unless already licensed under a license which does not allows this form
of licensing or if another license is stated on that page / in that software:

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program.  If not, see <http://www.gnu.org/licenses/>.

Just to be clear, the information on this website is for meant for educational
purposes and you use it at your own risk. I do not take responsibility if you
screw something up. Use common sense, do not 'rm -rf /' as root for example.
If you have any questions then do not hesitate to contact me.

See https://raymii.org/s/static/About.html for details.