2018-04-09
___D_e_a_l_i_n_g__w_i_t_h__r_o_g_u_e__c_r_a_w_l_e_r_s________________

Today I got hit by a crawler that thinks indexing all of my stagit
repo pages is a good idea.

Now I am unsure about the usefulness of a robots.txt file, if someone
wants to access a selector or all of them, fine by me. Unless my data
volume limit is not hit by it let them access my selectors.

But I have seen a lot of spiders creating selectors that aren't
valid. And I think one needs to deal with this properly. I am
implementing the following steps:

  * Add a pf(1) table for greylisting protential spammers

  * Add some tarpit selectors that will trigger another check check
  in the table whether the calling IP is in the greylist.


  * If the calling IP is in the greylist, and is hitting a bogus
  selector again, move it to the blacklist

  * Blacklisted IPs will get blocked from the system entirely for X
  hours

  * The tarpit daemon will slowly respon to each request with a huge
  potentially never ending text file stating some explanation and
  then hang up

  * A cron job will clean up the blacklist after a while.

So how to do this with pf(1)? Turns out to be quite easy:

'''pf.conf
  table <spammers-black> persist
  block in on egress proto tcp from <spammers-black> port 70
'''

The entries can be filled with pfctl(1), I am using a simple script
called update-pf:


'''shell
  # pfctl -t spammers-black -T replace /
'''

And deleted with the '-T expire <seconds>' command. The former will
be done within the trap cgi and the latter in a cronjob. Note that
this script is for geomyidae, other servers do not provide
REMOTE_ADDR. Check the documentation (or better source!) of your
gopher server.


The CGI:

'''shell
#!/bin/ksh

grep "$REMOTE_ADDR" /var/gopher/greylist > /dev/null
if [ "$?" -ne "0" ]; then
       echo "$REMOTE_ADDR" >> /var/gopher/greylist
else
       sed -i.bak "s,$REMOTE_ADDR,,g" /var/gopher/greylist
       echo "$REMOTE_ADDR" >> /var/gopher/blacklist
fi

doas /sbin/update-pf 2>/dev/null
gopher-tarpit
'''

Gopher tarpit is just a dump slowly sending program, you can use
anything really. Adjust the server settings to your need please.
It sends some selectors pointing to the cgi again:

'''c
#include <stdio.h>
#include <string.h>
#include <unistd.h>

char message[] = "i\tHi this is a tarpit...\tInfo\tvernunftzentrum.de\t70\r\n"
       "iFollow any of the links below or this selector again, and you will be banned\tInfo\tserver\tport\r\n"
       "1Some uninteresting content (do not follow!)\t/pit/\tvernunftzentrum.de\t70\r\n"
       "1More uninteresting content (do not follow!)\t/pit/\tvernunftzentrum.de\t70\r\n"
       ".\r\n";

int
main (int argc, char **argv)
{
       size_t l = strlen(message);
       for (int i=0; i<l; i++) {
               putchar(message[i]);
               fflush(stdout);
               sleep(1);
       }
       return 0;
}
'''

On OpenBSD not everyone can alter the packet filter config, so I put
the pfctl call into a script and allow this in doas.conf:

'''shell
#!/bin/ksh

pfctl -t spammers-black -T replace -f /var/gopher/blacklist
'''

'''doas.conf
permit nopass :_geomyidae cmd /sbin/update-pf
'''

With that you hit up the rogue crawler, let's free the ip again:

'''shell
pfctl -t spammers-black -T expire 7200
'''

Put that in a cronjob. Adjust the time value to taste.

So this sums it up for this little proof of concept. Please don't
deploy this 1:1. I encourage you to make an educated decision whether
it really is necessary. If it is, you now hold the seed for a cure to
your problems.

I would like to thank __20h__ for cross checking the text (modulo the
pf commands). All mistakes are mine.

Thanks for reading!

_____________________________________________________________________