Archive of

Blocking invasive web crawlers from my site

Today, I decided to take a peek at my web server access logs, and saw that there were two bots that were attempting to index literally anything they could find on my sites, following links, adding php options to the urls, etc. Take a look at the example below:

git.snowcake.me:443 85.208.96.210 - - [18/Sep/2023:19:44:04 -0400] "GET /mirrors/Bibata_Cursor/issues?assignee=1&labels=0&milestone=0&poster=0&project=0&q&sort=mostcomment&state=open&type=all HTTP/1.1" 200 9511 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 85.208.96.207 - - [18/Sep/2023:19:45:01 -0400] "GET /mirrors/Atmosphere/commit/fca213460bcd8cd826dc507769ee5100695d496e HTTP/1.1" 200 21244 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 85.208.96.209 - - [18/Sep/2023:19:45:21 -0400] "GET /primrose/switch-sigpatches/pulls?assignee=0&labels&milestone=-1&poster=0&project=0&q&sort=farduedate&state=closed&type=all HTTP/1.1" 200 9509 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.9 - - [18/Sep/2023:19:45:27 -0400] "GET /mirrors/hbc-archive/issues?assignee=1&labels=0&milestone=-1&poster=0&project=-1&q&sort=farduedate&state=open&type=all HTTP/1.1" 200 9551 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 85.208.96.207 - - [18/Sep/2023:19:46:11 -0400] "GET /primrose/emsite/pulls?assignee=1&labels&milestone=0&project=-1&q&sort=farduedate&state=closed&type=all HTTP/1.1" 200 9406 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.6 - - [18/Sep/2023:19:47:31 -0400] "GET /mirrors/apple_cursor/issues?labels&milestone=0&poster=0&project=0&q&sort=farduedate&state=open&type=all HTTP/1.1" 200 9463 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.14 - - [18/Sep/2023:19:47:45 -0400] "GET /mirrors/wii/commit/2a16bf72f527e66eef7689469d5bdc95228b74de?show-outdated&style=unified&whitespace=ignore-eol HTTP/1.1" 200 16062 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.8 - - [18/Sep/2023:19:47:53 -0400] "GET /mirrors/wii/issues?assignee=-1&labels=0&poster=0&q&sort=oldest&state=closed&type=all HTTP/1.1" 200 9527 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 51.222.253.1 - - [18/Sep/2023:19:49:40 -0400] "GET /mirrors/speedie-page/commit/17c140a78744619953b9ad0406842942e980088d HTTP/1.1" 200 15480 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"

This nonsense of attempting to index all the git repositories on my Forgejo instance goes on and on and on in the log for tons of lines at at time. If you look closer, some of the GET requests seem to be getting full commits, and of course, this will eventually add up and waste my bandwidth.

There is also another notable bot that I found while perusing through my access logs:

snowcake.me:443 162.216.149.224 - - [18/Sep/2023:19:12:03 -0400] "GET / HTTP/1.1" 200 2453 "http://98.116.68.200:80/" "Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com"

I am not even a customer of Palo Alto Networks, so this is just yet another annoyance for me.

To solve these issues, I simply put the following into each virtual host config for the nginx reverse proxy:

if ($http_user_agent ~* semrushbot|dotbot|expanse|palo|alto|gptbot) { return 403; }

This works really well, and because this catches the annoying bots at the reverse proxy level, the other virtual machines that run my apache web servers don't need to waste their resources processing these bogus requests.

I also put the GPTBot into the blacklist, because fuck OpenAI and fuck artificial intelligence in general. There are probably a ton of other crawlers used to gather training data for AI models, but I am not sure where to find a big list of them. If you happen to know of some I can add to my blacklist, please send me an email at primrose@snowcake.me. Thanks!

Signed,

Primrose


Stray has got to be the best PC game I've ever played

Yesterday, I finished the game Stray. For those who don't know, Stray is a game that focuses around a stray ginger cat exploring a hidden cyberpunk underground city, and looks for a way to get back to the outside again after getting separated from their group of cat friends. In my opinion, the storyline is one of the best I've ever seen in a game, and the graphics and soundtrack are amazing. You get to do so much along the way, and the ending left me in tears. I can't go into too many details without spoiling it for those who haven't played yet though.

What's even more surprising is the fact that this is Bluetwelve Studio's first game. Absolutely astonishing considering the quality of this game. Took me about 6 hours to get through everything, but it was worth it. Maybe I'm biased because I love cats, but even if you take the cats away from the game, how high quality everything else is, is remarkable. I highly recommend playing Stray if you don't know what else to play and are looking for a game with a good storyline.

You can find more information on the game at it's official website, linked here.

Signed,

Primrose