Blocking invasive web crawlers from my site

Today, I decided to take a peek at my web server access logs, and saw that there were two bots that were attempting to index literally anything they could find on my sites, following links, adding php options to the urls, etc. Take a look at the example below:

git.snowcake.me:443 85.208.96.210 - - [18/Sep/2023:19:44:04 -0400] "GET /mirrors/Bibata_Cursor/issues?assignee=1&labels=0&milestone=0&poster=0&project=0&q&sort=mostcomment&state=open&type=all HTTP/1.1" 200 9511 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 85.208.96.207 - - [18/Sep/2023:19:45:01 -0400] "GET /mirrors/Atmosphere/commit/fca213460bcd8cd826dc507769ee5100695d496e HTTP/1.1" 200 21244 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 85.208.96.209 - - [18/Sep/2023:19:45:21 -0400] "GET /primrose/switch-sigpatches/pulls?assignee=0&labels&milestone=-1&poster=0&project=0&q&sort=farduedate&state=closed&type=all HTTP/1.1" 200 9509 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.9 - - [18/Sep/2023:19:45:27 -0400] "GET /mirrors/hbc-archive/issues?assignee=1&labels=0&milestone=-1&poster=0&project=-1&q&sort=farduedate&state=open&type=all HTTP/1.1" 200 9551 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 85.208.96.207 - - [18/Sep/2023:19:46:11 -0400] "GET /primrose/emsite/pulls?assignee=1&labels&milestone=0&project=-1&q&sort=farduedate&state=closed&type=all HTTP/1.1" 200 9406 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.6 - - [18/Sep/2023:19:47:31 -0400] "GET /mirrors/apple_cursor/issues?labels&milestone=0&poster=0&project=0&q&sort=farduedate&state=open&type=all HTTP/1.1" 200 9463 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.14 - - [18/Sep/2023:19:47:45 -0400] "GET /mirrors/wii/commit/2a16bf72f527e66eef7689469d5bdc95228b74de?show-outdated&style=unified&whitespace=ignore-eol HTTP/1.1" 200 16062 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 185.191.171.8 - - [18/Sep/2023:19:47:53 -0400] "GET /mirrors/wii/issues?assignee=-1&labels=0&poster=0&q&sort=oldest&state=closed&type=all HTTP/1.1" 200 9527 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)" git.snowcake.me:443 51.222.253.1 - - [18/Sep/2023:19:49:40 -0400] "GET /mirrors/speedie-page/commit/17c140a78744619953b9ad0406842942e980088d HTTP/1.1" 200 15480 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"

This nonsense of attempting to index all the git repositories on my Forgejo instance goes on and on and on in the log for tons of lines at at time. If you look closer, some of the GET requests seem to be getting full commits, and of course, this will eventually add up and waste my bandwidth.

There is also another notable bot that I found while perusing through my access logs:

snowcake.me:443 162.216.149.224 - - [18/Sep/2023:19:12:03 -0400] "GET / HTTP/1.1" 200 2453 "http://98.116.68.200:80/" "Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com"

I am not even a customer of Palo Alto Networks, so this is just yet another annoyance for me.

To solve these issues, I simply put the following into each virtual host config for the nginx reverse proxy:

if ($http_user_agent ~* semrushbot|dotbot|expanse|palo|alto|gptbot) { return 403; }

This works really well, and because this catches the annoying bots at the reverse proxy level, the other virtual machines that run my apache web servers don't need to waste their resources processing these bogus requests.

I also put the GPTBot into the blacklist, because fuck OpenAI and fuck artificial intelligence in general. There are probably a ton of other crawlers used to gather training data for AI models, but I am not sure where to find a big list of them. If you happen to know of some I can add to my blacklist, please send me an email at primrose@snowcake.me. Thanks!

Signed,

Primrose