Nginx: managing bot connections

In this tutorial, I will explain how to manage bots and limit their impact on your websites with Nginx, especially when you use it as reverse proxy.

Today, there are more and more bots, also called crawlers, present on the internet, which browse websites:

  • Indexing (Google, Bing)
  • AI (OpenIA, Claude …)
  • Data collection

Generally, indexing robots optimize their passage through your website and do not consume too many resources, however other bots are very poorly optimized such as Amazon and their AI bot which can consume a lot of resources and in some cases make your site unavailable.

For some hosting services, it is also necessary to take into account the use of resources such as bandwidth, which is consumed and generates an additional cost.

Most rebots are supposed to read the robots.txt file located in the root of your site where you can give instructions to limit their impact, in fact, we realize that some crawlers do not respect the directives, so we end up having to use the configurations of the Web servers (Nginx) to block them.

In this tutorial, we will look at two approaches: request limiting and blocking.

Limit the number of requests with Nginx

The first limitation that can be put in place is limiting the number of requests per second or minute that the robot will be able to perform.

To do this, we will use the following parameters:

  • limit_req_zone: declaration of bots and the limit
  • limit_req: zone declaration
  • limit_req_status: return code

Official documentation: https://nginx.org/en/docs/http/ngx_http_limit_req_module.html

In your virtual host configuration file, we’ll start by declaring the bots with their agent and the restriction, which should be placed before the block server{ ... }.

map $http_user_agent $bad_crawlers_rdritcom {
    ~*.*(bytespider|amazonbot|claudebot|DotBot|petalbot).* $http_user_agent;
    default "";
}
limit_req_zone $bad_crawlers_rdritcom zone=badcrawlerrdritcom:10m rate=1r/m;

Here all robots containing bytespider, amazonbot … will be limited to 1 request per minute.

Now, in the server{ … } block, we will apply the limitation

limit_req zone=badcrawlerrdritcom;
limit_req_status 429;

The previously declared area and the return code are indicated.

Test the nginx configuration:

sudo nginx -t

If you don’t encounter any errors, reload the configuration:

sudo systemctl reload nginx

With this configuration, you will be able to limit the number of requests and reduce the impact of bots.

Block bots with Nginx

Another, slightly more aggressive approach is to simply block requests with a 403 return code.

Here is an example of a configuration that should be placed in the block server{ ... }.

if ($http_user_agent ~* (bytespider)) {
    return 403;
}

if ($http_user_agent ~* (claudebot)) {
    return 403;
}

if ($http_user_agent ~* (amazonbot)) {
    return 403;
}

This configuration will block the bytespider, claudebot and amazonbot robots by returning a 403 error code.

Check the Nginx configuration:

sudo nginx -t

Reload the configuration:

sudo systemctl reload nginx

You know how to block bots from Nginx.

robots.txt file from the configuration

A final solution, which is possible, is to generate the robots.txt file directly from the virtual host configuration if you don’t have the option to create a robots.txt file on the web server. This might be the case with certain business applications or proprietary portals.

Here is the code to put in the block server{ ... } from virtualhost.

location /robots.txt {
    add_header Content-Type text/plain;
    return 200 "User-agent: AhrefsBot\nCrawl-delay: 10\nUser-agent: Amazonbot\nCrawl-delay: 10";
}

This directive declares information for the AhrefsBot and Amazonbot robots.

Check the Nginx configuration:

sudo nginx -t

Reload the configuration:

sudo systemctl reload nginx

Block User Agents from mitchellkrogza’s list

To continue with protection against bots, we will use the list provided by mitchellkrogza who is the author of nginx-ultimate-bas-bot-blocker.

To do this, we will use this list: raw.githubusercontent.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/master/_generator_lists/bad-user-agents.list

To do this I made a script that allows downloading the list and creating a configuration file for Nginx.

Here is the script:

#!/bin/bash

# -------------------------------
# Configuration
# ------------------------------
LIST_URL="https://raw.githubusercontent.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/master/_generator_lists/bad-user-agents.list"
WORK_DIR="/opt/nginx-bad-ua"
LIST_FILE="$WORK_DIR/blocked_user_agents.list"
CLEAN_LIST="$WORK_DIR/ua-cleaned.list"
CONF_FILE="/etc/nginx/conf.d/blocked_ua_vhost.conf" # map file for a vhost

# Create the directory if necessary
mkdir -p "$WORK_DIR"

# -------------------------------
#1️⃣ Download the list
# ------------------------------
curl -fsSL "$LIST URL" -to "$LIST FILE"

    exit 1
fi

# ------------------------------
#2️⃣ Clean up the list: remove blank lines, comments, spaces and backslashes
# -------------------------------

# -------------------------------
# 3️⃣ Generate the Nginx map file
# -------------------------------
{
    echo "# Auto-generated $(date)"
    echo "# Map to block User-Agents (specific vhost)"
    echo "map \$http_user_agent \$bad_ua {"
    echo " default 0;"
    while read -r ua; do
        echo " ~*$ua 1;"
    done < "$CLEAN_LIST"
    echo "}"
} > "$CONF_FILE"

# -------------------------------
# 4️⃣ Nginx Check and Reload
# -------------------------------
if nginx -t; then
    systemctl reload nginx
else
    exit 1
fi

# -------------------------------
#5️⃣ Instructions for the vhost
# -------------------------------
echo ""
echo "include /etc/nginx/conf.d/blocked_ua_vhost.conf;"
echo "if (\$bad_ua) { return 444; }"

Copy the script to the Nginx server and run it; this will create a configuration file that will be loaded by Nginx.

Next, in the virtual hosts where you want to block bots, add this to the block server{ ... }.

if ($bad_ua) {
    return 444;
}

Test the Nginx configuration:

sudo nginx -t

Reload for it to be taken into account:

sudo systemctl reload nginx

If you want to be able to easily analyze bot blocking, it is possible to have a separate log file.

In the file nginx.conf in the section http{ ... } add :

                  '"$request" $status $body_bytes_sent '
                  '"$http_user_agent" "$host"';

Finally, in the virtual host file, add:

access_log /var/log/nginx/bad_ua.log bad_ua if=$bad_ua;

You know how to manage robots (Bots) directly from the Nginx configuration and limit their impact on your web server.

Romain Drouche
System Architect | MCSE: Core Infrastructure
IT infrastructure expert with over 15 years of field experience. Currently a Systems and Networks Project Manager and Information Systems Security (ISS) expert, I use my expertise to ensure the reliability and security of technological environments.

Leave a Comment