fail2ban + Traefik — blocking HTTP DDoS flood

On the access nodes where our clients create tunnels, we currently use Traefik for HTTP traffic termination. Recently, we encountered a problem where one of the domains was receiving thousands of requests per second, causing the web server to reach 100% CPU, affecting all client traffic on the node in that region, with users reporting freezes and timeouts. This is the story of how we implemented fail2ban to block DDoS floods using access logs.

`

We are configuring fail2ban to block ddos flood through access logs of Traefik.

Background

On access nodes where our clients create tunnels, we currently use Traefik for HTTP traffic termination. It has dynamic configuration, automatically issues Let's Encrypt certificates, and solves many tasks.

The tunnel service consists of hundreds of customer domains, which go through a reverse proxy to the client's machines. However, the termination and initial routing of HTTP traffic still happens on our node. Recently, we encountered an issue where one domain was receiving thousands of requests per second, and Traefik went into a CPU overload, affecting all client traffic on the node in that region, with users complaining about lags and timeouts.

After analyzing the access logs of Traefik, we saw that all requests were going to one domain from dozens of IP addresses in AWS. We have throttling rules in place, and when there are many requests to one domain, we start returning 429s, but apparently, one of our users launched something in AWS, maybe some lambda functions or something else, but the throttling with 429s didn't stop anyone. All resources were spent on establishing HTTPS connections and returning 429 or 404 errors (the client's tunnel was inactive).

Example of one access log message:

{"ClientAddr":"198.51.100.1:54536","ClientHost":"198.51.100.1","ClientPort":"54536","ClientUsername":"-","DownstreamContentSize":0,"DownstreamStatus":429,"Duration":16389705,"OriginContentSize":0,"OriginDuration":15588690,"OriginStatus":0,"Overhead":801015,"RequestAddr":"example.ru.tuna.am","RequestContentSize":0,"RequestCount":21820187,"RequestHost":"й.ru.tuna.am","RequestMethod":"GET","RequestPath":"/api","RequestPort":"-","RequestProtocol":"HTTP/2.0","RequestScheme":"https","RetryAttempts":0,"RouterName":"tuna-web@file","ServiceAddr":"127.0.0.1:8080","ServiceName":"tuna@file","ServiceURL":"http://127.0.0.1:8080","StartLocal":"2025-04-02T12:20:50.019075429+03:00","StartUTC":"2025-04-02T09:20:50.019075429Z","TLSCipher":"TLS_AES_128_GCM_SHA256","TLSVersion":"1.3","entryPointName":"websecure","level":"info","msg":"","time":"2025-04-02T12:20:50+03:00"}

As a result, when I woke up in the morning and saw that nothing was working, I had to investigate and eventually blocked all the IPs like this:

for i in $(tail -n 10000  /var/log/traefik/access.log | jq -cr 'select(.RequestHost == "example.ru.tuna.am") | .ClientHost' | sort | uniq) ; do ufw deny from ${i} ; done

The block helped, at the firewall level the problem was solved 100%, but a week later the situation repeated, so it became clear that this was not a one-time action and needed to be automated.

Solution

After a bit of research it turned out that if it’s not a carrier-grade DDOS with terabytes of traffic, BGP blackhole isn’t required and almost everyone solves the problem with a simple firewall, or more precisely, with fail2ban. The neat thing about this solution is that blocking is automatic, but so is unblocking once the quarantine period ends.

fail2ban setup

fail2ban is a daemon written in Python that analyzes any log files and, using a regex expression, increments a hit counter, then if a certain threshold is exceeded, puts the offender in a jail, adding the IP to a firewall ban rule. It’s a huge project and it’s highly customizable. On top of that, the software is popular and there’s always an up-to-date package in all distributions, which is adapted for the current firewall out of the box so you don’t have to worry about it, plus it comes with tons of pre-configured filters. For example, after installation, the rule for sshd is ready and blocks bots trying to brute-force access.

sudo apt install -y fail2ban

Install the package; in the case of Debian 12, it’s a bit outdated and the daemon can’t start out of the box because sshd no longer writes its log to a file—all output goes to journald. To fix this, edit the file /etc/fail2ban/jail.d/defaults-debian.conf and add backend = systemd at the end:

# cat /etc/fail2ban/jail.d/defaults-debian.conf 
[sshd]
enabled = true
backend = systemd

Now you can start the daemon and check that it all works with fail2ban-client start. By the way, there’s a client for conveniently viewing which rules are active, how many offenders are in jail, and so on. Really handy—call fail2ban-client --help for a bunch of hints.

Let’s get back to the task. Essentially, the entire setup comes down to 2 files. There’s a filter where we describe what to match via regex, all filters are stored in the /etc/fail2ban/filter.d directory, which already comes packed with a bunch of stuff from your distribution maintainers, and there are jails, the files for them live in /etc/fail2ban/jail.d, here we describe the rules according to which offenders will end up in the described jail.

Based on the log format, we write a filter, create the file /etc/fail2ban/filter.d/traefik-429.conf with the following contents:

[INCLUDES]
before = common.conf

[Definition]

# The regular expression looks for:
# - "ClientHost": "" – required group for IP (Fail2Ban uses )
# - "DownstreamStatus":429 – status 429

failregex = ^\{.*"ClientHost":"".*"DownstreamStatus":\s*429.*\}$
ignoreregex =

Pretty straightforward here: we look for lines with 429 HTTP errors and extract the client’s IP.

Now let’s create a jail /etc/fail2ban/jail.d/traefik-429.conf:

[traefik-429]
enabled  = true
filter   = traefik-429
logpath  = /var/log/traefik/access.log
maxretry = 1000
findtime = 1m
bantime  = 1m

Nothing complicated here either, but let me break it down:

  • [traefik-429] - the name of the jail, i.e., the set of Fail2Ban rules that will be applied to the logs. Usually, the name is related to the filter or the target to be blocked.

  • enabled - state, on/off

  • filter - specifies the filter name to be used in log analysis.

  • logpath - path to the log file that Fail2Ban will analyze.

  • maxretry - number of matches from a single IP address () allowed before banning.

  • findtime - the time window in which attempts are counted.

  • bantime - duration of the IP address () ban.

Files are created, now apply the rules – fail2ban-client reload --all and check that they’ve been applied:

# fail2ban-client status
Status
|- Number of jail:      2
`- Jail list:   sshd, traefik-429

or just for our jail:

# fail2ban-client status traefik-429
Status for the jail: traefik-429
|- Filter
|  |- Currently failed: 0
|  |- Total failed:     3647
|  `- File list:        /var/log/traefik/access.log
`- Actions
   |- Currently banned: 0
   |- Total banned:     3
   `- Banned IP list:

Well, looks like everything’s ready, let’s test it?

Testing

I’m used to testing this with wrk, and I also have a lua script for proper detailed checks.

report.lua
local os_name = io.popen("uname"):read("*l")
local file_path

if os_name == "Linux" then
    file_path = "/dev/shm/responses.tmp"
elseif os_name == "Darwin" or os_name == "FreeBSD" then
    file_path = "/tmp/responses.tmp"
else
    file_path = "responses.tmp"
end

response = function(status, headers, body)
    local file = io.open(file_path, "a")  -- Open file for appending
    if file then
        file:write(status .. "\n")  -- Write status code to file
        file:close()
    else
        print("Failed to open file for writing")
    end
end

-- Function called after the test is completed
done = function(summary, latency, requests)
   io.write("------------------------------\n")
   io.write(string.format("Requests: %d\n", summary.requests))
   io.write(string.format("Duration: %.2f s\n", summary.duration / 1000000))
   io.write(string.format("Bytes: %d\n", summary.bytes))

   io.write(string.format("Requests/sec: %.2f\n", summary.requests / (summary.duration / 1000000.0)))
   io.write(string.format("Transfer/sec: %.2f MB\n", (summary.bytes / 1048576) / (summary.duration / 1000000.0)))
   io.write("------------------------------\n")
   io.write(string.format("\nLatency Distribution (ms):\n"))
   for _, p in pairs({ 50, 90, 99, 99.999 }) do
      n = latency:percentile(p)
      io.write(string.format("  %g%%: %d\n", p, n / 1000))
   end
   io.write("------------------------------\n")
   io.write(string.format("\nSummary:\n"))
   io.write(string.format("  Min Latency: %d ms\n", latency.min / 1000))
   io.write(string.format("  Max Latency: %d ms\n", latency.max / 1000))
   io.write(string.format("  Mean Latency: %.2f ms\n", latency.mean / 1000))
   io.write(string.format("  Stdev Latency: %.2f ms\n", latency.stdev / 1000))
   io.write(string.format("  Percentile 99.9: %d ms\n", latency:percentile(99.9) / 1000))

   -- This is a hack, since for some reason global variables are not passed to the done function
   local responses = {}
   local file = io.open(file_path, "r")
   if file then
      for line in file:lines() do
         local status = tonumber(line)
         if status then
            if responses[status] == nil then
                  responses[status] = 0
               end
               responses[status] = responses[status] + 1
            end
         end
         file:close()
      else
         print("Failed to open file for reading")
      end
      -- Delete the temp file after reading
      os.remove(file_path)

   io.write("------------------------------\n")
   io.write(string.format("\nHTTP Status Codes:\n"))
   if next(responses) ~= nil then
      for status, count in pairs(responses) do
          io.write(string.format("  %d : %d\n", status, count))
      end
   else
      io.write("  No status codes recorded.\n")
   end
end

We buy a VPS online, of course we won’t be testing from our own machine, so we don’t get ourselves banned. We run wrk and see what happens:

# wrk -c 10 -t 10 -d 10s  -s report.lua https://fail2ban.stage.tuna.am
Running 10s test @ https://fail2ban.stage.tuna.am
  10 threads and 10 connections
...
HTTP Status Codes:
  404 : 70
  429 : 1104

We see that we got more than 1000 429s in 10 seconds, well, it should probably work:

# fail2ban-client status traefik-429
Status for the jail: traefik-429
|- Filter
|  |- Currently failed: 1
|  |- Total failed:     4751
|  `- File list:        /var/log/traefik/access.log
`- Actions
   |- Currently banned: 1
   |- Total banned:     4
   `- Banned IP list:   65.21.241.57

On the target host, the VM's IP successfully got banned. Let's try launching wrk again:

wrk -c 10 -t 10 -d 10s  -s report.lua https://fail2ban.stage.tuna.am
unable to connect to fail2ban.stage.tuna.am:https Connection refused

Indeed, we get a TCP reset, well, it seems everything works as expected.

Looks like it's ready for production, or am I missing something

Comments

    Relevant news on the topic "Network"