Hi fellow admins,

Just wanted to share a quick tip about reducing server load.

Before implementing these Cloudflare rules, my server load (on a 4-core Ubuntu box) was consistently between 3.5 and 4.0. Now, it’s running much smoother, around 0.3 server load.

The best part? It only took 3 rules, and they all work with the Cloudflare free plan.

The order of the rules are important, so please pay attention.

Allowlist

This rule is set first in order to avoid friction with other fediverse servers and good crawlers. Use the action Skip


(http.user_agent contains "Observatory") or

(http.user_agent contains "FediFetcher") or
(http.user_agent contains "FediDB/") or
(http.user_agent contains "+fediverse.observer") or
(http.user_agent contains "FediList Agent/") or

(starts_with(http.user_agent, "Blackbox Exporter/")) or
(http.user_agent contains "Lestat") or
(http.user_agent contains "Lemmy-Federation-Exporter") or
(http.user_agent contains "lemmy-stats-crawler") or
(http.user_agent contains "lemmy-explorer-crawler/") or

(starts_with(http.user_agent, "Lemmy/")) or
(starts_with(http.user_agent, "PieFed/")) or
(http.user_agent contains "Mlmym") or
(http.user_agent contains "Photon") or
(http.user_agent contains "Boost") or
(starts_with(http.user_agent, "Jerboa")) or
(http.user_agent contains "Thunder") or
(http.user_agent contains "VoyagerApp/") or

(cf.verified_bot_category in {
    "Search Engine Crawler"
    "Search Engine Optimization" 
    "Monitoring & Analytics"
    "Feed Fetcher"
    "Archiver"
    "Page Preview"
    "Academic Research"
    "Security"
    "Accessibility"
    "Webhooks"
  }
  and http.host ne "old.lemmy.eco.br"
  and http.host ne "photon.lemmy.eco.br"
) or

(http.user_agent contains "letsencrypt"
  and http.request.uri.path contains "/.well-known/acme-challenge/"
) or

(starts_with(http.request.full_uri, "https://lemmy.eco.br/pictrs/") and 
  http.request.method eq "GET" and not 
  starts_with(http.user_agent, "Mozilla") and not 
  ip.src.asnum in {
    200373 198571 26496 31815 18450 398101 50673 7393 14061
    205544 199610 21501 16125 51540 264649 39020 30083 35540
    55293 36943 32244 6724 63949 7203 201924 30633 208046 36352
    25264 32475 23033 31898 210920 211252 16276 23470 136907
    12876 210558 132203 61317 212238 37963 13238 2639 20473
    63018 395954 19437 207990 27411 53667 27176 396507 206575
    20454 51167 60781 62240 398493 206092 63023 213230 26347
    20738 45102 24940 57523 8100 8560 6939 14178 46606 197540
    397630 9009 11878 49453 29802
})
  1. The User Agent contains the name of known Fediverse crawlers, and monitoring tools (e.g., “Observatory”, “FediFetcher”, “lemmy-stats-crawler”).
  2. The User Agent contains the name of known Lemmy mobile and frontends (e.g., “Jerboa”, “Boost”, “VoyagerApp”).
  3. The request comes from Cloudflare-verified bots in specific categories (like “Search Engine Crawler” or “Monitoring & Analytics”) and is not targeting the specific hosts “old.lemmy.eco.br” or “photon.lemmy.eco.br” where I host alternative frontends.
  4. The request is a Let’s Encrypt challenge for the domain (used for SSL certificate renewal).
  5. The request is a specific type of GET request to the “pictrs” image server that does not come from a standard web browser (a User Agent starting with “Mozilla”) and does not originate from a list of specified Autonomous System Numbers (ASNs), this ASNs are all from VPSs providers, so no excuse for browsers UA.

Blocklist

This list blocks the majority of bad crwalers and bots. Use the action Block

(cf.verified_bot_category in {"AI Crawler"}) or

(ip.src.country in {"T1"}) or 

(starts_with(http.user_agent, "Mozilla/") and 
http.request.version in {"HTTP/1.0" "HTTP/1.1" "HTTP/1.2" "SPDY/3.1"} and 
any(http.request.headers["accept"][*] contains "text/html")) or

(http.user_agent wildcard r"HeadlessChrome/*") or

(
  http.request.uri.path contains "/xmlrpc.php" or
  http.request.uri.path contains "/wp-config.php" or
  http.request.uri.path contains "/wlwmanifest.xml"
) or

(ip.src.asnum in {
    200373 198571 26496 31815 18450 398101 50673 7393 14061
    205544 199610 21501 16125 51540 264649 39020 30083 35540
    55293 36943 32244 6724 63949 7203 201924 30633 208046 36352
    25264 32475 23033 31898 210920 211252 16276 23470 136907
    12876 210558 132203 61317 212238 37963 13238 2639 20473
    63018 395954 19437 207990 27411 53667 27176 396507 206575
    20454 51167 60781 62240 398493 206092 63023 213230 26347
    20738 45102 24940 57523 8100 8560 6939 14178 46606 197540
    397630 9009 11878 49453 29802
  }
  and http.user_agent wildcard r"Mozilla/*"
) or

(http.request.uri.path ne "/robots.txt") and 
((http.user_agent contains "Amazonbot") or
  (http.user_agent contains "Anchor Browser") or
  (http.user_agent contains "Bytespider") or
  (http.user_agent contains "CCBot") or
  (http.user_agent contains "Claude-SearchBot") or
  (http.user_agent contains "Claude-User") or
  (http.user_agent contains "ClaudeBot") or
  (http.user_agent contains "FacebookBot") or
  (http.user_agent contains "Google-CloudVertexBot") or
  (http.user_agent contains "GPTBot") or
  (http.user_agent contains "meta-externalagent") or
  (http.user_agent contains "Novellum") or
  (http.user_agent contains "PetalBot") or
  (http.user_agent contains "ProRataInc") or
  (http.user_agent contains "Timpibot")
) or

(ip.src.asnum eq 32934)
  1. The request comes from Cloudflare-verified "AI Crawler"s.
  2. The request originates from a Tor exit node (country code “T1”), it is a Tor heavy tier.
  3. The request uses a Mozilla browser User Agent with an older HTTP version and accepts HTML content, in 2025 it is super weird, all bots.
  4. The User Agent is HeadlessChrome, hence bot.
  5. The request path targets common WordPress vulnerability endpoints (/xmlrpc.php, /wp-config.php, /wlwmanifest.xml).
  6. The request originates from a specific list of Autonomous System Numbers (ASNs) and uses a Mozilla User Agent. Again, more bots.
  7. The request is not for /robots.txt and the User Agent contains the name of known crawlers or bots (e.g., “GPTBot”, “Bytespider”, “FacebookBot”).
  8. The request originates from Autonomous System Number 32934 (Facebook).

Challenge

This one is to protect the frontends, I added some conditions in order to not make logged users verify with cloudflare. Normally a crawler won’t have an user account. Set the action to Managed Challenge.

(http.host eq "old.lemmy.eco.br" and not len(http.request.cookies["jwt"]) > 0)

or (http.host eq "photon.lemmy.eco.br" 
  and not len(http.request.headers["authorization"]) > 0 
  and not starts_with(http.cookie, "ph_phc"))

or (http.host wildcard "lemmy.eco.br" 
  and not len(http.request.cookies["jwt"]) > 0 
  and not len(http.request.headers["authorization"]) > 0 
  and starts_with(http.user_agent, "Mozilla") 
  and not http.referer contains "photon.lemmy.eco.br")

or (http.user_agent contains "yandex"
  or http.user_agent contains "sogou"
  or http.user_agent contains "semrush"
  or http.user_agent contains "ahrefs"
  or http.user_agent contains "baidu"
  or http.user_agent contains "python-requests"
  or http.user_agent contains "neevabot"
  or http.user_agent contains "CF-UC"
  or http.user_agent contains "sitelock"
  or http.user_agent contains "mj12bot"
  or http.user_agent contains "zoominfobot"
  or http.user_agent contains "mojeek")

or ((http.user_agent contains "crawl"
  or http.user_agent contains "spider"
  or http.user_agent contains "bot")
  and not cf.client.bot)

or (ip.src.asnum in {135061 23724 4808}
  and http.user_agent contains "siteaudit")
  1. A request to the host “old.lemmy.eco.br” that does not have a “jwt” cookie.
  2. A request to the host “photon.lemmy.eco.br” that lacks both an “Authorization” header and a cookie starting with “ph_phc”.
  3. A request to any subdomain of “lemmy.eco.br” that lacks both a “jwt” cookie and an “Authorization” header, uses a Mozilla User Agent, and does not have a referrer from “photon.lemmy.eco.br”.
  4. The User Agent contains the name of a specific crawler, bot, or tool (e.g., “yandex”, “baidu”, “python-requests”, “sitelock”).
  5. The User Agent contains the words “crawl”, “spider”, or “bot” but is not a verified Cloudflare-managed bot.
  6. The request originates from specific Autonomous System Numbers (135061, 23724, 4808) and the User Agent contains the word “siteaudit”.

All these are heavily inspired by this article: https://urielwilson.com/a-practical-guide-to-custom-cloudflare-waf-rules/

Please let me know your thoughts.

  • chgxvjh [he/him, comrade/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    3
    ·
    1 day ago

    I think it’s pretty reckless to give a company that is almost certainly connected to American military intelligence access to your users communications and connection data.

    If you are brothered by bots, step one robots.txt, then you can still block crawlers in the config of your own webserver just like did on cloudflare, and then you can roll out tools like Anubis or iocaine to frustrate bots further.

    • Ademir@lemmy.eco.brOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      1 day ago

      I get your point, but any solution would still need to block threats at the server level, and that costs server load. Honestly, we’re too small for the U.S. military to even notice us. Plus, the free CDN is a great benefit.

      My real work for the revolution happens on the streets, and there’s very little any intelligence agency can do to stop that.