Discussion:
Rate limit spider / bots ?
haproxy-6mkYu7MjHcjIEberA2iXlQC/
2012-02-13 21:14:37 UTC
Permalink
Hi folks

Been using haproxy for a while now and love it load balancing apache and nginx web server clusters and I am glad to have stumbled across this forum :)

The question I have is, is it possible to rate limit spider and bots by user agent from haproxy level ? i.e. rate limit yandex and baidu bots ?

thanks

---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,445347#msg-445347
Baptiste
2012-02-14 05:20:11 UTC
Permalink
Hey,


Hi,

For now, you can only track users by IP.

cheers
Post by haproxy-6mkYu7MjHcjIEberA2iXlQC/
Hi folks
Been using haproxy for a while now and love it load balancing apache and nginx web server clusters and I am glad to have stumbled across this forum :)
The question I have is, is it possible to rate limit spider and bots by user agent from haproxy level ? i.e. rate limit yandex and baidu bots ?
thanks
---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,445347#msg-445347
haproxy-6mkYu7MjHcjIEberA2iXlQC/
2012-02-14 20:00:03 UTC
Permalink
Thanks Baptiste

here's hoping such feature gets added :)

---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,445917#msg-445917
John Lauro
2012-02-15 00:20:40 UTC
Permalink
You could setup the acls so they all goto one backend, and thus limit the
number of connections on that backend to something low like 1. Not exactly
rate limit, but at most 1 connection to server them all...
-----Original Message-----
Sent: Monday, February 13, 2012 4:15 PM
Subject: Rate limit spider / bots ?
Hi folks
Been using haproxy for a while now and love it load balancing apache and
nginx web server clusters and I am glad to have stumbled across this forum
:)
The question I have is, is it possible to rate limit spider and bots by
user agent from haproxy level ? i.e. rate limit yandex and baidu bots ?
thanks
---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,445347#msg-445347
haproxy-6mkYu7MjHcjIEberA2iXlQC/
2012-02-16 02:39:10 UTC
Permalink
John how would i go about using acl ? I thought rate-limit option didn't support backends http://code.google.com/p/haproxy-docs/wiki/rate_limit_sessions ?

---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,446647#msg-446647
John Lauro
2012-02-16 05:24:57 UTC
Permalink
This isn't tested, just a sample idea... obviously parts missing...

Create you acls something like:

frontend web
acl is_bot hdr_sub(User-Agent) -i bot
...
use_backend botq if is_bot
default_backend normalq

backend normalq
server be1 10.1.2.3 check minconn 20 maxconn 30
server be2 10.1.2.4 check minconn 20 maxconn 30

backend botq
server be1 10.1.2.3 track normalq/be1 minconn 1 maxconn 1
server be2 10.1.2.4 track normalq/be2 minconn 1 maxconn 1


So that doesn't really rate limit it, it just makes it so there is at most
one concurrent request (per backend) is shared/serviced at a time for all
identified bots. Personally I think that would be better than rate
limiting, but... that being said, if you really want to rate limit, nothing
says couldn't have botq connect to a different port on 127.0.0.1, and rate
limit that internal port as a different frontend in haproxy if you really
want it to rate limit it. Preserving original IP is possible through the
loopback, and so it is can be used as a way to do complicated setups such as
rate limiting the backend at the cost of cpu overhead having haproxy to talk
itself... As long as that is not the path for most traffic, it shouldn't be
a big deal.
-----Original Message-----
Sent: Wednesday, February 15, 2012 9:39 PM
Subject: Re: Rate limit spider / bots ?
John how would i go about using acl ? I thought rate-limit option didn't
support backends http://code.google.com/p/haproxy-
docs/wiki/rate_limit_sessions ?
---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,446647#msg-446647
haproxy-6mkYu7MjHcjIEberA2iXlQC/
2012-02-16 21:27:45 UTC
Permalink
Thanks John gives me other ideas would this work

###
frontend www_fe
bind :80
mode http
maxconn 4096
default_backend www_be
option contstats
acl spiderbots hdr_sub(user-agent) -i -f /etc/haproxy/spiderbotlist.lst
use_backend spider_backend if spiderbots

backend spider_backend
acl too_fast fe_sess_rate gt 10
acl too_many fe_conn gt 10
tcp-request inspect-delay 1000ms
tcp-request content accept if ! too_fast or ! too_many
tcp-request content accept if WAIT_END
server be1 10.1.2.3 check maxconn 100
server be2 10.1.2.4 check maxconn 100

would that in theory limit new session and concurrent connections to spider_backend to = maxconn / fe_sess_rate and maxconn /fe_sess_rate ?

100/10 = 10 new sessions/sec or 10 concurrent connections at a time ?

the 11th new session or concurrent user would have a 1000ms delay until serviced ?

thanks

---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,447162#msg-447162
Baptiste
2012-02-17 09:37:25 UTC
Permalink
Hi,

Let me just update a bit your configuration:

frontend www_fe
bind :8080
mode http
maxconn 4096
default_backend www_be
acl spiderbots hdr_sub(user-agent) -i -f /etc/haproxy/spiderbotlist.lst
use_backend spider_backend if spiderbots

backend www_be
mode http
server be1 127.0.0.1:80 check maxconn 100

backend spider_backend
mode http
acl too_fast be_sess_rate gt 10
acl too_many be_conn gt 10
tcp-request inspect-delay 3s
tcp-request content accept if ! too_fast or ! too_many
tcp-request content accept if WAIT_END
server be1 127.0.0.1:80 check maxconn 100


It has been tested with apache bench:
without the WAIT_END line, I have 3600 req/s, and with WAIT_END line,
it was only 8.5 reqs/s.

cheers
Post by haproxy-6mkYu7MjHcjIEberA2iXlQC/
Thanks John gives me other ideas would this work
###
frontend www_fe
bind :80
mode http
maxconn 4096
default_backend www_be
option contstats
acl spiderbots hdr_sub(user-agent) -i -f /etc/haproxy/spiderbotlist.lst
use_backend spider_backend if spiderbots
backend spider_backend
acl too_fast fe_sess_rate gt 10
acl too_many fe_conn gt 10
tcp-request inspect-delay 1000ms
tcp-request content accept if ! too_fast or ! too_many
tcp-request content accept if WAIT_END
server be1  10.1.2.3 check maxconn 100
server be2  10.1.2.4 check maxconn 100
would that in theory limit new session and concurrent connections to spider_backend to = maxconn / fe_sess_rate and maxconn /fe_sess_rate ?
100/10 = 10 new sessions/sec or 10 concurrent connections at a time ?
the 11th new session or concurrent user would have a 1000ms delay until serviced ?
thanks
---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,447162#msg-447162
haproxy-6mkYu7MjHcjIEberA2iXlQC/
2012-02-17 17:00:52 UTC
Permalink
sweet so bumping tcp-request inspect-delay from 1500ms to 3s dramatically slowed down activity :)

tried apachebench with tcp-request inspect-delay of 1.5s, 2s and 3s giving me 217 req/s, 97 req/s, and 67 req/s respectively. Without rate limiting around 948 req/s

thanks Baptiste

---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,447698#msg-447698
Baptiste
2012-02-17 17:48:59 UTC
Permalink
you're welcome :)

I've added an article on our blog: http://blog.exceliance.fr/ about
this piece of configuration which is easy to implement and is quite
efficient :)

cheers
Post by haproxy-6mkYu7MjHcjIEberA2iXlQC/
sweet so bumping tcp-request inspect-delay from 1500ms to 3s dramatically slowed down activity :)
tried apachebench with tcp-request inspect-delay of 1.5s, 2s and 3s giving me 217 req/s, 97 req/s, and 67 req/s respectively. Without rate limiting around 948 req/s
thanks Baptiste
---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,447698#msg-447698
haproxy-6mkYu7MjHcjIEberA2iXlQC/
2012-02-17 18:05:13 UTC
Permalink
yeah very simple implementation

i have it working on a live site right now - pretty good to see live spider bot activity on haproxy admin stats end too

if you blog posted config, what does this live do exactly


acl spiderbots hdr_cnt(User-Agent) eq 0

is it to because bots head length usually zero and you are trying to match all bots to hit spiderbot's backend ?

---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,447738#msg-447738
Baptiste
2012-02-17 18:30:31 UTC
Permalink
I just did that to provide some exampes.
This week, I was working on a customer Aloha (Exceliance LB)
installation when I saw that some spiders were browsing the website.
And they had no useragent header. So I decided to add this kind of
example in my configuration.

Main purpose is to show people the different thing they can do with
acl, no only matching a simple host header value.

cheers
Post by haproxy-6mkYu7MjHcjIEberA2iXlQC/
yeah very simple implementation
i have it working on a live site right now - pretty good to see live spider bot activity on haproxy admin stats end too
if you blog posted config, what does this live do exactly
acl spiderbots hdr_cnt(User-Agent) eq 0
is it to because bots head length usually zero and you are trying to match all bots to hit spiderbot's backend ?
---
posted at http://www.serverphorums.com
http://www.serverphorums.com/read.php?10,445347,447738#msg-447738
Loading...