Problems with HAProxy, down servers and 503 errors

Discussion:

John Marrett

2009-01-23 22:38:29 UTC

We have been using HAProxy in a production environment, without issue
for a long period. Thanks for a wonderful product!

Unfortunately we recently encountered some issues as we have worked on
the migration of one of our sites onto a new HAProxy based load
balancing solution. We've started to notice issues related to persistent
cookies, client requests and down backend servers.

This new application requires users remain on the same web server to
avoid losing session information, which is not shared between backend
servers. If we stop one of the backend servers (port 80 is no longer
listening, HAProxy receives a RST packet from the server when sending a
health check or client request) the clients who have a persistent
session on this web server will continue to be sent to the until it is
formally declared down (after two health checks fail, as controlled by
the fall 2 parameter).

Here's where we run into our issue:

While HAProxy is receiving RST responses, it sends a 503 response to the
client. We're not very eager to send this error response to the client.
It appeared, from my reading of the documentation, that by setting
"option redispatch" and "retries 3" (or greater than 1) we should get
HAProxy to retry, and, in the event of explicit connection failure from
the backend server, move on to the next functioning server on the final
retry. This doesn't appear to be the case.

To make matters worse, when HAProxy throws a 503 response because of a
RST it ignores the errorfile directive. If you have two servers, and
stop one you will receive an extremely plain 503 error response. If no
backend is available at all the errorfile directive functions properly,
and the "pretty" error message is returned.

Ideally, in the event that a backend server is returning RSTs, we'd like
to move to the next server. HAProxy could either do this immediately or
buffer the request until it makes the final determination that the
backend server is down and can send it to another server. If that isn't
possible, we'd really like the 503 response returned to the client to be
the one specified in the errorfile.

I'm going to investigate the possibility of creating a patch for this
issue tonight, though if more experience hands could help, either with
the patch, or with something obvious that I've missed in my
configuration, I'd greatly appreciate it.

A few other notes:

Interestingly, even upon receiving a RST to a client request to the
backend server, HAProxy doesn't consider the server as having failed a
health check until it performs it's next health check. So, if you have a
health checking interval of 10 seconds, if a customer makes a request 1
second after the first health check, with fall 2 set, it will take 29
seconds before the backend server is declared down and the client moved
on to the next server.

The documentation for the track command could also be made a bit
clearer, it took me a while (and a colleagues examination of the source)
to determine that the <proxy> is another backend, in the case that you
are trying to reference a server from a different backend. (Perhaps,
depending on your configuration, it's not always a backend, but could be
something else?).

Thank you for taking the time to read this novella :), configuration
follows, thanks in advance for your help,

-JohnF

Configuration Details

We are running 1.3.15.7, with the following configuration (excerpt):

global
stats socket /var/run/haproxy.stat

defaults
balance roundrobin
cookie SERVERID insert indirect
option httpchk GET /index.html HTTP/1.0
timeout client 10m
timeout server 10m
timeout connect 3s

frontend http_frontend *:80
mode http
reqirep ^Host:([^:]*) Host:\1
#Traffic matching ACLs
[...]
acl host_qa_site_com hdr(host) -i qa.site.com
use_backend qa_site_com_http if host_qa_site_com
frontend ssl_frontend *:81
mode http
reqirep ^Host:([^:]*) Host:\1
#Traffic matching ACLs
[...]
acl host_qa_site_com hdr(host) -i qa.site.com
use_backend qa2_site_com_ssl if host_qa_site_com
[...]

backend qa_site_com_http
mode http
errorfile 503 /etc/haproxy/errorfiles/503.http
option redispatch
retries 3
option httpchk GET /ut.asp
server web_1 web1:80 cookie 3f06565277298ce80af6bbaab8c5b584 check
inter 5100ms fall 2
server web_2 web2:80 cookie bc62407de9ecf95bae662880b593a0d4 check
inter 5100ms fall 2
backend qa_site_com_ssl
mode http
errorfile 503 /etc/haproxy/errorfiles/503.http
option redispatch
retries 3
option httpchk GET /ut.asp
server web_1 web1:80 cookie 3f06565277298ce80af6bbaab8c5b584 track
qa_site_com_http/web_1
server web_2 web2:80 cookie bc62407de9ecf95bae662880b593a0d4 track
qa_site_com_http/web_2

John Marrett

2009-01-25 16:23:24 UTC

Permalink

I'm embarassed to report that this is not an HAProxy issue.

In addition to the changes being made on the load balancing level, we
have also upgraded the backend real servers. It seems there has been a
change in their shutdown procedure, where before they would stop
responding immediately when a shutdown was initiated they now return 503
errors during the (protracted) shut down process.

This explains both unusual issues perfectly clearly.

I'm very sorry for any time wasted looking into this issue.

-JohnF

-----Original Message-----
Sent: January 23, 2009 5:38 PM
Subject: Problems with HAProxy, down servers and 503 errors
We have been using HAProxy in a production environment, without issue
for a long period. Thanks for a wonderful product!
Unfortunately we recently encountered some issues as we have worked on
the migration of one of our sites onto a new HAProxy based load
balancing solution. We've started to notice issues related to
persistent
cookies, client requests and down backend servers.
This new application requires users remain on the same web server to
avoid losing session information, which is not shared between backend
servers. If we stop one of the backend servers (port 80 is no longer
listening, HAProxy receives a RST packet from the server when
sending a
health check or client request) the clients who have a persistent
session on this web server will continue to be sent to the until it is
formally declared down (after two health checks fail, as controlled by
the fall 2 parameter).
While HAProxy is receiving RST responses, it sends a 503
response to the
client. We're not very eager to send this error response to
the client.
It appeared, from my reading of the documentation, that by setting
"option redispatch" and "retries 3" (or greater than 1) we should get
HAProxy to retry, and, in the event of explicit connection
failure from
the backend server, move on to the next functioning server on
the final
retry. This doesn't appear to be the case.
To make matters worse, when HAProxy throws a 503 response because of a
RST it ignores the errorfile directive. If you have two servers, and
stop one you will receive an extremely plain 503 error response. If no
backend is available at all the errorfile directive functions
properly,
and the "pretty" error message is returned.
Ideally, in the event that a backend server is returning
RSTs, we'd like
to move to the next server. HAProxy could either do this
immediately or
buffer the request until it makes the final determination that the
backend server is down and can send it to another server. If
that isn't
possible, we'd really like the 503 response returned to the
client to be
the one specified in the errorfile.
I'm going to investigate the possibility of creating a patch for this
issue tonight, though if more experience hands could help, either with
the patch, or with something obvious that I've missed in my
configuration, I'd greatly appreciate it.
Interestingly, even upon receiving a RST to a client request to the
backend server, HAProxy doesn't consider the server as having failed a
health check until it performs it's next health check. So, if
you have a
health checking interval of 10 seconds, if a customer makes a
request 1
second after the first health check, with fall 2 set, it will take 29
seconds before the backend server is declared down and the
client moved
on to the next server.
The documentation for the track command could also be made a bit
clearer, it took me a while (and a colleagues examination of
the source)
to determine that the <proxy> is another backend, in the case that you
are trying to reference a server from a different backend. (Perhaps,
depending on your configuration, it's not always a backend,
but could be
something else?).
Thank you for taking the time to read this novella :), configuration
follows, thanks in advance for your help,
-JohnF
Configuration Details
global
stats socket /var/run/haproxy.stat
defaults
balance roundrobin
cookie SERVERID insert indirect
option httpchk GET /index.html HTTP/1.0
timeout client 10m
timeout server 10m
timeout connect 3s
frontend http_frontend *:80
mode http
reqirep ^Host:([^:]*) Host:\1
#Traffic matching ACLs
[...]
acl host_qa_site_com hdr(host) -i qa.site.com
use_backend qa_site_com_http if host_qa_site_com
frontend ssl_frontend *:81
mode http
reqirep ^Host:([^:]*) Host:\1
#Traffic matching ACLs
[...]
acl host_qa_site_com hdr(host) -i qa.site.com
use_backend qa2_site_com_ssl if host_qa_site_com
[...]
backend qa_site_com_http
mode http
errorfile 503 /etc/haproxy/errorfiles/503.http
option redispatch
retries 3
option httpchk GET /ut.asp
server web_1 web1:80 cookie 3f06565277298ce80af6bbaab8c5b584 check
inter 5100ms fall 2
server web_2 web2:80 cookie bc62407de9ecf95bae662880b593a0d4 check
inter 5100ms fall 2
backend qa_site_com_ssl
mode http
errorfile 503 /etc/haproxy/errorfiles/503.http
option redispatch
retries 3
option httpchk GET /ut.asp
server web_1 web1:80 cookie 3f06565277298ce80af6bbaab8c5b584 track
qa_site_com_http/web_1
server web_2 web2:80 cookie bc62407de9ecf95bae662880b593a0d4 track
qa_site_com_http/web_2

Willy Tarreau

2009-01-25 22:49:43 UTC

Permalink

Hi John,

Post by John Marrett
I'm embarassed to report that this is not an HAProxy issue.

Don't feel embarassed. I'm glad that you found the issue. And it's
kind to send us an update.

Post by John Marrett
In addition to the changes being made on the load balancing level, we
have also upgraded the backend real servers. It seems there has been a
change in their shutdown procedure, where before they would stop
responding immediately when a shutdown was initiated they now return 503
errors during the (protracted) shut down process.
This explains both unusual issues perfectly clearly.
I'm very sorry for any time wasted looking into this issue.

No problem, no time wasted yet !

Have you at least found a solution to your issue ?

Regards,
Willy

John Marrett

2009-01-26 00:06:23 UTC

Permalink

Willy,

Post by Willy Tarreau
No problem, no time wasted yet !

Well, none of your time :) It took me far longer than it should have to
realise my error. Regretable, packet captures are usually my first
diagnostic tool. A mistake I won't make again any time soon.

Post by Willy Tarreau
Have you at least found a solution to your issue ?

I've found a partial solution to my issue, and in fact, now I have a
question that's relevant to the list. The backend server is IIS, if
you're getting 503s during shutdowns, you can use this solution to turn
them into RSTs [1].

The RST is sent by IIS after it receives the full client request from
HAProxy (I suspect that it may want to see the Host header before it
decides how it wants to treat the request). When HAProxy receives the
RST it returns a 503 to the client (respecting the errorfile!). Despite
the presence of "option redistribute", HAProxy does not send the request
to another backend server.

If there was a way to get HAProxy to send the request to another
functional real server at this time it would be great, though I fear
that HAProxy no longer has the request information after having sent it
to the server.

Any further advice would be much appreciated, I can provide packet
captures off list if required.

-JohnF

John Marrett

2009-01-26 00:14:59 UTC

Permalink

Forgot the link for the IIS 503 / RST solution:

http://technet.microsoft.com/en-us/library/cc757659.aspx

I believe that our application itself (currently) throws 503s, so we
couldn't use some kind of down server on 503 response type solution,
though we could probably change that if it might afford us a solution.

-JohnF

-----Original Message-----
Sent: January 25, 2009 7:06 PM
To: Willy Tarreau
Subject: RE: Problems with HAProxy, down servers and 503 errors
Willy,

Post by Willy Tarreau
No problem, no time wasted yet !

Well, none of your time :) It took me far longer than it
should have to
realise my error. Regretable, packet captures are usually my first
diagnostic tool. A mistake I won't make again any time soon.

Post by Willy Tarreau
Have you at least found a solution to your issue ?

I've found a partial solution to my issue, and in fact, now I have a
question that's relevant to the list. The backend server is IIS, if
you're getting 503s during shutdowns, you can use this
solution to turn
them into RSTs [1].
The RST is sent by IIS after it receives the full client request from
HAProxy (I suspect that it may want to see the Host header before it
decides how it wants to treat the request). When HAProxy receives the
RST it returns a 503 to the client (respecting the
errorfile!). Despite
the presence of "option redistribute", HAProxy does not send
the request
to another backend server.
If there was a way to get HAProxy to send the request to another
functional real server at this time it would be great, though I fear
that HAProxy no longer has the request information after
having sent it
to the server.
Any further advice would be much appreciated, I can provide packet
captures off list if required.
-JohnF

Willy Tarreau

2009-01-26 05:13:43 UTC

Permalink

Post by John Marrett
Willy,

Post by Willy Tarreau
No problem, no time wasted yet !

Well, none of your time :) It took me far longer than it should have to
realise my error. Regretable, packet captures are usually my first
diagnostic tool. A mistake I won't make again any time soon.

Post by Willy Tarreau
Have you at least found a solution to your issue ?

I've found a partial solution to my issue, and in fact, now I have a
question that's relevant to the list. The backend server is IIS, if
you're getting 503s during shutdowns, you can use this solution to turn
them into RSTs [1].
The RST is sent by IIS after it receives the full client request from
HAProxy (I suspect that it may want to see the Host header before it
decides how it wants to treat the request). When HAProxy receives the
RST it returns a 503 to the client (respecting the errorfile!). Despite
the presence of "option redistribute", HAProxy does not send the request
to another backend server.
If there was a way to get HAProxy to send the request to another
functional real server at this time it would be great, though I fear
that HAProxy no longer has the request information after having sent it
to the server.

You're perfectly right, redispatch only happens when the request is still
in haproxy. Once it has been sent, it is cannot be performed. It must not
be performed either for non idempotent requests, because there is no way
to know whether some processing has begun on the server before it died
and returned an RST.

Post by John Marrett
Any further advice would be much appreciated, I can provide packet
captures off list if required.

Shouldn't you include the Host header in the health checks, in order to
sollicit the final server and get a chance to see it fail ?

Regards,
Willy

John Marrett

2009-01-26 10:01:18 UTC

Permalink

Willy,

Post by Willy Tarreau
You're perfectly right, redispatch only happens when the
request is still
in haproxy. Once it has been sent, it is cannot be performed.
It must not
be performed either for non idempotent requests, because
there is no way
to know whether some processing has begun on the server before it died
and returned an RST.

I suspected that this would be the case.

Post by Willy Tarreau

Post by John Marrett
Any further advice would be much appreciated, I can provide packet
captures off list if required.

Shouldn't you include the Host header in the health checks,
in order to
sollicit the final server and get a chance to see it fail ?

The health checks naturally include the Host header, and work properly.
We just wanted to avoid returning 503 to the client for the 4-14 seconds
it takes for the backend server to be reported down (inter 5 fall 2).
We'll see if there's something we can do to get IIS to return a RST
before it receives the full request.

-JohnF

Willy Tarreau

2009-01-27 05:59:03 UTC

Permalink

Post by John Marrett
The health checks naturally include the Host header, and work properly.

OK.

Post by John Marrett
We just wanted to avoid returning 503 to the client for the 4-14 seconds
it takes for the backend server to be reported down (inter 5 fall 2).

You could shorter that time using "fastinter". This will replace "inter"
when the server has failed one check. For instance, you use
"inter 1000 fastinter 200 fall 2" and you can detect a failure in 1.2 second.

Post by John Marrett
We'll see if there's something we can do to get IIS to return a RST
before it receives the full request.

Alternatively, you could send a redirect to the home page and remove
any persistence cookie so that the user gets sent to another server.

Willy