Discussion:
PUT/POST request bodies sometimes are not proxied in version 1.8
Nick Vatamaniuc
2018-11-08 16:26:25 UTC
Permalink
Hello,

I'd like to report an issue where sometimes PUT/POST request bodies do not
get proxied. Headers are sent then the backend server waits to receive the
body. It never arrives and after 5 minutes the socket times out. This
happens in all the 1.8 releases up until 1.8.14 but it doesn't happen in
1.7. In fact one of the ways we mitigate the issue is by reverting back to
1.7.

Normally the requests go through without an issue. We see about 1M+
requests per day. However about 5 or 10 of those end up failing as
described. Is there anything we can do to debug the issue further, or were
there any changes that might lead to this specific bug in 1.8.x?

More details about our setup:
Lua is used on all requests
nbproc > 1
Not using threads
Request body sizes about 30-50KB

Some config settings:
option http-server-close
option log-health-checks
option log-separate-errors
option redispatch
retries 4
timeout client 150000
timeout connect 5000
timeout queue 5000
timeout server 3600000

$ /usr/sbin/haproxy -vv
HA-Proxy version 1.8.14-52e4d43 2018/09/20
Copyright 2000-2018 Willy Tarreau <***@haproxy.org>

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
-Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
OPTIONS = USE_SLZ=1 USE_REGPARM=1 USE_THREAD=1 USE_OPENSSL=1 USE_LUA=1
USE_SYSTEMD=1 USE_STATIC_PCRE2=1 USE_PCRE2_JIT=1 USE_TFO=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.1 11 Sep 2018
Running on OpenSSL version : OpenSSL 1.1.1 11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE2 version : 10.31 2018-02-12
PCRE2 library supports JIT : yes
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

Thank you
-Nick Vatamaniuc
Àbéjídé Àyodélé
2018-11-08 17:26:28 UTC
Permalink
I believe this is similar to our experience with:

https://www.mail-archive.com/***@formilux.org/msg31645.html

Abejide Ayodele
It always seems impossible until it's done. --Nelson Mandela
Post by Nick Vatamaniuc
Hello,
I'd like to report an issue where sometimes PUT/POST request bodies do not get proxied. Headers are sent then the backend server waits to receive the body. It never arrives and after 5 minutes the socket times out. This happens in all the 1.8 releases up until 1.8.14 but it doesn't happen in 1.7. In fact one of the ways we mitigate the issue is by reverting back to 1.7.
Normally the requests go through without an issue. We see about 1M+ requests per day. However about 5 or 10 of those end up failing as described. Is there anything we can do to debug the issue further, or were there any changes that might lead to this specific bug in 1.8.x?
Lua is used on all requests
nbproc > 1
Not using threads
Request body sizes about 30-50KB
option http-server-close
option log-health-checks
option log-separate-errors
option redispatch
retries 4
timeout client 150000
timeout connect 5000
timeout queue 5000
timeout server 3600000
$ /usr/sbin/haproxy -vv
HA-Proxy version 1.8.14-52e4d43 2018/09/20
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label
OPTIONS = USE_SLZ=1 USE_REGPARM=1 USE_THREAD=1 USE_OPENSSL=1 USE_LUA=1 USE_SYSTEMD=1 USE_STATIC_PCRE2=1 USE_PCRE2_JIT=1 USE_TFO=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.1.1 11 Sep 2018
Running on OpenSSL version : OpenSSL 1.1.1 11 Sep 2018
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.4
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE2 version : 10.31 2018-02-12
PCRE2 library supports JIT : yes
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
[SPOE] spoe
[COMP] compression
[TRACE] trace
Thank you
-Nick Vatamaniuc
Willy Tarreau
2018-11-09 04:01:26 UTC
Permalink
not sure if you're seeing 503, I think Nick is seeing 504 instead.

Willy
Willy Tarreau
2018-11-09 04:15:22 UTC
Permalink
Post by Nick Vatamaniuc
Hello,
I'd like to report an issue where sometimes PUT/POST request bodies do not
get proxied. Headers are sent then the backend server waits to receive the
body. It never arrives and after 5 minutes the socket times out. This
happens in all the 1.8 releases up until 1.8.14 but it doesn't happen in
1.7. In fact one of the ways we mitigate the issue is by reverting back to
1.7.
Normally the requests go through without an issue. We see about 1M+
requests per day. However about 5 or 10 of those end up failing as
described. Is there anything we can do to debug the issue further, or were
there any changes that might lead to this specific bug in 1.8.x?
I'm not aware of anything which could produce this situation. The low
rate tends to indicate a difficult race condition involving several
elements. Just trying to bisect the situation, do you know if your
application uses 100-continue or not ? Could you try if
"option http-buffer-request" changes anything ? This option makes haproxy
wait for the body before sending the request to the server. Not only this
reduces the number of concurrent connections on the server, but it will
also help figure where the problem happens. Ideally if you can spot a
situation where such a connection is currently frozen, a "show sess all"
issued over the CLI could significantly help, by showing the session's
exact state. Be careful the output can be huge.

If you want to periodically filter the output to search for such a frozen
session, instead you can do this :

1) spot the candidates, those aged 1 minute or more :
$ echo "show sess" | socat - /path/to/socket | grep age='[0-9]*m'
0xb5fff600: proto=tcpv6 src=1.2.3.4:57355 fe=http-in be=<NONE> srv=<none> ts=02 age=1m5s calls=1 rq[f=400000h,i=0,an=36h,rx=25s,wx=,ax=5s] rp[f=80000000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=1,ex=] s1=[0,10h,fd=-1,ex=] exp=5s

2) show the session's details :
$ echo "show sess 0xb5fff600" | socat - /path/to/socket

This can simply be scripted to do this every 30 seconds :

while sleep 30; do
sess=$(echo "show sess" | socat - /path/to/socket | grep age='[0-9]*m' | cut -f1 -d:)
if [ -n "$sess" ]; then
echo "=========== $(date) ============="
for i in $sess; do echo -n "show sess $i;"done | socat - /path/to/socket
fi >> /path/to/faulty-sessions.log
done
(...)
I'm not seeing anything wrong nor suspicious here.

Thanks,
Willy
Nick Vatamaniuc
2018-11-09 18:58:09 UTC
Permalink
Hi Willy,

Thanks for responding.

Requests were not using 100-continue. Also setting http-buffer-request
didn't seem to make any difference.

I captured a few sessions as you suggested. The architecture looks
something like this:

(client) -> [http/tcp] -> fe=http_proxy,be=http_proxy,srv=ratelimit-5 ->
[unix socket] ->
-> fe=ratelimt-5,be=testy017,srv=dbX -> [http/tcp] -> (backend server)

This is a session captured to the ratelimit-5 server::

0x2533f00: [09/Nov/2018:17:25:57.386724] id=137665 proto=tcpv4 source=
208.43.229.156:56109
flags=0xce, conn_retries=4, srv_conn=0x21e0cb0, pend_pos=(nil)
frontend=http_proxy (id=2 mode=http), listener=? (id=1) addr=
208.43.229.156:56109
backend=http_proxy (id=2 mode=http) addr=unix
server=ratelimit-5 (id=1) addr=unix
task=0x254b530 (state=0x04 nice=0 calls=14 exp=56m10s tmask=0x1 age=3m50s)
txn=0x2567f10 flags=0x88600000 meth=3 status=-1 req.st=MSG_DATA
rsp.st=MSG_RPBEFORE
waiting=0
si[0]=0x2534148 (state=EST flags=0x08 endp0=NONE:0x24ca6e0 exp=<NEVER>,
et=0x000)
si[1]=0x2534170 (state=EST flags=0x18 endp1=NONE:0x2484810 exp=<NEVER>,
et=0x000)
co0=0x24cd390 ctrl=tcpv4 xprt=SSL mux=PASS data=STRM
target=LISTENER:0x1ff91c0
flags=0x80203306 fd=254 fd.state=25 fd.cache=0 updt=0 fd.tmask=0x1
co1=0x2438810 ctrl=unix_stream xprt=RAW mux=PASS data=STRM
target=SERVER:0x21e0cb0
flags=0x00203306 fd=287 fd.state=25 fd.cache=0 updt=0 fd.tmask=0x1
req=0x2533f10 (f=0x20870000 an=0x28000 pipe=0 tofwd=1470 total=30720)
an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
buf=0xb1b6a0 data=0xb1b6b4 o=0 p=0 req.next=0 i=0 size=0
res=0x2533f50 (f=0xa0400000 an=0xda00000 pipe=0 tofwd=0 total=0)
an_exp=<NEVER> rex=56m10s wex=<NEVER>
buf=0xb1b6a0 data=0xb1b6b4 o=0 p=0 rsp.next=0 i=0 size=0


And this is one captured from ratelimit-5 to backend server:

0x2560d80: [09/Nov/2018:17:26:24.347239] id=578926 proto=unix_stream source=
208.43.229.156:56129
flags=0xce, conn_retries=4, srv_conn=0x2234e20, pend_pos=(nil)
frontend=ratelimit-5 (id=3 mode=http), listener=? (id=1) addr=
208.43.229.156:56129
backend=testy017 (id=2109660078 mode=http) addr=10.125.66.126:40530
server=db2 (id=2) addr=10.56.221.163:5984
task=0x2496430 (state=0x04 nice=0 calls=3 exp=55m26s tmask=0x1 age=4m33s)
txn=0x2561190 flags=0x18700000 meth=3 status=-1 req.st=MSG_DATA
rsp.st=MSG_RPBEFORE
waiting=0
si[0]=0x2560fc8 (state=EST flags=0x08 endp0=NONE:0x251ddf0 exp=<NEVER>,
et=0x000)
si[1]=0x2560ff0 (state=EST flags=0x18 endp1=NONE:0x1ff10c0 exp=<NEVER>,
et=0x000)
co0=0x22ce6e0 ctrl=unix_stream xprt=RAW mux=PASS data=STRM
target=LISTENER:0x21e5ed0
flags=0xa0203306 fd=272 fd.state=25 fd.cache=0 updt=0 fd.tmask=0x1
co1=0x25f8c80 ctrl=tcpv4 xprt=RAW mux=PASS data=STRM
target=SERVER:0x2234e20
flags=0x00203306 fd=275 fd.state=25 fd.cache=0 updt=0 fd.tmask=0x1
req=0x2560d90 (f=0x20841080 an=0x28000 pipe=0 tofwd=1470 total=30969)
an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
buf=0xb1b6a0 data=0xb1b6b4 o=0 p=0 req.next=0 i=0 size=0
res=0x2560dd0 (f=0xa0400000 an=0xda00000 pipe=0 tofwd=0 total=0)
an_exp=<NEVER> rex=55m26s wex=<NEVER>
buf=0xb1b6a0 data=0xb1b6b4 o=0 p=0 rsp.next=0 i=0 size=0


I can generate enough load to make this happen with the wrk2 benchmark :
(https://github.com/giltene/wrk2)

$ ./wrk --script randbody.lua --connections 500 --duration 10m --threads 10
\
--rate 1000 --timeout 310 https://...


------------------- randbody.lua -------------------------------

function randstring(length)
local array = {}
for i = 1, length do
array[i] = string.char(math.random(65, 90))
end
return table.concat(array)
end

wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.headers["Authorization"] = "xxxxx..."
wrk.body = '{"data": "' .. randstring(32000) .. '"}'

--------------- (end randbody.lua) ------------------------------

And the output looks like this:

----------------------------------------------------------------
Running 10m test @ https://....
10 threads and 500 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.08s 30.15s 5.00m 98.09%
Req/Sec 93.18 91.83 456.00 45.47%
559450 requests in 10.00m, 317.46MB read
Non-2xx or 3xx responses: 41
Requests/sec: 932.33
Transfer/sec: 541.74KB

---------------------------------------------------------------

Thank you,
-Nick Vatamaniuc
Post by Willy Tarreau
Post by Nick Vatamaniuc
Hello,
I'd like to report an issue where sometimes PUT/POST request bodies do
not
Post by Nick Vatamaniuc
get proxied. Headers are sent then the backend server waits to receive
the
Post by Nick Vatamaniuc
body. It never arrives and after 5 minutes the socket times out. This
happens in all the 1.8 releases up until 1.8.14 but it doesn't happen in
1.7. In fact one of the ways we mitigate the issue is by reverting back
to
Post by Nick Vatamaniuc
1.7.
Normally the requests go through without an issue. We see about 1M+
requests per day. However about 5 or 10 of those end up failing as
described. Is there anything we can do to debug the issue further, or
were
Post by Nick Vatamaniuc
there any changes that might lead to this specific bug in 1.8.x?
I'm not aware of anything which could produce this situation. The low
rate tends to indicate a difficult race condition involving several
elements. Just trying to bisect the situation, do you know if your
application uses 100-continue or not ? Could you try if
"option http-buffer-request" changes anything ? This option makes haproxy
wait for the body before sending the request to the server. Not only this
reduces the number of concurrent connections on the server, but it will
also help figure where the problem happens. Ideally if you can spot a
situation where such a connection is currently frozen, a "show sess all"
issued over the CLI could significantly help, by showing the session's
exact state. Be careful the output can be huge.
If you want to periodically filter the output to search for such a frozen
$ echo "show sess" | socat - /path/to/socket | grep age='[0-9]*m'
0xb5fff600: proto=tcpv6 src=1.2.3.4:57355 fe=http-in be=<NONE>
srv=<none> ts=02 age=1m5s calls=1 rq[f=400000h,i=0,an=36h,rx=25s,wx=,ax=5s]
rp[f=80000000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=1,ex=]
s1=[0,10h,fd=-1,ex=] exp=5s
$ echo "show sess 0xb5fff600" | socat - /path/to/socket
while sleep 30; do
sess=$(echo "show sess" | socat - /path/to/socket | grep
age='[0-9]*m' | cut -f1 -d:)
if [ -n "$sess" ]; then
echo "=========== $(date) ============="
for i in $sess; do echo -n "show sess $i;"done | socat - /path/to/socket
fi >> /path/to/faulty-sessions.log
done
(...)
I'm not seeing anything wrong nor suspicious here.
Thanks,
Willy
Loading...