Discussion:
Few problems seen in haproxy? (threads, connections).
Krishna Kumar (Engineering)
2018-10-02 15:48:19 UTC
Permalink
Hi Willy, and community developers,

I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.

1. HAProxy system:
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning

2. Client, 48 core similar to server. Test command line:
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128

3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as
of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
Copyright 2000-2018 Willy Tarreau <***@haproxy.org>

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148

5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
Stats about Lock FD:
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_RQ:
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_WQ:
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock POOL:
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LISTENER:
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock PROXY:
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock SERVER:
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LBPRM:
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock BUF_WQ:
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock STRMS:
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock VARS:
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec

6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
this behavior was not seen. Ran the following command for both:
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
1.8.14 during test:
451 SYN-SENT
9166 ESTAB
1.8.14 after test:
2 ESTAB

1.9-dev3 during test:
109 SYN-SENT
9400 ESTAB
1.9-dev3 after test:
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
On the client system, all sockets were in FIN-WAIT-2 state:
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has closed
the
socket.

7. Configuration file for process mode:
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).

defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms

frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).

backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"

8. Configuration file for thread mode:
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39

defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms

frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade

backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"

I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.

Thanks,
- Krishna
Krishna Kumar (Engineering)
2018-10-04 03:12:53 UTC
Permalink
Re-sending in case this mail was missed. To summarise the 3 issues seen:

1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers),
till
the server/client timeout expires.

The tests for threads and processes were done on the same systems, so there
is
no difference in system parameters.

Thanks,
- Krishna


On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as
of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Илья Шипицин
2018-10-04 04:51:20 UTC
Permalink
load testing is somewhat good.
can you describe an overall setup ? (I want to reproduce and play with it)

чт, 4 Пкт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers),
till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Krishna Kumar (Engineering)
2018-10-04 04:59:27 UTC
Permalink
Sure.

1. Client: Use one of the following two setup's:
- a single baremetal (48 core, 40g) system
Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
- 100 2 core vm's.
Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
each VM and summarize the results using some
parallel-ssh setup.

2. HAProxy running on a single baremetal (same system config
as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
cores of the same NUMA node for each irq, kill irqbalance, with
haproxy configuration file as given in my first mail. Around 60
backend servers are configured in haproxy.

3. Backend servers are 2 core VM's running nginx and serving
a file called "/128", which is 128 bytes in size.

Let me know if you need more information.

Thanks,
- Krishna
Post by Илья Шипицин
load testing is somewhat good.
can you describe an overall setup ? (I want to reproduce and play with it)
чт, 4 Пкт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend
servers), till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Илья Шипицин
2018-10-04 06:38:57 UTC
Permalink
haproxy config, nginx config
non default sysctl (if any)

as a side note, can you have a look at "dmesg" output ? do you have nf
conntrack enabled ? what are its limits ?

чт, 4 Пкт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Sure.
- a single baremetal (48 core, 40g) system
Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
- 100 2 core vm's.
Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
each VM and summarize the results using some
parallel-ssh setup.
2. HAProxy running on a single baremetal (same system config
as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
cores of the same NUMA node for each irq, kill irqbalance, with
haproxy configuration file as given in my first mail. Around 60
backend servers are configured in haproxy.
3. Backend servers are 2 core VM's running nginx and serving
a file called "/128", which is 128 bytes in size.
Let me know if you need more information.
Thanks,
- Krishna
Post by Илья Шипицин
load testing is somewhat good.
can you describe an overall setup ? (I want to reproduce and play with it)
чт, 4 Пкт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
(for 1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains
in either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend
servers), till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Krishna Kumar (Engineering)
2018-10-04 06:52:54 UTC
Permalink
1. haproxy config: Same as given above (both processes and threads were
given in the mail)
2. nginx: default, no changes.
3. sysctl's: nothing set. All changes as described earlier (e.g.
irqbalance, irq pinning, etc).
4. nf_conntrack: disabled
5. dmesg: no messages.

With the same system and settings, threads gives 18x lesser RPS than
processes, along with
the other 2 issues given in my mail today.
Post by Илья Шипицин
haproxy config, nginx config
non default sysctl (if any)
as a side note, can you have a look at "dmesg" output ? do you have nf
conntrack enabled ? what are its limits ?
чт, 4 Пкт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Sure.
- a single baremetal (48 core, 40g) system
Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
- 100 2 core vm's.
Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
each VM and summarize the results using some
parallel-ssh setup.
2. HAProxy running on a single baremetal (same system config
as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
cores of the same NUMA node for each irq, kill irqbalance, with
haproxy configuration file as given in my first mail. Around 60
backend servers are configured in haproxy.
3. Backend servers are 2 core VM's running nginx and serving
a file called "/128", which is 128 bytes in size.
Let me know if you need more information.
Thanks,
- Krishna
Post by Илья Шипицин
load testing is somewhat good.
can you describe an overall setup ? (I want to reproduce and play with it)
чт, 4 Пкт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
(for 1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains
in either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend
servers), till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
checkout as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Илья Шипицин
2018-10-04 07:28:05 UTC
Permalink
what I going to try (when I will have some spare time) is sampling with
google perftools

https://github.com/gperftools/gperftools

they are great in cpu profiling.
you can try them youself if you have time/wish :)


чт, 4 Пкт. 2018 г. в 11:53, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. haproxy config: Same as given above (both processes and threads were
given in the mail)
2. nginx: default, no changes.
3. sysctl's: nothing set. All changes as described earlier (e.g.
irqbalance, irq pinning, etc).
4. nf_conntrack: disabled
5. dmesg: no messages.
With the same system and settings, threads gives 18x lesser RPS than
processes, along with
the other 2 issues given in my mail today.
Post by Илья Шипицин
haproxy config, nginx config
non default sysctl (if any)
as a side note, can you have a look at "dmesg" output ? do you have nf
conntrack enabled ? what are its limits ?
чт, 4 Пкт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Sure.
- a single baremetal (48 core, 40g) system
Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
- 100 2 core vm's.
Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
each VM and summarize the results using some
parallel-ssh setup.
2. HAProxy running on a single baremetal (same system config
as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
cores of the same NUMA node for each irq, kill irqbalance, with
haproxy configuration file as given in my first mail. Around 60
backend servers are configured in haproxy.
3. Backend servers are 2 core VM's running nginx and serving
a file called "/128", which is 128 bytes in size.
Let me know if you need more information.
Thanks,
- Krishna
Post by Илья Шипицин
load testing is somewhat good.
can you describe an overall setup ? (I want to reproduce and play with it)
чт, 4 Пкт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared
to nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
(for 1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains
in either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend
servers), till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
checkout as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Krishna Kumar (Engineering)
2018-10-04 08:14:03 UTC
Permalink
Thanks, will take a look!
Post by Илья Шипицин
what I going to try (when I will have some spare time) is sampling with
google perftools
https://github.com/gperftools/gperftools
they are great in cpu profiling.
you can try them youself if you have time/wish :)
чт, 4 Пкт. 2018 г. в 11:53, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. haproxy config: Same as given above (both processes and threads were
given in the mail)
2. nginx: default, no changes.
3. sysctl's: nothing set. All changes as described earlier (e.g.
irqbalance, irq pinning, etc).
4. nf_conntrack: disabled
5. dmesg: no messages.
With the same system and settings, threads gives 18x lesser RPS than
processes, along with
the other 2 issues given in my mail today.
Post by Илья Шипицин
haproxy config, nginx config
non default sysctl (if any)
as a side note, can you have a look at "dmesg" output ? do you have nf
conntrack enabled ? what are its limits ?
чт, 4 Пкт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Sure.
- a single baremetal (48 core, 40g) system
Run: "wrk -c 4800 -t 48 -d 30s http://<IP>:80/128", or,
- 100 2 core vm's.
Run "wrk -c 16 -t 2 -d 30s http://<IP>:80/128" from
each VM and summarize the results using some
parallel-ssh setup.
2. HAProxy running on a single baremetal (same system config
as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
cores of the same NUMA node for each irq, kill irqbalance, with
haproxy configuration file as given in my first mail. Around 60
backend servers are configured in haproxy.
3. Backend servers are 2 core VM's running nginx and serving
a file called "/128", which is 128 bytes in size.
Let me know if you need more information.
Thanks,
- Krishna
Post by Илья Шипицин
load testing is somewhat good.
can you describe an overall setup ? (I want to reproduce and play with it)
чт, 4 Пкт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared
to nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
(for 1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it
remains in either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend
servers), till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
checkout as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing
-Wdeclaration-after-statement -fwrapv -fno-strict-overflow
-Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
-Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for
30 seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Krishna Kumar (Engineering)
2018-10-05 08:55:13 UTC
Permalink
Sorry for repeating once again, but this is my last unsolicited
mail on this topic. Any directions for what to look out for?

Thanks,
- Krishna


On Thu, Oct 4, 2018 at 8:42 AM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
either
CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers),
till
the server/client timeout expires.
The tests for threads and processes were done on the same systems, so
there is
no difference in system parameters.
Thanks,
- Krishna
On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Hi Willy, and community developers,
I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
as of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
(protocols markes as <default> cannot be specified using 'proto' keyword)
h2 : mode=HTTP side=FE
<default> : mode=TCP|HTTP side=FE|BE
[SPOE] spoe
[COMP] compression
[TRACE] trace
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 9335
# write unlock: 9335 (0)
# wait time for write : 0.910 msec
# wait time for write/lock: 97.492 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
"ss -tnp | awk '{print $1}' | sort | uniq -c | sort -n"
451 SYN-SENT
9166 ESTAB
2 ESTAB
109 SYN-SENT
9400 ESTAB
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client, while all
connections in ESTAB state were to the server. This lasted for 30
seconds.
2186 FIN-WAIT-2
This (2185/2186) seems to imply that client closed the connection but
haproxy did not close the socket for 30 seconds. This also results in
high CPU utilization on haproxy for some reason (100% for each process
for 30 seconds), which is also unexpected as the remote side has
closed the
socket.
global
daemon
maxconn 26000
nbproc 48
stats socket /var/run/ha-1-admin.sock mode 600 level admin process 1
# (and so on for 48 processes).
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
default_backend fk-be-upgrade
bind <VIP>:80 process 1
# (and so on for 48 processes).
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
global
daemon
maxconn 26000
stats socket /var/run/ha-1-admin.sock mode 600 level admin
nbproc 1
nbthread 48
# cpu-map auto:1/1-48 0-39
defaults
option http-keep-alive
balance leastconn
retries 2
option redispatch
maxconn 25000
option splice-response
option tcp-smart-accept
option tcp-smart-connect
option splice-auto
timeout connect 5000ms
timeout client 30000ms
timeout server 30000ms
timeout client-fin 30000ms
timeout http-request 10000ms
timeout http-keep-alive 75000ms
timeout queue 10000ms
timeout tarpit 15000ms
frontend fk-fe-upgrade-80
mode http
bind <VIP>:80 process 1/1-48
default_backend fk-be-upgrade
backend fk-be-upgrade
mode http
default-server maxconn 2000 slowstart
# 58 server lines follow, e.g.: "server <name> <ip:80>"
I had also captured 'perf' output for the system for thread vs processes,
can send it later if required.
Thanks,
- Krishna
Willy Tarreau
2018-10-05 21:33:34 UTC
Permalink
Hi Krishna,
Post by Krishna Kumar (Engineering)
Sorry for repeating once again, but this is my last unsolicited
mail on this topic. Any directions for what to look out for?
Sorry, but I didn't even have the time to read your mail over the last
two days and still have a huge backlog pending. Maybe next week.

Regards,
Willy
Willy Tarreau
2018-10-10 13:09:28 UTC
Permalink
Hi Krishna,

On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering) wrote:
(...)
Post by Krishna Kumar (Engineering)
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as
of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
(...)
Post by Krishna Kumar (Engineering)
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
Our largest thread test was on 12 cores and it happens that in your case
it's also the optimal one.

However I do have some comments about your config, before going back to
Post by Krishna Kumar (Engineering)
# cpu-map auto:1/1-48 0-39
=> you must absolutely pin your processes, and they must be pinned
to cores *not* shared with the network card. That's critical.
Moreover it's also important that threads are not split across
multiple physical CPUs because the remote L3 cache access time
over QPI/UPI is terrible. When you run on 12 threads with two
12-cores/24-threads CPUs, you could very well have haproxy using
12 threads from 6 cores, and the NIC using 12 threads from the
other 6 cores of the same physical CPU. The second socket is,
as usual, useless for anything requiring low latency. However
it's perfect to run SSL. So you could be interested in testing
if running the NIC on one socket (try to figure what node the
PCIe lanes are physically connected to), and haproxy on the other
one. It *could* be possible that you get more performance from 12
cores of each but I strongly doubt it based on a number of tests.
If you use SSL however it's different as you will benefit from
lots of cores much more than low latency.
Post by Krishna Kumar (Engineering)
bind <VIP>:80 process 1/1-48
=> it's also capital for scalability to have individual bind lines. Here
you have a single socket accessed from all 48 threads. There's no
efficient thread load balancing here. By having this :

bind <VIP>:80 process 1/1
bind <VIP>:80 process 1/2
...
bind <VIP>:80 process 1/47
bind <VIP>:80 process 1/48

You will let the kernel perform the load balancing and distribute a
fair load to all threads. This way none of them will risk to pick a
larger share of the incoming connections than optimal. I know it's
annoying to configure at the moment, I've been thinking about having
a way to automatically iterate from a single config line (like the
"auto" feature of cpu-map), but for now it's not done.
Post by Krishna Kumar (Engineering)
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
This one definitely is huge. We know some work is still needed on this lock
and that there are still a few low hanging fruits but not much savings to
expect short term. This output is very revealing however of the importance
of this lock.
Post by Krishna Kumar (Engineering)
# wait time for write/lock: 78.996 nsec
That's roughly the time it takes to access the other CPU's cache, so using
your two sockets for the same process definitely hurts a lot here.
Post by Krishna Kumar (Engineering)
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
This one is still far too large for what we'd hope, even though it
has significantly shrunk since 1.8. It could be related to the poor
distribution of the incoming connections across threads.
Post by Krishna Kumar (Engineering)
# wait time for write/lock: 3790.941 nsec
Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting
to suspect some longer tree walks than expected. Again, lack of fairness
between threads can make this significantly worse than it should.
Post by Krishna Kumar (Engineering)
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
For this one I had an idea that could significantly improve the situation.
It will unlikely be for 1.9 though.
Post by Krishna Kumar (Engineering)
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
Interesting. Very interesting... I thought we merged the lock-free pool
code. Maybe I was mistaken. Or I misunderstood something about the nature
of the changes that went in early during the 1.9-dev cycle. This one is
not expected to be hard to address, and I even see how we could go further
with a small thread-local cache since I developed such a thing a few years
ago (though it was over-engineered) that could easily be retrofitted here.
Post by Krishna Kumar (Engineering)
# wait time for write/lock: 4244.788 nsec
Looks like a malloc() time. Maybe the measure was huge during the ramp-up
and amortized later.
Post by Krishna Kumar (Engineering)
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
Good to know, could still be improved.
Post by Krishna Kumar (Engineering)
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
Same here!
Post by Krishna Kumar (Engineering)
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
Here it will be less easy because it depends on the LB
algorithm.
Post by Krishna Kumar (Engineering)
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
Post-1.9 I think we could make this one thread-local for the vast majority
of buffer allocations. Short term, we would possibly try to improve the
situation to avoid taking the lock for all successes. It's tricky but I
think possible.
Post by Krishna Kumar (Engineering)
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
This one is used by variables.
Post by Krishna Kumar (Engineering)
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3 runs
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
this behavior was not seen.
We've had the exact opposite situation yesterday on another case. It really
depends on the scheduling, but it's possible you've hit a bug in 1.9-dev3
though.

(...)
Post by Krishna Kumar (Engineering)
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client,
I suspect we still have a few cases where some client connections may
only be checked upon client timeout.

Many thanks, your test is very useful at this stage of the development!
There definitely are some performance points we won't address for 1.9,
but it confirms some observations and/or completes them.

Best regards,
Willy
Krishna Kumar (Engineering)
2018-10-11 02:48:21 UTC
Permalink
Hi Willy,

Thank you very much for the in-depth analysis and configuration setting
suggestions.
I believe I have got the 3 key items to continue based on your mail:

1. Thread pinning
2. Fix system irq pinning accordingly
3. Listen on all threads.

I will post the configuration changes and the results.

Regards,
- Krishna
Post by Willy Tarreau
Hi Krishna,
(...)
Post by Krishna Kumar (Engineering)
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning
wrk -c 4800 -t 48 -d 30s http://<IP:80>/128
3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
as
Post by Krishna Kumar (Engineering)
of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
(...)
Post by Krishna Kumar (Engineering)
4. HAProxy results for #processes and #threads
# Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148
Our largest thread test was on 12 cores and it happens that in your case
it's also the optimal one.
However I do have some comments about your config, before going back to
Post by Krishna Kumar (Engineering)
# cpu-map auto:1/1-48 0-39
=> you must absolutely pin your processes, and they must be pinned
to cores *not* shared with the network card. That's critical.
Moreover it's also important that threads are not split across
multiple physical CPUs because the remote L3 cache access time
over QPI/UPI is terrible. When you run on 12 threads with two
12-cores/24-threads CPUs, you could very well have haproxy using
12 threads from 6 cores, and the NIC using 12 threads from the
other 6 cores of the same physical CPU. The second socket is,
as usual, useless for anything requiring low latency. However
it's perfect to run SSL. So you could be interested in testing
if running the NIC on one socket (try to figure what node the
PCIe lanes are physically connected to), and haproxy on the other
one. It *could* be possible that you get more performance from 12
cores of each but I strongly doubt it based on a number of tests.
If you use SSL however it's different as you will benefit from
lots of cores much more than low latency.
Post by Krishna Kumar (Engineering)
bind <VIP>:80 process 1/1-48
=> it's also capital for scalability to have individual bind lines. Here
you have a single socket accessed from all 48 threads. There's no
bind <VIP>:80 process 1/1
bind <VIP>:80 process 1/2
...
bind <VIP>:80 process 1/47
bind <VIP>:80 process 1/48
You will let the kernel perform the load balancing and distribute a
fair load to all threads. This way none of them will risk to pick a
larger share of the incoming connections than optimal. I know it's
annoying to configure at the moment, I've been thinking about having
a way to automatically iterate from a single config line (like the
"auto" feature of cpu-map), but for now it's not done.
Post by Krishna Kumar (Engineering)
5. Lock stats for 1.9-dev3: Some write locks on average took a lot more
time
Post by Krishna Kumar (Engineering)
# write lock : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
This one definitely is huge. We know some work is still needed on this lock
and that there are still a few low hanging fruits but not much savings to
expect short term. This output is very revealing however of the importance
of this lock.
Post by Krishna Kumar (Engineering)
# wait time for write/lock: 78.996 nsec
That's roughly the time it takes to access the other CPU's cache, so using
your two sockets for the same process definitely hurts a lot here.
Post by Krishna Kumar (Engineering)
# write lock : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
This one is still far too large for what we'd hope, even though it
has significantly shrunk since 1.8. It could be related to the poor
distribution of the incoming connections across threads.
Post by Krishna Kumar (Engineering)
# wait time for write/lock: 3790.941 nsec
Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting
to suspect some longer tree walks than expected. Again, lack of fairness
between threads can make this significantly worse than it should.
Post by Krishna Kumar (Engineering)
# write lock : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
For this one I had an idea that could significantly improve the situation.
It will unlikely be for 1.9 though.
Post by Krishna Kumar (Engineering)
# write lock : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
Interesting. Very interesting... I thought we merged the lock-free pool
code. Maybe I was mistaken. Or I misunderstood something about the nature
of the changes that went in early during the 1.9-dev cycle. This one is
not expected to be hard to address, and I even see how we could go further
with a small thread-local cache since I developed such a thing a few years
ago (though it was over-engineered) that could easily be retrofitted here.
Post by Krishna Kumar (Engineering)
# wait time for write/lock: 4244.788 nsec
Looks like a malloc() time. Maybe the measure was huge during the ramp-up
and amortized later.
Post by Krishna Kumar (Engineering)
# write lock : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
Good to know, could still be improved.
Post by Krishna Kumar (Engineering)
# write lock : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
Same here!
Post by Krishna Kumar (Engineering)
# write lock : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
Here it will be less easy because it depends on the LB
algorithm.
Post by Krishna Kumar (Engineering)
# write lock : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
Post-1.9 I think we could make this one thread-local for the vast majority
of buffer allocations. Short term, we would possibly try to improve the
situation to avoid taking the lock for all successes. It's tricky but I
think possible.
Post by Krishna Kumar (Engineering)
# write lock : 901947
# write unlock: 901947 (0)
# wait time for write : 299.224 msec
# wait time for write/lock: 331.753 nsec
This one is used by variables.
Post by Krishna Kumar (Engineering)
6. CPU utilization after test for processes/threads: haproxy-1.9-dev3
runs
Post by Krishna Kumar (Engineering)
at 4800% (48 cpus) for 30 seconds after the test is done. For 1.8.14,
this behavior was not seen.
We've had the exact opposite situation yesterday on another case. It really
depends on the scheduling, but it's possible you've hit a bug in 1.9-dev3
though.
(...)
Post by Krishna Kumar (Engineering)
2185 CLOSE-WAIT
2187 ESTAB
All connections that were in CLOSE-WAIT were from the client,
I suspect we still have a few cases where some client connections may
only be checked upon client timeout.
Many thanks, your test is very useful at this stage of the development!
There definitely are some performance points we won't address for 1.9,
but it confirms some observations and/or completes them.
Best regards,
Willy
Willy Tarreau
2018-10-11 03:07:40 UTC
Permalink
Post by Krishna Kumar (Engineering)
Hi Willy,
Thank you very much for the in-depth analysis and configuration setting
suggestions.
1. Thread pinning
2. Fix system irq pinning accordingly
3. Listen on all threads.
I will post the configuration changes and the results.
By the way, please pull the latest master fixes. I've addressed two issues
there with locking :
- one where the scheduler work was slightly too high, increasing the time
spent on RQ lock
- another one where I messed up on a fix, causing lock-free pools to be
disabled (as seen in your output, where the POOL lock appears a lot)

On some tests I've run here, I've found the stick-tables lock to be a
bottleneck when tracking is enabled. I don't have a short-term solution
to this, but looking at the code it's obvious that it can significantly
be improved (though it will take quite some time). I'll probably at least
try to replace it with an RW lock as I think it could improve the situation.

The FD lock is another one requiring some lift-up. I'm certain it's possible,
I just don't know if it will not degrade low-thread count performance by
using too many atomic ops instead. We'll have to experiment.

Cheers,
Willy
Krishna Kumar (Engineering)
2018-10-11 03:23:02 UTC
Permalink
Thanks, will do that.
Post by Willy Tarreau
Post by Krishna Kumar (Engineering)
Hi Willy,
Thank you very much for the in-depth analysis and configuration setting
suggestions.
1. Thread pinning
2. Fix system irq pinning accordingly
3. Listen on all threads.
I will post the configuration changes and the results.
By the way, please pull the latest master fixes. I've addressed two issues
- one where the scheduler work was slightly too high, increasing the time
spent on RQ lock
- another one where I messed up on a fix, causing lock-free pools to be
disabled (as seen in your output, where the POOL lock appears a lot)
On some tests I've run here, I've found the stick-tables lock to be a
bottleneck when tracking is enabled. I don't have a short-term solution
to this, but looking at the code it's obvious that it can significantly
be improved (though it will take quite some time). I'll probably at least
try to replace it with an RW lock as I think it could improve the situation.
The FD lock is another one requiring some lift-up. I'm certain it's possible,
I just don't know if it will not degrade low-thread count performance by
using too many atomic ops instead. We'll have to experiment.
Cheers,
Willy
Krishna Kumar (Engineering)
2018-10-11 06:34:31 UTC
Permalink
I must say the improvements are pretty impressive!

Earlier number reported with 24 processes: 519K
Earlier number reported with 24 threads: 79K
New RPS with system irq tuning, today's git,
configuration changes, 24 threads: 353K
Old code with same tuning gave: 290K

My test machine is a 2 NUMA node server, CPU's 0, 2, ..., 22 on node 0, and
1, 3, ..., 23 on node 1. That adds to 24 cpus. The remaining 24 cores are
HT.
Node 0 has 24, 26, ..., 46 and node 1 has 25, 27, ..., 47. This may explain
why
it scales well to 24. The 16 irq's of the NIC and pinned to cpus 0, 2, ...,
30.
Hoping performance further improves.

12 threads: 280K
16 threads: 318K
24 threads: 353K (occasional drop till 330K)
32 threads: 238K

I am attaching 2 text files of the lock metrics for 24 and 32 threads. A
vimdiff shows the differences nicely (fd, task_rq, task_wq, proxy, server,
lbprm and buf_wq increased significantly).

Thanks!

On Thu, Oct 11, 2018 at 8:53 AM Krishna Kumar (Engineering) <
Post by Krishna Kumar (Engineering)
Thanks, will do that.
Post by Willy Tarreau
Post by Krishna Kumar (Engineering)
Hi Willy,
Thank you very much for the in-depth analysis and configuration setting
suggestions.
1. Thread pinning
2. Fix system irq pinning accordingly
3. Listen on all threads.
I will post the configuration changes and the results.
By the way, please pull the latest master fixes. I've addressed two issues
- one where the scheduler work was slightly too high, increasing the time
spent on RQ lock
- another one where I messed up on a fix, causing lock-free pools to be
disabled (as seen in your output, where the POOL lock appears a lot)
On some tests I've run here, I've found the stick-tables lock to be a
bottleneck when tracking is enabled. I don't have a short-term solution
to this, but looking at the code it's obvious that it can significantly
be improved (though it will take quite some time). I'll probably at least
try to replace it with an RW lock as I think it could improve the situation.
The FD lock is another one requiring some lift-up. I'm certain it's possible,
I just don't know if it will not degrade low-thread count performance by
using too many atomic ops instead. We'll have to experiment.
Cheers,
Willy
Willy Tarreau
2018-10-15 05:18:07 UTC
Permalink
Hi Krishna,
Post by Krishna Kumar (Engineering)
I must say the improvements are pretty impressive!
Earlier number reported with 24 processes: 519K
Earlier number reported with 24 threads: 79K
New RPS with system irq tuning, today's git,
configuration changes, 24 threads: 353K
Old code with same tuning gave: 290K
OK that's much better but I'm still horrified by the time taken in
the load balancing algorithm. I thought it could be fwlc_reposition(),
which contains an eb32_delete()+eb32_insert(), so I decided to replace
this with a new eb32_move() which moves the node within the tree, and
it didn't change anything here. Also I figured that I cannot manage to
reach that high time spent in this lock (300ms here, 58s for you). There
is one possible difference that might explain it, do you have a maxconn
setting on your servers ? If so, is it possible that it's reached ? You
can take a look at your stats page and see if the "Queue/Max" entry for
any backend is non-null.

Indeed, I'm seeing that once a server is saturated, we skip it for the
next one. This part can be expensive. Ideally we should remove such servers
from the tree until they're unblocked, but there is one special case making
this difficult, which is the dynamic limitation (minconn+maxconn+fullconn).
However I think we could improve this so that only this use case would be
affected and not the other ones.

I'm also seeing that this lock could be replaced by an RW lock. But before
taking a deeper look, I'm interested in verifying that it's indeed the
situation you're facing.

Thanks,
Willy
Willy Tarreau
2018-10-15 17:44:49 UTC
Permalink
Hi again,

finally I got rid of the FD lock for single-threaded accesses (most of
them), and based on Olivier's suggestion, I implemented a per-thread
wait queue, and cache-aligned some list heads to avoid undesired cache
line sharing. For me all of this combined resulted in a performance
increase of 25% on a 12-threads workload. I'm interested in your test
results, all of this is in the latest master.

If you still see LBPRM a lot, I can send you the experimental patch
to move the element inside the tree without unlinking/relinking it
and we can see if that provides any benefit or not (I'm not convinced).

Cheers,
Willy
Krishna Kumar (Engineering)
2018-10-16 06:42:14 UTC
Permalink
Hi Willy,

My systems were out of rotation for some other tests so did not get
to this till now. I have pulled the latest bits just now and tested.
Regarding
maxconn, I simply kept maxconn in global/defaults to 1 million and have
this line in the backend section:
default-server maxconn 1000000
I have not seen the Queue/Max you mentioned earlier.

The FD time has gone down to zero, but the LB time has increased
about 50% from last time (7700 ns to 11600 ns, I am using 'balance
leastconn').
Test was run for 1 minute:
$ wrk -c 4800 -t 48 -d 60s http://www.flipkart.com/128

The results were for 32 threads, which is the same configuration I tested
with earlier. Both of these testing was done with threads pinning to NUMA-1
cores (cores 1, 3, 5, ..47), and irq's to NUMA-0 (0, 2, 4, ..46). However,
the
cpus recycles from 1-47 back to 1-15 for the thread pinning. So that may
explain the much higher lock numbers that I am seeing. When I changed
this to use all cpus (0-31), the LBPRM lock took 74339.117 ns per operation.
But performance dropped from 210K to 80K.

Overall, I am not at ease for threading, or will have to settle for 12
threads
for the 12 non-hyperthreaded cores for a single socket.

Inlining the output of locks for the case where all threads are pinned to
NUMA-1 cores (and hence 2 threads to same cores for some cores),
at the end of this mail.

Thanks,
- Krishna

Stats about Lock FD:
# write lock : 2
# write unlock: 2 (0)
# wait time for write : 0.001 msec
# wait time for write/lock: 302.000 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_RQ:
# write lock : 373317
# write unlock: 373317 (0)
# wait time for write : 341.875 msec
# wait time for write/lock: 915.775 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_WQ:
# write lock : 373432
# write unlock: 373432 (0)
# wait time for write : 491.524 msec
# wait time for write/lock: 1316.235 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LISTENER:
# write lock : 1248
# write unlock: 1248 (0)
# wait time for write : 0.295 msec
# wait time for write/lock: 236.341 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock PROXY:
# write lock : 12524202
# write unlock: 12524202 (0)
# wait time for write : 20979.972 msec
# wait time for write/lock: 1675.154 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock SERVER:
# write lock : 50100330
# write unlock: 50100330 (0)
# wait time for write : 76908.311 msec
# wait time for write/lock: 1535.086 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LBPRM:
# write lock : 50096808
# write unlock: 50096808 (0)
# wait time for write : 584505.012 msec
# wait time for write/lock: 11667.510 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock BUF_WQ:
# write lock : 35653802
# write unlock: 35653802 (0)
# wait time for write : 80406.420 msec
# wait time for write/lock: 2255.199 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock STRMS:
# write lock : 9602
# write unlock: 9602 (0)
# wait time for write : 5.613 msec
# wait time for write/lock: 584.594 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock VARS:
# write lock : 37596611
# write unlock: 37596611 (0)
# wait time for write : 2285.148 msec
# wait time for write/lock: 60.781 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read : 0.000 msec
# wait time for read/lock : 0.000 nsec
Post by Willy Tarreau
Hi again,
finally I got rid of the FD lock for single-threaded accesses (most of
them), and based on Olivier's suggestion, I implemented a per-thread
wait queue, and cache-aligned some list heads to avoid undesired cache
line sharing. For me all of this combined resulted in a performance
increase of 25% on a 12-threads workload. I'm interested in your test
results, all of this is in the latest master.
If you still see LBPRM a lot, I can send you the experimental patch
to move the element inside the tree without unlinking/relinking it
and we can see if that provides any benefit or not (I'm not convinced).
Cheers,
Willy
Loading...