Discussion:
High CPU Usage followed by segfault error
Soji Antony
2018-10-02 14:56:12 UTC
Permalink
Hello,

We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9]. We
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy processes.

OS: Ubuntu 14
Kernel: 4.4.0-134-generic

The issue we are seeing with Haproxy is all of a sudden CPU usage spikes to
100% on cores which haproxy is using & causing latency spikes and high load
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.

haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40 error
6 in haproxy[55f04b10100
0+170000]

Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated

We are using marathonlb for auto discovery and reloads are quite frequent
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was also
taking cpu.

hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had with
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple hap
instances running after reload. I would really appreciate if some one can
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.

FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.

Thanks

Thanks
Soji Antony
2018-10-12 06:31:51 UTC
Permalink
Hi Oliver,

Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier is
not related to the CPU spike as it is happening at different time. Recently
we had the same issue with one of our haproxy servers and found the
following in strace o/p:

# haproxy -vv

HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
Copyright 2000-2018 Willy Tarreau <***@haproxy.org>

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

Strace O/P:

[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>

kernel.log

Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task <logshipper>:1213
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250] Tainted: G OE
<kernel-version>
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389] ffff880768c878a8
ffff880768c87968 ffff880766ae3700 ffff880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392] 0000000000000000
7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394] ffff880768c878c0
ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152396] Call Trace:
Oct 10 19:16:28 int16 kernel: [193201.152402] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404] [<ffffffff817f0fb5>]
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418] [<ffffffff817f39ab>]
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430] [<ffffffff81022d35>] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438] [<ffffffff81036ab9>] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441] [<ffffffff813bf154>] ?
__blk_mq_alloc_request+0xe4/0x1f0
Oct 10 19:16:28 int16 kernel: [193201.152442] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152445] [<ffffffff817f06f6>]
io_schedule_timeout+0xa6/0x110
Oct 10 19:16:28 int16 kernel: [193201.152446] [<ffffffff817f171b>]
bit_wait_io+0x1b/0x60
Oct 10 19:16:28 int16 kernel: [193201.152448] [<ffffffff817f1362>]
__wait_on_bit+0x62/0x90
Oct 10 19:16:28 int16 kernel: [193201.152451] [<ffffffff81187bc0>]
wait_on_page_bit+0xc0/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152454] [<ffffffff810c27f0>] ?
autoremove_wake_function+0x40/0x40
Oct 10 19:16:28 int16 kernel: [193201.152456] [<ffffffff81197e66>]
truncate_inode_pages_range+0x366/0x6d0
Oct 10 19:16:28 int16 kernel: [193201.152459] [<ffffffff81554398>] ?
blkif_queue_rq+0x508/0x690
Oct 10 19:16:28 int16 kernel: [193201.152461] [<ffffffff817f2d82>] ?
down_write+0x12/0x40
Oct 10 19:16:28 int16 kernel: [193201.152465] [<ffffffff811b6646>] ?
unmap_mapping_range+0x66/0x110
Oct 10 19:16:28 int16 kernel: [193201.152467] [<ffffffff81198297>]
truncate_pagecache+0x47/0x60
Oct 10 19:16:28 int16 kernel: [193201.152469] [<ffffffff811982e2>]
truncate_setsize+0x32/0x40
Oct 10 19:16:28 int16 kernel: [193201.152509] [<ffffffffc014d5d8>]
xfs_setattr_size+0x168/0x3d0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152522] [<ffffffffc014d8df>]
xfs_vn_setattr+0x9f/0xb0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152524] [<ffffffff81221e80>]
notify_change+0x250/0x470
Oct 10 19:16:28 int16 kernel: [193201.152533] [<ffffffff81202a86>]
do_truncate+0x66/0xa0
Oct 10 19:16:28 int16 kernel: [193201.152545] [<ffffffff81212d06>]
path_openat+0x266/0x12e0
Oct 10 19:16:28 int16 kernel: [193201.152555] [<ffffffffc02f80ba>] ?
record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152564] [<ffffffff8118a6f5>] ?
generic_file_read_iter+0x5d5/0x670
Oct 10 19:16:28 int16 kernel: [193201.152566] [<ffffffff812159ee>]
do_filp_open+0x7e/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152569] [<ffffffff81222f84>] ?
__alloc_fd+0xc4/0x180
Oct 10 19:16:28 int16 kernel: [193201.152571] [<ffffffff81204b19>]
do_sys_open+0x129/0x270
Oct 10 19:16:28 int16 kernel: [193201.152574] [<ffffffffc02f8aef>] ?
syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152576] [<ffffffff81204c94>]
SyS_openat+0x14/0x20
Oct 10 19:16:28 int16 kernel: [193201.152578] [<ffffffff817f4a12>]
tracesys_phase2+0x94/0x99

FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR , priority 99.

Thanks
Hi,
Post by Soji Antony
Hello,
We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9]. We
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy processes.
OS: Ubuntu 14
Kernel: 4.4.0-134-generic
The issue we are seeing with Haproxy is all of a sudden CPU usage spikes
to
Post by Soji Antony
100% on cores which haproxy is using & causing latency spikes and high
load
Post by Soji Antony
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.
haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40
error
Post by Soji Antony
6 in haproxy[55f04b10100
0+170000]
Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated
We are using marathonlb for auto discovery and reloads are quite frequent
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was also
taking cpu.
hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had
with
Post by Soji Antony
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple
hap
Post by Soji Antony
instances running after reload. I would really appreciate if some one can
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.
FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.
A good first step would probably to upgrade to the latest version if
possible.
1.8.3 is quite old, and a bunch of bugs have been fixed since then,
especially when using multithreading.
Regards,
Olivier
Soji Antony
2018-10-13 08:04:53 UTC
Permalink
Really appreciate if some one can help with this issue. This is happening
quite frequently on our servers. I have taken a coredump when this happened
last time using 'gcore' command. However unable to share it as it might
have ssl related information. Is there anyway I can remove confidential
information from this coredump file before sharing it. Unfortunately we are
not able to reproduce this issue in a test environment. Also CPU usage was
100% for all the cores which haproxy was using for around 4hrs though we
have taken it out of rotation from DNS & was not serving any traffic. For
me it looks like something internal to haproxy is causing an infinite loop
causing heavy cpu usage.

Thanks

[image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
Post by Soji Antony
Hi Oliver,
Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier is
not related to the CPU spike as it is happening at different time. Recently
we had the same issue with one of our haproxy servers and found the
# haproxy -vv
HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
[SPOE] spoe
[COMP] compression
[TRACE] trace
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
kernel.log
Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task <logshipper>:1213
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250] Tainted: G
OE <kernel-version>
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389] ffff880768c878a8
ffff880768c87968 ffff880766ae3700 ffff880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392] 0000000000000000
7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394] ffff880768c878c0
ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152402] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404] [<ffffffff817f0fb5>]
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418] [<ffffffff817f39ab>]
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430] [<ffffffff81022d35>] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438] [<ffffffff81036ab9>] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441] [<ffffffff813bf154>] ?
__blk_mq_alloc_request+0xe4/0x1f0
Oct 10 19:16:28 int16 kernel: [193201.152442] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152445] [<ffffffff817f06f6>]
io_schedule_timeout+0xa6/0x110
Oct 10 19:16:28 int16 kernel: [193201.152446] [<ffffffff817f171b>]
bit_wait_io+0x1b/0x60
Oct 10 19:16:28 int16 kernel: [193201.152448] [<ffffffff817f1362>]
__wait_on_bit+0x62/0x90
Oct 10 19:16:28 int16 kernel: [193201.152451] [<ffffffff81187bc0>]
wait_on_page_bit+0xc0/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152454] [<ffffffff810c27f0>] ?
autoremove_wake_function+0x40/0x40
Oct 10 19:16:28 int16 kernel: [193201.152456] [<ffffffff81197e66>]
truncate_inode_pages_range+0x366/0x6d0
Oct 10 19:16:28 int16 kernel: [193201.152459] [<ffffffff81554398>] ?
blkif_queue_rq+0x508/0x690
Oct 10 19:16:28 int16 kernel: [193201.152461] [<ffffffff817f2d82>] ?
down_write+0x12/0x40
Oct 10 19:16:28 int16 kernel: [193201.152465] [<ffffffff811b6646>] ?
unmap_mapping_range+0x66/0x110
Oct 10 19:16:28 int16 kernel: [193201.152467] [<ffffffff81198297>]
truncate_pagecache+0x47/0x60
Oct 10 19:16:28 int16 kernel: [193201.152469] [<ffffffff811982e2>]
truncate_setsize+0x32/0x40
Oct 10 19:16:28 int16 kernel: [193201.152509] [<ffffffffc014d5d8>]
xfs_setattr_size+0x168/0x3d0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152522] [<ffffffffc014d8df>]
xfs_vn_setattr+0x9f/0xb0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152524] [<ffffffff81221e80>]
notify_change+0x250/0x470
Oct 10 19:16:28 int16 kernel: [193201.152533] [<ffffffff81202a86>]
do_truncate+0x66/0xa0
Oct 10 19:16:28 int16 kernel: [193201.152545] [<ffffffff81212d06>]
path_openat+0x266/0x12e0
Oct 10 19:16:28 int16 kernel: [193201.152555] [<ffffffffc02f80ba>] ?
record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152564] [<ffffffff8118a6f5>] ?
generic_file_read_iter+0x5d5/0x670
Oct 10 19:16:28 int16 kernel: [193201.152566] [<ffffffff812159ee>]
do_filp_open+0x7e/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152569] [<ffffffff81222f84>] ?
__alloc_fd+0xc4/0x180
Oct 10 19:16:28 int16 kernel: [193201.152571] [<ffffffff81204b19>]
do_sys_open+0x129/0x270
Oct 10 19:16:28 int16 kernel: [193201.152574] [<ffffffffc02f8aef>] ?
syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152576] [<ffffffff81204c94>]
SyS_openat+0x14/0x20
Oct 10 19:16:28 int16 kernel: [193201.152578] [<ffffffff817f4a12>]
tracesys_phase2+0x94/0x99
FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR , priority 99.
Thanks
Post by Soji Antony
Hi,
Post by Soji Antony
Hello,
We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9].
We
Post by Soji Antony
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy
processes.
Post by Soji Antony
OS: Ubuntu 14
Kernel: 4.4.0-134-generic
The issue we are seeing with Haproxy is all of a sudden CPU usage
spikes to
Post by Soji Antony
100% on cores which haproxy is using & causing latency spikes and high
load
Post by Soji Antony
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.
haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40
error
Post by Soji Antony
6 in haproxy[55f04b10100
0+170000]
Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated
We are using marathonlb for auto discovery and reloads are quite
frequent
Post by Soji Antony
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was
also
Post by Soji Antony
taking cpu.
hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had
with
Post by Soji Antony
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple
hap
Post by Soji Antony
instances running after reload. I would really appreciate if some one
can
Post by Soji Antony
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.
FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.
A good first step would probably to upgrade to the latest version if
possible.
1.8.3 is quite old, and a bunch of bugs have been fixed since then,
especially when using multithreading.
Regards,
Olivier
Soji Antony
2018-10-15 17:40:09 UTC
Permalink
Hi Olivier,

Many thanks for your reply.
Please find the gdb output given below.

# gdb /usr/sbin/haproxy core.dump3.13871
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/haproxy...(no debugging symbols
found)...done.
[New LWP 13872]
[New LWP 13873]
[New LWP 13888]
[New LWP 13889]
[New LWP 13890]
[New LWP 13892]
[New LWP 13893]
[New LWP 13894]
[New LWP 13895]
[New LWP 13871]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy'.
#0 0x000055b79086de43 in _start ()
(gdb) thread apply all bt

Thread 10 (Thread 0x7f88ee327980 (LWP 13871)):
#0 0x000055b79086dee1 in _start ()

Thread 9 (Thread 0x7f88d37fe700 (LWP 13895)):
#0 0x000055b79086de46 in _start ()

Thread 8 (Thread 0x7f88cb7fe700 (LWP 13894)):
#0 0x000055b79086de46 in _start ()

Thread 7 (Thread 0x7f88d3fff700 (LWP 13893)):
#0 0x000055b79086de43 in _start ()

Thread 6 (Thread 0x7f88e8c25700 (LWP 13892)):
#0 0x000055b79086de43 in _start ()

Thread 5 (Thread 0x7f88e94c5700 (LWP 13890)):
#0 0x000055b79086de46 in _start ()

Thread 4 (Thread 0x7f88e9cc6700 (LWP 13889)):
#0 0x000055b79086de46 in _start ()

Thread 3 (Thread 0x7f88ea4c7700 (LWP 13888)):
#0 0x000055b79086de43 in _start ()

Thread 2 (Thread 0x7f88eacc8700 (LWP 13873)):
#0 0x000055b79086de43 in _start ()

Thread 1 (Thread 0x7f88eb4c9700 (LWP 13872)):
#0 0x000055b79086de43 in _start ()
(gdb) info threads
Id Target Id Frame
10 Thread 0x7f88ee327980 (LWP 13871) 0x000055b79086dee1 in _start ()
9 Thread 0x7f88d37fe700 (LWP 13895) 0x000055b79086de46 in _start ()
8 Thread 0x7f88cb7fe700 (LWP 13894) 0x000055b79086de46 in _start ()
7 Thread 0x7f88d3fff700 (LWP 13893) 0x000055b79086de43 in _start ()
6 Thread 0x7f88e8c25700 (LWP 13892) 0x000055b79086de43 in _start ()
5 Thread 0x7f88e94c5700 (LWP 13890) 0x000055b79086de46 in _start ()
4 Thread 0x7f88e9cc6700 (LWP 13889) 0x000055b79086de46 in _start ()
3 Thread 0x7f88ea4c7700 (LWP 13888) 0x000055b79086de43 in _start ()
2 Thread 0x7f88eacc8700 (LWP 13873) 0x000055b79086de43 in _start ()
* 1 Thread 0x7f88eb4c9700 (LWP 13872) 0x000055b79086de43 in _start ()
(gdb)
Post by Soji Antony
Hi,
Post by Soji Antony
Really appreciate if some one can help with this issue. This is happening
quite frequently on our servers. I have taken a coredump when this
happened
Post by Soji Antony
last time using 'gcore' command. However unable to share it as it might
have ssl related information. Is there anyway I can remove confidential
information from this coredump file before sharing it. Unfortunately we
are
Post by Soji Antony
not able to reproduce this issue in a test environment. Also CPU usage
was
Post by Soji Antony
100% for all the cores which haproxy was using for around 4hrs though we
have taken it out of rotation from DNS & was not serving any traffic. For
me it looks like something internal to haproxy is causing an infinite
loop
Post by Soji Antony
causing heavy cpu usage.
Thanks
Sorry for the late answer.
Without sending us the core, if you could at least give use the gdb output
of "thread apply all bt" so that we know where each thread is spinning,
that
may be very useful, and shouldn't leak any confidential information.
Thanks !
Olivier
Post by Soji Antony
[image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
Post by Soji Antony
Hi Oliver,
Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but
seeing
Post by Soji Antony
Post by Soji Antony
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier
is
Post by Soji Antony
Post by Soji Antony
not related to the CPU spike as it is happening at different time.
Recently
Post by Soji Antony
Post by Soji Antony
we had the same issue with one of our haproxy servers and found the
# haproxy -vv
HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents =
200
Post by Soji Antony
Post by Soji Antony
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
[SPOE] spoe
[COMP] compression
[TRACE] trace
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
kernel.log
Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task
<logshipper>:1213
Post by Soji Antony
Post by Soji Antony
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250] Tainted: G
OE <kernel-version>
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389] ffff880768c878a8
ffff880768c87968 ffff880766ae3700 ffff880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392] 0000000000000000
7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394] ffff880768c878c0
ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152402] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404] [<ffffffff817f0fb5>]
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418] [<ffffffff817f39ab>]
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430] [<ffffffff81022d35>] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438] [<ffffffff81036ab9>] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441] [<ffffffff813bf154>] ?
__blk_mq_alloc_request+0xe4/0x1f0
Oct 10 19:16:28 int16 kernel: [193201.152442] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152445] [<ffffffff817f06f6>]
io_schedule_timeout+0xa6/0x110
Oct 10 19:16:28 int16 kernel: [193201.152446] [<ffffffff817f171b>]
bit_wait_io+0x1b/0x60
Oct 10 19:16:28 int16 kernel: [193201.152448] [<ffffffff817f1362>]
__wait_on_bit+0x62/0x90
Oct 10 19:16:28 int16 kernel: [193201.152451] [<ffffffff81187bc0>]
wait_on_page_bit+0xc0/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152454] [<ffffffff810c27f0>] ?
autoremove_wake_function+0x40/0x40
Oct 10 19:16:28 int16 kernel: [193201.152456] [<ffffffff81197e66>]
truncate_inode_pages_range+0x366/0x6d0
Oct 10 19:16:28 int16 kernel: [193201.152459] [<ffffffff81554398>] ?
blkif_queue_rq+0x508/0x690
Oct 10 19:16:28 int16 kernel: [193201.152461] [<ffffffff817f2d82>] ?
down_write+0x12/0x40
Oct 10 19:16:28 int16 kernel: [193201.152465] [<ffffffff811b6646>] ?
unmap_mapping_range+0x66/0x110
Oct 10 19:16:28 int16 kernel: [193201.152467] [<ffffffff81198297>]
truncate_pagecache+0x47/0x60
Oct 10 19:16:28 int16 kernel: [193201.152469] [<ffffffff811982e2>]
truncate_setsize+0x32/0x40
Oct 10 19:16:28 int16 kernel: [193201.152509] [<ffffffffc014d5d8>]
xfs_setattr_size+0x168/0x3d0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152522] [<ffffffffc014d8df>]
xfs_vn_setattr+0x9f/0xb0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152524] [<ffffffff81221e80>]
notify_change+0x250/0x470
Oct 10 19:16:28 int16 kernel: [193201.152533] [<ffffffff81202a86>]
do_truncate+0x66/0xa0
Oct 10 19:16:28 int16 kernel: [193201.152545] [<ffffffff81212d06>]
path_openat+0x266/0x12e0
Oct 10 19:16:28 int16 kernel: [193201.152555] [<ffffffffc02f80ba>] ?
record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152564] [<ffffffff8118a6f5>] ?
generic_file_read_iter+0x5d5/0x670
Oct 10 19:16:28 int16 kernel: [193201.152566] [<ffffffff812159ee>]
do_filp_open+0x7e/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152569] [<ffffffff81222f84>] ?
__alloc_fd+0xc4/0x180
Oct 10 19:16:28 int16 kernel: [193201.152571] [<ffffffff81204b19>]
do_sys_open+0x129/0x270
Oct 10 19:16:28 int16 kernel: [193201.152574] [<ffffffffc02f8aef>] ?
syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152576] [<ffffffff81204c94>]
SyS_openat+0x14/0x20
Oct 10 19:16:28 int16 kernel: [193201.152578] [<ffffffff817f4a12>]
tracesys_phase2+0x94/0x99
FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR ,
priority 99.
Post by Soji Antony
Post by Soji Antony
Thanks
On Tue, Oct 2, 2018 at 10:09 PM Olivier Houchard <
Post by Soji Antony
Hi,
Post by Soji Antony
Hello,
We are currently using haproxy 1.8.3 with single process
multithreaded
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
configuration.
We have 1 process and 10 threads each mapped to a separate core
[0-9].
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
We
Post by Soji Antony
are running our haproxy instances on a c4.4xlarge aws ec2 instance.
The
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
only other CPU intensive process running on this server is a log
shipper
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
which is explicity mapped to cpu cores 13 - 16 explicitly using
taskset
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
command. Also we have given 'SCHED_RR' priority 99 for haproxy
processes.
Post by Soji Antony
OS: Ubuntu 14
Kernel: 4.4.0-134-generic
The issue we are seeing with Haproxy is all of a sudden CPU usage
spikes to
Post by Soji Antony
100% on cores which haproxy is using & causing latency spikes and
high
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
load
Post by Soji Antony
on the server. We are seeing the following error messages in system
/
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
kernel logs when this issue happens.
haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp
00007ffdab2bdd40
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
error
Post by Soji Antony
6 in haproxy[55f04b10100
0+170000]
Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated
We are using marathonlb for auto discovery and reloads are quite
frequent
Post by Soji Antony
on this server. Last time when this issue happened we had seen
haproxy
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
using 750% of CPU and it went into D state. Also the old process was
also
Post by Soji Antony
taking cpu.
hard-stop-after was not set in our hap configuration and we were
seeing
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
multiple old pid's running on the server. After the last outage we
had
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
with
Post by Soji Antony
CPU we set 'hard-stop-after' to 10s and now we are not seeing
multiple
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
hap
Post by Soji Antony
instances running after reload. I would really appreciate if some
one
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
can
Post by Soji Antony
explain us why the CPU usage spikes with the above segfault error &
what
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
this error exactly means.
FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where
no
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
traffic was coming & system went down due to CPU usage & found the
same
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
segfault error in the logs.
A good first step would probably to upgrade to the latest version if
possible.
1.8.3 is quite old, and a bunch of bugs have been fixed since then,
especially when using multithreading.
Regards,
Olivier
Soji Antony
2018-10-16 03:58:46 UTC
Permalink
FYI, the initial version which we were using before upgrading to 1.8.14
was 1.8.13.
By mistake updated it as 1.8.3 in my first email.

Thanks
Post by Soji Antony
Hi Olivier,
Many thanks for your reply.
Please find the gdb output given below.
# gdb /usr/sbin/haproxy core.dump3.13871
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
<http://www.gnu.org/software/gdb/bugs/>.
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/haproxy...(no debugging symbols
found)...done.
[New LWP 13872]
[New LWP 13873]
[New LWP 13888]
[New LWP 13889]
[New LWP 13890]
[New LWP 13892]
[New LWP 13893]
[New LWP 13894]
[New LWP 13895]
[New LWP 13871]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy'.
#0 0x000055b79086de43 in _start ()
(gdb) thread apply all bt
#0 0x000055b79086dee1 in _start ()
#0 0x000055b79086de46 in _start ()
#0 0x000055b79086de46 in _start ()
#0 0x000055b79086de43 in _start ()
#0 0x000055b79086de43 in _start ()
#0 0x000055b79086de46 in _start ()
#0 0x000055b79086de46 in _start ()
#0 0x000055b79086de43 in _start ()
#0 0x000055b79086de43 in _start ()
#0 0x000055b79086de43 in _start ()
(gdb) info threads
Id Target Id Frame
10 Thread 0x7f88ee327980 (LWP 13871) 0x000055b79086dee1 in _start ()
9 Thread 0x7f88d37fe700 (LWP 13895) 0x000055b79086de46 in _start ()
8 Thread 0x7f88cb7fe700 (LWP 13894) 0x000055b79086de46 in _start ()
7 Thread 0x7f88d3fff700 (LWP 13893) 0x000055b79086de43 in _start ()
6 Thread 0x7f88e8c25700 (LWP 13892) 0x000055b79086de43 in _start ()
5 Thread 0x7f88e94c5700 (LWP 13890) 0x000055b79086de46 in _start ()
4 Thread 0x7f88e9cc6700 (LWP 13889) 0x000055b79086de46 in _start ()
3 Thread 0x7f88ea4c7700 (LWP 13888) 0x000055b79086de43 in _start ()
2 Thread 0x7f88eacc8700 (LWP 13873) 0x000055b79086de43 in _start ()
* 1 Thread 0x7f88eb4c9700 (LWP 13872) 0x000055b79086de43 in _start ()
(gdb)
Post by Soji Antony
Hi,
Post by Soji Antony
Really appreciate if some one can help with this issue. This is
happening
Post by Soji Antony
quite frequently on our servers. I have taken a coredump when this
happened
Post by Soji Antony
last time using 'gcore' command. However unable to share it as it might
have ssl related information. Is there anyway I can remove confidential
information from this coredump file before sharing it. Unfortunately we
are
Post by Soji Antony
not able to reproduce this issue in a test environment. Also CPU usage
was
Post by Soji Antony
100% for all the cores which haproxy was using for around 4hrs though we
have taken it out of rotation from DNS & was not serving any traffic.
For
Post by Soji Antony
me it looks like something internal to haproxy is causing an infinite
loop
Post by Soji Antony
causing heavy cpu usage.
Thanks
Sorry for the late answer.
Without sending us the core, if you could at least give use the gdb output
of "thread apply all bt" so that we know where each thread is spinning,
that
may be very useful, and shouldn't leak any confidential information.
Thanks !
Olivier
Post by Soji Antony
[image: Screen Shot 2018-10-12 at 8.13.02 PM.png]
Post by Soji Antony
Hi Oliver,
Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but
seeing
Post by Soji Antony
Post by Soji Antony
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier
is
Post by Soji Antony
Post by Soji Antony
not related to the CPU spike as it is happening at different time.
Recently
Post by Soji Antony
Post by Soji Antony
we had the same issue with one of our haproxy servers and found the
# haproxy -vv
HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents =
200
Post by Soji Antony
Post by Soji Antony
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
[SPOE] spoe
[COMP] compression
[TRACE] trace
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
kernel.log
Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task
<logshipper>:1213
Post by Soji Antony
Post by Soji Antony
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250] Tainted: G
OE <kernel-version>
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389] ffff880768c878a8
ffff880768c87968 ffff880766ae3700 ffff880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392] 0000000000000000
7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394] ffff880768c878c0
ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152402] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404] [<ffffffff817f0fb5>]
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418] [<ffffffff817f39ab>]
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430] [<ffffffff81022d35>] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438] [<ffffffff81036ab9>] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441] [<ffffffff813bf154>] ?
__blk_mq_alloc_request+0xe4/0x1f0
Oct 10 19:16:28 int16 kernel: [193201.152442] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152445] [<ffffffff817f06f6>]
io_schedule_timeout+0xa6/0x110
Oct 10 19:16:28 int16 kernel: [193201.152446] [<ffffffff817f171b>]
bit_wait_io+0x1b/0x60
Oct 10 19:16:28 int16 kernel: [193201.152448] [<ffffffff817f1362>]
__wait_on_bit+0x62/0x90
Oct 10 19:16:28 int16 kernel: [193201.152451] [<ffffffff81187bc0>]
wait_on_page_bit+0xc0/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152454] [<ffffffff810c27f0>] ?
autoremove_wake_function+0x40/0x40
Oct 10 19:16:28 int16 kernel: [193201.152456] [<ffffffff81197e66>]
truncate_inode_pages_range+0x366/0x6d0
Oct 10 19:16:28 int16 kernel: [193201.152459] [<ffffffff81554398>] ?
blkif_queue_rq+0x508/0x690
Oct 10 19:16:28 int16 kernel: [193201.152461] [<ffffffff817f2d82>] ?
down_write+0x12/0x40
Oct 10 19:16:28 int16 kernel: [193201.152465] [<ffffffff811b6646>] ?
unmap_mapping_range+0x66/0x110
Oct 10 19:16:28 int16 kernel: [193201.152467] [<ffffffff81198297>]
truncate_pagecache+0x47/0x60
Oct 10 19:16:28 int16 kernel: [193201.152469] [<ffffffff811982e2>]
truncate_setsize+0x32/0x40
Oct 10 19:16:28 int16 kernel: [193201.152509] [<ffffffffc014d5d8>]
xfs_setattr_size+0x168/0x3d0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152522] [<ffffffffc014d8df>]
xfs_vn_setattr+0x9f/0xb0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152524] [<ffffffff81221e80>]
notify_change+0x250/0x470
Oct 10 19:16:28 int16 kernel: [193201.152533] [<ffffffff81202a86>]
do_truncate+0x66/0xa0
Oct 10 19:16:28 int16 kernel: [193201.152545] [<ffffffff81212d06>]
path_openat+0x266/0x12e0
Oct 10 19:16:28 int16 kernel: [193201.152555] [<ffffffffc02f80ba>] ?
record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152564] [<ffffffff8118a6f5>] ?
generic_file_read_iter+0x5d5/0x670
Oct 10 19:16:28 int16 kernel: [193201.152566] [<ffffffff812159ee>]
do_filp_open+0x7e/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152569] [<ffffffff81222f84>] ?
__alloc_fd+0xc4/0x180
Oct 10 19:16:28 int16 kernel: [193201.152571] [<ffffffff81204b19>]
do_sys_open+0x129/0x270
Oct 10 19:16:28 int16 kernel: [193201.152574] [<ffffffffc02f8aef>] ?
syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152576] [<ffffffff81204c94>]
SyS_openat+0x14/0x20
Oct 10 19:16:28 int16 kernel: [193201.152578] [<ffffffff817f4a12>]
tracesys_phase2+0x94/0x99
FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR ,
priority 99.
Post by Soji Antony
Post by Soji Antony
Thanks
On Tue, Oct 2, 2018 at 10:09 PM Olivier Houchard <
Post by Soji Antony
Hi,
Post by Soji Antony
Hello,
We are currently using haproxy 1.8.3 with single process
multithreaded
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
configuration.
We have 1 process and 10 threads each mapped to a separate core
[0-9].
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
We
Post by Soji Antony
are running our haproxy instances on a c4.4xlarge aws ec2
instance. The
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
only other CPU intensive process running on this server is a log
shipper
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
which is explicity mapped to cpu cores 13 - 16 explicitly using
taskset
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
command. Also we have given 'SCHED_RR' priority 99 for haproxy
processes.
Post by Soji Antony
OS: Ubuntu 14
Kernel: 4.4.0-134-generic
The issue we are seeing with Haproxy is all of a sudden CPU usage
spikes to
Post by Soji Antony
100% on cores which haproxy is using & causing latency spikes and
high
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
load
Post by Soji Antony
on the server. We are seeing the following error messages in
system /
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
kernel logs when this issue happens.
haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp
00007ffdab2bdd40
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
error
Post by Soji Antony
6 in haproxy[55f04b10100
0+170000]
Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated
We are using marathonlb for auto discovery and reloads are quite
frequent
Post by Soji Antony
on this server. Last time when this issue happened we had seen
haproxy
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
using 750% of CPU and it went into D state. Also the old process
was
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
also
Post by Soji Antony
taking cpu.
hard-stop-after was not set in our hap configuration and we were
seeing
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
multiple old pid's running on the server. After the last outage we
had
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
with
Post by Soji Antony
CPU we set 'hard-stop-after' to 10s and now we are not seeing
multiple
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
hap
Post by Soji Antony
instances running after reload. I would really appreciate if some
one
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
can
Post by Soji Antony
explain us why the CPU usage spikes with the above segfault error
& what
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
this error exactly means.
FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where
no
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
traffic was coming & system went down due to CPU usage & found the
same
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
Post by Soji Antony
segfault error in the logs.
A good first step would probably to upgrade to the latest version if
possible.
1.8.3 is quite old, and a bunch of bugs have been fixed since then,
especially when using multithreading.
Regards,
Olivier
Willy Tarreau
2018-10-16 14:11:20 UTC
Permalink
Hi,
Post by Soji Antony
FYI, the initial version which we were using before upgrading to 1.8.14
was 1.8.13.
By mistake updated it as 1.8.3 in my first email.
No problem, thanks for the precision. After re-reading the code, I found a
bug, which is extremely difficult to trigger and which could induce the
situation you're running into. It requires a race between two "show fd"
on the CLI, where the second must start exactly when the first one stops.
I couldn't reproduce it since the window is too narrow but theorically it
exists, and the fact that you don't see it often could confirm this.

However since you're facing the same issue with 1.8.13 which uses the older
synchronization point, I still have some doubts about the root cause. But
we know the last one could be a bit tricky so we can't rule out a slightly
different issue giving overall the same visible effects.

Could you please apply the attached patch ? I'm going to merge it into 1.9
and we'll backport it to 1.8 later.

By the way, any reason you're running with SCHED_RR ? It might make things
worse during reloads by letting some threads spin on their own spinlocks
without offering a chance to the same thread of the other process to complete
its work.

Regards,
Willy
Willy Tarreau
2018-10-16 15:02:30 UTC
Permalink
Post by Willy Tarreau
Could you please apply the attached patch ? I'm going to merge it into 1.9
and we'll backport it to 1.8 later.
And please add the attached one as well, which is specific to 1.8. I
suspect that different versions of compiler could emit inappropriate
code due to the threads_want_sync variable not being marked volatile.

In your case the issue would manifest itself if you're having heavy
server queueing ("maxconn" on the server lines) or if you're seeing
a lot of "server up/down" events.

Thanks,
Willy
Soji Antony
2018-10-17 03:56:20 UTC
Permalink
Hi Willy / Olivier,

Thank you very much for the patch & detailed explanation. I will apply this
patch on our servers.
Unfortunately, as is often the case with gdb, that's less than useful :/
If you have that available, you may install the haproxy-dbg package, but
I'm not convinced it will yield better results.
PFA gdb.txt file which has 'thread apply all bt' & 'info threads' command
output after installing haproxy-dbg package if that helps.
Can you share your config, obsucating any confidential informations, IP
addresses etc ?
PFA haproxy.cfg file, I have added a comment [removed] wherever I have
truncated lines.
You mentionned you where getting a segfault, do you know how to reproduce
it ?

Not sure how to reproduce it. I can see these segfault error messages in
kernel logs on random servers.I can try enabling crash dump on one of the
servers and share the details.

Oct 9 16:16:35 kernel: [85669.521234] haproxy[59075]: segfault at
7fda1fb0fc60 ip 000055c7273b643b sp 00007fd8c2ffaab0 error 4 in
haproxy[55c72734e000+172000]
Oct 10 09:48:43 [148797.364018] haproxy[60048]: segfault at 8 ip
0000556ba5c7eac2 sp 00007ffc5ef9e730 error 6 in haproxy[556ba5b89000+172000]
Oct 11 14:30:56 kernel: [252130.055746] haproxy[4538]: segfault at
7fe088e87350 ip 00005637ab43fea7 sp 00007fe0857e8c20 error 4 in
haproxy[5637ab410000+172000]
Oct 11 16:47:03 kernel: [260297.444482] haproxy[74455]: segfault at
7f07d0de7290 ip 00005574f96e1ea7 sp 00007f07ce9c6c20 error 4 in
haproxy[5574f96b2000+172000]
Oct 11 22:06:19 : [279453.364729] haproxy[103724]: segfault at 7f7e492535d0
ip 000055c8b4f1dea7 sp 00007f7e46d93c20 error 4 in
haproxy[55c8b4eee000+172000]
Oct 13 04:31:14 : [388948.155673] haproxy[92338]: segfault at 8 ip
00005583be079ac2 sp 00007ffc6cb34e60 error 6 in haproxy[5583bdf84000+172000]
Oct 15 15:17:04 kernel: [600498.581053] haproxy[63374]: segfault at 8 ip
000055dd2e7d1ac2 sp 00007ffed747e1d0 error 6 in haproxy[55dd2e6dc000+172000]
You also mentionned reloads are frequent, can you tell if the CPU spike
happens
immediately after a reload ?

It is very difficult to say as the reloads are quite often. Attaching the
graph for your reference. I can see that last reload happened at 16.56.19 &
CPU spike usage started spiking at 16.57.30. But may be due to high CPU
usage the script which we use to send this reload count to graphite might
have failed & might be not reflecting in graph.
By the way, any reason you're running with SCHED_RR ? It might make things
worse during reloads by letting some threads spin on their own spinlocks
without offering a chance to the same thread of the other process to
complete
its work.

This was a cron added long back when ''Meltdown and Spectre' fix slowed
down CPU. We were using hap 1.6 in single process mode at that time. Later
on we upgraded hap to 1.8.13 but the cron was still enabled on our servers
[#* * * * * /usr/bin/chrt -a -p 99 $(cat /var/run/haproxy.pid)]. Intially
we suspected this might have caused CPU spikes and disabled it. Currently
It is disabled on our servers. CPU spikes observed even after disabling it.
Post by Willy Tarreau
Post by Willy Tarreau
Could you please apply the attached patch ? I'm going to merge it into
1.9
Post by Willy Tarreau
Post by Willy Tarreau
and we'll backport it to 1.8 later.
And please add the attached one as well, which is specific to 1.8. I
suspect that different versions of compiler could emit inappropriate
code due to the threads_want_sync variable not being marked volatile.
In your case the issue would manifest itself if you're having heavy
server queueing ("maxconn" on the server lines) or if you're seeing
a lot of "server up/down" events.
Thanks,
Willy
Nice catch !
This one is a good candidate.
Regards,
Olivier
Soji Antony
2018-10-20 16:28:10 UTC
Permalink
Hi

FYI, following is the backtrace for segfault which we are seeing in the
kern.log.

Oct 18 10:11:30 kernel: [841364.001036] haproxy[30696]: segfault at 8 ip
00005567eaf6aac2 sp 00007ffdd70447b0 error 6 in haproxy[5567eae75000+172000]

# apport-retrace -g _usr_sbin_haproxy.0.crash
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Reading symbols from /usr/sbin/haproxy...Reading symbols from
/usr/lib/debug/.build-id/56/c5ffb3112d35c68a487caa1f4b788953891ade.debug...done.
done.
[New LWP 30696]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p
/var/run/haproxy.pid -sf 30646'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00005567eaf6aac2 in do_unbind_listener
(listener=***@entry=0x5567ebd82a00,
do_close=***@entry=1) at src/listener.c:319
319 src/listener.c: No such file or directory.
(gdb) list
314 in src/listener.c
(gdb) backtrace
#0 0x00005567eaf6aac2 in do_unbind_listener
(listener=***@entry=0x5567ebd82a00,
do_close=***@entry=1) at src/listener.c:319
#1 0x00005567eaf6b252 in unbind_listener
(listener=***@entry=0x5567ebd82a00)
at src/listener.c:342
#2 0x00005567eaf6b308 in unbind_all_listeners (proto=0x5567eb1fbcc0
<proto_tcpv4>) at src/listener.c:365
#3 0x00005567eaf9bf0e in protocol_unbind_all () at src/protocol.c:76
#4 0x00005567eaf397d6 in deinit () at src/haproxy.c:2289
#5 0x00005567eaea1e1f in main (argc=<optimized out>, argv=<optimized out>)
at src/haproxy.c:3092

Haproxy 1.8.14 is installed from ubuntu repository

# cat vbernat-haproxy.list
deb http://ppa.launchpad.net/vbernat/haproxy-1.8/ubuntu trusty main
deb-src http://ppa.launchpad.net/vbernat/haproxy-1.8/ubuntu trusty main

Thanks
Post by Soji Antony
Hi Willy / Olivier,
Thank you very much for the patch & detailed explanation. I will apply
this patch on our servers.
Unfortunately, as is often the case with gdb, that's less than useful :/
If you have that available, you may install the haproxy-dbg package, but
I'm not convinced it will yield better results.
PFA gdb.txt file which has 'thread apply all bt' & 'info threads' command
output after installing haproxy-dbg package if that helps.
Can you share your config, obsucating any confidential informations, IP
addresses etc ?
PFA haproxy.cfg file, I have added a comment [removed] wherever I have
truncated lines.
You mentionned you where getting a segfault, do you know how to
reproduce it ?
Not sure how to reproduce it. I can see these segfault error messages in
kernel logs on random servers.I can try enabling crash dump on one of the
servers and share the details.
Oct 9 16:16:35 kernel: [85669.521234] haproxy[59075]: segfault at
7fda1fb0fc60 ip 000055c7273b643b sp 00007fd8c2ffaab0 error 4 in
haproxy[55c72734e000+172000]
Oct 10 09:48:43 [148797.364018] haproxy[60048]: segfault at 8 ip
0000556ba5c7eac2 sp 00007ffc5ef9e730 error 6 in haproxy[556ba5b89000+172000]
Oct 11 14:30:56 kernel: [252130.055746] haproxy[4538]: segfault at
7fe088e87350 ip 00005637ab43fea7 sp 00007fe0857e8c20 error 4 in
haproxy[5637ab410000+172000]
Oct 11 16:47:03 kernel: [260297.444482] haproxy[74455]: segfault at
7f07d0de7290 ip 00005574f96e1ea7 sp 00007f07ce9c6c20 error 4 in
haproxy[5574f96b2000+172000]
Oct 11 22:06:19 : [279453.364729] haproxy[103724]: segfault at
7f7e492535d0 ip 000055c8b4f1dea7 sp 00007f7e46d93c20 error 4 in
haproxy[55c8b4eee000+172000]
Oct 13 04:31:14 : [388948.155673] haproxy[92338]: segfault at 8 ip
00005583be079ac2 sp 00007ffc6cb34e60 error 6 in haproxy[5583bdf84000+172000]
Oct 15 15:17:04 kernel: [600498.581053] haproxy[63374]: segfault at 8 ip
000055dd2e7d1ac2 sp 00007ffed747e1d0 error 6 in haproxy[55dd2e6dc000+172000]
You also mentionned reloads are frequent, can you tell if the CPU spike
happens
immediately after a reload ?
It is very difficult to say as the reloads are quite often. Attaching the
graph for your reference. I can see that last reload happened at 16.56.19 &
CPU spike usage started spiking at 16.57.30. But may be due to high CPU
usage the script which we use to send this reload count to graphite might
have failed & might be not reflecting in graph.
By the way, any reason you're running with SCHED_RR ? It might make
things
worse during reloads by letting some threads spin on their own spinlocks
without offering a chance to the same thread of the other process to complete
its work.
This was a cron added long back when ''Meltdown and Spectre' fix slowed
down CPU. We were using hap 1.6 in single process mode at that time. Later
on we upgraded hap to 1.8.13 but the cron was still enabled on our servers
[#* * * * * /usr/bin/chrt -a -p 99 $(cat /var/run/haproxy.pid)]. Intially
we suspected this might have caused CPU spikes and disabled it. Currently
It is disabled on our servers. CPU spikes observed even after disabling it.
Post by Willy Tarreau
Post by Willy Tarreau
Could you please apply the attached patch ? I'm going to merge it
into 1.9
Post by Willy Tarreau
Post by Willy Tarreau
and we'll backport it to 1.8 later.
And please add the attached one as well, which is specific to 1.8. I
suspect that different versions of compiler could emit inappropriate
code due to the threads_want_sync variable not being marked volatile.
In your case the issue would manifest itself if you're having heavy
server queueing ("maxconn" on the server lines) or if you're seeing
a lot of "server up/down" events.
Thanks,
Willy
Nice catch !
This one is a good candidate.
Regards,
Olivier
Loading...