High CPU Usage followed by segfault error

Discussion:

Soji Antony

2018-10-02 14:56:12 UTC

Hello,

We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9]. We
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy processes.

OS: Ubuntu 14
Kernel: 4.4.0-134-generic

The issue we are seeing with Haproxy is all of a sudden CPU usage spikes to
100% on cores which haproxy is using & causing latency spikes and high load
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.

haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40 error
6 in haproxy[55f04b10100
0+170000]

Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated

We are using marathonlb for auto discovery and reloads are quite frequent
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was also
taking cpu.

hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had with
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple hap
instances running after reload. I would really appreciate if some one can
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.

FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.

Thanks

Thanks

Soji Antony

2018-10-12 06:31:51 UTC

Permalink

Hi Oliver,

Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier is
not related to the CPU spike as it is happening at different time. Recently
we had the same issue with one of our haproxy servers and found the
following in strace o/p:

# haproxy -vv

HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
Copyright 2000-2018 Willy Tarreau <***@haproxy.org>

Build options :
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1

Default settings :
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

Strace O/P:

[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>

kernel.log

Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task <logshipper>:1213
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250] Tainted: G OE
<kernel-version>
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389] ffff880768c878a8
ffff880768c87968 ffff880766ae3700 ffff880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392] 0000000000000000
7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394] ffff880768c878c0
ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152396] Call Trace:
Oct 10 19:16:28 int16 kernel: [193201.152402] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404] [<ffffffff817f0fb5>]
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418] [<ffffffff817f39ab>]
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430] [<ffffffff81022d35>] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438] [<ffffffff81036ab9>] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441] [<ffffffff813bf154>] ?
__blk_mq_alloc_request+0xe4/0x1f0
Oct 10 19:16:28 int16 kernel: [193201.152442] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152445] [<ffffffff817f06f6>]
io_schedule_timeout+0xa6/0x110
Oct 10 19:16:28 int16 kernel: [193201.152446] [<ffffffff817f171b>]
bit_wait_io+0x1b/0x60
Oct 10 19:16:28 int16 kernel: [193201.152448] [<ffffffff817f1362>]
__wait_on_bit+0x62/0x90
Oct 10 19:16:28 int16 kernel: [193201.152451] [<ffffffff81187bc0>]
wait_on_page_bit+0xc0/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152454] [<ffffffff810c27f0>] ?
autoremove_wake_function+0x40/0x40
Oct 10 19:16:28 int16 kernel: [193201.152456] [<ffffffff81197e66>]
truncate_inode_pages_range+0x366/0x6d0
Oct 10 19:16:28 int16 kernel: [193201.152459] [<ffffffff81554398>] ?
blkif_queue_rq+0x508/0x690
Oct 10 19:16:28 int16 kernel: [193201.152461] [<ffffffff817f2d82>] ?
down_write+0x12/0x40
Oct 10 19:16:28 int16 kernel: [193201.152465] [<ffffffff811b6646>] ?
unmap_mapping_range+0x66/0x110
Oct 10 19:16:28 int16 kernel: [193201.152467] [<ffffffff81198297>]
truncate_pagecache+0x47/0x60
Oct 10 19:16:28 int16 kernel: [193201.152469] [<ffffffff811982e2>]
truncate_setsize+0x32/0x40
Oct 10 19:16:28 int16 kernel: [193201.152509] [<ffffffffc014d5d8>]
xfs_setattr_size+0x168/0x3d0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152522] [<ffffffffc014d8df>]
xfs_vn_setattr+0x9f/0xb0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152524] [<ffffffff81221e80>]
notify_change+0x250/0x470
Oct 10 19:16:28 int16 kernel: [193201.152533] [<ffffffff81202a86>]
do_truncate+0x66/0xa0
Oct 10 19:16:28 int16 kernel: [193201.152545] [<ffffffff81212d06>]
path_openat+0x266/0x12e0
Oct 10 19:16:28 int16 kernel: [193201.152555] [<ffffffffc02f80ba>] ?
record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152564] [<ffffffff8118a6f5>] ?
generic_file_read_iter+0x5d5/0x670
Oct 10 19:16:28 int16 kernel: [193201.152566] [<ffffffff812159ee>]
do_filp_open+0x7e/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152569] [<ffffffff81222f84>] ?
__alloc_fd+0xc4/0x180
Oct 10 19:16:28 int16 kernel: [193201.152571] [<ffffffff81204b19>]
do_sys_open+0x129/0x270
Oct 10 19:16:28 int16 kernel: [193201.152574] [<ffffffffc02f8aef>] ?
syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152576] [<ffffffff81204c94>]
SyS_openat+0x14/0x20
Oct 10 19:16:28 int16 kernel: [193201.152578] [<ffffffff817f4a12>]
tracesys_phase2+0x94/0x99

FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR , priority 99.

Thanks

Hi,

Post by Soji Antony
Hello,
We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9]. We
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy processes.
OS: Ubuntu 14
Kernel: 4.4.0-134-generic
The issue we are seeing with Haproxy is all of a sudden CPU usage spikes

Post by Soji Antony
100% on cores which haproxy is using & causing latency spikes and high

load

Post by Soji Antony
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.
haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40

error

Post by Soji Antony
6 in haproxy[55f04b10100
0+170000]
Sep 29 12:21:02 marathonlb-int21 kernel: [2223350.996059] sched: RT
throttling activated
We are using marathonlb for auto discovery and reloads are quite frequent
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was also
taking cpu.
hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had

with

Post by Soji Antony
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple

hap

Post by Soji Antony
instances running after reload. I would really appreciate if some one can
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.
FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.

A good first step would probably to upgrade to the latest version if
possible.
1.8.3 is quite old, and a bunch of bugs have been fixed since then,
especially when using multithreading.
Regards,
Olivier

Soji Antony

2018-10-13 08:04:53 UTC

Permalink

Really appreciate if some one can help with this issue. This is happening
quite frequently on our servers. I have taken a coredump when this happened
last time using 'gcore' command. However unable to share it as it might
have ssl related information. Is there anyway I can remove confidential
information from this coredump file before sharing it. Unfortunately we are
not able to reproduce this issue in a test environment. Also CPU usage was
100% for all the cores which haproxy was using for around 4hrs though we
have taken it out of rotation from DNS & was not serving any traffic. For
me it looks like something internal to haproxy is causing an infinite loop
causing heavy cpu usage.

Thanks

[image: Screen Shot 2018-10-12 at 8.13.02 PM.png]

Post by Soji Antony
Hi Oliver,
Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but seeing
the same CPU issue again.
I have found that the segmentation fault which we were seeing earlier is
not related to the CPU spike as it is happening at different time. Recently
we had the same issue with one of our haproxy servers and found the
# haproxy -vv
HA-Proxy version 1.8.14-1ppa1~trusty 2018/09/23
TARGET = linux2628
CPU = generic
CC = gcc
CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4
-Wformat -Werror=format-security -D_FORTIFY_SOURCE=2
OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1
maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
Built with Lua version : Lua 5.3.1
Built with transparent proxy support using: IP_TRANSPARENT
IPV6_TRANSPARENT IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.31 2012-07-06
Running on PCRE version : 8.31 2012-07-06
PCRE library supports JIT : no (libpcre build without JIT?)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
[SPOE] spoe
[COMP] compression
[TRACE] trace
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
[pid 114267] sched_yield( <unfinished ...>
[pid 114266] <... sched_yield resumed> ) = 0
[pid 114265] <... sched_yield resumed> ) = 0
[pid 114267] <... sched_yield resumed> ) = 0
[pid 114266] sched_yield( <unfinished ...>
[pid 114265] sched_yield( <unfinished ...>
kernel.log
Oct 10 19:13:04 int16 kernel: [192997.000062] sched: RT throttling activated
Oct 10 19:16:28 int16 kernel: [193201.140115] INFO: task <logshipper>:1213
blocked for more than 120 seconds.
Oct 10 19:16:28 int16 kernel: [193201.144250] Tainted: G
OE <kernel-version>
Oct 10 19:16:28 int16 kernel: [193201.147927] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 10 19:16:28 int16 kernel: [193201.152389] ffff880768c878a8
ffff880768c87968 ffff880766ae3700 ffff880768c88000
Oct 10 19:16:28 int16 kernel: [193201.152392] 0000000000000000
7fffffffffffffff ffff88078ffd50f0 ffffffff817f1700
Oct 10 19:16:28 int16 kernel: [193201.152394] ffff880768c878c0
ffffffff817f0fb5 ffff88076f917200 ffff880768c87970
Oct 10 19:16:28 int16 kernel: [193201.152402] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152404] [<ffffffff817f0fb5>]
schedule+0x35/0x80
Oct 10 19:16:28 int16 kernel: [193201.152418] [<ffffffff817f39ab>]
schedule_timeout+0x23b/0x2d0
Oct 10 19:16:28 int16 kernel: [193201.152430] [<ffffffff81022d35>] ?
xen_clocksource_read+0x15/0x20
Oct 10 19:16:28 int16 kernel: [193201.152438] [<ffffffff81036ab9>] ?
sched_clock+0x9/0x10
Oct 10 19:16:28 int16 kernel: [193201.152441] [<ffffffff813bf154>] ?
__blk_mq_alloc_request+0xe4/0x1f0
Oct 10 19:16:28 int16 kernel: [193201.152442] [<ffffffff817f1700>] ?
bit_wait+0x50/0x50
Oct 10 19:16:28 int16 kernel: [193201.152445] [<ffffffff817f06f6>]
io_schedule_timeout+0xa6/0x110
Oct 10 19:16:28 int16 kernel: [193201.152446] [<ffffffff817f171b>]
bit_wait_io+0x1b/0x60
Oct 10 19:16:28 int16 kernel: [193201.152448] [<ffffffff817f1362>]
__wait_on_bit+0x62/0x90
Oct 10 19:16:28 int16 kernel: [193201.152451] [<ffffffff81187bc0>]
wait_on_page_bit+0xc0/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152454] [<ffffffff810c27f0>] ?
autoremove_wake_function+0x40/0x40
Oct 10 19:16:28 int16 kernel: [193201.152456] [<ffffffff81197e66>]
truncate_inode_pages_range+0x366/0x6d0
Oct 10 19:16:28 int16 kernel: [193201.152459] [<ffffffff81554398>] ?
blkif_queue_rq+0x508/0x690
Oct 10 19:16:28 int16 kernel: [193201.152461] [<ffffffff817f2d82>] ?
down_write+0x12/0x40
Oct 10 19:16:28 int16 kernel: [193201.152465] [<ffffffff811b6646>] ?
unmap_mapping_range+0x66/0x110
Oct 10 19:16:28 int16 kernel: [193201.152467] [<ffffffff81198297>]
truncate_pagecache+0x47/0x60
Oct 10 19:16:28 int16 kernel: [193201.152469] [<ffffffff811982e2>]
truncate_setsize+0x32/0x40
Oct 10 19:16:28 int16 kernel: [193201.152509] [<ffffffffc014d5d8>]
xfs_setattr_size+0x168/0x3d0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152522] [<ffffffffc014d8df>]
xfs_vn_setattr+0x9f/0xb0 [xfs]
Oct 10 19:16:28 int16 kernel: [193201.152524] [<ffffffff81221e80>]
notify_change+0x250/0x470
Oct 10 19:16:28 int16 kernel: [193201.152533] [<ffffffff81202a86>]
do_truncate+0x66/0xa0
Oct 10 19:16:28 int16 kernel: [193201.152545] [<ffffffff81212d06>]
path_openat+0x266/0x12e0
Oct 10 19:16:28 int16 kernel: [193201.152555] [<ffffffffc02f80ba>] ?
record_event_consumer.part.5+0x2ea/0x9e0 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152564] [<ffffffff8118a6f5>] ?
generic_file_read_iter+0x5d5/0x670
Oct 10 19:16:28 int16 kernel: [193201.152566] [<ffffffff812159ee>]
do_filp_open+0x7e/0xd0
Oct 10 19:16:28 int16 kernel: [193201.152569] [<ffffffff81222f84>] ?
__alloc_fd+0xc4/0x180
Oct 10 19:16:28 int16 kernel: [193201.152571] [<ffffffff81204b19>]
do_sys_open+0x129/0x270
Oct 10 19:16:28 int16 kernel: [193201.152574] [<ffffffffc02f8aef>] ?
syscall_enter_probe+0xef/0x100 [sysdigcloud_probe]
Oct 10 19:16:28 int16 kernel: [193201.152576] [<ffffffff81204c94>]
SyS_openat+0x14/0x20
Oct 10 19:16:28 int16 kernel: [193201.152578] [<ffffffff817f4a12>]
tracesys_phase2+0x94/0x99
FYI, CPU scheduling priority set for haproxy is SCHEDULED_RR , priority 99.
Thanks

Post by Soji Antony
Hi,

Post by Soji Antony
Hello,
We are currently using haproxy 1.8.3 with single process multithreaded
configuration.
We have 1 process and 10 threads each mapped to a separate core [0-9].

Post by Soji Antony
are running our haproxy instances on a c4.4xlarge aws ec2 instance. The
only other CPU intensive process running on this server is a log shipper
which is explicity mapped to cpu cores 13 - 16 explicitly using taskset
command. Also we have given 'SCHED_RR' priority 99 for haproxy

processes.

Post by Soji Antony
OS: Ubuntu 14
Kernel: 4.4.0-134-generic
The issue we are seeing with Haproxy is all of a sudden CPU usage

spikes to

Post by Soji Antony
100% on cores which haproxy is using & causing latency spikes and high

load

Post by Soji Antony
on the server. We are seeing the following error messages in system /
kernel logs when this issue happens.
haproxy[92558]: segfault at 8 ip 000055f04b1f5da2 sp 00007ffdab2bdd40

error

frequent

Post by Soji Antony
on this server. Last time when this issue happened we had seen haproxy
using 750% of CPU and it went into D state. Also the old process was

also

Post by Soji Antony
taking cpu.
hard-stop-after was not set in our hap configuration and we were seeing
multiple old pid's running on the server. After the last outage we had

with

Post by Soji Antony
CPU we set 'hard-stop-after' to 10s and now we are not seeing multiple

hap

Post by Soji Antony
instances running after reload. I would really appreciate if some one

can

Post by Soji Antony
explain us why the CPU usage spikes with the above segfault error & what
this error exactly means.
FYI: There was no traffic spike on this hap instance when the issue
happened. We have even seen the same issue in a non-prod hap where no
traffic was coming & system went down due to CPU usage & found the same
segfault error in the logs.

Soji Antony

2018-10-15 17:40:09 UTC

Permalink

Hi Olivier,

Many thanks for your reply.
Please find the gdb output given below.

# gdb /usr/sbin/haproxy core.dump3.13871
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/haproxy...(no debugging symbols
found)...done.
[New LWP 13872]
[New LWP 13873]
[New LWP 13888]
[New LWP 13889]
[New LWP 13890]
[New LWP 13892]
[New LWP 13893]
[New LWP 13894]
[New LWP 13895]
[New LWP 13871]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/haproxy'.
#0 0x000055b79086de43 in _start ()
(gdb) thread apply all bt

Thread 10 (Thread 0x7f88ee327980 (LWP 13871)):
#0 0x000055b79086dee1 in _start ()

Thread 9 (Thread 0x7f88d37fe700 (LWP 13895)):
#0 0x000055b79086de46 in _start ()

Thread 8 (Thread 0x7f88cb7fe700 (LWP 13894)):
#0 0x000055b79086de46 in _start ()

Thread 7 (Thread 0x7f88d3fff700 (LWP 13893)):
#0 0x000055b79086de43 in _start ()

Thread 6 (Thread 0x7f88e8c25700 (LWP 13892)):
#0 0x000055b79086de43 in _start ()

Thread 5 (Thread 0x7f88e94c5700 (LWP 13890)):
#0 0x000055b79086de46 in _start ()

Thread 4 (Thread 0x7f88e9cc6700 (LWP 13889)):
#0 0x000055b79086de46 in _start ()

Thread 3 (Thread 0x7f88ea4c7700 (LWP 13888)):
#0 0x000055b79086de43 in _start ()

Thread 2 (Thread 0x7f88eacc8700 (LWP 13873)):
#0 0x000055b79086de43 in _start ()

Thread 1 (Thread 0x7f88eb4c9700 (LWP 13872)):
#0 0x000055b79086de43 in _start ()
(gdb) info threads
Id Target Id Frame
10 Thread 0x7f88ee327980 (LWP 13871) 0x000055b79086dee1 in _start ()
9 Thread 0x7f88d37fe700 (LWP 13895) 0x000055b79086de46 in _start ()
8 Thread 0x7f88cb7fe700 (LWP 13894) 0x000055b79086de46 in _start ()
7 Thread 0x7f88d3fff700 (LWP 13893) 0x000055b79086de43 in _start ()
6 Thread 0x7f88e8c25700 (LWP 13892) 0x000055b79086de43 in _start ()
5 Thread 0x7f88e94c5700 (LWP 13890) 0x000055b79086de46 in _start ()
4 Thread 0x7f88e9cc6700 (LWP 13889) 0x000055b79086de46 in _start ()
3 Thread 0x7f88ea4c7700 (LWP 13888) 0x000055b79086de43 in _start ()
2 Thread 0x7f88eacc8700 (LWP 13873) 0x000055b79086de43 in _start ()
* 1 Thread 0x7f88eb4c9700 (LWP 13872) 0x000055b79086de43 in _start ()
(gdb)

Post by Soji Antony
Hi,

Post by Soji Antony
Really appreciate if some one can help with this issue. This is happening
quite frequently on our servers. I have taken a coredump when this

happened

Post by Soji Antony
last time using 'gcore' command. However unable to share it as it might
have ssl related information. Is there anyway I can remove confidential
information from this coredump file before sharing it. Unfortunately we

are

Post by Soji Antony
not able to reproduce this issue in a test environment. Also CPU usage

was

Post by Soji Antony
100% for all the cores which haproxy was using for around 4hrs though we
have taken it out of rotation from DNS & was not serving any traffic. For
me it looks like something internal to haproxy is causing an infinite

loop

Post by Soji Antony
causing heavy cpu usage.
Thanks

Sorry for the late answer.
Without sending us the core, if you could at least give use the gdb output
of "thread apply all bt" so that we know where each thread is spinning,
that
may be very useful, and shouldn't leak any confidential information.
Thanks !
Olivier

Post by Soji Antony
[image: Screen Shot 2018-10-12 at 8.13.02 PM.png]

Post by Soji Antony
Hi Oliver,
Thanks for the suggestion. We have upgraded haproxy to 1.8.14 but

seeing