Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

Discussion:

Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

Ciprian Dorin Craciun

2018-09-29 18:57:20 UTC

Hello all!

I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`,
`tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set them
to various values ranging from 4k to 256k. Unfortunately in all cases
it seems that this generates too large TCP packets (larger than the
advertised and agreed MSS in both direction), which in turn leads to
TCP fragmentation and reassembly. (Both client and server are Linux

4.10. The protocol used was HTTP 1.1 over TLS 1.2.)

The resulting outcome was a bandwidth of about 100 KB (for a
client-server latency of 160ms).

The only setting that din't have this effect was not to set them. The
resulting bandwidth was around 10 MB.

(I have tested the backend server without HAProxy, in fact two
different webservers, both with and without HAProxy, and I would
exclude them as the issue.)

Thus I was wondering if anyone encountered similar issues and how
they've fixed it. (My guess is that it's more due to the Linux TCP
implementation stack than from HAProxy.)

As a sidenote is the following interpretation correct:
* `tune.*buf.server` refers to the TCP sockets that the frontend binds
to and listens for actual clients;
* `tune.*buf.client` refers to the TCP sockets that the backend
creates and connects to the actual servers;

Thanks,
Ciprian.

Willy Tarreau

2018-09-30 06:08:25 UTC

Hi Ciprian,

Post by Ciprian Dorin Craciun
Hello all!
I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`,
`tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set them
to various values ranging from 4k to 256k. Unfortunately in all cases
it seems that this generates too large TCP packets (larger than the
advertised and agreed MSS in both direction), which in turn leads to
TCP fragmentation and reassembly. (Both client and server are Linux

4.10. The protocol used was HTTP 1.1 over TLS 1.2.)

No no no, I'm sorry but this is not possible at all. You will never find
a single TCP stack doing this! I'm pretty sure there is an issue somewhere
in your capture or analysis.

MSS is the maximum segment size and corresponds to the maximum *payload*
transported over TCP. It doesn't include the IP nor TCP headers. Usually
over Ethernet it's 1460, resulting in 1500 bytes packets. If you're seeing
fragments, it very likely is due to an intermediary router or firewall
which has a shorter MTU at some point, such as an IPSEC VPN, IP tunnel
or ADSL link, and which must fragment to deliver the data. Some such
equipments are capable of interfering with the MSS negociation to reduce
it to fit the MTU reduction, you need to check on the affected equipments.

Also, regarding your initial question, tune.rcvbuf/sndbuf will have no
effect on all this since they only specify the extra buffer size in the
system.

However, if the problem you're experiencing is only with the listening
side, there's an "mss" parameter that you can set on your "bind" lines
to enforce a lower MSS, it may be a workaround in your case. I'm
personally using it at home to reduce the latency over ADSL ;-)

Post by Ciprian Dorin Craciun
The resulting bandwidth was around 10 MB.

Please use correct units when reporting issues, in order to reduce the
confusion. "10 MB" is not a bandwidth but a size (10 megabytes). Most
likely you want to mean 10 megabytes per second (10 MB/s). But maybe
you even mean 10 megabits per second (10 Mb/s or 10 Mbps), which equals
1.25 MB/s.

Regards,
Willy

Ciprian Dorin Craciun

2018-09-30 06:29:58 UTC

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`,
`tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set them
to various values ranging from 4k to 256k. Unfortunately in all cases
it seems that this generates too large TCP packets (larger than the
advertised and agreed MSS in both direction), which in turn leads to
TCP fragmentation and reassembly. (Both client and server are Linux

4.10. The protocol used was HTTP 1.1 over TLS 1.2.)

No no no, I'm sorry but this is not possible at all. You will never find
a single TCP stack doing this! I'm pretty sure there is an issue somewhere
in your capture or analysis.
[...]
However, if the problem you're experiencing is only with the listening
side, there's an "mss" parameter that you can set on your "bind" lines
to enforce a lower MSS, it may be a workaround in your case. I'm
personally using it at home to reduce the latency over ADSL ;-)

I am also extreemly sckeptical that this is HAProxy's fault, however
the only change needed to eliminate this issue was commenting-out
these tune arguments. I have also explicitly set the `mss` parameter
to `1400`.

The catpure was taken directly on the server on public interface.

I'll try to make a fresh catpure to see if I can replicate this.

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
The resulting bandwidth was around 10 MB.

Please use correct units when reporting issues, in order to reduce the
confusion. "10 MB" is not a bandwidth but a size (10 megabytes). Most
likely you want to mean 10 megabytes per second (10 MB/s). But maybe
you even mean 10 megabits per second (10 Mb/s or 10 Mbps), which equals
1.25 MB/s.

:) Sorry for that. (Thats the otucome of writing emails at 3 AM
after 4 hours of pocking into a production system.) I completely
agree with you about the MB/Mb consistency, and I always hate that
some providers still use MB to mean mega-bits, like it's 2000. :)

Yes, I meant 10 mega-bytes / second. Sory again.

Ciprian.

Mathias Weiersmüller

2018-09-30 07:06:29 UTC

I am pretty sure you have TCP segmentation offload enabled. The TCP/IP stack therefore sends bigger-than-allowed TCP segments towards the NIC who in turn takes care about the proper segmentation.

You want to check the output of "ethtool -k eth0" and the values of:
tcp-segmentation-offload
generic-segmentation-offload

Cheers

Mathias

-----Ursprüngliche Nachricht-----
Von: Ciprian Dorin Craciun <***@gmail.com>
Gesendet: Sonntag, 30. September 2018 08:30
An: ***@1wt.eu
Cc: ***@formilux.org
Betreff: Re: Do `tune.rcvbuf.server` and `tune.sndbuf.server` (and their `tune.*.client` equivalents) lead to TCP fragmentation?

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
I've played with `tune.rcvbuf.server`, `tune.sndbuf.server`,
`tune.rcvbuf.client`, and `tune.sndbuf.client` and explicitly set
them to various values ranging from 4k to 256k. Unfortunately in
all cases it seems that this generates too large TCP packets (larger
than the advertised and agreed MSS in both direction), which in turn
leads to TCP fragmentation and reassembly. (Both client and server
are Linux

4.10. The protocol used was HTTP 1.1 over TLS 1.2.)

No no no, I'm sorry but this is not possible at all. You will never
find a single TCP stack doing this! I'm pretty sure there is an issue
somewhere in your capture or analysis.
[...]
However, if the problem you're experiencing is only with the listening
side, there's an "mss" parameter that you can set on your "bind" lines
to enforce a lower MSS, it may be a workaround in your case. I'm
personally using it at home to reduce the latency over ADSL ;-)

I am also extreemly sckeptical that this is HAProxy's fault, however the only change needed to eliminate this issue was commenting-out these tune arguments. I have also explicitly set the `mss` parameter to `1400`.

The catpure was taken directly on the server on public interface.

I'll try to make a fresh catpure to see if I can replicate this.

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
The resulting bandwidth was around 10 MB.

Please use correct units when reporting issues, in order to reduce the
confusion. "10 MB" is not a bandwidth but a size (10 megabytes). Most
likely you want to mean 10 megabytes per second (10 MB/s). But maybe
you even mean 10 megabits per second (10 Mb/s or 10 Mbps), which equals
1.25 MB/s.

:) Sorry for that. (Thats the otucome of writing emails at 3 AM after 4 hours of pocking into a production system.) I completely agree with you about the MB/Mb consistency, and I always hate that some providers still use MB to mean mega-bits, like it's 2000. :)

Yes, I meant

Ciprian Dorin Craciun

2018-09-30 07:20:06 UTC

On Sun, Sep 30, 2018 at 10:06 AM Mathias Weiersmüller

Post by Mathias WeiersmÃ¼ller
I am pretty sure you have TCP segmentation offload enabled. The TCP/IP stack therefore sends bigger-than-allowed TCP segments towards the NIC who in turn takes care about the proper segmentation.

I was just trying to replicate the issue I've seen yesterday, and for
a moment (in initial tests) I was able to. However on repeated tests
it seems that the `tune.rcvbuf.*` (and related) have no impact, as I
constantly see TCP fragments (around 2842 bytes Ethernet frames).

Post by Mathias WeiersmÃ¼ller
tcp-segmentation-offload
generic-segmentation-offload

The output of `ethtool -k eth0` is bellow:
~~~~
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
generic-segmentation-offload: on
~~~~

Thanks,
Ciprian.

Willy Tarreau

2018-09-30 07:35:16 UTC

Post by Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 10:06 AM Mathias Weiersmüller

Post by Mathias WeiersmÃ¼ller
I am pretty sure you have TCP segmentation offload enabled. The TCP/IP
stack therefore sends bigger-than-allowed TCP segments towards the NIC who
in turn takes care about the proper segmentation.

I was just trying to replicate the issue I've seen yesterday, and for
a moment (in initial tests) I was able to. However on repeated tests
it seems that the `tune.rcvbuf.*` (and related) have no impact, as I
constantly see TCP fragments (around 2842 bytes Ethernet frames).

Note that these are not fragments but segments. And as Matti suggested,
it's indeed due to GSO, you're seeing two TCP frames sent at once through
the stack, and they will be segmented by the NIC.

Post by Ciprian Dorin Craciun

Post by Mathias WeiersmÃ¼ller
tcp-segmentation-offload
generic-segmentation-offload

~~~~
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
generic-segmentation-offload: on
~~~~

Indeed.

Willy

Ciprian Dorin Craciun

2018-09-30 07:52:10 UTC

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
I was just trying to replicate the issue I've seen yesterday, and for
a moment (in initial tests) I was able to. However on repeated tests
it seems that the `tune.rcvbuf.*` (and related) have no impact, as I
constantly see TCP fragments (around 2842 bytes Ethernet frames).

Note that these are not fragments but segments. And as Matti suggested,
it's indeed due to GSO, you're seeing two TCP frames sent at once through
the stack, and they will be segmented by the NIC.

[Just as info.]

So it seems I was able to reproduce the bandwith issue by only toying
with `tune.sndbuf.client`:
* with no value, downloading an 8 MB file I get decent bandwidth
around 4 MB/s; (for larger files I even get up to 10 MB/s); (a
typical Ethernet frame length as reported by `tcpdump` is around 59 KB
towards the end of the transfer;)
* with that tune parameter set to 128 KB, I get around 1 MB/s; (a
typical Ethernet frame length is around 4 KB;)
* with that tune parameter set to 16 KB, I get around 100 KB/s; (a
typical Ethernet frame lengh is around 2KB;)

By "typical Ethernet frame length" I meen a packet as reported by
`tcpdump` and viewed in Wrieshark looks like this (for the first one):
~~~~
Frame 1078: 59750 bytes on wire (478000 bits), 59750 bytes captured
(478000 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Sep 30, 2018 10:26:58.667739000 EEST
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1538292418.667739000 seconds
[Time delta from previous captured frame: 0.000018000 seconds]
[Time delta from previous displayed frame: 0.000018000 seconds]
[Time since reference or first frame: 1.901135000 seconds]
Frame Number: 1078
Frame Length: 59750 bytes (478000 bits)
Capture Length: 59750 bytes (478000 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp:ssl:ssl]
[Coloring Rule Name: TCP]
[Coloring Rule String: tcp]
Ethernet II, Src: f2:3c:91:9f:51:b8 (f2:3c:91:9f:51:b8), Dst:
Cisco_9f:f0:0a (00:00:0c:9f:f0:0a)
Destination: Cisco_9f:f0:0a (00:00:0c:9f:f0:0a)
Source: f2:3c:91:9f:51:b8 (f2:3c:91:9f:51:b8)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: XXX.XXX.XXX.XXX, Dst: XXX.XXX.XXX.XXX
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
0000 00.. = Differentiated Services Codepoint: Default (0)
.... ..00 = Explicit Congestion Notification: Not ECN-Capable
Transport (0)
Total Length: 59736
Identification: 0x8d7a (36218)
Flags: 0x4000, Don't fragment
0... .... .... .... = Reserved bit: Not set
.1.. .... .... .... = Don't fragment: Set
..0. .... .... .... = More fragments: Not set
...0 0000 0000 0000 = Fragment offset: 0
Time to live: 64
Protocol: TCP (6)
Header checksum: 0x3054 [validation disabled]
[Header checksum status: Unverified]
Source: XXX.XXX.XXX.XXX
Destination: XXX.XXX.XXX.XXX
Transmission Control Protocol, Src Port: 443, Dst Port: 38150, Seq:
8271805, Ack: 471, Len: 59684
Source Port: 443
Destination Port: 38150
[Stream index: 0]
[TCP Segment Len: 59684]
Sequence number: 8271805 (relative sequence number)
[Next sequence number: 8331489 (relative sequence number)]
Acknowledgment number: 471 (relative ack number)
1000 .... = Header Length: 32 bytes (8)
Flags: 0x010 (ACK)
Window size value: 234
[Calculated window size: 29952]
[Window size scaling factor: 128]
Checksum: 0x7d1c [unverified]
[Checksum Status: Unverified]
Urgent pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
[SEQ/ACK analysis]
[Timestamps]
TCP payload (59684 bytes)
TCP segment data (16095 bytes)
TCP segment data (10779 bytes)
[2 Reassembled TCP Segments (16405 bytes): #1073(310), #1078(16095)]
[Frame: 1073, payload: 0-309 (310 bytes)]
[Frame: 1078, payload: 310-16404 (16095 bytes)]
[Segment count: 2]
[Reassembled TCP length: 16405]
Secure Sockets Layer
Secure Sockets Layer
~~~~

I'll try to disable offloading and see what happens.

I forgot to say that this is a paravirtualized VM running on Linode in
their Dallas datacenter.

Ciprian.

Ciprian Dorin Craciun

2018-09-30 08:11:51 UTC

Post by Willy Tarreau
Note that these are not fragments but segments. And as Matti suggested,
it's indeed due to GSO, you're seeing two TCP frames sent at once through
the stack, and they will be segmented by the NIC.

I have disabled all offloading features:
~~~~
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
~~~~

Now I see "as expected" Ethernet frames with `tcpdump` / `Wireshark`.
(There is indeed however a bump in kernel CPU usage.)

However the bandwidth behaviour is exactly the same:
* no `tune.sndbuf.client`, bandwidth goes up to 11 MB/s for a large download;
* with `tune.sndbuf.client 16384` it goes up to ~110 KB/s;
* with `tune.sndbuf.client 131072` it goes up to ~800 KB/s;
* with `tune.sndbuf.client 262144` it goes up to ~1400 KB/s;
(These are bandwidths obtained after the TCP window has "settled".)

It seems there is a liniar correlation between that tune parameter and
the bandwidth.

However due to the fact that I get the same behaviour both with and
without offloading, I wonder if there isn't somehow a "hidden"
consequence of setting this `tune.sndbuf.client` parameter?

Thanks,
Ciprian.

Mathias Weiersmüller

2018-09-30 08:33:08 UTC

However the bandwidth behaviour is exactly the same:
* no `tune.sndbuf.client`, bandwidth goes up to 11 MB/s for a large download;
* with `tune.sndbuf.client 16384` it goes up to ~110 KB/s;
* with `tune.sndbuf.client 131072` it goes up to ~800 KB/s;
* with `tune.sndbuf.client 262144` it goes up to ~1400 KB/s; (These are bandwidths obtained after the TCP window has "settled".)

It seems there is a liniar correlation between that tune parameter and the bandwidth.

However due to the fact that I get the same behaviour both with and without offloading, I wonder if there isn't somehow a "hidden"
consequence of setting this `tune.sndbuf.client` parameter?

==============

Sorry for the extremly brief answer:
- you mentioned you have 160 ms latency.
- tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", meaning unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s....
- do the math with your value of 131072 and you will have get your ~800 KB/s.
- no hidden voodoo happening here: read about BDP (Bandwidth Delay Product)

Cheers

Matti

Ciprian Dorin Craciun

2018-09-30 08:41:11 UTC

On Sun, Sep 30, 2018 at 11:33 AM Mathias Weiersmüller

Post by Mathias WeiersmÃ¼ller
- you mentioned you have 160 ms latency.

Yes, I have mentioned this because I've read somewhere (not
remembering now where), that the `SO_SNDBUF` socket option also
impacts the TCP window size.

Post by Mathias WeiersmÃ¼ller
- tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", meaning unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s....
- do the math with your value of 131072 and you will have get your ~800 KB/s.
- no hidden voodoo happening here: read about BDP (Bandwidth Delay Product)

Please don't get me wrong: I didn't imply any "voodoo". :)

When I asked if there is some "hidden" consequence I didn't meant it
as "magic", but as a question for what other (unknown to me)
consequences there are.

And it seems that the `tune.sndbuf.client` also limits the TCP window size.

So my question is how can I (if at all possible) configure the buffer
size witout "breaking" the TCP window size?

Thanks,
Ciprian.

Ciprian Dorin Craciun

2018-09-30 09:01:54 UTC

On Sun, Sep 30, 2018 at 11:41 AM Ciprian Dorin Craciun

Post by Mathias WeiersmÃ¼ller
- tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", meaning unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s....
- do the math with your value of 131072 and you will have get your ~800 KB/s.

However, something bothers me... Setting `tune.sndbuf.client`, is
used only to call `setsockopt (SO_SNDBUF)`, right? It is not used by
HAProxy for any internal buffer size?

If so then by not setting it the kernel should choose the default
value, which according to:
~~~~

sysctl net.ipv4.tcp_wmem

net.ipv4.tcp_wmem = 4096 16384 4194304
~~~~
, should be 16384.

Looking with `netstat` at the `Recv-Q` column, it seems that with no
`tune` setting the value even goes up to 5 MB.
However setting the `tune` parameter it always goes up to around 20 KB.

Anyway, why am I trying to configure the sending buffer size: if I
have large downloads and I have (some) slow clients, and as a
consequence HAProxy times out waiting for the kernel buffer to clear.
However if I configure the buffer size small enough it seems HAProxy
is "kept bussy" and nothing breaks.

Thus, is there a way to have both OK bandwidth for normal clients, and
not timeout for slow clients?

Thanks,
Ciprian.

Willy Tarreau

2018-09-30 09:12:13 UTC

Post by Ciprian Dorin Craciun
On Sun, Sep 30, 2018 at 11:41 AM Ciprian Dorin Craciun

Post by Mathias WeiersmÃ¼ller
- tune.sndbuf.client 16384 allows you to have 16384 bytes "on-the-fly", meaning unacknowlegded. 16384 / 0.16 sec = roughly 128 KB/s....
- do the math with your value of 131072 and you will have get your ~800 KB/s.

However, something bothers me... Setting `tune.sndbuf.client`, is
used only to call `setsockopt (SO_SNDBUF)`, right? It is not used by
HAProxy for any internal buffer size?

No it's totally independent. haproxy uses tune.bufsize for its own
buffers. You are configuring the recv and send buffers of the sockets
(in the system) and as such acting on the window sizes.

Post by Ciprian Dorin Craciun
If so then by not setting it the kernel should choose the default
~~~~

sysctl net.ipv4.tcp_wmem

net.ipv4.tcp_wmem = 4096 16384 4194304
~~~~
, should be 16384.

No, it *starts* at 16384 then grows up to the configured limit depending
on the ability to do so without losses and the available memory.

Post by Ciprian Dorin Craciun
Looking with `netstat` at the `Recv-Q` column, it seems that with no
`tune` setting the value even goes up to 5 MB.

Recv-Q is based on tcp_rmem, not tcp_wmem.

Post by Ciprian Dorin Craciun
However setting the `tune` parameter it always goes up to around 20 KB.
Anyway, why am I trying to configure the sending buffer size: if I
have large downloads and I have (some) slow clients, and as a
consequence HAProxy times out waiting for the kernel buffer to clear.

Thus you might have very short timeouts! Usually it's not supposed to
be an issue.

Post by Ciprian Dorin Craciun
However if I configure the buffer size small enough it seems HAProxy
is "kept bussy" and nothing breaks.

I see but then maybe you should simply lower the tcp_wmem max value a
little bit, or increase your timeout ?

Post by Ciprian Dorin Craciun
Thus, is there a way to have both OK bandwidth for normal clients, and
not timeout for slow clients?

That's exactly the role of the TCP stack. It measures RTT and losses and
adjusts the send window accordingly. You must definitely let the TCP
stack play its role there, you'll have much less problems. Even if you
keep 4 MB as the max send window, for a 1 Mbps client that's rougly 40
seconds of transfer. You can deal with this using much larger timeouts
(1 or 2 minutes), and configure the tcp-ut value on the bind line to
get rid of clients which do not ACK the data they're being sent at the
TCP level.

Willy

Ciprian Dorin Craciun

2018-09-30 09:23:20 UTC

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
If so then by not setting it the kernel should choose the default
~~~~

sysctl net.ipv4.tcp_wmem

net.ipv4.tcp_wmem = 4096 16384 4194304
~~~~
, should be 16384.

No, it *starts* at 16384 then grows up to the configured limit depending
on the ability to do so without losses and the available memory.

OK. The Linux man-page eludes this part... Good to know. :)

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
Anyway, why am I trying to configure the sending buffer size: if I
have large downloads and I have (some) slow clients, and as a
consequence HAProxy times out waiting for the kernel buffer to clear.

Thus you might have very short timeouts! Usually it's not supposed to
be an issue.

I wouldn't say they are "small":
~~~~
timeout server 60s
timeout server-fin 6s
timeout client 30s
timeout client-fin 6s
timeout tunnel 180s
timeout connect 6s
timeout queue 30s
timeout check 6s
timeout tarpit 30s
~~~~

As seen the timeout which I believe is the culprit is the `timeout
client 30s` which I guess is quite enough.

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
However if I configure the buffer size small enough it seems HAProxy
is "kept bussy" and nothing breaks.

I see but then maybe you should simply lower the tcp_wmem max value a
little bit, or increase your timeout ?

I'll try to experiment with `tcp_wmem max` as you've suggested.

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
Thus, is there a way to have both OK bandwidth for normal clients, and
not timeout for slow clients?

That's exactly the role of the TCP stack. It measures RTT and losses and
adjusts the send window accordingly. You must definitely let the TCP
stack play its role there, you'll have much less problems. Even if you
keep 4 MB as the max send window, for a 1 Mbps client that's rougly 40
seconds of transfer. You can deal with this using much larger timeouts
(1 or 2 minutes), and configure the tcp-ut value on the bind line to
get rid of clients which do not ACK the data they're being sent at the
TCP level.

I initially let the TCP "do its thing", but it got me into trouble
with poor wireless clients...

I'll also give `tcp-ut` as suggested.

Thanks,
Ciprian.

Willy Tarreau

2018-09-30 11:22:53 UTC

Post by Ciprian Dorin Craciun

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
Anyway, why am I trying to configure the sending buffer size: if I
have large downloads and I have (some) slow clients, and as a
consequence HAProxy times out waiting for the kernel buffer to clear.

Thus you might have very short timeouts! Usually it's not supposed to
be an issue.

~~~~
timeout server 60s
timeout server-fin 6s
timeout client 30s
timeout client-fin 6s
timeout tunnel 180s
timeout connect 6s
timeout queue 30s
timeout check 6s
timeout tarpit 30s
~~~~
As seen the timeout which I believe is the culprit is the `timeout
client 30s` which I guess is quite enough.

It's enough for a 2 Mbps bandwidth on the client, not for less. I don't
see the point is setting too short timeouts on the client side for data
transfers, I tend to consider that if the response starts to be sent,
then the most expensive part was done and it'd better be completed
otherwise the client will try again and inflict the same cost to the
server again. You should probably increase this enough so that you
don't see unexpected timeouts anymore, and rely on tcp-ut to cut early
if a client doesn't read the data.

Willy

Ciprian Dorin Craciun

2018-09-30 11:35:24 UTC

Post by Willy Tarreau

Post by Ciprian Dorin Craciun
As seen the timeout which I believe is the culprit is the `timeout
client 30s` which I guess is quite enough.

I tend to consider that if the response starts to be sent,
then the most expensive part was done and it'd better be completed
otherwise the client will try again and inflict the same cost to the
server again.

I prefer shorter timeout values because on the server side I have
uWSGI with Python, and with its default model (one process / request
at one time), having long outstanding connections could degrade the
user experience.

Post by Willy Tarreau
You should probably increase this enough so that you
don't see unexpected timeouts anymore, and rely on tcp-ut to cut early
if a client doesn't read the data.

One question about this: if the client gradually reads from the
(server side) buffer, but it doesn't completely clears it, having this
`TCP_USER_TIMEOUT` configured would consider this connection "live"?
More specifically, say there is 4MB in the server buffer and the
client "consumes" (i.e. acknowledges) only small parts of it, would
the timeout apply as:
(A) until the entire buffer is cleared, or
(B) until at least "some" amount of data is read;

Thanks,
Ciprian.

Willy Tarreau

2018-09-30 12:15:57 UTC

Post by Ciprian Dorin Craciun
One question about this: if the client gradually reads from the
(server side) buffer, but it doesn't completely clears it, having this
`TCP_USER_TIMEOUT` configured would consider this connection "live"?

yes, that's it.

Post by Ciprian Dorin Craciun
More specifically, say there is 4MB in the server buffer and the
client "consumes" (i.e. acknowledges) only small parts of it, would
(A) until the entire buffer is cleared, or
(B) until at least "some" amount of data is read;

The timeout is an inactivity period. So let's say you set 10s in tcp-ut,
it would only kill the connection if the client acks nothing in 10s, even
if it takes 3 minutes to dump the whole buffer. It's mostly used in
environments with very long connections where clients may disappear
without warning, such as websocket connections or webmails.

Willy

Willy Tarreau

2018-09-30 07:32:57 UTC

Post by Mathias WeiersmÃ¼ller
I am pretty sure you have TCP segmentation offload enabled. The TCP/IP stack
therefore sends bigger-than-allowed TCP segments towards the NIC who in turn
takes care about the proper segmentation.
tcp-segmentation-offload
generic-segmentation-offload

Yep totally agreed, as soon as you have either GSO or TSO, you will see
large frames. Ciprian in this case it's better to capture from another
machine in the path to get a reliable capture. You can also disable
TSO/GSO using ethtool -K, but be prepared to see a significant bump in
CPU usage. Don't do this if you are already running above 20% CPU usage
on average.

Willy

16 Replies
4 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Ciprian Dorin Craciun 2018-09-29 18:57:20 UTC

Willy Tarreau 2018-09-30 06:08:25 UTC

Ciprian Dorin Craciun 2018-09-30 06:29:58 UTC

Mathias Weiersmüller 2018-09-30 07:06:29 UTC

Ciprian Dorin Craciun 2018-09-30 07:20:06 UTC

Willy Tarreau 2018-09-30 07:35:16 UTC

Ciprian Dorin Craciun 2018-09-30 07:52:10 UTC

Ciprian Dorin Craciun 2018-09-30 08:11:51 UTC

Mathias Weiersmüller 2018-09-30 08:33:08 UTC

Ciprian Dorin Craciun 2018-09-30 08:41:11 UTC

Ciprian Dorin Craciun 2018-09-30 09:01:54 UTC

Willy Tarreau 2018-09-30 09:12:13 UTC

Ciprian Dorin Craciun 2018-09-30 09:23:20 UTC

Willy Tarreau 2018-09-30 11:22:53 UTC

Ciprian Dorin Craciun 2018-09-30 11:35:24 UTC

Willy Tarreau 2018-09-30 12:15:57 UTC

Willy Tarreau 2018-09-30 07:32:57 UTC

about - legalese

Loading...