XPRA does not recover from network congestion (even when network does) #2090

totaam · 2018-12-24T19:55:15Z

Issue migrated from trac ticket # 2090

component: network | priority: critical | resolution: needinfo

2018-12-24 19:55:15: nathan_lstc created the issue

I have been running performance tests on XPRA (between work and home) and I was noticing that if my internet connection got congested with H264 content from my OpenGL window my xterm windows would become very slow and they never recover unless I restart the client.

I have now tested things on my gigabit work LAN. Using a piece of software that rate-limits the MS Windows XPRA client. I found that if I impose a 5kb/s upload limit things get very slow, and that when I turn off this limit the recovery is poor. I believe, therefore, that I have found way to reliably simulate the problem that I was seeing from home. When temporary congestion seems to make things slow XPRA seems to not recover. I'm baffled.

Here my procedure:

rate limit

rotate 3D model until the network congestion spinner appears

turn off rate limit

What happens:

3D window remains slow even after several seconds of idle with rate limit turned off. This window recovers but seemingly after only a long time of continuous use without congestion. Desired behavior: the window should recover right away.

Even though the action was in the OpenGL window the xterm window never recovers until I restart the client. Specifically, the time between keypress and echo becomes very long and stays that way. Desired behavior: the xterm should recover instantly.

What I'm using:

head revision server.

Bandwidth detection settings seem to make no difference.

I've tried with several clients.

I've so far only used paramiko ssh (one hop) on the controlled lan tests.

From home I've observed the problem is worse with the 2hop than with ssh -L from the shell and then a one-hop (weather the one-hop is tcp:// or ssh:// seems to make small to no difference). I think, therefore, I may need to multithread the 2hop ssh. I'll send pataches soon (I plan on starting with paramiko). Can I put them on this ticket?

totaam · 2018-12-24T22:05:26Z

2018-12-24 22:05:26: nathan_lstc commented

XPRA_FORCE_BATCH=1 seems to solve the problem on the LAN (later I will look at what happens over broadband). I think that somehow what is going on in my OpenGL window is driving the batch delay up. (When I opened the ticket I didn't understand what batch delay did).

If what is happening is what I'm guessing I think that batch delay for "text" windows should plummet rapidly after a spike because sluggish shells are hard to use (compared to GUIs).

totaam · 2018-12-25T19:17:30Z

2018-12-25 19:17:30: antoine changed owner from antoine to nathan_lstc

totaam · 2018-12-25T19:17:30Z

2018-12-25 19:17:30: antoine commented

Sounds similar to #1911.

Forcing XPRA_FORCE_BATCH=1 will ensure that we let regions accumulate, giving more opportunity for rectangles to get merged and for packet aggregation (#619) - which means better use of the more limited bandwidth.
Maybe we should always batch by default, or at least always batch when we see any congestion.

How can I reproduce this easily on my laptop?

totaam · 2018-12-26T23:03:04Z

2018-12-26 23:03:04: nathan_lstc commented

My procedure is

Using VirtualGL run lsprepost with a nontrivial 3D model. I can provide you with lsprepost and a nontrivial data set. I'm guessing that anything that really congests the network will do the trick.

I've only tested on windows but I used a program called "NetLimiter 4". I set it to limit download to 50KB/s and upload to 5KB/s. Then things get very slow (as expected). Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.

I have found that with:

Environment=XPRA_FORCE_BATCH=1
Environment=XPRA_BATCH_MAX_DELAY=50

I get pretty good results over broadband and LAN. I'm not saying that is the right thing to do, frankly, I have very little clue about this. I'm going to test performance on an LTE hotspot shortly...

With these settings + dbus video-box hinting XPRA is working better than anything else I've ever tried. I'll soon be deploying it to some more of my users.

totaam · 2018-12-27T19:55:53Z

2018-12-27 19:55:53: antoine changed status from new to assigned

totaam · 2018-12-27T19:55:53Z

2018-12-27 19:55:53: antoine changed owner from nathan_lstc to antoine

totaam · 2018-12-27T19:55:53Z

2018-12-27 19:55:53: antoine commented

Using VirtualGL run lsprepost with a nontrivial 3D model.
I can provide you with lsprepost and a nontrivial data set.
I'm guessing that anything that really congests the network will do the trick.
Can you reproduce with something more widely available? glxgears or glxspheres perhaps?

I set it to limit download to 50KB/s and upload to 5KB/s.
Then things get very slow (as expected).
It's a miracle it works at all. 5KB/s is really much lower than anything it was ever designed for!

Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.
Ah.

Environment=XPRA_FORCE_BATCH=1
It should be fine to change this default.
In almost all cases, batching is the right thing to do. The minimum, which is 5ms is imperceptible anyway.
VNC servers have it always enabled.

Environment=XPRA_BATCH_MAX_DELAY=50
I am less sure about this one.
With the mostly dynamic batch delay code, the base value is almost meaningless. Unless when it is high and causes problems...
The only issue with capping the batch delay is that we have other heuristics that use the batch delay as input.
I'll see what I can do.

totaam · 2018-12-28T23:03:29Z

2018-12-28 23:03:29: nathan_lstc commented

I have not disappeared.

The performance of VirtualGL on LTE hotspot was really great. This is a testament to XPRA, not my hotspot, which isn't that great.

I will get back with you on this once I have finished hooking dbus to ls-prepost. I have gotten our developers to give me a callback, and I have figured out how to extract relative coordinates from the glCanvas. Now, I have to write the dbus code...

totaam · 2019-01-02T09:10:35Z

2019-01-02 09:10:35: antoine commented

r21267 enables batching by default.
As for the max delay change, my current ideas are:

use the existing "soft-expired" mechanism: we could let the batch delay increase, but speculatively allow packets to go out when there is no bottleneck. (the batch delay would end up increasing a lot less since the "actual batch delay" will remain lower)

if we have already waited longer than the current batch delay value (ie: not many screen updates in an xterm), then we can lower the batch delay for the next screen update that comes - (ideally ignoring small screen updates...)

totaam · 2019-01-02T20:47:07Z

2019-01-02 20:47:07: antoine changed status from assigned to new

totaam · 2019-01-02T20:47:07Z

2019-01-02 20:47:07: antoine changed owner from antoine to nathan_lstc

totaam · 2019-01-02T20:47:07Z

2019-01-02 20:47:07: antoine commented

More improvements in:

r21273: batching is enabled by default, move some checks to save some cpu cycles

r21275: skip waiting unnecessarily when the window was idle for longer than the batch delay + move more code out of hot path

This won't fix your problems, but it will help.
Can you please capture the -d stats debug output of when the batch delay stays high when it shouldn't?

totaam · 2019-01-04T19:12:00Z

2019-01-04 19:12:00: nathan_lstc commented

I've attached log1.txt.

At about 9:50:50 I do a rotation and it's good.
At about 9:51:00 I choke the bandwidth
At about 9:51:22 I restore the bandwidth. The 3D window remains slow, which is not good. Oddly, my xterm comes right back, which is good.
At about 9:52:30 The lag seems to be reduced mostly.

totaam · 2019-01-04T19:12:32Z

2019-01-04 19:12:32: nathan_lstc uploaded file `log1.txt` (351.9 KiB)

totaam · 2019-01-04T19:30:59Z

2019-01-04 19:30:59: antoine changed priority from major to critical

totaam · 2019-01-04T19:30:59Z

2019-01-04 19:30:59: antoine changed status from new to assigned

totaam · 2019-01-04T19:30:59Z

2019-01-04 19:30:59: antoine changed owner from nathan_lstc to antoine

totaam · 2019-01-04T19:30:59Z

2019-01-04 19:30:59: antoine commented

Thanks, I can reproduce the problem locally using tc.
This is caused by the latency heuristics: an increase in latency causes the batch delay to go up quickly, but a decrease in latency does not bring it back down quickly enough.

totaam · 2019-01-11T13:19:19Z

2019-01-11 13:19:19: antoine changed status from assigned to new

totaam · 2019-01-11T13:19:19Z

2019-01-11 13:19:19: antoine changed owner from antoine to nathan_lstc

totaam · 2019-01-11T13:19:19Z

2019-01-11 13:19:19: antoine commented

More fixes:

r21290 + r21294 cosmetic

r21299: expire regions much more quickly (but don't send until backlog is cleared)

@nathan_lstc: is this now usable? (ignoring the pycuda issue from #2022#comment:72 for now)

totaam · 2019-01-13T13:55:21Z

2019-01-13 13:55:21: nathan_lstc commented

I have 4 computers runnings XPRA servers right now. Only one of them is running vanilla yum-repo. I have just upgraded that one to r21314. Out of the box it seems to work well. Here is my systemd file:
ExecStart=/bin/sh -c "cd ~;PATH=/opt/xpra/bin:$PATH xpra --no-daemon start --no-printing  --start-via-proxy=no --systemd-run=no --start=\"xrdb -merge $HOME/.Xresources\" --start-child=xterm --exit-with-children --mdns=no --xsettings=no :`id -u`"
Environment=PYTHONPATH=/opt/xpra/lib64/python2.7/site-packages
Environment=LD_LIBRARY_PATH=/opt/libjpeg-turbo/lib64/:/usr/local/cuda/lib64:/usr/lib64/xpra
Environment=CUDA_VISIBLE_DEVICES=0
Everything is working right for me (I'll check with the guy having the pycuda issue on Monday). Looking through the patches, I can see they do what I was setting through env varriables. Thanks!

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?
client.batch.delay.50p=112
client.batch.delay.80p=932
client.batch.delay.90p=936
client.batch.delay.avg=412
client.batch.delay.cur=935
client.batch.delay.max=941
client.batch.delay.min=60
client.batch.locked=False
client.batch.max-delay=500
client.batch.min-delay=16
client.batch.timeout-delay=15000
client.window.1.batch.actual_delays.90p=63
client.window.1.batch.actual_delays.avg=57
client.window.1.batch.actual_delays.cur=57
client.window.1.batch.actual_delays.max=106
client.window.1.batch.actual_delays.min=46
Is window.1.batch.actual_delays.cur the batch delay that is currently happening? If so, what about "client.batch.delay.cur"? Right now, one number looks okay the other not so good:
[nathan@bobross lsprepost4.7_centos7]$ xpra info :250 |grep -i delay |grep cur
client.batch.delay.cur=934
client.window.1.batch.actual_delays.cur=80
client.window.1.batch.delay.cur=80
client.window.8.batch.actual_delays.cur=230
client.window.8.batch.delay.cur=230
[nathan@bobross lsprepost4.7_centos7]$

totaam · 2019-01-13T14:14:11Z

2019-01-13 14:14:11: antoine commented

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?

That's coming from get_weighted_list_stats in [/browser/xpra/trunk/src/xpra/simple_stats.py].

"min", "max" and "avg" should be self explanatory

"cur" is current

90p is 90 percentile

Is window.1.batch.actual_delays.cur the batch delay that is currently happening?
For this window, yes.
If so, what about "client.batch.delay.cur"?
That's the global value, which is only used when creating new windows.
Right now, one number looks okay the other not so good:
(..)
client.window.8.batch.actual_delays.cur=230
Yes, that's too high.
Can you get the -d stats log?

totaam · 2019-02-07T15:14:38Z

2019-02-07 15:14:38: antoine commented

Bump.

totaam · 2019-02-11T03:20:14Z

2019-02-11 03:20:14: antoine commented

Fixes to the batch delay changes in r21621. (found thanks to #2140#comment:1)
This may explain the high batch delay values. The new dynamic delay code was hiding that somewhat - doing its job, but masking unreasonable values.

totaam · 2019-02-13T21:59:29Z

2019-02-13 21:59:29: nathan_lstc commented

I am far from certain, but I just tried a combination of operations over a 1gb LAN that felt slow in the previous revisions, but in this revision it seemed absolutely smooth. I'm going to make a much more extensive test and get back with you.
[nathan@curry lsprepost4.7_centos7]$ xpra info :250 | grep delay | grep cur
client.batch.delay.cur=8
client.window.1.batch.actual_delays.cur=60
client.window.1.batch.delay.cur=3
client.window.15.batch.actual_delays.cur=150
client.window.15.batch.delay.cur=11
client.window.16.batch.actual_delays.cur=8
client.window.16.batch.delay.cur=8
[nathan@curry lsprepost4.7_centos7]$

totaam · 2019-03-07T12:38:01Z

2019-03-07 12:38:01: antoine commented

Bump.

totaam · 2019-03-18T02:46:34Z

2019-03-18 02:46:34: antoine changed status from new to closed

totaam · 2019-03-18T02:46:34Z

2019-03-18 02:46:34: antoine set resolution to needinfo

totaam · 2020-02-10T08:56:38Z

2020-02-10 08:56:38: antoine commented

See also #2421.

totaam closed this as completed Mar 18, 2019

totaam added the v2.4.x label Jan 22, 2021

totaam mentioned this issue Jan 22, 2021

Window icon not updating on client #2140

Closed

totaam mentioned this issue Oct 4, 2020

network layer improvements #1590

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XPRA does not recover from network congestion (even when network does) #2090

XPRA does not recover from network congestion (even when network does) #2090

totaam commented Dec 24, 2018

totaam commented Dec 24, 2018

totaam commented Dec 25, 2018

totaam commented Dec 25, 2018

totaam commented Dec 26, 2018

totaam commented Dec 27, 2018

totaam commented Dec 27, 2018

totaam commented Dec 27, 2018

totaam commented Dec 28, 2018

totaam commented Jan 2, 2019

totaam commented Jan 2, 2019

totaam commented Jan 2, 2019

totaam commented Jan 2, 2019

totaam commented Jan 4, 2019

totaam commented Jan 4, 2019

totaam commented Jan 4, 2019

totaam commented Jan 4, 2019

totaam commented Jan 4, 2019

totaam commented Jan 4, 2019

totaam commented Jan 11, 2019

totaam commented Jan 11, 2019

totaam commented Jan 11, 2019

totaam commented Jan 13, 2019

totaam commented Jan 13, 2019

totaam commented Feb 7, 2019

totaam commented Feb 11, 2019

totaam commented Feb 13, 2019

totaam commented Mar 7, 2019

totaam commented Mar 18, 2019

totaam commented Mar 18, 2019

totaam commented Feb 10, 2020

XPRA does not recover from network congestion (even when network does) #2090

XPRA does not recover from network congestion (even when network does) #2090

Comments

totaam commented Dec 24, 2018

2018-12-24 19:55:15: nathan_lstc created the issue

totaam commented Dec 24, 2018

2018-12-24 22:05:26: nathan_lstc commented

totaam commented Dec 25, 2018

2018-12-25 19:17:30: antoine changed owner from antoine to nathan_lstc

totaam commented Dec 25, 2018

2018-12-25 19:17:30: antoine commented

totaam commented Dec 26, 2018

2018-12-26 23:03:04: nathan_lstc commented

totaam commented Dec 27, 2018

2018-12-27 19:55:53: antoine changed status from new to assigned

totaam commented Dec 27, 2018

2018-12-27 19:55:53: antoine changed owner from nathan_lstc to antoine

totaam commented Dec 27, 2018

2018-12-27 19:55:53: antoine commented

totaam commented Dec 28, 2018

2018-12-28 23:03:29: nathan_lstc commented

totaam commented Jan 2, 2019

2019-01-02 09:10:35: antoine commented

totaam commented Jan 2, 2019

2019-01-02 20:47:07: antoine changed status from assigned to new

totaam commented Jan 2, 2019

2019-01-02 20:47:07: antoine changed owner from antoine to nathan_lstc

totaam commented Jan 2, 2019

2019-01-02 20:47:07: antoine commented

totaam commented Jan 4, 2019

2019-01-04 19:12:00: nathan_lstc commented

totaam commented Jan 4, 2019

2019-01-04 19:12:32: nathan_lstc uploaded file log1.txt (351.9 KiB)

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine changed priority from major to critical

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine changed status from new to assigned

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine changed owner from nathan_lstc to antoine

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine commented

totaam commented Jan 11, 2019

2019-01-11 13:19:19: antoine changed status from assigned to new

totaam commented Jan 11, 2019

2019-01-11 13:19:19: antoine changed owner from antoine to nathan_lstc

totaam commented Jan 11, 2019

2019-01-11 13:19:19: antoine commented

totaam commented Jan 13, 2019

2019-01-13 13:55:21: nathan_lstc commented

totaam commented Jan 13, 2019

2019-01-13 14:14:11: antoine commented

totaam commented Feb 7, 2019

2019-02-07 15:14:38: antoine commented

totaam commented Feb 11, 2019

2019-02-11 03:20:14: antoine commented

totaam commented Feb 13, 2019

2019-02-13 21:59:29: nathan_lstc commented

totaam commented Mar 7, 2019

2019-03-07 12:38:01: antoine commented

totaam commented Mar 18, 2019

2019-03-18 02:46:34: antoine changed status from new to closed

totaam commented Mar 18, 2019

2019-03-18 02:46:34: antoine set resolution to needinfo

totaam commented Feb 10, 2020

2020-02-10 08:56:38: antoine commented

2019-01-04 19:12:32: nathan_lstc uploaded file `log1.txt` (351.9 KiB)