Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPRA does not recover from network congestion (even when network does) #2090

Closed
totaam opened this issue Dec 24, 2018 · 30 comments
Closed

XPRA does not recover from network congestion (even when network does) #2090

totaam opened this issue Dec 24, 2018 · 30 comments
Labels

Comments

@totaam
Copy link
Collaborator

totaam commented Dec 24, 2018

Issue migrated from trac ticket # 2090

component: network | priority: critical | resolution: needinfo

2018-12-24 19:55:15: nathan_lstc created the issue


I have been running performance tests on XPRA (between work and home) and I was noticing that if my internet connection got congested with H264 content from my OpenGL window my xterm windows would become very slow and they never recover unless I restart the client.

I have now tested things on my gigabit work LAN. Using a piece of software that rate-limits the MS Windows XPRA client. I found that if I impose a 5kb/s upload limit things get very slow, and that when I turn off this limit the recovery is poor. I believe, therefore, that I have found way to reliably simulate the problem that I was seeing from home. When temporary congestion seems to make things slow XPRA seems to not recover. I'm baffled.

Here my procedure:

  1. rate limit
  2. rotate 3D model until the network congestion spinner appears
  3. turn off rate limit

What happens:

  1. 3D window remains slow even after several seconds of idle with rate limit turned off. This window recovers but seemingly after only a long time of continuous use without congestion. Desired behavior: the window should recover right away.

  2. Even though the action was in the OpenGL window the xterm window never recovers until I restart the client. Specifically, the time between keypress and echo becomes very long and stays that way. Desired behavior: the xterm should recover instantly.

What I'm using:

  1. head revision server.
  2. Bandwidth detection settings seem to make no difference.
  3. I've tried with several clients.
  4. I've so far only used paramiko ssh (one hop) on the controlled lan tests.
  5. From home I've observed the problem is worse with the 2hop than with ssh -L from the shell and then a one-hop (weather the one-hop is tcp:// or ssh:// seems to make small to no difference). I think, therefore, I may need to multithread the 2hop ssh. I'll send pataches soon (I plan on starting with paramiko). Can I put them on this ticket?
@totaam
Copy link
Collaborator Author

totaam commented Dec 24, 2018

2018-12-24 22:05:26: nathan_lstc commented


XPRA_FORCE_BATCH=1 seems to solve the problem on the LAN (later I will look at what happens over broadband). I think that somehow what is going on in my OpenGL window is driving the batch delay up. (When I opened the ticket I didn't understand what batch delay did).

If what is happening is what I'm guessing I think that batch delay for "text" windows should plummet rapidly after a spike because sluggish shells are hard to use (compared to GUIs).

@totaam
Copy link
Collaborator Author

totaam commented Dec 25, 2018

2018-12-25 19:17:30: antoine changed owner from antoine to nathan_lstc

@totaam
Copy link
Collaborator Author

totaam commented Dec 25, 2018

2018-12-25 19:17:30: antoine commented


Sounds similar to #1911.

Forcing XPRA_FORCE_BATCH=1 will ensure that we let regions accumulate, giving more opportunity for rectangles to get merged and for packet aggregation (#619) - which means better use of the more limited bandwidth.
Maybe we should always batch by default, or at least always batch when we see any congestion.

How can I reproduce this easily on my laptop?

@totaam
Copy link
Collaborator Author

totaam commented Dec 26, 2018

2018-12-26 23:03:04: nathan_lstc commented


My procedure is

  1. Using VirtualGL run lsprepost with a nontrivial 3D model. I can provide you with lsprepost and a nontrivial data set. I'm guessing that anything that really congests the network will do the trick.

  2. I've only tested on windows but I used a program called "NetLimiter 4". I set it to limit download to 50KB/s and upload to 5KB/s. Then things get very slow (as expected). Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.

I have found that with:

Environment=XPRA_FORCE_BATCH=1
Environment=XPRA_BATCH_MAX_DELAY=50

I get pretty good results over broadband and LAN. I'm not saying that is the right thing to do, frankly, I have very little clue about this. I'm going to test performance on an LTE hotspot shortly...

With these settings + dbus video-box hinting XPRA is working better than anything else I've ever tried. I'll soon be deploying it to some more of my users.

@totaam
Copy link
Collaborator Author

totaam commented Dec 27, 2018

2018-12-27 19:55:53: antoine changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Dec 27, 2018

2018-12-27 19:55:53: antoine changed owner from nathan_lstc to antoine

@totaam
Copy link
Collaborator Author

totaam commented Dec 27, 2018

2018-12-27 19:55:53: antoine commented


Using VirtualGL run lsprepost with a nontrivial 3D model.
I can provide you with lsprepost and a nontrivial data set.
I'm guessing that anything that really congests the network will do the trick.
Can you reproduce with something more widely available? glxgears or glxspheres perhaps?

I set it to limit download to 50KB/s and upload to 5KB/s.
Then things get very slow (as expected).
It's a miracle it works at all. 5KB/s is really much lower than anything it was ever designed for!

Then I go ahead and uncheck the limits, and the batch delay is huge and things, especially xterms, become unusable.
Ah.

Environment=XPRA_FORCE_BATCH=1
It should be fine to change this default.
In almost all cases, batching is the right thing to do. The minimum, which is 5ms is imperceptible anyway.
VNC servers have it always enabled.

Environment=XPRA_BATCH_MAX_DELAY=50
I am less sure about this one.
With the mostly dynamic batch delay code, the base value is almost meaningless. Unless when it is high and causes problems...
The only issue with capping the batch delay is that we have other heuristics that use the batch delay as input.
I'll see what I can do.

@totaam
Copy link
Collaborator Author

totaam commented Dec 28, 2018

2018-12-28 23:03:29: nathan_lstc commented


I have not disappeared.

The performance of VirtualGL on LTE hotspot was really great. This is a testament to XPRA, not my hotspot, which isn't that great.

I will get back with you on this once I have finished hooking dbus to ls-prepost. I have gotten our developers to give me a callback, and I have figured out how to extract relative coordinates from the glCanvas. Now, I have to write the dbus code...

@totaam
Copy link
Collaborator Author

totaam commented Jan 2, 2019

2019-01-02 09:10:35: antoine commented


r21267 enables batching by default.
As for the max delay change, my current ideas are:

  • use the existing "soft-expired" mechanism: we could let the batch delay increase, but speculatively allow packets to go out when there is no bottleneck. (the batch delay would end up increasing a lot less since the "actual batch delay" will remain lower)
  • if we have already waited longer than the current batch delay value (ie: not many screen updates in an xterm), then we can lower the batch delay for the next screen update that comes - (ideally ignoring small screen updates...)

@totaam
Copy link
Collaborator Author

totaam commented Jan 2, 2019

2019-01-02 20:47:07: antoine changed status from assigned to new

@totaam
Copy link
Collaborator Author

totaam commented Jan 2, 2019

2019-01-02 20:47:07: antoine changed owner from antoine to nathan_lstc

@totaam
Copy link
Collaborator Author

totaam commented Jan 2, 2019

2019-01-02 20:47:07: antoine commented


More improvements in:

  • r21273: batching is enabled by default, move some checks to save some cpu cycles
  • r21275: skip waiting unnecessarily when the window was idle for longer than the batch delay + move more code out of hot path

This won't fix your problems, but it will help.
Can you please capture the -d stats debug output of when the batch delay stays high when it shouldn't?

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2019

2019-01-04 19:12:00: nathan_lstc commented


I've attached log1.txt.

At about 9:50:50 I do a rotation and it's good.
At about 9:51:00 I choke the bandwidth
At about 9:51:22 I restore the bandwidth. The 3D window remains slow, which is not good. Oddly, my xterm comes right back, which is good.
At about 9:52:30 The lag seems to be reduced mostly.

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2019

2019-01-04 19:12:32: nathan_lstc uploaded file log1.txt (351.9 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine changed priority from major to critical

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine changed owner from nathan_lstc to antoine

@totaam
Copy link
Collaborator Author

totaam commented Jan 4, 2019

2019-01-04 19:30:59: antoine commented


Thanks, I can reproduce the problem locally using tc.
This is caused by the latency heuristics: an increase in latency causes the batch delay to go up quickly, but a decrease in latency does not bring it back down quickly enough.

@totaam
Copy link
Collaborator Author

totaam commented Jan 11, 2019

2019-01-11 13:19:19: antoine changed status from assigned to new

@totaam
Copy link
Collaborator Author

totaam commented Jan 11, 2019

2019-01-11 13:19:19: antoine changed owner from antoine to nathan_lstc

@totaam
Copy link
Collaborator Author

totaam commented Jan 11, 2019

2019-01-11 13:19:19: antoine commented


More fixes:

  • r21290 + r21294 cosmetic
  • r21299: expire regions much more quickly (but don't send until backlog is cleared)

@nathan_lstc: is this now usable? (ignoring the pycuda issue from #2022#comment:72 for now)

@totaam
Copy link
Collaborator Author

totaam commented Jan 13, 2019

2019-01-13 13:55:21: nathan_lstc commented


I have 4 computers runnings XPRA servers right now. Only one of them is running vanilla yum-repo. I have just upgraded that one to r21314. Out of the box it seems to work well. Here is my systemd file:

ExecStart=/bin/sh -c "cd ~;PATH=/opt/xpra/bin:$PATH xpra --no-daemon start --no-printing  --start-via-proxy=no --systemd-run=no --start=\"xrdb -merge $HOME/.Xresources\" --start-child=xterm --exit-with-children --mdns=no --xsettings=no :`id -u`"
Environment=PYTHONPATH=/opt/xpra/lib64/python2.7/site-packages
Environment=LD_LIBRARY_PATH=/opt/libjpeg-turbo/lib64/:/usr/local/cuda/lib64:/usr/lib64/xpra
Environment=CUDA_VISIBLE_DEVICES=0

Everything is working right for me (I'll check with the guy having the pycuda issue on Monday). Looking through the patches, I can see they do what I was setting through env varriables. Thanks!

One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?


client.batch.delay.50p=112
client.batch.delay.80p=932
client.batch.delay.90p=936
client.batch.delay.avg=412
client.batch.delay.cur=935
client.batch.delay.max=941
client.batch.delay.min=60
client.batch.locked=False
client.batch.max-delay=500
client.batch.min-delay=16
client.batch.timeout-delay=15000
client.window.1.batch.actual_delays.90p=63
client.window.1.batch.actual_delays.avg=57
client.window.1.batch.actual_delays.cur=57
client.window.1.batch.actual_delays.max=106
client.window.1.batch.actual_delays.min=46

Is window.1.batch.actual_delays.cur the batch delay that is currently happening? If so, what about "client.batch.delay.cur"? Right now, one number looks okay the other not so good:


[nathan@bobross lsprepost4.7_centos7]$ xpra info :250 |grep -i delay |grep cur
client.batch.delay.cur=934
client.window.1.batch.actual_delays.cur=80
client.window.1.batch.delay.cur=80
client.window.8.batch.actual_delays.cur=230
client.window.8.batch.delay.cur=230
[nathan@bobross lsprepost4.7_centos7]$

@totaam
Copy link
Collaborator Author

totaam commented Jan 13, 2019

2019-01-13 14:14:11: antoine commented


One more question: "xpra info" has a lot of numbers called "delay". How do I interpret them all?

That's coming from get_weighted_list_stats in [/browser/xpra/trunk/src/xpra/simple_stats.py].

  • "min", "max" and "avg" should be self explanatory
  • "cur" is current
  • 90p is 90 percentile

Is window.1.batch.actual_delays.cur the batch delay that is currently happening?
For this window, yes.
If so, what about "client.batch.delay.cur"?
That's the global value, which is only used when creating new windows.
Right now, one number looks okay the other not so good:
(..)
client.window.8.batch.actual_delays.cur=230
Yes, that's too high.
Can you get the -d stats log?

@totaam
Copy link
Collaborator Author

totaam commented Feb 7, 2019

2019-02-07 15:14:38: antoine commented


Bump.

@totaam
Copy link
Collaborator Author

totaam commented Feb 11, 2019

2019-02-11 03:20:14: antoine commented


Fixes to the batch delay changes in r21621. (found thanks to #2140#comment:1)
This may explain the high batch delay values. The new dynamic delay code was hiding that somewhat - doing its job, but masking unreasonable values.

@totaam
Copy link
Collaborator Author

totaam commented Feb 13, 2019

2019-02-13 21:59:29: nathan_lstc commented


I am far from certain, but I just tried a combination of operations over a 1gb LAN that felt slow in the previous revisions, but in this revision it seemed absolutely smooth. I'm going to make a much more extensive test and get back with you.

[nathan@curry lsprepost4.7_centos7]$ xpra info :250 | grep delay | grep cur
client.batch.delay.cur=8
client.window.1.batch.actual_delays.cur=60
client.window.1.batch.delay.cur=3
client.window.15.batch.actual_delays.cur=150
client.window.15.batch.delay.cur=11
client.window.16.batch.actual_delays.cur=8
client.window.16.batch.delay.cur=8
[nathan@curry lsprepost4.7_centos7]$

@totaam
Copy link
Collaborator Author

totaam commented Mar 7, 2019

2019-03-07 12:38:01: antoine commented


Bump.

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 02:46:34: antoine changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 02:46:34: antoine set resolution to needinfo

@totaam totaam closed this as completed Mar 18, 2019
@totaam
Copy link
Collaborator Author

totaam commented Feb 10, 2020

2020-02-10 08:56:38: antoine commented


See also #2421.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant