Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh bug in certain applications #1781

Closed
totaam opened this issue Mar 2, 2018 · 23 comments
Closed

Refresh bug in certain applications #1781

totaam opened this issue Mar 2, 2018 · 23 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Mar 2, 2018

Issue migrated from trac ticket # 1781

component: encodings | priority: major | resolution: fixed

2018-03-02 17:32:11: maxmylyn created the issue


I finally managed to reliably reproduce this - it's easiest in Firefox and Chrome on [https://sfbay.craigslist.org/search/cta] (a site for classifieds - this sub-site happens to be for cars, I've gotten it to trigger on the Motorcycles one as well).

My client and server are both Fedora 26 machines running trunk r18661 built from source.

I started my server with: XPRA_BANDWIDTH_DETECTION=0 xpra start :13 --bind-tcp=0.0.0.0: --start-new-commands=yes --start-child=xterm --start-child=google-chrome --start-via-proxy=no --systemd-run=no --no-daemon (running without the bandwidth limits makes it easier to detect) and attached with xpra attach tcp:ip.

Once attached I nagivated to the link I provided above. Just mousing over a few of the ads (and cycling through the photos) is enough to induce a state where the client is constantly repainting. Continued use of this site makes the paints more and more aggressive to the point where it becomes almost impossible to type into the Xterm.

I ran two different sessions - one to get the xpra info of both a fresh and a bugged state. And then a second one with -d refresh, compress enabled. On the second run I got into the bugged state almost instantly - I let it sit for at least 10-20 seconds before trying to stop the session with the Xterm. Hopefully that's enough.

@totaam
Copy link
Collaborator Author

totaam commented Mar 2, 2018

2018-03-02 17:33:04: maxmylyn uploaded file 1781freshinfo.txt (149.2 KiB)

Fresh xpra info

@totaam
Copy link
Collaborator Author

totaam commented Mar 2, 2018

2018-03-02 17:33:19: maxmylyn uploaded file 1781buggedinfo.txt (150.8 KiB)

After stuck in a constant refresh state

@totaam
Copy link
Collaborator Author

totaam commented Mar 2, 2018

2018-03-02 17:41:19: maxmylyn uploaded file 1781refreshcompresstrimmed.log (325.2 KiB)

uncompressed and trimmed log - the first few seconds, and the last few seconds (at the point where it was sitting and not being interacted with)

@totaam
Copy link
Collaborator Author

totaam commented Mar 2, 2018

2018-03-02 17:41:45: maxmylyn uploaded file 1781log.tar (20.0 KiB)

tarball of the entire log - tarred with -caf

@totaam
Copy link
Collaborator Author

totaam commented Mar 5, 2018

2018-03-05 16:14:32: antoine changed owner from antoine to maxmylyn

@totaam
Copy link
Collaborator Author

totaam commented Mar 5, 2018

2018-03-05 16:14:32: antoine commented


Your log file is empty:

$ tar -xvf 1781log.tar 
1781refreshcompress.log
$ ls -la 1781*
-rw-rw-r--. 1 antoine antoine 20480 Mar  3 12:40 1781log.tar
-rw-rw-r--. 1 antoine antoine 10240 Mar  3 00:38 1781refreshcompress.log
$ wc -l 1781refreshcompress.log 
0 1781refreshcompress.log

In any case, never put a single file in a tar file. That never does anything useful except increasing the file size, tar is not a compressed format, it is archival format.
Use xz instead.

As for the log sample attached, it is trimmed just around the time where things got interesting so I don't really know how things got into this state.

@totaam
Copy link
Collaborator Author

totaam commented Mar 6, 2018

2018-03-06 17:16:59: maxmylyn uploaded file 1781log.tar.xz (192.7 KiB)

(properly) compressed full log

@totaam
Copy link
Collaborator Author

totaam commented Mar 6, 2018

2018-03-06 17:21:03: maxmylyn commented


Well I've learned something today:

  • The a flag on tar will auto use the correct compression based on the name, rather than magically auto-selecting the proper one based on what I assumed it was before

I've properly tarred (and verified that it's correct) the full log from start to finish.

@totaam
Copy link
Collaborator Author

totaam commented Mar 6, 2018

2018-03-06 17:27:58: antoine commented


As per comment:1:

never put a single file in a tar file. That never does anything useful except increasing the file size, tar is not a compressed format, it is archival format. Use xz instead.

200	1781log.tar.xz
196	1781refreshcompress.log.xz

@totaam
Copy link
Collaborator Author

totaam commented Mar 10, 2018

2018-03-10 05:51:26: antoine commented


I believe this is caused by r17540, triggering a refresh loop.
This bug is present in the 2.2.x branch, but I believe that the bandwidth detection changes in 2.3 (#999) make it easier to trigger.

FYI: you can generally avoid refresh related problems by turning off auto-refresh.

18683 disables lossy refresh in the 2.2.x branch.
I will find a better solution for 2.3.

@totaam
Copy link
Collaborator Author

totaam commented Mar 10, 2018

2018-03-10 09:27:25: antoine commented


My guess in comment:4 was based on log events like this one:

compress:   0.1ms for    2x1    pixels at 1126,965  for wid=2     using      webp with ratio 525.0%  (    0KB to     0KB), sequence  3608, client_options={'flush': 3, 'quality': 100, 'rgb_format': 'BGRX'}
compress:   0.3ms for    3x16   pixels at 1126,22   for wid=2     using      webp with ratio  21.9%  (    0KB to     0KB), sequence  1165, client_options={'flush': 3, 'quality': 100, 'rgb_format': 'BGRX'}

We shouldn't be using "webp" for such tiny screen updates.

But it turns out that this may not be fixed by 18683, r18685 is likely to be a more correct solution. (not sure this should be backported though).

It's not clear to me which part of the screen was refreshing unnecessarily. Can you specify the coordinates or at least take a screenshot with paint boxes turned on.
And assuming that you can reproduce this reliably still, please try turning off features to see if they are involved in this, ie: video region detection, refresh delay, etc..

@totaam
Copy link
Collaborator Author

totaam commented Mar 13, 2018

2018-03-13 20:34:17: maxmylyn commented


  • Upped client and server to r18711

I no longer see the excessive flickering that made it obvious what was going on. However, my session still bogs down to the point of being unusable in the same manner, and the logs still show a flood of Webp refreshes -d refresh (see below). Turning auto-refresh off xpra control :13 auto-refresh 0 makes the issue go away; alternatively detaching and re-attaching help, but the issue will crop up again eventually.

2018-03-13 13:17:53,688 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:53,688 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(474, 78, 637, 28), refresh regions=[])
2018-03-13 13:17:53,735 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:53,901 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,067 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,290 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,368 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,492 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,585 auto refresh:  h264 screen update (actual quality=  0, lossy= True), added pixels to refresh regions, scheduling refresh in 150ms (pct=0, batch=28) (region=rectangle(0, 73, 1128, 804), refresh regions=[])
2018-03-13 13:17:54,586 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 73), refresh regions=[])

I also managed to get a traceback once or twice(something about video subregion), but can't seem to get it again for this comment.

@totaam
Copy link
Collaborator Author

totaam commented Mar 14, 2018

2018-03-14 00:20:18: antoine commented


Hmmm, looking at this again, maybe r18712 will fix this.
This code is far from being new though, so I would have expected it to be reported sooner (every version going back to pre 1.0 releases would be affected, unless somehow a changeset like r17480 made it easier to trigger).

If you still hit the issue, please provide "-d damage,refresh,regionrefresh,subregion" with more than a few lines and maybe also with "-d encoding" added. And if it does fire similar large-ish webp refreshes, does the region being refreshed match xpra info | grep video_subregion?

The traceback could be critical, please always save and report ALL tracebacks.

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 17:13:20: maxmylyn commented


Upped client and server to r18739:

I can still trigger this exact same state fairly reliably - by opening Chrome and navigating to [sfbay.craigslist.org] and navigating a few of the ads (cars and motorcycles in particular). Oddly enough it was far more difficult to trigger while the logs were running, at least in my case. It took a few tries yesterday afternoon, but I managed to get it fairly quickly this morning before the log file got way too large. Running with the OpenGL paint boxes makes it much easier to tell when you've gotten into the correct state - it's most apparent when you see a ton of small paint boxes along the edges of the windows. I'll attach the log you asked for.

NOTE: I switched from a browser and terminal start children to two start terminal children, and I used one for starting the browser and the other to stop the session quickly.

As you mentioned as well, I double checked to see if they matched from xpra info | grep video_subregion and....they kind of do.

From my Xpra info (different session than the logs I'll be attaching):

window.3.video_subregion.rectangle=(1, 1, 1127, 965)
window.3.video_subregion.score=99
window.3.video_subregion.scores.(1, 1, 1127, 965)=99
window.3.video_subregion.set-at=7354
window.3.video_subregion.supported=True
window.3.video_subregion.time=65861
window.3.video_subregion.video-mode=False
window.3.video_subregion.width=1127
window.3.video_subregion.x=1
window.3.video_subregion.y=1

and from the logs (same session as the one I got the info from):

2018-03-16 09:45:43,432 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1, 106, 4, 856), refresh regions=[])
2018-03-16 09:45:43,470 remove_refresh_region(rectangle(161, 301, 6, 13)) updated refresh regions=[]
2018-03-16 09:45:43,470 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(161, 301, 6, 13), refresh regions=[])
2018-03-16 09:45:43,479 refresh_subregion([R(0, 3, 1, 960)])
2018-03-16 09:45:43,503 remove_refresh_region(rectangle(1, 1, 1126, 105)) updated refresh regions=[]
2018-03-16 09:45:43,504 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1, 1, 1126, 105), refresh regions=[])
2018-03-16 09:45:43,505 remove_refresh_region(rectangle(1126, 1, 2, 383)) updated refresh regions=[]
2018-03-16 09:45:43,505 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1126, 1, 2, 383), refresh regions=[])
2018-03-16 09:45:43,600 refresh_subregion([R(0, 0, 1128, 3), R(1124, 3, 4, 959), R(0, 962, 1128, 4)])
2018-03-16 09:45:43,602 refresh_subregion([R(3, 0, 1122, 1)])
2018-03-16 09:45:43,602 remove_refresh_region(rectangle(0, 0, 1129, 967)) updated refresh regions=[]
2018-03-16 09:45:43,602 refresh_subregion([R(1, 1, 1126, 105), R(1, 106, 4, 856), R(1124, 106, 3, 856), R(1, 962, 1126, 3)])

Note the (1, 1, 1126, 105) from the logs and the (1, 1, 1127, 965) from the info. The start coordinates are the same, and the ending X-coordinates almost match, but the subregion from the info is much taller than what's getting auto-refreshed. That feels like it's too big of a coincidence, but I have a feeling you know a lot more about what's going on here.

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 17:14:20: maxmylyn uploaded file 1781damagerefreshregionsubregion.log (3203.3 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 17:51:26: antoine commented


Note the (1, 1, 1126, 105) from the logs and the (1, 1, 1127, 965) from the info.
That's very likely to be because the browser paints the video area using horizontal bands. This data from the log file uses a different geometry but the idea is the same (here with 58 pixel high bands):

damage(5, 48, 1119, 58, {})    wid=3, scheduling batching expiry for sequence 1809 in 34.0 ms
damage(5, 106, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
damage(5, 164, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
...
damage(5, 860, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
damage(5, 918, 1119, 44, {})   wid=3, using existing delayed {} regions created 0.0ms ago

We then run some clever code to identify the "real" video region from that. (usually by aggregating the horizontal bands together)
We also round down the dimensions (1127 becomes 1126) because video encoders often use colourspace subsampling, which requires even dimensions to operate.

This log file looks like it would be useful, but I'm not sure where to look (~25000 lines in there): did the problem appear at the start? If not, when?
Do you have the corresponding "paint boxes" screenshot so I can try to reconcile? Or at least the "xpra info"?

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 17:53:01: maxmylyn commented


I managed to get it into the stuck state in the last few seconds of the log - when the flood of webp paints start is usually when it's in the stuck state.

I'll try again and post a new log with screenshots and an xpra info.

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 18:19:14: maxmylyn commented


Okay got myself stuck into the state really quick with the logs running, so I'll attach them and a screenshot and an Xpra info.

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 18:19:45: maxmylyn uploaded file 1781loground2.log (1025.9 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 18:19:58: maxmylyn uploaded file 1781loginfo.txt (170.6 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2018

2018-03-16 18:20:44: maxmylyn uploaded file 1781 paint bug.png (701.9 KiB)

Stuck paint state - note the boxes along the edge
1781 paint bug.png

@totaam
Copy link
Collaborator Author

totaam commented Mar 17, 2018

2018-03-17 17:27:51: antoine commented


From the log, we see refreshes following the last add_video_refresh on line 7802:

refresh_subregion([R(1124, 3, 4, 184), R(1124, 207, 4, 25), R(1124, 251, 4, 694), R(1124, 958, 4, 3), R(1124, 961, 4, 1), R(0, 962, 1128, 1)])

Which would match the right and bottom edges of the window where the black paint boxes are on the screenshot: window.3.size=(1129, 967).
Just noticed that webp was missing from the default [/browser/xpra/trunk/src/xpra/client/paint_colors.py paint colours] definition, which is why it showed up as black rather than pink, this is fixed in r18745.

Then each one of those regions gets refreshed, ie:

auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 962, 1128, 1), refresh regions=[])

Yet, after about 130ms, we're back to refreshing the same list again. Without seeing any damage(..) events in between.

That's because the video area shrank and we then request the areas that are no longer managed as video to get refreshed... and the method fires repeatedly instead of just once. The minimal fix for this is r18747.
This bug affected all versions that have video region support. (some recent optimizations may have made it easier to hit)

r18748 is a more complete code update in this area and r18749 is another related minor improvement.

@totaam
Copy link
Collaborator Author

totaam commented Mar 28, 2018

2018-03-28 16:57:47: maxmylyn commented


Meant to close this one but somehow I never got around to it. I'll just blame that on being retasked.

Anyways, post r18749, I no longer see the flickering or excessive webp refreshes. I do see some fairly consistent small updates after a short time in some cases, but I can't rule out odd Firefox or Chrome painting behavior as being a culprit, and it doesn't seem to have any notable impact on my session like I was seeing before.

Closing

@totaam totaam closed this as completed Mar 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant