Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client doesn't stay on NVENC h264, even when selected in the UI #2022

Closed
totaam opened this issue Nov 1, 2018 · 100 comments
Closed

Client doesn't stay on NVENC h264, even when selected in the UI #2022

totaam opened this issue Nov 1, 2018 · 100 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Nov 1, 2018

Issue migrated from trac ticket # 2022

component: encodings | priority: major | resolution: fixed

2018-11-01 16:03:47: nathan_lstc created the issue


Unless I run xpra with only h264 enabled,

/usr/bin/python2 /usr/bin/xpra start -d encoder --start=xterm --no-daemon --video-encoders=nvenc --encodings=h264 :57

I can observe from the debug data that very few frames are actually rendered using NVENC even when I have h264 selected from the UI. When I run it using the above command, it works wonderfully. I'm using r20858 on CentOS7 with VirtualGL on the server and an r20882 client on windows. The driver is 390.87. I'm using a C2075 for render and a K20c for encode.

I do not think the problem has much to do with my hardware or reversion because I'm having this trouble on all of my configurations including a brand-new new computer (the other one is pretty old) with a GTX 1060 running release 2.3.4 server with release 2.4 client.

Nathan

@totaam
Copy link
Collaborator Author

totaam commented Nov 1, 2018

Xpra will only use video encoders when it thinks is necessary, for more details on the original feature see #410 - there have been many updates and fixes since.

Roughly speaking, the video encoder will not be triggered unless something on screen refreshes rapidly enough.
By only allowing h264, you are bypassing all the automatic encoding logic.

To test, you can run glxgears or anything else that generates lots of frames and verify that NVENC does get used.

If you feel that your screen updates would be better compressed with a video encoder and that the automatic logic is too slow to switch to the video encoder, you can debug the video region detection with -d regiondetect, the encoding selection logic with -d encoding.
Bear in mind that GPU encoders take longer to setup, so it is best to only use them when we're certain that the content is video-ish and is going to keep producing screen updates regularly.

Maybe we could tweak the logic to switch to video sooner when we know we're dealing with an efficient encoder (ie: NVENC).

@totaam
Copy link
Collaborator Author

totaam commented Nov 1, 2018

2018-11-01 19:44:13: nathan_lstc commented


Here's a "-d compress" log. For context, this log was collected during a sustained rotation operation on an OpenGL rendering of a finite element model of vehicle impact (it is a standard benchmark for us). The log looks like the below excerpt the whole way through. When XPRA jumps between RGB and YUV the screen briefly locks hard.

I have my system tray UI set to speed=auto, quality=auto and encoding = h264.

There are two cases I want to include:

(1) Without "XPRA_FORCE_CSC_MODE=BGRX" it is slow and throws "fail to encode errors" for NVENC sometimes, especially when I set speed to 100 (I had speed set to auto). In this case it didn't throw those errors, but the whole picture did turn yellow a few times during the collection of this log.

(2) with XPRA_FORCE_CSC_MODE=BGRX it is also very slow, but not as slow as before (but nothing like speed and perfection of having only NVENC enabled). There were noticeable jerks when it switched between encodings. Never any NVENC encoding errors nor random color changes like when BGRX isn't set.

Case (1) Plain, no XPRA_FORCE_CSC_MODE.

2018-11-01 13:12:06,968 compress:   1.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   134, client_options={'zlib': 5, 'delta': 181, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 186}
2018-11-01 13:12:07,553 compress: 582.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.0%  ( 6128KB to    61KB), sequence   135, client_options={u'speed': 41, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 20}
2018-11-01 13:12:07,695 compress:   2.0ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   136, client_options={'zlib': 5, 'delta': 186, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 189}
2018-11-01 13:12:08,232 compress: 533.5ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.0%  ( 6128KB to    60KB), sequence   137, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 34}
2018-11-01 13:12:08,318 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   138, client_options={'zlib': 5, 'delta': 189, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 192}
2018-11-01 13:12:08,365 compress:  46.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.8%  ( 6128KB to    48KB), sequence   139, client_options={u'speed': 46, u'frame': 1, u'pts': 626, 'csc': 'YUV444P', u'quality': 34}
2018-11-01 13:12:08,506 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   140, client_options={'zlib': 5, 'delta': 192, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 195}
2018-11-01 13:12:08,548 compress:  40.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.6%  ( 6128KB to    34KB), sequence   141, client_options={u'speed': 46, u'frame': 2, u'pts': 814, 'csc': 'YUV444P', u'quality': 34}
2018-11-01 13:12:08,696 compress:   1.4ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   142, client_options={'zlib': 5, 'delta': 195, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 198}
2018-11-01 13:12:09,204 compress: 505.4ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.0%  ( 6128KB to    58KB), sequence   143, client_options={u'speed': 44, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 44}
2018-11-01 13:12:09,300 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   144, client_options={'zlib': 5, 'delta': 198, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 201}
2018-11-01 13:12:09,914 compress: 612.8ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   2.2%  ( 6128KB to   137KB), sequence   145, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 24}
2018-11-01 13:12:09,924 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   146, client_options={'zlib': 5, 'delta': 201, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 204}
2018-11-01 13:12:10,584 compress: 658.1ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.2%  ( 6128KB to    73KB), sequence   147, client_options={u'speed': 46, u'frame': 0, u'pts': 0, 'csc': u'XRGB', u'quality': 24}
2018-11-01 13:12:10,648 compress:   0.4ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   148, client_options={'zlib': 5, 'delta': 204, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 207}
2018-11-01 13:12:11,210 compress: 561.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.2%  ( 6128KB to    70KB), sequence   149, client_options={u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 5}
2018-11-01 13:12:11,231 compress:   0.1ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   150, client_options={'zlib': 6, 'delta': 207, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 210}
2018-11-01 13:12:11,802 compress: 569.8ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.4%  ( 6128KB to    23KB), sequence   151, client_options={u'speed': 30, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 5}
2018-11-01 13:12:11,822 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   152, client_options={'zlib': 6, 'delta': 210, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 213}
2018-11-01 13:12:11,846 compress:  23.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.2%  ( 6128KB to    13KB), sequence   153, client_options={u'speed': 30, u'frame': 1, u'pts': 591, 'csc': 'YUV444P', u'quality': 5}
2018-11-01 13:12:12,834 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   154, client_options={'zlib': 7, 'delta': 213, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 216}
2018-11-01 13:12:13,367 compress: 532.6ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.3%  ( 6128KB to    21KB), sequence   155, client_options={u'speed': 28, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 4}
2018-11-01 13:12:13,368 compress:   0.1ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   156, client_options={'zlib': 7, 'delta': 216, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 219}
2018-11-01 13:12:13,423 compress:  52.1ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.1%  ( 6128KB to     8KB), sequence   157, client_options={u'speed': 28, u'frame': 1, u'pts': 453, 'csc': 'YUV444P', u'quality': 4}
2018-11-01 13:12:13,744 compress:   0.8ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   158, client_options={'zlib': 6, 'delta': 219, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 222}
2018-11-01 13:12:14,316 compress: 569.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.4%  ( 6128KB to    22KB), sequence   159, client_options={u'speed': 32, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 8}

Case 2:

2018-11-01 13:09:17,544 compress:   0.5ms for   36x36   pixels at 1596,411  for wid=2     using     rgb24 with ratio  37.5%  (    5KB to     1KB), sequence   323, client_options={'zlib': 5, 'delta': 401, 'bucket': 1, 'rgb_format': 'BGRX', 'flush': 1, 'store': 469}
2018-11-01 13:09:17,547 compress:   1.1ms for 1512x16   pixels at    2,941  for wid=2     using      jpeg with ratio   1.7%  (   94KB to     1KB), sequence   324, client_options={'quality': 49}
2018-11-01 13:09:18,099 compress:  89.3ms for 1636x959  pixels at    0,0    for wid=2     using      webp with ratio   2.6%  ( 6128KB to   161KB), sequence   325, client_options={u'quality': 100, u'rgb_format': 'BGRX'}
2018-11-01 13:09:18,234 compress:   0.6ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   326, client_options={'zlib': 4, 'delta': 460, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 475}
2018-11-01 13:09:18,814 compress: 577.6ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   4.7%  ( 6128KB to   290KB), sequence   327, client_options={u'speed': 39, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 10}
2018-11-01 13:09:18,835 compress:   0.5ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   328, client_options={'zlib': 6, 'delta': 475, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 478}
2018-11-01 13:09:18,892 compress:  56.2ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   2.2%  ( 6128KB to   133KB), sequence   329, client_options={u'speed': 39, u'frame': 1, u'pts': 606, 'csc': 'YUV444P', u'quality': 10}
2018-11-01 13:09:19,385 compress:   0.7ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.9%  (    6KB to     0KB), sequence   330, client_options={'flush': 1, 'zlib': 6, 'bucket': 0, 'store': 481, 'rgb_format': 'BGRX'}
2018-11-01 13:09:19,442 compress:  55.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   3.2%  ( 6128KB to   193KB), sequence   331, client_options={u'speed': 39, u'frame': 2, u'pts': 1156, 'csc': 'YUV444P', u'quality': 10}
2018-11-01 13:09:19,841 compress:   1.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   332, client_options={'zlib': 5, 'delta': 481, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 484}
2018-11-01 13:09:20,438 compress: 592.9ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.4%  ( 6128KB to    26KB), sequence   333, client_options={u'speed': 41, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 7}
2018-11-01 13:09:20,569 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   334, client_options={'zlib': 5, 'delta': 484, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 487}
2018-11-01 13:09:20,620 compress:  50.2ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.3%  ( 6128KB to    16KB), sequence   335, client_options={u'speed': 41, u'frame': 1, u'pts': 730, 'csc': 'YUV444P', u'quality': 7}
2018-11-01 13:09:20,832 compress:   0.8ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   336, client_options={'zlib': 5, 'delta': 487, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 490}
2018-11-01 13:09:21,400 compress: 562.8ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.7%  ( 6128KB to    41KB), sequence   337, client_options={u'speed': 47, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 23}
2018-11-01 13:09:21,504 compress:   0.7ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   338, client_options={'zlib': 5, 'delta': 490, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 493}
2018-11-01 13:09:21,514 compress:   1.2ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   8 rectangles  ( 6128KB)           , sequence   340, client_options={'flush': 2}
2018-11-01 13:09:21,543 compress:  28.7ms for 1636x561  pixels at    0,144  for wid=2     using      jpeg with ratio   0.9%  ( 3585KB to    31KB), sequence   341, client_options={'quality': 23, 'optimize': True, 'flush': 1}
2018-11-01 13:09:21,547 compress:   3.7ms for 1636x61   pixels at    0,763  for wid=2     using      jpeg with ratio   0.6%  (  389KB to     2KB), sequence   342, client_options={'quality': 23, 'optimize': True}
2018-11-01 13:09:21,699 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   342, client_options={'zlib': 5, 'delta': 493, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 496}
2018-11-01 13:09:21,707 compress:   0.8ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   1 rectangles  ( 6128KB)           , sequence   344, client_options={'flush': 1}
2018-11-01 13:09:21,740 compress:  32.2ms for 1636x606  pixels at    0,260  for wid=2     using      jpeg with ratio   1.1%  ( 3872KB to    42KB), sequence   345, client_options={'quality': 23, 'optimize': True}
2018-11-01 13:09:21,929 compress:   0.2ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   345, client_options={'zlib': 5, 'delta': 496, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 499}
2018-11-01 13:09:22,522 compress: 591.0ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   1.7%  ( 6128KB to   101KB), sequence   346, client_options={u'speed': 49, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 26}
2018-11-01 13:09:22,523 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   347, client_options={'zlib': 5, 'delta': 499, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 502}
2018-11-01 13:09:22,571 compress:  47.3ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.9%  ( 6128KB to    56KB), sequence   348, client_options={u'speed': 49, u'frame': 1, u'pts': 531, 'csc': 'YUV444P', u'quality': 26}
2018-11-01 13:09:22,634 compress:   1.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   349, client_options={'zlib': 5, 'delta': 502, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 505}
2018-11-01 13:09:22,688 compress:  51.2ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.7%  ( 6128KB to    41KB), sequence   350, client_options={u'speed': 49, u'frame': 2, u'pts': 703, 'csc': 'YUV444P', u'quality': 26}
2018-11-01 13:09:22,827 compress:   0.7ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   351, client_options={'zlib': 5, 'delta': 505, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 508}
2018-11-01 13:09:23,371 compress: 542.7ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.8%  ( 6128KB to    49KB), sequence   352, client_options={u'speed': 51, u'frame': 0, u'pts': 0, 'csc': 'YUV444P', u'quality': 24}
2018-11-01 13:09:23,427 compress:  55.6ms for 1636x959  pixels at    0,0    for wid=2     using      webp with ratio   2.0%  ( 6128KB to   122KB), sequence   353, client_options={u'quality': 100, u'rgb_format': 'BGRX'}
2018-11-01 13:09:23,495 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   354, client_options={'zlib': 5, 'delta': 508, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 513}
2018-11-01 13:09:23,542 compress:  45.6ms for 1636x958  pixels at    0,0    for wid=2     using      h264 with ratio   0.5%  ( 6128KB to    28KB), sequence   355, client_options={u'speed': 51, u'frame': 1, u'pts': 669, 'csc': 'YUV444P', u'quality': 24}
2018-11-01 13:09:23,654 compress:   0.9ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   356, client_options={'zlib': 5, 'delta': 513, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 516}
2018-11-01 13:09:23,663 compress:   0.8ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   1 rectangles  ( 6128KB)           , sequence   358, client_options={'flush': 1}
2018-11-01 13:09:23,683 compress:  20.4ms for 1636x601  pixels at    0,266  for wid=2     using      jpeg with ratio   0.9%  ( 3840KB to    35KB), sequence   359, client_options={'quality': 24}
2018-11-01 13:09:23,986 compress:   0.3ms for 1636x1    pixels at    0,958  for wid=2     using     rgb24 with ratio   0.8%  (    6KB to     0KB), sequence   359, client_options={'zlib': 5, 'delta': 516, 'bucket': 0, 'rgb_format': 'BGRX', 'flush': 1, 'store': 519}
2018-11-01 13:09:23,999 compress:   0.6ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   3 rectangles  ( 6128KB)           , sequence   361, client_options={'flush': 4}
2018-11-01 13:09:24,016 compress:  16.3ms for 1636x516  pixels at    0,312  for wid=2     using      jpeg with ratio   1.1%  ( 3297KB to    35KB), sequence   362, client_options={'quality': 26, 'flush': 3}
2018-11-01 13:09:24,018 compress:   1.0ms for 1636x4    pixels at    0,836  for wid=2     using       png with ratio   0.8%  (   25KB to     0KB), sequence   363, client_options={'flush': 2}
2018-11-01 13:09:24,020 compress:   1.8ms for 1636x5    pixels at    0,915  for wid=2     using       png with ratio   4.6%  (   31KB to     1KB), sequence   364, client_options={'flush': 1}
2018-11-01 13:09:24,023 compress:   1.5ms for 1636x12   pixels at    0,926  for wid=2     using      jpeg with ratio   3.4%  (   76KB to     2KB), sequence   365, client_options={'quality': 26}
2018-11-01 13:09:24,619 compress:  73.4ms for 1636x959  pixels at    0,0    for wid=2     using      webp with ratio   1.9%  ( 6128KB to   116KB), sequence   365, client_options={u'quality': 100, u'rgb_format': 'BGRX'}

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

First things first, something is not right with your setup if rgb ends up using zlib compression. You should have lz4 installed and if not then you should see a big warning in your server log on startup.
You can confirm the status by running xpra/net/net_util.py on Linux and Network_info.exe on win32.

As for the logs, I meant "-d compress" as well as, or combined with the -d regiondetect and -d encoding from comment:1.

From the first log sample, most of the time it seems to be taking 500ms to compress at a resolution of 1636x958. That's dreadfully slow. You may want to raise min-speed and leave the encoding set to "auto" so it can at least switch to vp8 / vp9 when not using NVENC.
Why it isn't switching to nvenc would require the -d score log output. I think that is going to be the key.

it is slow and throws "fail to encode errors" for NVENC sometimes

Please post the exact error message, including the full stacktrace.
You may need to add -d video,encoder to see it.
That could also cause encoder selection problems, we should be able to fix those or at least avoid the conditions that trigger it.

with XPRA_FORCE_CSC_MODE=BGRX

Why are you changing this switch exactly?
Setting a high enough (min-)quality should give the same result.
There are two ways that NVENC can handle BGRX input: natively for the cards + driver combinations that support it (can be turned off with XPRA_NVENC_NATIVE_RGB=0), or via a CUDA conversion step otherwise.
It's not clear what your card supports.
Please post the output of the server startup with -d nvenc
Either way, NVENC should win the scoring compared to doing the CSC step via another module (libyuv or whatever).

There are some scroll encoding packets in the middle, those can cause problems when encoding video as we then skip frames and encode them differently.
You can play with raising:

XPRA_SCROLL_MIN_PERCENT=80

Or even disable scroll encoding altogether:

XPRA_SCROLL_ENCODING=0

As for the IRC discussion regarding setting hints, we can worry about that once the encoding selection bug is fixed - it will help (see #1952 and #2023)

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 20:10:25: nathan_lstc commented


After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong. For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1. When I backed down to version 1.1.0 my windows build correctly includes lz4, and I see this:


2018-11-02 13:01:50,174 compress:   0.3ms for 1636x959  pixels at    0,0    for wid=2     using    scroll as   1 rectangles  ( 6128KB)           , sequence   111, client_options={'flush': 2}
2018-11-02 13:01:50,315 compress: 141.1ms for 1636x593  pixels at    0,233  for wid=2     using     rgb24 with ratio   1.7%  ( 3789KB to    66KB), sequence   112, client_options={'flush': 1, 'lz4': 1, 'rgb_format': 'BGRX'}
2018-11-02 13:01:50,320 compress:   4.2ms for 1636x20   pixels at    0,915  for wid=2     using     rgb24 with ratio   3.4%  (  127KB to     4KB), sequence   113, client_options={'lz4': 1, 'rgb_format': 'BGRX'}
2018-11-02 13:01:50,740 compress:   8.3ms for 1636x959  pixels at    0,0    for wid=2     using     rgb24 with ratio   3.3%  ( 6128KB to   199KB), sequence   113, client_options={'lz4': 1, 'rgb_format': 'BGRX'}

Version 2.1.0 works only if I run the python script from within mingw shell. However, lz4 doesn't seem to be getting put into the build directory correctly when I use version 2.1.0. Consequently, it doesn't make it into the installer. I'm not sure why it's wrong, and I'll look into a little more before giving up and moving on to your other suggestions. (This problem is less important to me than the encoder one, especially since I've got a workaround, namely downgrade to 1.1.0)

Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 21:34:54: nathan_lstc commented


By running python with "-S" in the "dist/lib" directory I've determined that lz4 2.1.1 works if I copy the "egg" data into "dist/lib".

Python is far outside my core expertise, so I can only speculate about the right fix. Maybe the install_exe function of setup.py needs to include the "egg" directories...

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 22:27:56: nathan_lstc commented


Command line:

 xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc :57

For this capture I had encoder set to H264 (everything else was set to auto). I rendered a 3D model and rotated it. I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some). This particular run was very jerky on the gigabit LAN. I have no idea why.

If you are interested I can send you all the software I am running.

[/raw-attachment/ticket/2022/1.txt]

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 22:31:00: nathan_lstc commented


Same as before except everything is set to "auto" including encoder. Very slow.

[/attachment/ticket/2022/2.txt]

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 22:39:58: nathan_lstc commented


In this case I disable all encodings except rgb and h264. When I begin to rotate it's jerky, but once it gets going its great!

Command line:

xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc --encodings=h264,h265,rgb :57

Output:

[/attachment/ticket/2022/3.txt]

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 22:45:38: nathan_lstc uploaded file 2.txt (50.6 KiB)

Log for second try.

@totaam
Copy link
Collaborator Author

totaam commented Nov 2, 2018

2018-11-02 22:47:30: nathan_lstc uploaded file 3.txt (181.5 KiB)

third example

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 05:34:52: antoine uploaded file 1.txt (89.1 KiB)

re-adding attachment with txt extension so it can be viewed online

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong.
For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1.

So you're using custom builds rather than our official releases. Please make sure to mention that every time, as per ReportingBugs.
And test with official builds too to make sure that the problem doesn't come from your builds.
As for lz4, this is a well documented issue: #1883#comment:4 with its own ticket: #1929

Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.

It wasn't meant to.
It was essential to find out why your output was not the one that was expected.
And it will help with performance (outdated benchmark here: [https://xpra.org/stats/lz4/])

I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some)

"scroll" encoding is not used for windows we correctly tag as "video", hence why you should look into tagging your video windows.

xpra start -d compress,regiondetect,encoding --start=gnome-terminal --no-daemon --video-encoders=nvenc --encodings=h264,h265,rgb :57

As I mentioned before, that's a very bad idea:

  • don't disable other video encoders, include "-d score" and then we'll see if the other video encoders somehow get used when they should not (NVENC should win)
  • don't disable other encodings! xpra will be forced to use "rgb" for non-video stuff, which is going to completely kill your bandwidth, even on a gigabit lan

Considering those important configuration issues, the logs you attached are not very telling.
Please see comment:3 :

Why it isn't switching to nvenc would require the -d score log output. I think that is going to be the key.

Please also include both:

  • -d score,video
  • -d score,encoding,video,scaling,csc,stats (just in case details are missing from the first one)

What I do see in there:

  • lots of:
send_delayed_regions: using full window update 1636x959 as  h264, from <bound method WindowVideoSource.get_best_encoding_video of WindowVideoSource(2 : (1636, 959))>
do_video_encode(h264, XShmImageWrapper(BGRX: 0, 0, 1636, 959), {})
make_instance() nvenc - instance count=0

So it is choosing the correct encoding (h264) and the correct encoder (nvenc) but for some reason the context gets recycled every time.
And since nvenc is slow to setup, that's probably where the delay comes from.

  • by the time the log samples start, the batch delay is already through the roof:
send_delayed for wid 2, batch delay is 233ms, elapsed time is 233ms

So something already happened before that which reduced the framerate to less than 5fps.
Could well be the wacky settings, or the re-cycling issue, -d stats should be enough to see why that is.

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 17:47:08: nathan_lstc commented


I have been testing with both builds consistently on the client side, and I've been using the builds from winswitch beta on CentOS7. There are no other surprises in my configs. I have not, for instance, set LD_LIBRARY_PATH to anything interesting. I had not seen wiki/ReportingBug before (I should have thought to look for such a document). My apologies!

Regarding tagging, can you point me to some information on how that works (I'm ignorant about how a program passes this kind of information through)? I will pass it on to our UI guys, and I expect that if it's straightforward I get can them to put it in our UI trunk right away. (They are using wxwidgets with gtk2 on X. Our software's window has a rectangular video region surrounded by buttons).

Regarding, the additional debug captures I will try to do that in a few hours.

Again. Thank you.

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

Regarding tagging, can you point me to some information on how that works

It's all in this ticket which is assigned to you since I knew you would be interested: #2023. There's a trivial example included: test_apps/test_window_content_type.py
Basically, set the _XPRA_CONTENT_TYPE X11 window property to what you want (ie: "video" or "text").

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 22:47:53: nathan_lstc commented


I'm doing the first requested test: debug on score and video. My command line:

xpra start -d score,video --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --video-encoders=nvenc  :57 2>&1 |tee 4.txt

The "vglrun" command is so that I can spare you all the startup logging. Once I connected to the server I went straight to rotating.

[/attachment/ticket/2022/4.txt]

Same exercise but with -d score,video,compress

[/attachment/ticket/2022/5.txt]

Same exercise but with -d score,encoding,video,scaling,csc,stats

[/attachment/ticket/2022/6.txt]

Same exercise but with -d score,encoding,video,scaling,csc,stats,compress

[/attachment/ticket/2022/7.txt]

Same exercise but with -d score,encoding,video,scaling,csc,stats,compress using the 20905 client from the web site.

[/attachment/ticket/2022/8.txt]

After uploading the 7th one, only as control, I tried "--encodings=h264,h265" (this is not what I did for 8.txt). I found that it locks up just a tiny little bit at the beginning, unless I let it settle before setting in on rotating. On the 8.txt (which had all encoders enabled and was with the stock client) I was extra careful to let it "settle". As I expected, with settling it is still has the issue for which I opened this ticket, but I figure it's best to isolate one problem at a time. I, therefore, regard 8.txt as the nicest debug log.

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 22:48:37: nathan_lstc uploaded file 4.txt (119.1 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 22:48:46: nathan_lstc uploaded file 5.txt (117.9 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 22:48:59: nathan_lstc uploaded file 6.txt (338.0 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 22:49:09: nathan_lstc uploaded file 7.txt (289.3 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 22:56:59: nathan_lstc commented


I've forwarded this to our chief UI developer. I'll let you know when it's done.

Replying to [comment:11 Antoine Martin]:

Regarding tagging, can you point me to some information on how that works
It's all in this ticket which is assigned to you since I knew you would be interested: #2023#comment:1. There's a trivial example included: [/browser/xpra/trunk/src/tests/xpra/test_apps/test_window_content_type.py].
Basically, set the _XPRA_CONTENT_TYPE X11 window property to what you want (ie: "video" or "text").

@totaam
Copy link
Collaborator Author

totaam commented Nov 3, 2018

2018-11-03 23:23:06: nathan_lstc uploaded file 8.txt (225.6 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

As per previous comments, do not disable the other video encoders.

What I found in the logs despite that:

  • DPI set to 45 x 25 (wanted 96 x 96)
    You should be using the patched dummy driver from the repo.
  • we don't find a video encoder for small areas (because you disabled the other video encoders):
video size 484x92 out of range for nvenc, min 128x128
(..)
Error: no video pipeline options found for BGRX at 485x93
  • your card is limited to 2 contexts since you're using a gamer card without a license key:
CUDA 9.2.0 / PyCUDA 2018.1, found 1 device:
  + GeForce GTX 1060 6GB @ 0000:AF:00.0 (memory: 98% free, compute: 6.1)
(..)
setup_pipeline failed for (84, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 520, 188, video_spec(nvenc)): cannot open encoding session, context is NULL

And that causes failures when we reach the limit. 2 contexts should be enough if you only have one "video" window though.
When we do hit this limit, things go downhill fast:

setup_pipeline(..) failed! took 2197.59ms

That's the encoding pipeline blocked for over 2 seconds! ouch.

  • the batch delay is already high because of this:
update_batch_delay: congestion                   : 1.87,8.74  {}

You may want to turn off bandwidth detection to see if that helps with --bandwidth-detection=no. (#1912)
Or post the -d bandwidth log output so we can fix this.
But then it also goes up because of this:

update_batch_delay: compression-work-queue       : 2.64,5.12  {u'target': 1000, u'weight_multiplier': 3648, u'smoothing': u'logp', u'aim': 250, u'aimed_avg': 5247, u'div': 1000, u'avg': 12057, u'recent': 13314}

That's nvenc hogging the compresion work queue because it is slow to setup, and even worse when it hits the 2 context limit and fails (2 seconds as per above)... not much we can do about this.

  • the video context seems to get recycled too quickly, then we see:
check_pipeline('auto', 451, 134, 'BGRX') setting up a new pipeline as check failed - encodings=['h264', 'h265']

why it failed is not logged, so I can't be sure why, but here are some possibilities:

r20927 also adds an env var you can use to turn on video context cleanup debugging:

XPRA_DEBUG_VIDEO_CLEAN=1 xpra start ...

This should tell us why the video context is being destroyed so quickly.

There are new centos 7.5 builds here: [https://xpra.org/beta].

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

2018-11-04 18:08:47: nathan_lstc commented


I did not understand what you meant by "do not disable the other video encoders". For some reason I thought you meant don't use "--encodings=", I just now think you don't want "--video-encoders=" either. I apologize for that. I was not being intentionally obtuse. I am very grateful for your help. I do not mean to waste your time!

I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it. Is there an easy way to buy or request licenses from NVIDIA (for testing)?

All of my users have GTX cards (and except for a solitary tesla card), that's all I have. I have two kerpler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA. I infer older cards have some limitations in this regard. That card, by the way, does work great for the render window if I add the "CSC_MODE=BGRX" environment variable and a --encodings. I think that card can also be made to work. If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?

Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine), so hopefully we can eventually use GTX cards. I'll try to get into the office later today so that I can perform the followup tests you requested.

Thanks again!

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

I did not understand what you meant

My bad, a lot of things that seem obvious to me probably aren't.

I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it

If we can fix all the bugs, you should be OK with just 2 contexts.

Is there an easy way to buy or request licenses from NVIDIA (for testing)?

I don't know.

I have two kepler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA

Could be bugs with lossless and / or YUV444 mode with those older cards.
If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.

If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?

I've used both gamer cards (GTX 7xx to GTX10xx - with and without license keys) and quadros myself.
There is absolutely no difference in performance, the quadros just cost at least 4 times as much.
See Video Encode and Decode GPU Support Matrix

Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine)

Tagging windows should allow you to do that.

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

2018-11-04 22:42:12: nathan_lstc commented


I've installed the dummy driver. I've updated to r20927. I'm running with the environment variable you specified. I've turned off bandwidth detection. Bandwidth detection made no noticeable difference. The DPI problem seems to be resolved (based on my browsing of the log file).

I've run:

 xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no  :57 2>&1 |tee 11.txt

Here's the full log:

[/attachment/ticket/2022/11.txt]

After enabling XPRA_DEBUG_VIDEO_CLEAN=1 I'm getting this:

2018-11-04 14:27:23,066 video_context_clean()
  File "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/site-packages/xpra/server/background_worker.py", line 62, in run
    item()
  File "/usr/lib64/python2.7/site-packages/xpra/server/source/encodings_mixin.py", line 160, in recalculate_delays
    ws.reconfigure()
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 1102, in reconfigure
    self.update_encoding_options(force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1035, in update_encoding_options
    self.check_pipeline_score(force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 1131, in check_pipeline_score
    self.video_context_clean()
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 300, in video_context_clean
    traceback.print_stack()
2018-11-04 14:27:23,069 video_context_clean() done

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

2018-11-04 22:42:33: nathan_lstc uploaded file 11.txt (682.6 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

2018-11-04 23:01:07: nathan_lstc commented


Replying to [comment:16 Antoine Martin]:

I did not understand what you meant
My bad, a lot of things that seem obvious to me probably aren't.

I very much appreciate your patience!

If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.

I just tried this exercise from the gigabit LAN. This older config is working much better than the GTX config. With bandwidth detection turned off my eyeball cannot detect anything wrong (when bandwidth detection is on it locks occasionally). Last time I saw the colors to go weird I was using broadband, so I'll go back home and try that (I came to work for this test).

For completeness I'm posting my experience on gigabit on my old config. I'm rendering to a C2075 (which is so old it cannot encode) and encoding on a K20c (which doesn't support rendering).

Command:

 xpra start -d score,encoding,video,scaling,csc,stats,compress --start="vglrun ./lspp45 ../d3plots/d3plot" --no-daemon --bandwidth-detection=no  :57 2>&1 |tee 12.txt

Environment:

[nathan@bobross lsprepost4.5_centos7]$ env | grep XPRA
XPRA_CUDA_DEVICE_NAME=Tesla K20c
XPRA_CUDA_DEVICE=0
XPRA_SCROLL_ENCODING=0
[nathan@bobross lsprepost4.5_centos7]$

[/attachment/ticket/2022/12.txt]

Update: From home I find that with nvenc and only nvence enabled (--encodings) perforamnce is great. With normal options (nothing disabled) performance is bad and my client eventually crashes. I've attached the server log from that.

[/attachment/ticket/2022/14.txt]

@totaam
Copy link
Collaborator Author

totaam commented Nov 4, 2018

2018-11-04 23:02:55: nathan_lstc uploaded file 12.txt (1591.8 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Nov 5, 2018

This older config is working much better than the GTX config

Probably because of the 2 context limit on gamer cards.

From this new log:

  • nvenc always wins the scoring
get_video_pipeline_options(['h264', 'vp9', 'vp8', 'mpeg1', 'mpeg2'], 3840, 1027, 'BGRX') \
    scores=[
     (81, (1, 1), None, 0, 0, None, 'BGRX', (1, 1), 3840, 1026, video_spec(nvenc)),
     (..)
    ]

So disabling the other video encoders is counter-productive.

  • this log sample seems to be missing XPRA_DEBUG_VIDEO_CLEAN=1 so I still cannot tell why the video encoder context gets recycled...
    The sample from comment:17 lacks the log context around the backtrace. A single backtrace is not enough in any case, some video context re-cycling is normal.
    That said, there aren't many instances of re-cycling during this test, so this may not tell us much.
  • the nvenc encoder seems to stick around for quite some time, long enough for over 750 frames:
video_encode nvenc encoder: h264 1462x756 result is 148828 bytes (56.9 MPixels/s), \
    client options={u'speed': 94, u'frame': 778, u'pts': 19778, 'csc': 'YUV444P', u'quality': 97}

The batch delay was also very good (low) at that point:

calculate_batch_delay for wid=1 current batch delay=8, last update 1 seconds ago

You seem to be getting at least 25 fps, reaching closer to 50fps near the end of the log - which is quite impressive.
Was it not smooth at that point?

At which point in time in this log file am I looking for problems?

With bandwidth detection turned off my eyeball cannot detect anything wrong

Maybe no problems in this log then?

@totaam
Copy link
Collaborator Author

totaam commented Nov 5, 2018

2018-11-05 05:11:57: nathan_lstc uploaded file 14.txt (475.9 KiB)

@totaam
Copy link
Collaborator Author

totaam commented Jan 6, 2019

2019-01-06 21:17:01: nathan_lstc commented


I've modified it as follows: (added init) I also reduced the sleep number from 0.1 to 0.01

from pycuda import driver

driver.init()
print("initilizing device")

output:

[root@curry nathan]# !py
python test_cuda_context_gil.py 
******initilizing device
*device=<pycuda._driver.Device object at 0x7ff61455a680>
context=<pycuda._driver.Context object at 0x7ff5ce4ad668>
*done
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)
[root@curry nathan]#

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 06:51:38: antoine commented


The error printed at the end can probably be avoided by calling cuda_context.detach().
In hindsight, this test code output is not very useful... I wanted to know if the print thread was being blocked during the call to make_context. Was it? (did the "*" stop being printed out).
A better way would have been to print timestamps every time instead.

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 22:11:33: nathan_lstc commented


Here's the output for a few runs:

[root@curry nathan]# python test_cuda_context_gil.py
******************initilizing device
*device=<pycuda._driver.Device object at 0x7fe1928ef680>
1546898949.39
1546898949.6
context=<pycuda._driver.Context object at 0x7fe14c867668>
**********done
*[root@curry nathan]# python test_cuda_context_gil.py
*************************************************initilizing device
device=<pycuda._driver.Device object at 0x7f0f6c8e2680>
*1546898951.31
1546898951.49
context=<pycuda._driver.Context object at 0x7f0f2685b668>
********done
[root@curry nathan]#

The current version:

#!/usr/bin/env python
def print_thread():
    while True:
        sys.stdout.write("*")
        sys.stdout.flush()
        time.sleep(0.001)

from threading import Thread
t = Thread(target=print_thread)
t.daemon = True
t.start()

time.sleep(0)
from pycuda import driver

driver.init()
print("initilizing device")
device = driver.Device(0)
print("device=%s" % (device,))
print(time.time())
context = device.make_context()
print(time.time())
print("context=%s" % (context,))
time.sleep(0.01)
context.detach()
print("done")

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 22:13:20: antoine commented


It's still printing stars "*" and not timestamps...

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 22:14:18: antoine commented


We want to see if the print_thread is paused during the call to make_context.

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 22:21:25: nathan_lstc commented


There is a "print time.time()" on either side of the make_context. The stars are printing fast enough that I think they should have appeared. But you are right it would be clearer with timestamps. Here it is:

[root@curry nathan]# [root@curry nathan]# !py
python test_cuda_context_gil.py
*print thread: 0.00265097618103
*print thread: 0.00385713577271
*print thread: 0.00500512123108
*print thread: 0.00671005249023
*print thread: 0.00848293304443
*print thread: 0.00974202156067
*print thread: 0.0113780498505
*print thread: 0.0127730369568
*print thread: 0.01389503479
*print thread: 0.0152540206909
*print thread: 0.0163860321045
*print thread: 0.0179829597473
*print thread: 0.019357919693
*print thread: 0.0204629898071
*print thread: 0.0215699672699
*print thread: 0.0227890014648
*print thread: 0.0246050357819
*print thread: 0.026202917099
*print thread: 0.0370080471039
*print thread: 0.0383989810944
*print thread: 0.039715051651
*print thread: 0.0414950847626
*print thread: 0.0425989627838
*print thread: 0.0456280708313
*print thread: 0.0471770763397
*print thread: 0.0486969947815
*print thread: 0.0498020648956
*print thread: 0.0512359142303
*print thread: 0.0525970458984
*print thread: 0.0548369884491
*print thread: 0.0561380386353
*print thread: 0.0572350025177
*print thread: 0.0583231449127
*print thread: 0.0594069957733
*print thread: 0.0614700317383
*print thread: 0.0625629425049
*print thread: 0.0636930465698
*print thread: 0.0647871494293
*print thread: 0.0663070678711
*print thread: 0.0675950050354
*print thread: 0.0690340995789
*print thread: 0.0701401233673
*print thread: 0.0712389945984
*print thread: 0.0740530490875
*print thread: 0.078320980072
*print thread: 0.07941198349
*print thread: 0.080500125885
*print thread: 0.0816259384155
*print thread: 0.0827519893646
*print thread: 0.0839450359344
*print thread: 0.0852839946747
*print thread: 0.0901050567627
*initilizing device
device=<pycuda._driver.Device object at 0x7f36ca67c6d8>
 print thread: 0.258333921432
1546899654.25
1546899654.47
context=<pycuda._driver.Context object at 0x7f36845e66e0>
*print thread: 0.471437931061
*print thread: 0.472643136978
*print thread: 0.473837137222
*print thread: 0.474987030029
*print thread: 0.476146936417
*print thread: 0.477342128754
*print thread: 0.478557109833
*print thread: 0.479777097702
*done
[root@curry nathan]#

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 22:24:59: antoine commented


Sorry, didn't see that.

So, as I expected, the problem is that python-pycuda is stalling the whole python interpreter during both device init and context creation.
I'll ask for help, otherwise we may have to rewrite the cuda glue in cython so we can call it with Cython: Interfacing with External C Code: Releasing the GIL.

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2019

2019-01-07 22:55:06: nathan_lstc commented


Should I still have my user do the XPRA_SOCKET_NODELAY and the XPRA_ACK_TOLERANCE test?

@totaam
Copy link
Collaborator Author

totaam commented Jan 8, 2019

2019-01-08 04:02:38: antoine commented


Should I still have my user do the XPRA_SOCKET_NODELAY and the XPRA_ACK_TOLERANCE test?
Probably not. The pycuda blockage is probably the main cause of the latency problems.

@totaam
Copy link
Collaborator Author

totaam commented Jan 8, 2019

2019-01-08 06:04:48: antoine commented


Question on pycuda mailing list: releasing the GIL during init and make_context

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2019

2019-01-09 15:26:04: antoine changed owner from antoine to nathan_lstc

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2019

2019-01-09 15:26:04: antoine commented


Batch delay fixes tracked in #2090#comment:10

As per this reply: No particular reason from my end. I'd be happy to consider a PR with such a change. Should be pretty straightforward.
Time for me to work on pycuda a bit then.

In the meantime, this should achieve the same results as setting XPRA_BATCH_MAX_DELAY=50 but better: we expire the delayed regions quicker, but they only actually get sent if there really isn't any reason to wait longer.
This doesn't fix the problem where we are too sensitive to latency going up and not sensitive enough when it goes back down, but it should mitigate most of it.

@nathan_lstc: is this better out of the box? (bar pycuda issue..)

@totaam
Copy link
Collaborator Author

totaam commented Jan 13, 2019

2019-01-13 16:58:22: antoine uploaded file test_cuda_context_gil.3.py (0.8 KiB)

increase timer frequency so we can verify the change more clearly

@totaam
Copy link
Collaborator Author

totaam commented Jan 13, 2019

2019-01-13 17:28:40: antoine commented


I've just tried on two systems with GTX cards:

$ lspci -v | grep GTX;python ./test_cuda_context_gil.py 
10:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) (prog-if 00 [VGA controller])
* 1547391204.69
initilizing device
device=<pycuda._driver.Device object at 0x7f5304885890>
context=<pycuda._driver.Context object at 0x7f52e7a29cf8>
 - 1547391204.8
done
$ lspci -v | grep GTX;python ./test_cuda_context_gil.py 
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller])
* 1547391242.08
initilizing device
device=<pycuda._driver.Device object at 0x7fe51232e998>
context=<pycuda._driver.Context object at 0x7fe4f5c32cf8>
 - 1547391242.25
done

The slowest one takes 170ms, which isn't very high, but enough to verify the changes with the updated test class firing the timer every ~10 ms:

$ lspci -v | grep GTX;python ./test_cuda_context_gil.py 
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1) (prog-if 00 [VGA controller])
* 1547398645.759
* 1547398645.765 loading pycuda
* 1547398645.779
* 1547398645.793
* 1547398645.805
* 1547398645.805 driver init
* 1547398645.815
* 1547398645.824 initializing device
* 1547398645.824 device=<pycuda._driver.Device object at 0x7ff8569cdcb0>
* 1547398645.825
* 1547398645.836
* 1547398645.847
* 1547398645.857
* 1547398645.867
* 1547398645.878
* 1547398645.888
* 1547398645.898
* 1547398645.908
* 1547398645.914 context=<pycuda._driver.Context object at 0x7ff83a038de8>
all done

The timer now runs regularly, even during the call to make_context.

r21358 adds a patch to pycuda, which we can submit upstream after further testing.
I've uploaded new python-pycuda RPM packages to the [https://xpra.org/beta] repository.

@nathan_lstc: does that work for you? You should no longer see spurious congestion events during nvenc context initialization.

@totaam
Copy link
Collaborator Author

totaam commented Jan 14, 2019

2019-01-14 04:54:35: antoine commented


I have audited all the few remaining slow calls to pycuda and nvenc:

  • copy_image calls either memcpy_htod or memcpy_dtod and both are already allowing other threads to run
  • nvEncInitializeEncoder and nvEncDestroyEncoder already use Cython's nogil
  • cuda_context.detach() can take ~25ms, I have not modified it because it uses CUDAPP_CALL_GUARDED_CLEANUP which does not have a THREADED variant and I think there might well be a reason for that - this can be revisited if the call takes much longer on some setups

@totaam
Copy link
Collaborator Author

totaam commented Jan 14, 2019

2019-01-14 23:01:22: nathan_lstc commented


The user who reported this problem has tested and he reports that it works. I have likewise tested and it works (but I never saw the problem very much to begin with).

So, yes, I think it works! Thanks!

@totaam
Copy link
Collaborator Author

totaam commented Jan 17, 2019

2019-01-17 15:50:35: antoine commented


Based on comment:35, I've created #2111.
As per comment:47 and others, without the log sample we won't be improving the detection heuristics so you will need to use the dbus video-region api to avoid context recycling.

I hope that the congestion events were caused by pycuda and that they're now all gone.

At last, closing!

@totaam totaam closed this as completed Jan 17, 2019
@totaam
Copy link
Collaborator Author

totaam commented Jan 17, 2019

2019-01-17 16:44:30: nathan_lstc commented


I totally forgot about that the log sample

I have, indeed, resolved my problem with the help of DBUS. XPRA is working great for LS-PREPOST now.

However, I am happy to you improve XPRA more generally! If you are still interested, then I will do it right away, let me know. Shall I do it?

Open a new ticket, perhaps?

@totaam
Copy link
Collaborator Author

totaam commented Jan 17, 2019

2019-01-17 17:28:41: antoine commented


Open a new ticket, perhaps?
Yes please, something like "better video region detection".
With a log of screen updates, and what the video region should have been.
Then we can see if we can tune the code to auto-detect it properly, given time..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant