Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in ssl_tls_write() #3229

Open
metalefty opened this issue Aug 30, 2024 · 10 comments
Open

Infinite loop in ssl_tls_write() #3229

metalefty opened this issue Aug 30, 2024 · 10 comments

Comments

@metalefty
Copy link
Member

An enterprise user reports that they're seeing a problem where xrdp 0.10.1 is stuck in a logging loop.

[2024-08-28T08:38:02.147-0400] [INFO ] Received memory_allocation_complete command. width: 7680, height: 4320
[2024-08-28T17:48:35.369-0400] [ERROR] SSL_read: I/O error
[2024-08-28T17:48:35.370-0400] [ERROR] xrdp_iso_send: trans_write_copy_s failed
[2024-08-28T17:48:35.370-0400] [ERROR] Sending [ITU T.125] DisconnectProviderUltimatum failed
...
[2024-08-28T17:49:23.596-0400] [ERROR] SSL_write: Failure in SSL library (protocol error?)
[2024-08-28T17:49:23.597-0400] [ERROR] SSL: error:140D010F:SSL routines:SSL_write:bad length
[2024-08-28T17:49:23.597-0400] [ERROR] xrdp_iso_send: trans_write_copy_s failed
[2024-08-28T17:49:23.598-0400] [ERROR] xrdp_mcs_send: xrdp_iso_send failed
[2024-08-28T17:49:23.598-0400] [ERROR] xrdp_sec_send: xrdp_mcs_send failed
[2024-08-28T17:49:23.599-0400] [ERROR] xrdp_rdp_send_data_from_channel: xrdp_sec_send failed
[2024-08-28T17:49:23.599-0400] [ERROR] xrdp_iso_send: trans_write_copy_s failed
[2024-08-28T17:49:23.600-0400] [ERROR] xrdp_mcs_send: xrdp_iso_send failed
[2024-08-28T17:49:23.600-0400] [ERROR] xrdp_sec_send: xrdp_mcs_send failed
[2024-08-28T17:49:23.601-0400] [ERROR] xrdp_rdp_send_data_from_channel: xrdp_sec_send failed
...

I haven't reproduced the issue I guess there are 2 issues.

  1. POLLERR / POLLHUP is not checked
  2. Return value of g_sck_can_send/recv() is not checked

Regarding issue 1, POLLERR / POLLHUP is always checked on FreeBSD but not on GNU/Linux.

Linux

https://man7.org/linux/man-pages/man2/poll.2.html

       POLLERR
              Error condition (only returned in revents; ignored in
              events).  This bit is also set for a file descriptor
              referring to the write end of a pipe when the read end has
              been closed.

       POLLHUP
              Hang up (only returned in revents; ignored in events).
              Note that when reading from a channel such as a pipe or a
              stream socket, this event merely indicates that the peer
              closed its end of the channel.  Subsequent reads from the
              channel will return 0 (end of file) only after all
              outstanding data in the channel has been consumed.

FreeBSD

https://man.freebsd.org/cgi/man.cgi?query=poll&apropos=0&sektion=0&manpath=FreeBSD+14.1-RELEASE+and+Ports&arch=default&format=html

       POLLERR	      An  exceptional  condition has occurred on the device or
		      socket.  This  flag  is  always  checked,	 even  if  not
		      present in the events bitmask.

       POLLHUP	      The  device  or socket has been disconnected.  This flag
		      is always	checked, even if not  present  in  the	events
		      bitmask.	 Note that POLLHUP and POLLOUT should never be
		      present in the revents bitmask at	the same time.
@metalefty
Copy link
Member Author

I haven't tested but the fix will be like this. Maybe the same change should applied in other places.

@matt335672 What do you think?

diff --git a/common/os_calls.c b/common/os_calls.c
index afd5a958..1952d425 100644
--- a/common/os_calls.c
+++ b/common/os_calls.c
@@ -1567,7 +1567,7 @@ g_sck_can_send(int sck, int millis)
         pollfd.revents = 0;
         if (poll(&pollfd, 1, millis) > 0)
         {
-            if ((pollfd.revents & POLLOUT) != 0)
+            if ((pollfd.revents & (POLLOUT | POLLHUP | POLLERR)) != 0)
             {
                 rv = 1;
             }
diff --git a/common/ssl_calls.c b/common/ssl_calls.c
index 70d2d7c8..63636770 100644
--- a/common/ssl_calls.c
+++ b/common/ssl_calls.c
@@ -1362,11 +1362,11 @@ ssl_tls_write(struct ssl_tls *tls, const char *data, int length)
              *     SSL_ERROR_WANT_WRITE
              */
             case SSL_ERROR_WANT_READ:
-                g_sck_can_recv(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
-                continue;
+                break_flag = g_sck_can_recv(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
+                break;
             case SSL_ERROR_WANT_WRITE:
-                g_sck_can_send(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
-                continue;
+                break_flag = g_sck_can_send(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
+                break;

             /* socket closed */
             case SSL_ERROR_ZERO_RETURN:

@matt335672
Copy link
Member

I've had a look at the Linux/FreeBSD issue.

The language used in both is more complicated than it needs to be. The POSIX manpage says:-

POLLERR
   An error has occurred on the device or stream. This flag is only valid in the revents bitmask; it shall be ignored in the events
   member.
POLLHUP
   A device has been disconnected, or a pipe or FIFO has been closed by the last process that had it open for writing. Once set,
   the hangup state of a FIFO shall persist until some process opens the FIFO for writing or until all read-only file descriptors for
   the FIFO are closed. This event and POLLOUT are mutually-exclusive; a stream can never be writable if a hangup has occurred.
   However, this event and POLLIN, POLLRDNORM, POLLRDBAND, or POLLPRI are not mutually-exclusive. This flag is only
   valid in the revents bitmask; it shall be ignored in the events member.

I think these are all saying the same thing, which is that we don't need to set POLLERR or POLLHUP but they are checked (and returned) automatically.

As for the existing code, this looks OK to me. The poll() will return regardless of which flag is set. In ssl_calls.c this means we'll reattempt the write, and then we'll get the error which we can log. That's where these two errors are coming from:-

[2024-08-28T17:48:35.369-0400] [ERROR] SSL_read: I/O error
...
[2024-08-28T17:49:23.596-0400] [ERROR] SSL_write: Failure in SSL library (protocol error?)

So I don't think this needs changing.

What does need changing is that something isn't checking the status of a read or write. It's not obvious looking at the code where this would be.

A couple of questions:-

  1. Can the user reproduce this? It looks like it's resize-related
  2. Can the user take debug builds from us?

Thanks.

@matt335672
Copy link
Member

I think I saw something like this once during GFX testing.

My guess is we've sent the client something on the GFX channel that it doesn't like. The client is closing the GFX channel (but not the main channel), and then we're logging errors for everything else we send on that channel. We do do a lot of error checking in those code paths.

If that's right, we need to figure out what the client is objecting to.

@metalefty
Copy link
Member Author

@matt335672

Thanks for the info. I've been busy for a while. I haven't asked the end-user if the issue is GFX related but they've taken a stack trace.


The stack trace taken whilst the xrdp process is still running shows

#0  0x00007f87bae11a35 in write () from /lib64/libpthread.so.0
#1  0x00007f87bb5766a0 in g_file_write () from /usr/lib64/xrdp/libcommon.so.0
#2  0x00007f87bb57409a in internal_log_message () from /usr/lib64/xrdp/libcommon.so.0
#3  0x00007f87bb57445a in log_message () from /usr/lib64/xrdp/libcommon.so.0
#4  0x00007f87bb54bafb in xrdp_sec_send () from /usr/lib64/xrdp/libxrdp.so.0
#5  0x00007f87bb5486d6 in xrdp_rdp_send_data_from_channel () from /usr/lib64/xrdp/libxrdp.so.0
#6  0x00007f87bb52fa03 in libxrdp_send_bitmap () from /usr/lib64/xrdp/libxrdp.so.0
#7  0x00005645f41b1384 in xrdp_painter_end_update ()
#8  0x00005645f419fab7 in xrdp_bitmap_invalidate.part ()
#9  0x00005645f41a0f91 in xrdp_bitmap_invalidate ()
#10 0x00005645f41b968e in xrdp_wm_init ()
#11 0x00005645f41b9eb5 in xrdp_wm_check_wait_objs ()
#12 0x00005645f41b3724 in xrdp_process_main_loop ()
#13 0x00005645f41a6cfb in xrdp_listen_main_loop ()
#14 0x00005645f419e29c in main ()

strace output shows evidence that the socket is disconnected (POLLHUP):

poll([{fd=14, events=POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT|POLLERR|POLLHUP}])

@matt335672
Copy link
Member

That's useful - thanks.

That's the code that sends GFX updates when the backend isn't running in GFX mode.

My suspicion is one of these lines. Neither has error checking on it:-

xrdp/libxrdp/libxrdp.c

Lines 586 to 587 in 364790b

xrdp_rdp_send_data((struct xrdp_rdp *)session->rdp, s,
RDP_DATA_PDU_UPDATE);

xrdp/libxrdp/libxrdp.c

Lines 694 to 695 in 364790b

xrdp_rdp_send_data((struct xrdp_rdp *)session->rdp, s,
RDP_DATA_PDU_UPDATE);

If you're stuck for time at the moment, let me know if you'd like a patch.

@metalefty
Copy link
Member Author

@matt335672 Could you work on patch? I'm still stuck for time for a while.

@matt335672
Copy link
Member

Will do. Should get to it Monday at the latest.

@matt335672
Copy link
Member

I think I can (sort of) reproduce this now with this patch:-

--- a/xrdp/xrdp_wm.c
+++ b/xrdp/xrdp_wm.c
@@ -833,6 +833,13 @@ xrdp_wm_init(struct xrdp_wm *self)
     {
         LOG(LOG_LEVEL_DEBUG, "   xrdp_wm_init: no autologin / auto run detected, draw login window");
         xrdp_login_wnd_create(self);
+static int first_time = 1;
+if (first_time)
+{
+    LOG(LOG_LEVEL_INFO, "Sleeping....");
+    g_sleep(10 * 1000);
+    LOG(LOG_LEVEL_INFO, "Done sleeping....");
+}
         /* clear screen */
         xrdp_bitmap_invalidate(self->screen, 0);
         xrdp_wm_set_focused(self, self->login_window);

I then connect and kill the connection before the sleep finishes.

I don't get an infinite loop, but I get a lot of logging along these lines:-

[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_iso_send(xrdp_iso.c:639)] xrdp_iso_send: trans_write_copy_s failed
[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_mcs_send(xrdp_mcs.c:1475)] xrdp_mcs_send: xrdp_iso_send failed
[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_sec_send(xrdp_sec.c:1328)] xrdp_sec_send: xrdp_mcs_send failed
[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_rdp_send_data_from_channel(xrdp_rdp.c:680)] xrdp_rdp_send_data_from_channel: xrdp_sec_send failed

Despite my earlier comments, it's nothing to do with GFX - in fact in GFX mode things seem to terminate more quickly.

I'll step through this and figure out where the best place to detect the connection has failed is. Adding a check to each potential write is a lot of work and probably unnecessary. There's probably something that can be done at the top level.

@matt335672
Copy link
Member

After spending quite a lot of time looking into this, I think I'm chasing my own tail. I'm sure I'm missing something.

The very first comment contains this line :-

[2024-08-28T08:38:02.147-0400] [INFO ] Received memory_allocation_complete command. width: 7680, height: 4320

That has to be generated by xup.c. In turn that means the client has authenticated and is trying to connect to a session.

The stacktrace is related to the login screen being drawn. I didn't think that was possible after we've authenticated and are trying to connect to a session.

This looks like two separate problems. To stop even more flailing about on my part, I'm going to ignore the stacktrace for now.

Going back to the logging above, we've got this message:-

[2024-08-28T17:48:35.370-0400] [ERROR] Sending [ITU T.125] DisconnectProviderUltimatum failed

This is generated here in xrdp_mcs_disconnect():-

xrdp/libxrdp/xrdp_mcs.c

Lines 1534 to 1540 in 1c33f3d

if (xrdp_iso_send(self->iso_layer, s) != 0)
{
free_stream(s);
close_rdp_socket(self);
LOG(LOG_LEVEL_ERROR, "Sending [ITU T.125] DisconnectProviderUltimatum failed");
return 1;
}

Following the call stack up:-

  1. xrdp_mcs_disconnect() is called from one place only, in xrdp_sec_disconnect()
  2. xrdp_sec_disconnect() is called from one place only, in xrdp_rdp_disconnect()
  3. xrdp_rdp_disconnect() is called from one place only in libxrdp_disconnect()
  4. libxrdp_disconnect() is called from two places. These are both in xrdp_process_main_loop(), and could be replaced with a single call.

After libxrdp_disconnect() returns, I can't see how a separate xrdp process would fail to exit.

@metalefty - can you check a couple of things with the user:-

  1. Are they running with fork=true in xrdp.ini?
  2. Can you ask them to set EnableProcessId=true in the [Logging] section? That should let us tie up the logging with the actual xrdp processes being used. I'd like to know if these messages are coming from one xrdp process, or many xrdp processes.

@metalefty
Copy link
Member Author

Are they running with fork=true in xrdp.ini?

Yes.

Can you ask them to set EnableProcessId=true in the [Logging] section? That should let us tie up the logging with the actual xrdp processes being used. I'd like to know if these messages are coming from one xrdp process, or many xrdp processes.

Yes, trying to reproduce with EnableProcessId=true now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants