Skip to content
This repository has been archived by the owner on Apr 6, 2019. It is now read-only.

Tacopie TCP server library (read access violation after close under TCP zero window condition) #42

Open
Cylix opened this issue Feb 28, 2018 · 4 comments

Comments

@Cylix
Copy link
Owner

Cylix commented Feb 28, 2018

Issue imported from emails:

Bonjour Simon,

ça va?

I'm very happy to have discovered your excellent Tacopie library on GitHub. You know, I was looking for a modern TCP server library that would take a lot of work out of my hands. I'm currently using it in order to run a few custom VNC-server experiments.

However, I now think I have run into a bug (either in my own program, tacopie or perhaps both).

Background:

I have made a small tacopie based "VNC server" program that talks to the well known "Ultra VNC Viewer" / "Tight VNC viewer" etc.

Now the current implementation of "my" VNC server is minimalistic. 
For instance it does not send 'incremental' screen updates at all (yet).

Build-up

When I connect with the UVNC client, the connection is established fine and things generally work as expected. UVNC displays the screen I'm sending and receiving keyboard/mouse input works too. Life is good!

Remember I don't send incremental updates to UVNC. Well, at some point in time, UVNC starts to worry about that and decides that it now has waited long enough for those. It then simply stops receiving data from its socket and displays a (misleading!) "server closed the connection" popup.

And I know it is misleading because all the while its TCP socket is actually still open! I verified this with Wireshark: no FIN messages etc.

To further convince myself the connection is still open I press the "refresh screen" button on the UVNC toolbar a few times. The server 'sees' these update requests and duly replies, writing a lot of data to the socket.

Since UVNC is not receiving data from the socket, this causes a so called "TCP zero window" situation where basically the TCP-buffer on the client side is exhausted.

Here comes the real problem

If I dismiss the UVNC popup under these specific conditions, the TCP connection will actually be terminated (which is good). But the server application may now crash with an access violation.

The crash happens when trying to obtain the mutex m_write_requests_mtx inside method tcp_client::process_write. I suspect the tcp_client object is destroyed before process_write is called.

The more bytes I attempt to write in that zero window situation, the higher the chance the server application will crash this way. If I press that refresh button numerous times, the crash is virtually guaranteed.

I figure this has something to do with those bytes that could not be written to the socket, delivered etc. 
It looks to me like a race condition where part of the library wants to notify the connection object after having it already destroyed somewhere else (from another thread?).

Disclaimer: I'm not sure if this can be attributed to a bug in my program or the library, but I thought I would communicate it anyway... It may or may not be of help to you.

@Cylix
Copy link
Owner Author

Cylix commented Feb 28, 2018

Email reply:

Hi Edwin,

Thanks for reaching out to me and for the detailed explanation!

Technically, when the tcp_client is destroyed, it will call .disconnect(true) internally.
Note that we pass true as a parameter, which enables the option wait_for_removal.
This option blocks the thread until all pending callbacks are processed.

Considering your case, it seems that this waiting process does not occur.
In such case, when going through the code of the tcp_client, the only possibility would be:
disconnect() is called after a read or a write failure. In such case, wait_for_removal is not set to true, which means the function returns right away.
when the tcp_client is destroyed, the destructor is called, leading to disconnect to be called. But because the client is already marked as disconnected, the function returns right away.
This would match your current context where the client fails to read from the remote server.
Even though the .disconnect prevents the insertion of new write requests and clear the write requests, it is indeed true that there may be a thread notifying the tcp_client about a write event on the socket because we started to watch before the disconnection.

However, the fix is not as trivial as it may sound:
Calling disconnect(true) from the read or write callback will block indefinitely: it would wait for the callbacks to be executed, but the callback is waiting for itself to complete
Always waiting for callback cleanup in the destructor can also be an issue: if the client is disconnected, and the socket closed as it is right now, the socket file descriptor might be reused by the operating system when initiating a new socket. Then, when trying to wait for a cleanup, we will actually wait for the cleanup of another socket.
I guess the best solution at the moment would be:
Always check for callback cleanup when the destructor is called, even if the client has been marked as disconnected
Do not close the socket until the client is destroyed. That would avoid the operating system to reuse the same socket identifier.
I'm gonna have a fix and will try to release that ASAP, will keep you up to date :)

Best and thanks a lot for reporting that, really appreciate!

@Cylix
Copy link
Owner Author

Cylix commented Feb 28, 2018

Status of this issue is on-going, fix should be release soon, sorry for the delay.

Cylix added a commit that referenced this issue Mar 12, 2018
> io service does not delegate tasks to the thread_pool anymore
> tcp_client spawn tasks by delegating the callbacks to the thread_pool (in place of the io_service)
> get rid of the wait_for_removal parameters that made the API and code logic too complex
@Cylix
Copy link
Owner Author

Cylix commented Mar 12, 2018

Fixed in 9c9b9d4

@Cylix Cylix closed this as completed Mar 12, 2018
@Cylix
Copy link
Owner Author

Cylix commented Mar 13, 2018

The implementation had some issues, I rollback the change and reopen the issue.

@Cylix Cylix reopened this Mar 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant