Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seems like read_until can miss the delimiter byte from time to time #4908

Closed
erdosboti opened this issue Aug 12, 2022 · 5 comments
Closed

Seems like read_until can miss the delimiter byte from time to time #4908

erdosboti opened this issue Aug 12, 2022 · 5 comments
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-io Module: tokio/io

Comments

@erdosboti
Copy link

Version
List the versions of all tokio crates you are using. The easiest way to get
this information is using cargo tree subcommand: v.1.17.0

cargo tree | grep tokio
│ │ ├── tokio v1.17.0
│ │ │ └── tokio-macros v1.7.0 (proc-macro)
│ ├── tokio v1.17.0 ()
│ │ ├── tokio v1.17.0 (
)
│ │ ├── tokio-util v0.7.0
│ │ │ └── tokio v1.17.0 ()
│ │ ├── tokio v1.17.0 (
)
│ │ └── tokio-util v0.6.9
│ │ └── tokio v1.17.0 ()
│ ├── tokio v1.17.0 (
)
│ ├── tokio-util v0.6.9 ()
├── tokio v1.17.0 (
)
├── tokio-executor-trait v2.1.0
│ └── tokio v1.17.0 ()
├── tokio-reactor-trait v1.1.0
│ ├── tokio v1.17.0 (
)
│ └── tokio-stream v0.1.8
│ └── tokio v1.17.0 ()
└── tokio-retry v0.3.0
└── tokio v1.17.0 (
)

Platform
The output of uname -a (UNIX), or version and 32 or 64-bit (Windows)

Linux localhost 5.18.14-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 23 Jul 2022 11:46:17 +0000 x86_64 GNU/Linux

Description
I'm fairly new to rust, so I might not do something correctly, but here is what I observe:
I use tokio to communicate with multiple cameras over TCP sockets. XML commands can be sent to and XML messages can be received from the cameras. The delimiter that marks the end of a message is the NULL byte b'\0'. I use read_until to read a single message from the socket. It works 99% of the time, but from time to time I get 2 messages in the buffer with 2 NULL bytes (one being at the end correctly). So it seems like sometimes read_until can "miss" the delimiter.

I tried this code:

The relevant code part looks like this:

    pub async fn read_next_message(&mut self) {
        match &mut self.stream {
            None => log::error!("Cannot read message, because connection is not yet established!"),
            Some(stream) => {
                let len = stream
                    .reader
                    .read_until(b'\0', &mut stream.read_buffer)
                    .await
                    .unwrap();
                if len == 0 {
                    log::warn!(
                        "Disconnected from {camera_address}...",
                        camera_address = self.camera_id.addr
                    );
                    self.establish_connection().await;
                    return;
                }

                // Temp error checking ------------------------------------------------
                if has_null_byte(&stream.read_buffer[..(stream.read_buffer.len() - 1)]) {
                    log::error!("BAD XML");
                    log::info!("buf len: {}", len);
                    log::info!("vec len: {}", stream.read_buffer.len());
                    let mut file = tokio::fs::File::create("/home/bpe/BAD_XML.xml")
                        .await
                        .unwrap();
                    file.write_all(&stream.read_buffer[..]).await.unwrap();
                }
                // ---------------------------------------------------------------------

               // probably not relevant part as the problem already happens before this --------------- 
               let xml_message =
                    std::str::from_utf8(&stream.read_buffer[0..(stream.read_buffer.len() - 1)])
                        .expect("Found invalid UTF-8");
                let msg = messages::CameraMessage::from(xml_message);
                log::info!("{}: Sent {}.", self.camera_id, msg);

                if let messages::CameraMessage::FixedCameraStatus(_) = msg {
                } else {
                    ObservationMessageJob::new(xml_message, self.camera_id)
                        .push(&mut self.redis_connection)
                        .await;

                    log::info!("{}: Pushed {} to sidekiq.", self.camera_id, msg);
                }
                // ------------------------------------------------------------------------------------------------

                stream.read_buffer.clear();
            }
        }
    }

I expected to see this happen:

I would expect that we never enter into the 'Temp error checking' code part, since we have only one NULL byte at the end of stream.read_buffer: Vec<u8>

Instead, this happened:

Sometimes we got in there (error checking part) and if I check the content of BAD_XML.xml I indeed see 2 messages and 2 NULL bytes at the end of those. I also log the return value of the read_until call and the stream.read_buffer which gives this:

... [ERROR] - BAD XML
... [INFO] - buf len: 255014
... [INFO] - vec len: 506172

I know that len can be smaller than the stream.read_buffer.len() if cancellation happen, I just thought it be worth mentioning that the first NULL bytes is at position 251157 which is a different value.

Additional information

  • This code is called in a tokio::select! branch.
  • The application is built into a static binary with cargo build --release --target=x86_64-unknown-linux-musl and is run on a server as a systemd service. The server platform is: Linux ... 4.15.0-188-generic #199-Ubuntu SMP Wed Jun 15 20:42:56 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux.
@erdosboti erdosboti added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Aug 12, 2022
@Darksonn
Copy link
Contributor

Have you tried this?

let len = stream
    .reader
    .read_until(b'\0', &mut stream.read_buffer)
    .await
    .unwrap();
if len == 0 {
    log::warn!(
        "Disconnected from {camera_address}...",
        camera_address = self.camera_id.addr
    );
    self.establish_connection().await;
    return;
}

 // Temp error checking ------------------------------------------------
-if has_null_byte(&stream.read_buffer[..(stream.read_buffer.len() - 1)]) {
+if has_null_byte(&stream.read_buffer[..len]) {
    log::error!("BAD XML");
    log::info!("buf len: {}", len);
    log::info!("vec len: {}", stream.read_buffer.len());
    let mut file = tokio::fs::File::create("/home/bpe/BAD_XML.xml")
        .await
        .unwrap();
    file.write_all(&stream.read_buffer[..]).await.unwrap();
}
// ---------------------------------------------------------------------

@Darksonn Darksonn added the M-io Module: tokio/io label Aug 12, 2022
@Darksonn
Copy link
Contributor

Actually, giving this another look, I am more worried about the fact that the call to .clear() is far away from the call to read_until. I would recommend clearing the buffer right before the read_until call.

@erdosboti
Copy link
Author

Thx for looking into it.
No I haven't tried, but as I mentioned len is not equal to the position of the first NULL byte, and I learned the hard way that it can be a smaller value than the actual message length.

Wow you're fast. I was just answering your comment and it's already updated.
About the .clear() thing: while I was describing the issue, I also realized that it might play a role in this. I'll try you suggestion and report back.

Thx again.

@erdosboti
Copy link
Author

erdosboti commented Aug 13, 2022

I applied your suggestion, but that caused broken messages again, but with the end part. So I finally figured out that the whole problem is that I didn't really understand how tokio::select! works. For future reference I describe what the problem was:
I had a tokio::select! in a loop with 2 branches:

  • one sends a command to the camera
  • the other one (described in the original bug description) reads a message from the camera

If the "send command to camera" branch is driven to completion first then the the "receive message from camera" branch is cancelled at a non-deterministic point, then in the next iteration of the loop it reruns the whole future. So this was the cause of the problem in both cases:

  • when the .clear() was at the end of the function then it could happen that a whole message was read into the buffer and then the cancellation happened and after that when it was rerun a new message was read in. So that's why it seemed like read_until was missing a delimiter.
  • when the clear() was just before the socket read operation then it could happen that not a whole message was read from the socket before the cancellation happened. And then when it was rerun we cleared the first part of the message and then the second half was read in.

So the solution to this was to have the clear() just before the read_until call as you suggested, but only do the clear() if the buffer's last byte is the delimiter byte.

Thank you very much for the help.

@Darksonn
Copy link
Contributor

Closing since this appears to be due to incorrect use of read_until. If there's an actual bug, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-io Module: tokio/io
Projects
None yet
Development

No branches or pull requests

2 participants