Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to write message to substream when using QUIC with Rendevous #3281

Closed
Firaenix opened this issue Dec 24, 2022 · 17 comments
Closed

Failed to write message to substream when using QUIC with Rendevous #3281

Firaenix opened this issue Dec 24, 2022 · 17 comments

Comments

@Firaenix
Copy link

Firaenix commented Dec 24, 2022

Summary

I'm trying to write up a P2P network using rendevous for peer discovery, however whenever I try to register a peer with my rendevous server, I get the following warning

2022-12-24T04:15:27.496989Z  WARN  Connection with peer 12D3KooWGVBwNbD2pmxUTCbw1h5GX1XGBkLwkmX94EsdNe8FzFwr failed: Failed to write message to substream
    at /Users/firaenix/.cargo/registry/src/github.com-1ecc6299db9ec823/libp2p-rendezvous-0.11.0/src/client.rs:203 on ThreadId(1)

I get no error/warning messages on the rendevous server, only the client.
I can receive a response from a discover request though, but ofcourse, the array of registrations is always empty.

Expected behaviour

I would expect to be able to register my peer and discover that peer on another instance of my application.

Version

Libp2p 0.5

@thomaseizinger
Copy link
Contributor

Does it work with other transports?

@Firaenix
Copy link
Author

I don't get the warning anymore using the tokio_development_transport on my rendezvous server and clients.

However, my clients still don't seem to be registering with the server.

@thomaseizinger
Copy link
Contributor

Can you share the code you are using? Have you looked at the rendezvous examples?

@Firaenix
Copy link
Author

Yeah I've been following the examples but haven't run them explicitly. I'll need to prep an example repo for you as my current code base is rather large and intertwined, I will get back to you

@thomaseizinger
Copy link
Contributor

Last time I checked, the rendezvous examples were functional. Best to compare your code to those and maybe try incorporating QUIC there first. I'd be surprised if there is an interaction between QUIC and rendezvous though.

@Firaenix
Copy link
Author

Just ran the rendezvous examples, they worked out of the box but when I added QUIC, it stopped working with the same issue as I mentioned in the first post.

[2022-12-27T04:52:00Z WARN libp2p_rendezvous::client] Connection with peer 12D3KooWDpJ7As7BWAwRMfu1VU2WCqNjvq387JEYKDBj4kx6nXTN failed: Failed to write message to substream

I get the error for both the discover example and the register_with_identify example. The rendezvous_point does not show any warnings but does mention that the peer disconnected after serving.

[2022-12-27T04:48:05Z INFO  rendezvous_point] Local peer id: 12D3KooWDpJ7As7BWAwRMfu1VU2WCqNjvq387JEYKDBj4kx6nXTN
[2022-12-27T04:52:00Z INFO  rendezvous_point] Connected to 12D3KooWAUUL6ZDg5epkFxkVK5SEeTUp3uzkR9581DF1brMK3JZn
[2022-12-27T04:52:00Z INFO  rendezvous_point] Peer 12D3KooWAUUL6ZDg5epkFxkVK5SEeTUp3uzkR9581DF1brMK3JZn registered for namespace 'rendezvous'
[2022-12-27T04:52:00Z INFO  rendezvous_point] Disconnected from 12D3KooWAUUL6ZDg5epkFxkVK5SEeTUp3uzkR9581DF1brMK3JZn
[2022-12-27T04:52:08Z INFO  rendezvous_point] Connected to 12D3KooWQvPesNZxthxLnFveEsjWVppBeuKr3nb1PRDHxwZpXuZA
[2022-12-27T04:52:08Z INFO  rendezvous_point] Served peer 12D3KooWQvPesNZxthxLnFveEsjWVppBeuKr3nb1PRDHxwZpXuZA with 1 registrations
[2022-12-27T04:52:08Z INFO  rendezvous_point] Disconnected from 12D3KooWQvPesNZxthxLnFveEsjWVppBeuKr3nb1PRDHxwZpXuZA

@dariusc93
Copy link
Member

Could you provide the code you use when you add quic transport?

@Firaenix
Copy link
Author

This is the register_with_identify example, only thing I have changed is adding the quic transport, I copied the same code to the other examples and updated their multiaddresses.

use futures::StreamExt;
use libp2p_core::{identity, upgrade::Version, Multiaddr, PeerId, Transport, muxing::StreamMuxerBox};
use libp2p_identify as identify;
use libp2p_ping as ping;
use libp2p_rendezvous as rendezvous;
use libp2p_swarm::{keep_alive, NetworkBehaviour, Swarm, SwarmEvent};
use std::time::Duration;
use void::Void;

#[tokio::main]
async fn main() {
    env_logger::init();

    let rendezvous_point_address = "/ip4/127.0.0.1/udp/62649/quic-v1".parse::<Multiaddr>().unwrap();
    let rendezvous_point = "12D3KooWDpJ7As7BWAwRMfu1VU2WCqNjvq387JEYKDBj4kx6nXTN"
        .parse()
        .unwrap();

    let identity = identity::Keypair::generate_ed25519();

    let quic_config = libp2p_quic::Config::new(&identity);

    let transport = libp2p_quic::tokio::Transport::new(quic_config)
        .map(|(p, c), _| (p, StreamMuxerBox::new(c)))
        .boxed();

    let mut swarm = Swarm::with_tokio_executor(
        transport, 
        MyBehaviour {
            identify: identify::Behaviour::new(identify::Config::new(
                "rendezvous-example/1.0.0".to_string(),
                identity.public(),
            )),
            rendezvous: rendezvous::client::Behaviour::new(identity.clone()),
            ping: ping::Behaviour::new(ping::Config::new().with_interval(Duration::from_secs(1))),
            keep_alive: keep_alive::Behaviour,
        },
        PeerId::from(identity.public()),
    );

    log::info!("Local peer id: {}", swarm.local_peer_id());

    let _ = swarm.listen_on("/ip4/0.0.0.0/udp/0/quic-v1".parse().unwrap());

    swarm.dial(rendezvous_point_address).unwrap();

    while let Some(event) = swarm.next().await {
        match event {
            SwarmEvent::NewListenAddr { address, .. } => {
                log::info!("Listening on {}", address);
            }
            SwarmEvent::ConnectionClosed {
                peer_id,
                cause: Some(error),
                ..
            } if peer_id == rendezvous_point => {
                log::error!("Lost connection to rendezvous point {}", error);
            }
            // once `/identify` did its job, we know our external address and can register
            SwarmEvent::Behaviour(MyEvent::Identify(identify::Event::Received { .. })) => {
                swarm.behaviour_mut().rendezvous.register(
                    rendezvous::Namespace::from_static("rendezvous"),
                    rendezvous_point,
                    None,
                );
            }
            SwarmEvent::Behaviour(MyEvent::Rendezvous(rendezvous::client::Event::Registered {
                namespace,
                ttl,
                rendezvous_node,
            })) => {
                log::info!(
                    "Registered for namespace '{}' at rendezvous point {} for the next {} seconds",
                    namespace,
                    rendezvous_node,
                    ttl
                );
            }
            SwarmEvent::Behaviour(MyEvent::Rendezvous(
                rendezvous::client::Event::RegisterFailed(error),
            )) => {
                log::error!("Failed to register {}", error);
                return;
            }
            SwarmEvent::Behaviour(MyEvent::Ping(ping::Event {
                peer,
                result: Ok(ping::Success::Ping { rtt }),
            })) if peer != rendezvous_point => {
                log::info!("Ping to {} is {}ms", peer, rtt.as_millis())
            }
            other => {
                log::debug!("Unhandled {:?}", other);
            }
        }
    }
}

#[derive(Debug)]
#[allow(clippy::large_enum_variant)]
enum MyEvent {
    Rendezvous(rendezvous::client::Event),
    Identify(identify::Event),
    Ping(ping::Event),
}

impl From<rendezvous::client::Event> for MyEvent {
    fn from(event: rendezvous::client::Event) -> Self {
        MyEvent::Rendezvous(event)
    }
}

impl From<identify::Event> for MyEvent {
    fn from(event: identify::Event) -> Self {
        MyEvent::Identify(event)
    }
}

impl From<ping::Event> for MyEvent {
    fn from(event: ping::Event) -> Self {
        MyEvent::Ping(event)
    }
}

impl From<Void> for MyEvent {
    fn from(event: Void) -> Self {
        void::unreachable(event)
    }
}

#[derive(NetworkBehaviour)]
#[behaviour(
    out_event = "MyEvent",
    event_process = false,
    prelude = "libp2p_swarm::derive_prelude"
)]
struct MyBehaviour {
    identify: identify::Behaviour,
    rendezvous: rendezvous::client::Behaviour,
    ping: ping::Behaviour,
    keep_alive: keep_alive::Behaviour,
}

@Firaenix
Copy link
Author

Firaenix commented Dec 28, 2022

Something odd is going on, I was diving into the QUIC transport code in VSCode to debug the issue, however it seems to work correctly in the VSCode terminal.
In my standalone iTerm terminal, it gets the error mentioned above...

Neovim terminal running rendezvous_point, split iTerm terminal running register_with_identify:
image

VSCode Terminal running register_with_identify:
image

Both register_with_identify instances are run using the same command: RUST_LOG=info cargo run --example register_with_identify

EDIT:
I can even run the compiled binary for register_with_identify in each terminal and it produces different results:
RUST_LOG=info /Users/firaenix/Projects/rust-libp2p/target/debug/examples/register_with_identify

@Firaenix
Copy link
Author

Firaenix commented Dec 28, 2022

The error seems to be here: https://github.com/libp2p/rust-libp2p/blob/master/protocols/rendezvous/src/handler/outbound.rs#L91

This is not a fix by any means, but when I comment out the close, the rendezvous point registers the peer successfully. Still unsure why that would work differently in each terminal though.

When I run RUST_LOG=info cargo run --example discover, it doesnt matter which terminal I run, I always get the Failed to write message to substream warning.

@wpy-snqu
Copy link

webrtc transport is ok.

@thomaseizinger
Copy link
Contributor

The error seems to be here: https://github.com/libp2p/rust-libp2p/blob/master/protocols/rendezvous/src/handler/outbound.rs#L91

This is not a fix by any means, but when I comment out the close, the rendezvous point registers the peer successfully. Still unsure why that would work differently in each terminal though.

When I run RUST_LOG=info cargo run --example discover, it doesnt matter which terminal I run, I always get the Failed to write message to substream warning.

We've had some issues with closing of QUIC streams before. Are you on latest master?

Related: #3144

@Firaenix
Copy link
Author

Those logs and code I shared above are from latest master, yes. I pulled the repo and ran the rendezvous examples

@wpy-snqu
Copy link

wpy-snqu commented Dec 30, 2022

fix,It is OK.

--- a/transports/quic/src/connection/substream.rs
+++ b/transports/quic/src/connection/substream.rs
@@ -185,8 +185,9 @@ impl AsyncWrite for Substream {
             }
             WriteState::Closed => return Poll::Ready(Ok(())),
             WriteState::Stopped => {
-                let err = quinn_proto::FinishError::Stopped(0u32.into());
-                return Poll::Ready(Err(io::Error::new(io::ErrorKind::ConnectionReset, err)));
+                // let err = quinn_proto::FinishError::Stopped(0u32.into());
+                // return Poll::Ready(Err(io::Error::new(io::ErrorKind::ConnectionReset, err)));
+                return Poll::Ready(Ok(()));
             }
         }

@thomaseizinger
Copy link
Contributor

@elenaf9 What do you think of the above patch? Would that make sense?

@elenaf9
Copy link
Contributor

elenaf9 commented Jan 11, 2023

Sorry for the late reply.
The issue here is that closing the substream between server and client is racy.
The following sequence happens:

sequenceDiagram
    participant Client
    participant Server
Note left of Client: write
Client ->>Server: <data>
Note right of Server: read
Note right of Server: write
Server->>Client:<data>
Note right of Server: close
Server ->>Client: FIN
Note left of Client: read
Client ->>Server: ack
Note left of Client : close
Note right of Server : drop
Client ->>Server: FIN
Loading

When the server finishes writing and closing the stream, it dropps that stream. If this happens before it receives the FIN of the client, it sends a STOP_SENDING to signal the client that it won't read any more data. This then causes the above error in the client's close call.

We can't simply do the above patch since there may be situations where we still have unacknowledged write data. If that's the case we won't know if the remote read all of our data, thus we need to return the FinishError::Stopped to the user.

Now for the rendezvous example we can simply fix this by having the Client call close directly after the write call, before it starts reading. That would guarantee that the server receives the FIN.
That being said, the calling order write-> read-> close generally should be possible and not cause racy bugs. I am currently looking into how to handle this case better.
I hope that helps. If you right now need a fix for rendezvous you could do a PR with the patch I described above, but I hope I'll get a proper fix done this week. I am also looking into whether this is related to the bugs described in #3298 and #3308.

@thomaseizinger
Copy link
Contributor

Closing as stale (or completed?).

@thomaseizinger thomaseizinger closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants