repair window responses are not retransmitted #336

aeyakovenko · 2018-06-08T19:52:07Z

leader sends packet to 1 validator (A)
validator A retransmits the packet to all the peers

if the packet is dropped, we do not know if it's in step 1 or 2. We basically need some way to decide in the validators if they should ask the peers, or the leader about the packet, and the leader should respond with a packet that the validator will retransmit if it was dropped in step 1.

the hard part here is avoiding having multiple validators retransmit this packet to the peers, because it would flood the network. so the leader needs to do some flow control.

aeyakovenko · 2018-06-09T17:57:15Z

we don’t have a retransmit flag, the leader sets the blobs sender id to self. And packets from the leader are retransmitted to peers.

So we can set the id to self on the first repair packet. Or the second one from a different node

aeyakovenko · 2018-06-09T17:58:38Z

I think if we included the window bits in the repair messages we could do something smarter and proactive.

So if the leader gets multiple requests for repair they can evaluate all windows and see what packets should be retransmitted to all the peers. We would need to weight them by stake size to be spam resistant eventually

pgarg66 · 2018-06-10T18:56:02Z

How about this:

The validator sends the retransmit request to the leader.
If the leader get the retransmit request from multiple validators, it sends the packets to one of validator (may be first one that requested the packet, or some better scheduler), setting sender id to self. The validator will retransmit the packet to other validators in this case.
If the retransmit request was from one validator only, it'll send it to that validator, sending sender id to validator (?) The validator will not retransmit this packet to other validators.

As you said, the leader can maintain a window. Also, does/can leader know which validator got which packet (in the window) when it was originally transmitted? If retransmission requests are being received for packets sent to a particular validator, it can indicate some problem (network/host) with that validator.

Thoughts?

aeyakovenko · 2018-06-10T23:57:03Z

Would you wait before responding, until message 2? Or just keep a counter of how many unique repair requests there are?

Right now the validators randomly send each other repair requests, and the leader is part of the random group.

pgarg66 · 2018-06-11T00:05:19Z

There can be a small (TBD) fixed wait before responding.

If one of the validator is down/bottlenecked (or if leader to validator packet was dropped), more than one peer validators will request for a retransmission within some time interval. If the unicast packet from one validator to another was dropped, then only one of them will request retransmit.

So, validators don't know who the leader is?

aeyakovenko · 2018-06-11T16:33:26Z

@pgarg66 validators know who the leader is. so each validator gets a different packet and retransmits to all the other peers, thats who we are splitting the leaders bandwidth into N downstream nodes.

I think something simple we can try is asking to retransmit with exponential backoff, so 2, 4, 8th... repair request

pgarg66 · 2018-06-12T03:45:28Z

"I think something simple we can try is asking to retransmit with exponential backoff, so 2, 4, 8th... repair request"

Sorry, I am slightly confused with this. Is the requester (validator) exponentially backing off before requesting a retransmission? Is the purpose of back off that multiple validators won't ask for a retransmission of the same packet?

aeyakovenko · 2018-06-12T04:23:44Z

the leader sets the sender id as self (which indicates retransmit), every time the number of requests to repair that specific packet doubles.

pgarg66 · 2018-06-15T23:02:22Z

Isn't this code already retrying to repair the window?

streamer.rs: line 203
let reqs = find_next_missing(locked_window, crdt, consumed, received)?;
let sock = UdpSocket::bind("0.0.0.0:0")?;
for (to, req) in reqs {
//todo cache socket
info!("repair_window request {} {} {}", *consumed, *received, to);
assert!(req.len() < BLOB_SIZE);
sock.send_to(&req, to)?;
}

aeyakovenko · 2018-06-15T23:18:39Z

the problem is here
https://github.com/solana-labs/solana/blob/master/src/crdt.rs#L609

we set the response to the repair request to not retransmit ever. so if the packet is dropped in the first hop, all the peers are missing the packet and none will broadcast it to the rest of the network

pgarg66 · 2018-06-16T00:02:30Z

I understand it now.
Good thing, now I also have some understanding of validator side of repair window processing.

* runtime: do fewer syscalls in remap_append_vec_file Use renameat2(src, dest, NOREPLACE) as an atomic version of if statx(dest).is_err() { rename(src, dest) }. We have high inode contention during storage rebuild and this saves 1 fs syscall for each appendvec. * Address review feedback

aeyakovenko added this to the v0.7.0 milestone Jun 8, 2018

pgarg66 self-assigned this Jun 9, 2018

This was referenced Jun 17, 2018

Issue #336: Added throttling of repair messages (will add the retransmission of repair messages logic soon) #371

Closed

Issue #336: added retransmission of repair messages #373

Closed

garious mentioned this issue Jun 19, 2018

Retransmit messages #382

Merged

garious closed this as completed in #382 Jun 19, 2018

adamlaska mentioned this issue Nov 17, 2023

[Snyk] Fix for 1 vulnerabilities adamlaska/solana#107

Open

adamlaska mentioned this issue Dec 28, 2023

[Snyk] Security upgrade start-server-and-test from 1.14.0 to 2.0.3 adamlaska/solana#117

Open

adamlaska mentioned this issue Jan 5, 2024

[Snyk] Fix for 1 vulnerabilities adamlaska/solana#121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repair window responses are not retransmitted #336

repair window responses are not retransmitted #336

aeyakovenko commented Jun 8, 2018

aeyakovenko commented Jun 9, 2018

aeyakovenko commented Jun 9, 2018 •

edited

Loading

pgarg66 commented Jun 10, 2018

aeyakovenko commented Jun 10, 2018 •

edited

Loading

pgarg66 commented Jun 11, 2018

aeyakovenko commented Jun 11, 2018

pgarg66 commented Jun 12, 2018

aeyakovenko commented Jun 12, 2018

pgarg66 commented Jun 15, 2018 •

edited

Loading

aeyakovenko commented Jun 15, 2018

pgarg66 commented Jun 16, 2018

repair window responses are not retransmitted #336

repair window responses are not retransmitted #336

Comments

aeyakovenko commented Jun 8, 2018

aeyakovenko commented Jun 9, 2018

aeyakovenko commented Jun 9, 2018 • edited Loading

pgarg66 commented Jun 10, 2018

aeyakovenko commented Jun 10, 2018 • edited Loading

pgarg66 commented Jun 11, 2018

aeyakovenko commented Jun 11, 2018

pgarg66 commented Jun 12, 2018

aeyakovenko commented Jun 12, 2018

pgarg66 commented Jun 15, 2018 • edited Loading

aeyakovenko commented Jun 15, 2018

pgarg66 commented Jun 16, 2018

aeyakovenko commented Jun 9, 2018 •

edited

Loading

aeyakovenko commented Jun 10, 2018 •

edited

Loading

pgarg66 commented Jun 15, 2018 •

edited

Loading