-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: resolve domains in enode strings #8188
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to have the new type DNSNodeRecord
live in reth-net-common
f.write_char(':')?; | ||
self.tcp_port.fmt(f)?; | ||
if self.tcp_port != self.udp_port { | ||
f.write_str("?discport=")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should open an issue to spec this behaviour with discport
|
||
// resolve trusted peers if they use a domain instead of dns | ||
for peer in &config.network.trusted_peers { | ||
let resolved = resolve_dns_node_record(peer.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to resolve DNS at this point?
When deploying geth on k8s, we have found its behaviour to eagerly resolve DNS of hostnames in enode strings to be a blocker for trusted peer setups (e.g. trying to resolve hostname for a clusterip of a statefulset that is being deployed in parallel, preventing peers from starting up further).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it work for this use case, what would be the best way to resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hostname for a statefulset pod on k8s should stabilize within a few seconds of the pod being up and running. The issue with geth was that the pods would terminate almost instantly on start up due to the hostname resolution failure.
To give us the best shot of not having the same issue, we could consider some or all of these options:
- Build in a configurable number of retries (and maybe backoff strategy) for resolving peer hostnames;
- Make the type for accessing peer records handle hostname resolution just in time (e.g.
peer.ip_addr()
would perform resolution if it had not been done before).
Option 1 would allow operators to ensure start up does not fail on the basis of hostnames that are in the process of being available on cluster.
Option 2 is not guaranteed to solve the issue, but might mitigate the number of retries required.
Let me know if you think we should address this in a separate issue and PR. And if you want me to help finish off this PR at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build in a configurable number of retries (and maybe backoff strategy) for resolving peer hostnames;
yeah, to me this seems like the most likely to work, and option 2 seems more difficult anyways. It would be great if you could help implement it! Feel free to submit a PR to this branch!
@Rjected I looked into this unit test failure
The From what I can see our only option is to post-process the url.host after that parsing has occurred. Would be better if we could control the url parsing logic but haven't seen a way to do that in our favour here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not super obvious to me why we need this.
this basically replaces the Noderecord usage, why?
what's the progress here?
/// cannot be resolved, an error is returned. | ||
/// | ||
/// If the host is already an IP address, the [NodeRecord] is returned immediately. | ||
pub async fn resolve_dns_node_record(node_record: DNSNodeRecord) -> Result<NodeRecord, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a standalone function and not a fn of dnsrecord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattsse I have moved the fn to be a method of the dns struct in this pr #8358 (review)
I also think we should consolidate DNSNodeRecord into NodeRecord and perhaps lazily resolve domain names when the tcp/udp getter fns are used
I have only just started contributing to reth so please take what I say with a grain of salt. My understanding is that the current impl of NodeRecord does not support domain names, only IPv4 and IPv6 addrs: pub struct NodeRecord {
/// The Address of a node.
pub address: IpAddr This PR is adding support for domain names. @mattsse are you suggesting we should be adding this support with the existing Apologies if I have misinterpreted anything 🙏 |
gotcha, @Rjected please document all of this because having like 4 different node variants is super confusing |
3510f0f
to
cdfdaff
Compare
@Rjected what's the path forward here? |
5d3c0be
to
6277861
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solving this with a trustedpeer type makes sense to me.
I'm okay with the Option but don't love it, especially since we already use the backon crate.
don't have a preference, but ideally the caller is responsible for retrying futures, like backon does.
eg:
reth/bin/reth/src/commands/debug_cmd/in_memory_merkle.rs
Lines 110 to 123 in 1e0d724
let backoff = ConstantBuilder::default().with_max_times(retries); | |
let client = fetch_client.clone(); | |
let header = (move || { | |
get_single_header(client.clone(), BlockHashOrNumber::Number(target_block_number)) | |
}) | |
.retry(&backoff) | |
.notify(|err, _| warn!(target: "reth::cli", "Error requesting header: {err}. Retrying...")) | |
.await?; | |
let client = fetch_client.clone(); | |
let chain = provider_factory.chain_spec(); | |
let block = (move || get_single_body(client.clone(), Arc::clone(&chain), header.clone())) | |
.retry(&backoff) |
crates/net/network/src/manager.rs
Outdated
let mut resolved_boot_nodes = vec![]; | ||
for record in &boot_nodes { | ||
let resolved = record.resolve(None).await?; | ||
resolved_boot_nodes.push(resolved); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a join
crates/net/types/src/trusted_peer.rs
Outdated
pub async fn resolve( | ||
&self, | ||
retry_strategy: Option<RetryStrategy>, | ||
) -> Result<NodeRecord, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like intregrating a retrystrategy in there like this.
imo it should be the caller's responsibility, to retry the resolve fn.
I'd rather have this as a separate function, or leave it up to the caller entirely.
we also already use the backon crate for retries. can we use that instead, the tokio-retry crate is very smol, but still. Also open to remove backon in favor of tokio-retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have created #8613 to address this 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see load_toml_config()
being tested. Now that the retry logic is moved there, I feel obligated to add tests for that fn.
Is that functionality perhaps tested by some e2e suite in CI? Or would we want to add a unit test for it perhaps?
for peer in &config.network.trusted_peers { | ||
let retry_strategy = RetryStrategy::new( | ||
std::time::Duration::from_millis(config.network.retry_millis), | ||
config.network.retry_attempts, | ||
); | ||
let resolved = peer | ||
.resolve(Some(retry_strategy)) | ||
.await | ||
.wrap_err_with(|| format!("Could not resolve trusted peer {peer}"))?; | ||
toml_config.peers.trusted_nodes.insert(resolved); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a join as well.
// resolve trusted peers if they use a domain instead of dns | ||
for peer in &config.network.trusted_peers { | ||
let backoff = ConstantBuilder::default().with_max_times(config.network.dns_retries); | ||
let resolved = (move || { peer.resolve() }) | ||
.retry(&backoff) | ||
.notify(|err, _| warn!(target: "reth::cli", "Error resolving peer domain: {err}. Retrying...")) | ||
.await?; | ||
toml_config.peers.trusted_nodes.insert(resolved); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could extract this logic, into a separate function.
but will track this in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ping me for any followup if I can help
7bc27a3
to
83323eb
Compare
Introduces a type that should properly deserialize bootnode enode strings with domains, since the current
NodeRecord
type does not do this.fixes #8187