Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idna to_unicode() API has degraded in 1.0 #938

Open
djc opened this issue Jun 11, 2024 · 10 comments
Open

idna to_unicode() API has degraded in 1.0 #938

djc opened this issue Jun 11, 2024 · 10 comments

Comments

@djc
Copy link
Contributor

djc commented Jun 11, 2024

I work on a domain search engine that deals with many domains. As part of performance efforts to optimize this path, I had previously carefully optimized Idna::to_unicode() to avoid allocations where possible (for example, in #653). However, the 1.0 release (apart from bringing in 25 transitive new dependencies which IMO is not great by itself for such a low-level crate) proposes I use domain_to_unicode() (which is a little simpler but definitely doesn't enable me to avoid per-conversion allocations), but then says:

This function exists for backward-compatibility. Consider using Uts46::to_user_interface or Uts46::to_unicode.

In turn, Uts64::to_unicode() documents itself as:

Most applications probably shouldn’t use this method and should be using Uts46::to_user_interface instead.

Meanwhile the interface for to_user_interface() is:

pub fn to_user_interface<'a, OutputUnicode: FnMut(&[char], &[char], bool) -> bool>(
        &self,
        domain_name: &'a [u8],
        ascii_deny_list: AsciiDenyList,
        hyphens: Hyphens,
        output_as_unicode: OutputUnicode,
    ) -> (Cow<'a, str>, Result<(), crate::Errors>) {

Which IMO is pretty unreasonable to suggest as an interface "most applications" should use. If there is a need for a more complex API it seems clear that the previous approach of building a Config with a builder pattern and building an instance with an internal buffer that could be reused allowed for more idiomatic and more performant operations.

@valenting
Copy link
Collaborator

@hsivonen did some benchmarking in Firefox and the new IDNA crate should be slightly faster than the old one.
Of course, it might still be suboptimal in its allocation strategy.
I suggest you hold on with the upgrade until we get the chance to make some more optimizations. We might even downgrade to the previous idna version temporarily (see #937)

@djc
Copy link
Contributor Author

djc commented Jun 11, 2024

Did the benchmarking involve doing a bunch of toUnicode conversions in a row rather than one-shot?

@valenting
Copy link
Collaborator

@hsivonen
Copy link
Collaborator

hsivonen commented Jun 24, 2024

which is a little simpler but definitely doesn't enable me to avoid per-conversion allocations

Do I understand correctly that you want to an API entry point that takes a &mut String so that you can set the logical length of the string back to zero between conversions without dropping the String's internal buffer?

Is the issue that Idna::to_unicode() was marked deprecated or that its internals changed in a way that causes more allocations?

The remark about "most applications" comes from the observation that for protocol needs, apps generally need to run ToASCII (as a matter of observation, the common case for using this crate directly is to call the crate top-level domain_to_ascii and nothing else), and for user interface reasons, apps should be running a policy layer that deals with confusables instead of running pure ToUnicode. Notably, there don't seem to be good use cases for pure ToUnicode in a piece of software as complex as a Web browser!

The interface complexity of to_user_interface arises from two reasons: 1) keeping the policy out of scope for this crate (hence the closure) and 2) allowing configurability for the bits (denied ASCII and positional hyphens) where theory and practice disagree and callers are likely to want to pick what degree they want between theory and practice.

I'm curious: What's your use case for pure (without policy for dealing with confusables) ToUnicode?

@djc
Copy link
Contributor Author

djc commented Jun 24, 2024

I work on InstantDomainSearch, which processes several GBs worth of input files (mostly CSV) to build up state (of which domains are free vs taken vs available on the aftermarket) and then processes user queries against this state. I mainly use idna to normalize domains using UTF-8 which seems like a sensibly compact canonical form.

Whereas the Config API seems to prefer establishing a policy upfront and then applying it to lots of conversions (which IMO is a sensible approach), the Uts64 API seems to completely externalize the policy which only makes sense to me if it changes from call to call which seems mostly unlikely. I also think that FnMut(&[char], &[char], bool) -> bool by itself is a fairly inscrutable generic type and so doesn't make for good Rust API -- it might be better off as an explicit trait where it's possible to name the parameters.

@hsivonen
Copy link
Collaborator

I work on InstantDomainSearch, which processes several GBs worth of input files (mostly CSV) to build up state (of which domains are free vs taken vs available on the aftermarket) and then processes user queries against this state. I mainly use idna to normalize domains using UTF-8 which seems like a sensibly compact canonical form.

This makes sense, but I don't think it's unreasonable for a "most applications" statement in the docs not to apply to your application. Most applications need an UTS 46 implementation for the purpose on resolving names (and most of those won't call into the idna crate directly but via url). Fewer apps also need to render domain names to the user, but that's still a common thing to do.

Code that's about registering domains is quite far from "most applications". (And registration needs policy checks on the ToUnicode output that are different from the display policy integration point that the API caters to.)

Whereas the Config API seems to prefer establishing a policy upfront and then applying it to lots of conversions (which IMO is a sensible approach), the Uts64 API seems to completely externalize the policy which only makes sense to me if it changes from call to call which seems mostly unlikely.

Letting the parameters change from call to call isn't at all what the API design is about. On the contrary, the API design more obviously allows for constant propagation than an API where the parameters don't looks like constants at the call site.

I also think that FnMut(&[char], &[char], bool) -> bool by itself is a fairly inscrutable generic type and so doesn't make for good Rust API -- it might be better off as an explicit trait where it's possible to name the parameters.

The closure signature not naming the parameters indeed isn't nice. In retrospect, at least the two bools there could have been two-state enums with descriptive names.

Anyway, back to the core question: Is this about having API surface that takes &mut String and being able to set the logical length to zero between calls without deallocating? Ignoring the deprecation annotation, does Idna::to_unicode() no longer allow you to get the behavior you wanted (assuming non-transitional mode)?

@djc
Copy link
Contributor Author

djc commented Jun 27, 2024

Anyway, back to the core question: Is this about having API surface that takes &mut String and being able to set the logical length to zero between calls without deallocating? Ignoring the deprecation annotation, does Idna::to_unicode() no longer allow you to get the behavior you wanted (assuming non-transitional mode)?

Yes, when I benchmarked this a few years ago, IIRC avoiding one allocation per to_unicode() operation was a clear performance win.

@hsivonen
Copy link
Collaborator

hsivonen commented Jul 1, 2024

Anyway, back to the core question: Is this about having API surface that takes &mut String and being able to set the logical length to zero between calls without deallocating? Ignoring the deprecation annotation, does Idna::to_unicode() no longer allow you to get the behavior you wanted (assuming non-transitional mode)?

Yes, when I benchmarked this a few years ago, IIRC avoiding one allocation per to_unicode() operation was a clear performance win.

That answers the first question. Thanks. It doesn't answer the second one though.

Anyway, to avoid API surface marked deprecated, it seems that this function needs to go somewhere: either in your app or in the idna crate:

    pub fn to_unicode(domain: &str, out: &mut String) -> Result<(), Errors> {
        match Uts46::new().process(
            domain.as_bytes(),
            FIRST_APP_SPECIFIC_CONSTANT,
            SECOND_APP_SPECIFIC_CONSTANT,
            THIRD_APP_SPECIFIC_CONSTANT,
            |_, _, _| true,
            out,
            None,
        ) {
            Ok(ProcessingSuccess::Passthrough) => {
                out.push_str(domain);
                Ok(())
            }
            Ok(ProcessingSuccess::WroteToSink) => Ok(()),
            Err(ProcessingError::ValidityError) => Err(crate::Errors::default()),
            Err(ProcessingError::SinkError) => unreachable!(),
        }
    }

That the three constants are app-specific hints at the direction that the function should go into your app.

Is there an obvious-enough a combination for the constants that this could go into the crate without exposing three arguments instead of making the three values constant? Also, if this went into the crate, instead of &mut String, out should probably be a type implemeting Write. If there were arguments instead of constants and Write instead of String, we'd be very close to the low-level API already in the crate. (In the crate, Write instead of String would be justified: Your allocation optimization might not be optimal enough for everyone: Since DNS has a length limit, it's plausible that a String-like type with a stack-allocated buffer could be more appropriate than String for optimizing allocations.)

@djc
Copy link
Contributor Author

djc commented Jul 3, 2024

It feels to me like a failure of the current crate API that it doesn't propose reasonable defaults for all these constants -- another reason that IMO the previous Config builder API was a better API design for most users, allowing a reasonable progressive disclosure of complexity rather than the current all or nothing approach. For ascii_deny_list, it seems to me like AsciiDenyList::URL is a reasonable default, and Hyphens::Allow also seems more reasonable than the other options. For ErrorPolicy IMO FailFast is arguably more aligned with the usual expectations on Rust APIs (contrary to the old API which IIRC more or less only supported MarkErrors). The use of fmt::Write seems okay to me.

@hsivonen
Copy link
Collaborator

hsivonen commented Jul 3, 2024

Indeed, AsciiDenyList::URL and Hyphens::Allow are supposed to be the defaults, since they are the values relevant to the URL Standard.

However, since AsciiDenyList::URL caters to non-DNS naming systems and Hyphens::Allow caters to Internet realities not matching IETF policy, it's not obvious to me that an API that just went with these values would suit your app, which sounds DNS-oriented and, as far as I could imagine, might be subject to hyphen policies.

The current design assumes that the progression is that if the main entry point without parameters isn't suitable, the API user would configure the ASCII deny list, the hyphen policy, or the confusable policy before configuring the allocation policy. Since you come from the angle of customizing allocation but only a bit (reusing the buffer of String instead of bringing an alternative string type), the progression is the wrong way round for you. Unclear what progression direction would serve the most API users the best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants