Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closure-based unescaping with custom entities #415

Merged
merged 3 commits into from
Jul 15, 2022

Conversation

dralley
Copy link
Collaborator

@dralley dralley commented Jul 10, 2022

@dralley
Copy link
Collaborator Author

dralley commented Jul 10, 2022

@Mingun I think this is roughly the API you asked for, but I'm having some trouble getting the custom_entities.rs example to compile.

Under this new model, the lifetimes are a bit more complex. Feel free to do some experimentation (I'm going to sleep)

@dralley dralley force-pushed the escape-api branch 2 times, most recently from 657a8f6 to 81d63a4 Compare July 10, 2022 20:06
src/escapei.rs Outdated Show resolved Hide resolved
@dralley dralley force-pushed the escape-api branch 6 times, most recently from 5eccdae to fd609e2 Compare July 11, 2022 16:22
@Mingun
Copy link
Collaborator

Mingun commented Jul 11, 2022

It is better to replace suffix _with_custom_entities to _with -- this is more similar how other Rust APIs is named

examples/custom_entities.rs Outdated Show resolved Hide resolved
@dralley
Copy link
Collaborator Author

dralley commented Jul 11, 2022

It is better to replace suffix _with_custom_entities to _with -- this is more similar how other Rust APIs is named

I agree but when it comes to unescape_and_decode_with() the name sounds misleading - it sounds like the "with" refers to the decoding portion instead of the unescaping portion.

As with the corrections you've made recently, the more appropriate name would be decode_and_unescape_with(). Thoughts?

@Mingun
Copy link
Collaborator

Mingun commented Jul 11, 2022

Yes, it should be decode_and_unescape everywhere, because this is the correct order of operations.

Actually, I would prefer the following naming:

Name Description
unescape(&self, decoder: Decoder) Decodes internal byte content by specified decoder, then unescapes it
unescape_with<F>(&self, decoder: Decoder, resolve_entity: F) -//-
utf8_unescape(&self) Assumes UTF-8 encoded internal content, unescapes it
(available only if feature "encoding" is not enabled)
utf8_unescape_with<F>(&self, resolve_entity: F) -//-

But moving some functions under a feature flag will now breaks serde deserializer, which currently also uses incorrect unescape-then-decode order. Fixing that is not a fast task

src/events/attributes.rs Outdated Show resolved Hide resolved
examples/custom_entities.rs Outdated Show resolved Hide resolved
@dralley
Copy link
Collaborator Author

dralley commented Jul 11, 2022

@Mingun I'm not 100% sure if this makes sense, obviously you've put more thought into decoding, but:

Given that encoding is a property of the source file, and not something that changes on an element-by-element basis, does it really make sense for the complexity of decoding to be distributed throughout all the different structs?

I feel like the reader ought to just decode everything into the buffer as utf-8, and thereafter Event* and Attribute* and *escape* can just assume that it's working with UTF-8. Instead of reading the bytes directly and trying to work with it no matter the encoding. And likewise on the writing side, working with UTF-8 and then converting it to UTF-16 (or whichever encoding) as it gets written seems much simpler.

This is the approach used by other libraries such as libxml https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Encodings-support#the-internal-encoding-how-and-why

From a performance perspective, this can also be better, since encoding / decoding huge blocks of data keeps the loops hot and is easy to accelerate with SIMD (which encoding_rs tries to use when possible). It might be worse in some cases, if you're using an alternate encoding and not reading the full document, but that feels a) not especially common and b) not worth adding runtime overhead to the common utf-8 case for.

It also means the API would be cut down to a much smaller number of functions.

@Mingun
Copy link
Collaborator

Mingun commented Jul 11, 2022

I would be glad to decode the input at time of get them from the underlying reader, even at cost of some minor performance penalty. But need to make this with caution, I trying to keep the spirit of @tafia's implementation -- do not do work if it is unrequired

@dralley
Copy link
Collaborator Author

dralley commented Jul 11, 2022

(BTW, @tafia , feel free to weigh in).

The way I see this is that if the user actually does want / need to support different encodings, the current approach will actually require more work (both human and computer) to be done.

Decoding individual strings means allocating a bunch of individual strings, and if you use the unescape_and_decode() functions to do that then you will be allocating everything all the time whether you're using utf-8 or not - on top of any other small incremental performance penalties that decoding larger blocks of data might avoid.

You could try to decode and unescape manually to get back the performance for utf-8, but that's going to be a huge amount of work to do that in every individual place that it would need to be done, and not worth it at all for the vast majority of users.

So I think, the approach would be better in most cases, and also much simpler to use, maintain and test.

@dralley dralley force-pushed the escape-api branch 4 times, most recently from 6abd001 to ce19763 Compare July 12, 2022 15:40
@dralley dralley marked this pull request as ready for review July 12, 2022 15:40
@dralley dralley requested a review from Mingun July 12, 2022 15:46
@dralley
Copy link
Collaborator Author

dralley commented Jul 12, 2022

Commits are a little dirty in the sense that changes mixed together a bit, I can fix that if you mind deeply.

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2022

Codecov Report

Merging #415 (a4d50a5) into master (a6588c2) will increase coverage by 0.09%.
The diff coverage is 54.02%.

@@            Coverage Diff             @@
##           master     #415      +/-   ##
==========================================
+ Coverage   49.58%   49.67%   +0.09%     
==========================================
  Files          22       22              
  Lines       13935    13856      -79     
==========================================
- Hits         6909     6883      -26     
+ Misses       7026     6973      -53     
Flag Coverage Δ
unittests 49.67% <54.02%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
benches/macrobenches.rs 0.00% <0.00%> (ø)
benches/microbenches.rs 0.00% <0.00%> (ø)
examples/custom_entities.rs 0.00% <0.00%> (ø)
src/events/attributes.rs 94.42% <0.00%> (+0.83%) ⬆️
src/lib.rs 12.33% <0.00%> (+<0.01%) ⬆️
src/reader.rs 60.13% <0.00%> (ø)
src/de/mod.rs 77.38% <50.00%> (+0.09%) ⬆️
src/escapei.rs 12.10% <100.00%> (-1.81%) ⬇️
src/events/mod.rs 72.28% <100.00%> (+2.17%) ⬆️
src/writer.rs 49.28% <100.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6588c2...a4d50a5. Read the comment docs.

@dralley
Copy link
Collaborator Author

dralley commented Jul 12, 2022

I'm leaving the decode_and_unescape*() API as-is for now. The more I think about it the more the UTF-8 only buffer approach seems like the appropriate course of action.

Otherwise it is very difficult to match tags in an encoding-agnostic way without decoding them, and that's a tremendous number of allocations, probably enough to nullify any benefit of avoiding the conversion overhead. If you're parsing the whole document including text and attributes, it would become pure overhead that you can't avoid.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is almost good, only few small nits are better to solve:

  • change *ed method names to *e names for further consistency
  • add feature gate for methods that assumes UTF-8 encoding, which should restrict their incorrect usage

src/events/attributes.rs Outdated Show resolved Hide resolved
}

/// gets escaped content
///
/// Searches for '&' into content and try to escape the coded character if possible
/// returns Malformed error with index within element if '&' is not followed by ';'
///
/// See also [`unescaped_with_custom_entities()`](Self::unescaped_with_custom_entities)
/// See also [`unescaped_with()`](Self::unescaped_with)
pub fn unescaped(&self) -> Result<Cow<[u8]>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly

Suggested change
pub fn unescaped(&self) -> Result<Cow<[u8]>> {
#[cfg(not(feature = "encoding"))]
pub fn unescape(&self) -> Result<Cow<[u8]>> {

}

fn make_unescaped<'s>(
pub fn unescaped_with<'s, 'entity>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly

Suggested change
pub fn unescaped_with<'s, 'entity>(
#[cfg(not(feature = "encoding"))]
pub fn unescape_with<'s, 'entity>(

@@ -37,39 +37,33 @@ impl<'a> Attribute<'a> {
///
/// This will allocate if the value contains any escape sequences.
///
/// See also [`unescaped_value_with_custom_entities()`](Self::unescaped_value_with_custom_entities)
/// See also [`unescaped_value_with()`](Self::unescaped_value_with)
pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly

Suggested change
pub fn unescaped_value(&self) -> XmlResult<Cow<[u8]>> {
#[cfg(not(feature = "encoding"))]
pub fn unescape_value(&self) -> XmlResult<Cow<[u8]>> {

Comment on lines 59 to 63
pub fn unescaped_value_with<'s, 'entity>(
&'s self,
resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>,
) -> XmlResult<Cow<'s, [u8]>> {
unescape_with(&*self.value, resolve_entity).map_err(Error::EscapeError)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be more consistently with other codebase

Suggested change
pub fn unescaped_value_with<'s, 'entity>(
&'s self,
resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>,
) -> XmlResult<Cow<'s, [u8]>> {
unescape_with(&*self.value, resolve_entity).map_err(Error::EscapeError)
#[cfg(not(feature = "encoding"))]
pub fn unescape_value_with<'s, 'entity>(
&'s self,
resolve_entity: impl Fn(&[u8]) -> Option<&'entity str>,
) -> XmlResult<Cow<'s, [u8]>> {
Ok(unescape_with(&*self.value, resolve_entity)?)

The added feature gate prevents incorrect usage. All usages in tests should be replaced with decode_* variant. Another way to solve that -- add a decoder to the Attribute / Event and keep only variant without reader parameter, but I think that this may be a task for another PR, focused on encoding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather push all the decoding changes into a different PR. At the moment all of the parsing code is based on searches for single ascii bytes, which means it is effectively a UTF-8 only library, even if the detection of encodings works properly. #322

So there's a lot of work needed to make the encoding feature actually useful.

Copy link
Collaborator

@Mingun Mingun Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand and support. That is why I suggest just disable dangerous methods when "encoding" feature is enabled.

So there's a lot of work needed to make the encoding feature actually useful.

Actually, it is useful for one-byte encodings. Bearing in mind, that

  • we use encoding_rs
  • encoding_rs is not extensible by design
  • as the result, the list of encodings is fixed
  • the only supported encodings, that is not XML compatible (because is not ASCII compatible, which is more strict restriction) is (generated by this snippet):
    • ISO-2022-JP
    • replacement (not a real encoding, actually)
    • UTF-16BE
    • UTF-16LE

So the encoding feature actually may be useful even in the current state.


The other suggestion is to make namings is even more consistently, compare:

  • unescaped
  • decode_and_unescape

Because the difference only in the decode operation applied in the second function, it is logically, that stripping out the decode_and_ prefix we should get a name of the method that works without decoding

(also, I think, you could squash last 3 commits)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about the other one-byte encodings, I just don't know how common they actually are, certainly compared to UTF-16. But probably what I should have said is that I don't really want to go back and forth over functions that it might make sense to remove entirely at some point in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go ahead and make the changes anyway (tomorrow, since I really need to go to sleep now)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mingun I wont forget about this, but I am going to put this in a separate PR. I want to finish running some experiments first re: bulk decode vs. decoding individual elements and the outcome of that might influence what we want the API to look like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally agree that encoding part should be in a separate PR, but can you polish namings in this PR? Only one simple change:

  • unescaped() -> unescape()
  • unescaped_with() -> unescape_with()
  • unescaped_value() -> unescape_value()
  • unescaped_value_with() -> unescape_value_with()

As you already changed some of that names it is quite logically to finish with that here, just to keep history clean :). And after that it can be merged

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK :) but, please take a look at #158 (comment)

I'm going to draft this for now because I'm 100% convinced we should change the APIs to pure &str / String ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to draft this for now because I'm 100% convinced we should change the APIs to pure &str / String ones.

I agree, I already work in this direction on top of this PR, namely changing the order of unescape and decode operations in the serde deserializer. I think I'm going to merge it now and continue evolving API in the another PRs. At least I want to make the new closure-based API accepts &str instead of &[u8] before the release.

src/events/mod.rs Outdated Show resolved Hide resolved
@dralley
Copy link
Collaborator Author

dralley commented Jul 13, 2022

I moved all the thoughts about future encoding strategy to the appropriate issue #158, and did a bit of background research on performance

It's redundant now that the macrobenchmarks and (un)escape benchmarks exist.
Instead of providing unescaping functions with an entity mapping
via a data structure, instead provide a closure which maps the entity
with replacement text.
@dralley dralley marked this pull request as draft July 15, 2022 02:01
@dralley dralley marked this pull request as ready for review July 15, 2022 04:53
///
/// [`unescaped_value()`]: Self::unescaped_value
/// [`unescape_value()`]: Self::unescape_value
Copy link
Collaborator Author

@dralley dralley Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I think it reads a little worse, even if it's more consistent. I don't care enough to argue any more (lol) but I'll just say it for what it's worth

  • .value / raw_value() -> ... (nothing)
  • unescaped_value() -> unescape_value()
  • normalized_value() -> normalize_value()

The former fields weird when compared against the latter, just a personal opinion. Ultimately it doesn't matter very much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just rename private function to something, I'll remove it in my PR tonight.

@Mingun Mingun merged commit 72e11d1 into tafia:master Jul 15, 2022
@Mingun
Copy link
Collaborator

Mingun commented Jul 15, 2022

Thanks for you hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants