Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add little-endian support #46

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ocecaco
Copy link

@ocecaco ocecaco commented Jan 9, 2020

This pull request adds support for little-endian encoding (#21), which is what Bedrock Edition uses for encoding NBT data. The encoding (i.e., big endian or little endian) is determined by a parameter passed to the serialization/deserialization functions at runtime.

Initially, I implemented it differently, by having the encoding be a compile-time type parameter (making functions generic over <B: ByteOrder>), but this led to a proliferation of type parameters, and would essentially lead to two copies of the library being generated during monomorphization. Therefore, I thought the runtime encoding selection makes the most sense.

I also fixed the tests and benchmarks, but didn't add any new tests/documentation yet. I would be glad to hear if you have any comments or suggestions about my approach and implementation.

@atheriel
Copy link
Collaborator

This is an interesting approach, and thanks for taking the initiative to implement it. I have a couple thoughts:

  • What is the cost of monomorphization here? Idiomatic Rust usually favours generic solutions to these problems, so it is unusual (to my eyes) to see an argument for runtime checks.

  • You've made many breaking changes to the internal and external APIs. I don't think it would be possible to do this otherwise, but it does feel a bit rash to me. Can you explain the API changes a little more?

  • The commit makes several unrelated style or naming changes. Is this the result of an automated tool?

@ocecaco
Copy link
Author

ocecaco commented Jan 10, 2020

These are all good questions. This implementation is something I put together in a few hours as a proof of concept, and I wanted to see whether you thought this approach makes sense before I spent a lot more time on it. I have now also pushed my original experiments with the monomorphization approach, so you can see what that looks like: it is on the endianness branch of my repository. This branch also passes all the tests.

Monomorphization

I am also still not sure whether the runtime branching is the most idiomatic Rust way to implement this. It is definitely unusual to have a runtime check every time a raw NBT field is read from the input, and it might be that the monomorphization approach makes more sense in the end. My main reasoning was that when you have a client library that uses both the little-endian and big-endian encodings, they would get essentially two complete copies of the library, each copy specialized for one encoding. Of course, generating specialized code is exactly the point of monomorphization, but the question is where to draw the monomorphization "boundary". In the runtime-checking based implementation, we get two copies of some small functions, whereas if we propagate the type parameters up to the top-level library functions, we get two copies of most of the library.

Getting (at worst) two copies of the library might not be so bad since (1) the library is not that large, so there wouldn't be a large explosion in code size. (2) most client libraries would only use one encoding, since they probably wouldn't have to deal with Java and Bedrock Edition formats at the same time, hence they only get one copy without any unnecessary runtime checks. (3) It might also be possible to avoid two copies being generated by using a trait object (i.e., Box<dyn ByteOrder>) as a type parameter, which is the typical Rust way of avoiding the code size increase from monomorphization. I haven't tried this yet, but this should work as long as the ByteOrder trait is object-safe.

Another argument for having the endianness be a runtime parameter instead of a type parameter is that type parameters are more annoying to pass around. In Rust, when you want to explicitly specify one type parameter, you have to specify all of them (although some can be inferred).
For instance, a function like from_reader would have a type parameter R: io::Read and another type parameter E: ByteOrder for the endianness. Then, in order to call it, you would have to write from_reader::<_, LittleEndian>(...) in order to specify the endianness. This might be okay for the internal API, but for the external API I think it looks a bit syntactically noisy to have to write underscores for the remaining type parameters.

By restructuring the external API somewhat, it is possible to avoid having to write the additional type parameters everywhere, for instance by having the endianness be specified once on a struct, like so:

struct Foo<E: ByteOrder> {
    _marker: PhantomData<E>,
}

impl<E: ByteOrder> Foo<E> {
    fn from_reader<R: io::Read>(...) {
    }
}

API changes

The main external API change is that all of the serialization/deserialization functions, like from_reader and to_writer, now take an additional Endianness argument, specifying whether a little-endian or big-endian encoding should be used.

Internally, the main change is that functions like read_bare_int are now defined on a RawReader struct, which wraps an R: io::Read field (which is kept hidden to prevent it from being used directly) and exposes an endianness-agnostic API. This means that where previously some code might call read_i32::<BigEndian>(...), it would now call the read_bare_int function on RawReader. In essence, this means that the only part of the library that directly deals with endianness is the raw module. The idea is that raw provides an abstraction layer so the rest of the library doesn't have to think about endianness anymore.

There are also some more minor changes, such as replacing u8 by i8 in a few places, to be closer to the Java implementation of NBT, and to be able to use the read_bare_byte function on RawReader.

Styling and name changes

I think that the name changes you might be referring to are the changes in the benches/file_tests.rs file. I noticed that this file seemed to be out of date, since the names it used no longer matched up with the exported names from the library. Therefore, I updated it so it compiles and runs (using cargo bench) again.

With regards to the style changes, I replaced try! by ? in a few places and replaced a &Trait by &dyn Trait somewhere. I thought that fixing some of the deprecation warnings in files where a lot of changes were made (like raw.rs or value.rs) made sense, but I can see that fixing compiler warnings in unrelated files/lines might clutter up the commit history, so I can get rid of those changes if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants