-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIP-0041? | UPLC Serialization Optimizations #314
Conversation
|
||
This proposal suggest ways to reduce serialized scripts size. | ||
|
||
the changes where designed to keep the bit oriented style of flat minimizing the number of required bits for values where possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale section is a bit thin, though there's a bit of rationale sprinkled in various places of the specification. It'd be nice to have a summary of what motivates each design choices. This can most likely be done by reworking a bit how the specification is written (that is, write the specification without justifications, and move the justifications to the rationale section).
For example, most of the "data serialization" section is about explaining what is currently wrong with the Data
CBOR serialization. Instead, leave the specification focused to what the proposal is about, and move the rationale to this section.
It'd be nice to consider some benchmarks? Taking some reference UPLC and serializing them side-by-side with the two methods to see how much gain one can expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @KtorZ; will refactor as suggested ( and correct the various typos 😄 )
|
||
## Abstract | ||
|
||
this document describes the parts of the current serialization algorithm that can be improved and provides the specification and documentation needed in order to implement an optimized version of this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to clean up the formatting, e.g. use capital case at the beginning of sentences.
case 6: return "000001"; | ||
case 7: return "0000001"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What language is this? Why not describe this in Haskell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typescript just because I feel the description is more intuitive using it
it translates to haskell as
pad :: Int -> String
pad 0 = "00000001"
pad 1 = "1"
pad 2 = "01"
pad 3 = "001"
pad 4 = "0001"
pad 5 = "00001"
pad 6 = "000001"
pad 7 = "0000001"
pad _ = undefined
--- | ||
CIP: "???" | ||
Title: optimized UPLC serialization | ||
Authors: Michele Nuzzi <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com> | |
Authors: Michele Nuzzi <michele.nuzzi.2014@gmail.com> |
|
||
the proposed changes to the algorithm will cause the same UPLC Abstract Syntaxt Tree to serialize in a different way based on the algorithm used; | ||
|
||
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first. | |
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program should be checked first. |
cc @michaelpj |
This could probably have started life as a @kwxm can you take a look? |
|
Apart from anything else, this could do with an impact assessment that assess how much of an improvement this is: actual numbers are very relevant. |
``` | ||
the case in which the ```missingBits``` is 0 implies that the current serialized program is already byte-alligned | ||
|
||
since this padding carries no usefull informations, the current ```pad( 0 )``` adds a useless byte each time a padding is needed and the number of used bits is a multiple of 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the reason for this is so that the decoder can easily discard padding at the end without having to check whether it's at a byte boundary, and this makes it faster. The proposed change would on average save one bit per bytestring, so I'm not sure if it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the proposed changes are accepted the only place where padding will be used is at the end of the script; since bytestring will no longer require to be byte alligned
the check for pad( 0 )
then becomes just lengthInBits `mod` 8 == 0
|
||
tags from ```integer``` to ```bool``` and the ```data``` one are directly followed by the respective value encoding; | ||
|
||
tags ```list``` and ```pair``` are the only tags that do require some other tag in order to be a defined type; since the twos always require some other type in order to be valid, the type application is implcit and it should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a bit annoying: it's like that because it reflects how the internal representations of types work. The extra generality might permit us to do some more complex things in future, but it's not clear if we'll ever actually need that. Simplifying this might be a good thing to do, since it complicates the encoding/decoding process as well as taking up extra room.
|
||
### ```data``` serialization | ||
|
||
All the effort of minimizing the size of on-chain scripts by prefering ```flat``` over ```CBOR``` serialization are ignored when it comes to ```data``` serialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be good to have a more efficient encoding of Data
rather than just wrapping the CBOR representation. However, I think the reason for this is that Data
is used elsewhere on the chain (including things that are passed to validator scripts as parameters), and CBOR is the preferred format there. It might be difficult to switch because of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conversion from CBOR to a specific data encoding can be done by the node prior to being passed as argument;
CBOR encoding is definitely too expansive to be included in a script
@michaelpj I will work on a CIP implementation these days to have a comparison in terms of size |
this implies ```(~1) + #chunks + 1``` meaningless bytes are added per ```ByteString``` | ||
|
||
in the descripton above | ||
- step ```1``` allows for an easy serialization and deserialization but doesn't carries any meaningful informations; given the importance of ByteStrings it should be removed at the cost of an added shift while serializing/deserializing the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that might slow things down significantly, since I think we'd need to shift every byte in a bytestring if it didn't start at a byte boundary (or have I got that wrong?). It's important that on-chain deserialisation be as fast as possible, so the extra expense might be problematic. I'd like to see some benchmark results for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to implement deserialization without shifting bytes for efficiency
I'll leave here the plu-ts implementation of the readNBits
function on which the deserialization process is based
I believe a very similar result can be achieved in Haskell using the Data.Binary.Get
monad
2) as many bytes as specified in the Unsigned Integer at step 1 | ||
``` | ||
|
||
using the new algorithm the ```ByteString``` serialization space complexity goes form ```O(n)``` to ```O(log n)``` where ```n``` is the number of bytes in the ```ByteString``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I need to think more carefully about this. It's a while since I've thought about bytestring encodings and I'll need to remind myself of the details of how we currently do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ideas here aren't nonsensical, but I'd like to see some benchmark results to see how they affect (a) the space taken up by serialised scripts, and (b) deserialisation times. I think that the benefits would have to be quite big in order to accept this.
I'll add that we're aware that there's room for improving things, but the reason that things are the way that they currently are is because we're using a pre-existing library and that's the way that it does things. We're definitely not averse to changing things, but then we'd need to maintain our own version of the flat
library,
Apologies for a slightly hurried review: this had slipped off the bottom of my list. I'll think a bit more about this, in particular the bytestring encoding.
Wondering if this is still an issue and needs to remain open given the advancements of Plutus v2/v3 and Plu.TS, Aiken, et al in the ensuing years since this CIP was opened. @michele-nuzzi what do you think? |
Theoretically it could still be something worth exploring Even with Aiken and plu-ts reducing size as much as possible Devs still tend to max out script size in an effort to reduce recursion and handle many inputs However I or my team are not able to focus on this in the short term If someone else wants to take over the CIP I'm ok with that |
thanks for update @michele-nuzzi ... @Ryun1 @Crypto2099 I'll remove this from the |
this CIP proposes some changes to the UPLC AST serialization in order to reduce the size of the serialized output