-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIP-0046? | Merkelised Plutus Scripts #385
CIP-0046? | Merkelised Plutus Scripts #385
Conversation
This proposal is a low-cost way to allow us to design protocols autonomously making use of staking validators. For that alone it is worth pushing forward on as currently they're not very useful, apart from one fancy technique. I'm also thinking it may find different uses but I'll sit on it for a while. |
Withdrawn my cip in favour of this. |
### Relation with BIP-144 | ||
|
||
BIP144 uses this trick to avoid submitting the parts of the script that aren't used. | ||
Given that reference scripts are common in Haskell, this isn't a big win for efficiency, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that reference scripts are common in Haskell
Not sure what this means, do you just mean "Given that Cardano supports reference scripts"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brainfart
BIP144 uses this trick to avoid submitting the parts of the script that aren't used. | ||
Given that reference scripts are common in Haskell, this isn't a big win for efficiency, | ||
but it might be worth implementing for the sake of scripts used only once. | ||
This CIP however doesn't require that that be implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We looked into MAST during the development of Plutus Core, but we concluded that it wasn't worth it because the size of the hashes corresponding to omitted subtrees cancelled out the saving from omitting the subtree. You can read some notes on it here: https://github.com/input-output-hk/plutus/tree/master/doc/notes/plutus-core/Merklisation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we tried a very similar Merklisation scheme, but for different reasons. We were looking at ways to reduce script sizes and the idea of using Merklisation to let us omit unused parts of the AST in fully applied validators seemed promising. It turned out that that involved replacing subtrees of the AST with hashes which were large (32 bytes) and incompressible, and that meant that we couldn't get any worthwhile size reductions, so we abandoned that idea. However that was for an entirely different purpose, so I don't think it's too relevant here.
However, it is arguably not the optimal solution due to the reference | ||
script problem described above. Even if the reference script problem | ||
is solved as described above, it seems logical to allow supplying a datum | ||
to a staking validator, or somehow combining the payment address and staking address for scripts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with supplying a datum to anything is where does the datum live? For a validator script it lives on the output. Where could it live for a staking validator? If we can come up with a sane answer to that, then in principle we could just give staking validators datums.
### Staking | ||
|
||
This makes staking validators much more powerful, since a single protocol can | ||
now manage many rewards accounts (by instantiating the script with a numeric identifier). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you write out this use case in more detail. You've alluded to it a few times but I'd really like to see more detail because I'm not familiar with it and I'm trying to back-infer the actual details, probably wrongly. And it seems to be the load-bearing example here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
This CIP however doesn't require that that be implemented. | ||
|
||
The argument for privacy doesn't apply, private smart contracts can be achieved through | ||
the use of non-interactive zero-knowledge probabilistic proofs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not today they can't. So I think it is still quite relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdym? They definitely can once we have at least bitwise primitives.
### Reference scripts | ||
|
||
Currently, different instances of the same script will need their own reference inputs | ||
since their hashes are different. It seems feasible to allow sharing of a single reference script, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or they can put them in the datum?
- A script address + datum can't fit in an address, | ||
if you want that you also need this (or need to change what an address is). | ||
|
||
## Specification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also need changes to the ledger spec. At the moment, the ledger doesn't deserialise Plutus scripts at all, it passes them to the evaluator still serialised, and despite this it can still hash them etc, straightforwardly. This CIP would probably require changing that in the spec and the implementation, so that the ledger has deserialized scripts around (one reason for this is that deserialization can fail, whereas hashing cannot). It might be good to have at least a sketch of those changes here.
I also don't know whether it violates any principles of the ledger to not have the hash of an item be the hash of its serialised bytes. I think that's true for everything else, it's possible that there's a reason for that (e.g. making it possible to check hashes without having to know the serialization).
currently has 8 constructors. On-chain, annotations are always the unit type, | ||
and are hence ignored for this algorithm. Each case/constructor is generally handled by | ||
hashing the concatenation of a prefix (single byte corresponding to the | ||
constructor index) along with the hashes of the arguments passed to the constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different to what @kwxm wrote here (https://github.com/input-output-hk/plutus/blob/master/doc/notes/plutus-core/Merklisation/Merklisation-notes.md#modified-merklisation-technique), which I think also included the serialized versions of the nodes in the value that gets hashed. Not sure if that's important, Kenneth do you remember?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which I think also included the serialized versions of the nodes in the value that gets hashed. Not sure if that's important
I'm not quite sure what you mean. It talks about "[serialising] all of the contents of the node into bytestrings", but I think by "contents" I meant all of the fields (things like variable names) except subnodes: you wouldn't serialise those and calculate hashes, but instead recursively apply the Merkle hash process. I think the overall process is basically similar to what's going on here.
|
||
In pseudocode: `hash $ prefix <> blake2b_256 (serialiseData d)` | ||
|
||
## Rationale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need some discussion of the cost of this kind of hashing. Our experiments suggested it was ~10x more expensive (https://github.com/input-output-hk/plutus/blob/master/doc/notes/plutus-core/Merklisation/Merklisation-notes.md#the-cost-of-calculating-merkle-hashes), unclear if this will have a meaningful impact but it might.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the potential cost of this is my main concern. Calculating the hash involves traversing the entire AST (although as the document points out it can be fused with the deserialisation process), but also calling the underlying hash function(s) at every node, which could become expensive compared with just feeding the serialised script directly to a hashing function in one go. I'd really like to see some figures for this: it's conceivable that computing the Merkle hash might be more expensive than executing the actual scripts, and that might make this proposal impractical. The estimates from our earlier experiments (which were maybe three years ago) were entirely theoretical though, and things have changed a lot since we did that: for one thing, we're using flat
instead of CBOR now, which makes the serialised scripts a lot smaller. I think some experiments would really be needed to decide whether the extra cost is a real issue.
it is not clear how much/less to Merkelise the hashing. | ||
E.g., the hashing of data itself could be Merkelised. This is not done in this CIP. | ||
The hashing of a `Data` constant could also prepend the prefix directly to the serialisation, | ||
rather than to the hash of the `Data`. It is not clear what is best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think stopping the merkelization at the constants is the right place.
Hence, they have been included. | ||
They use Merkle-tree hashing since that's the simplest and most useful in this case. | ||
|
||
## Path to Active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have some Acceptance Criteria a la the new CIP-001. Perhaps:
- The ledger specification is updated as necessary
- The Plutus Core specification is updated with the new hashing scheme
- Performance assessment has been performed
- Necessary hashing builtins have been added to PLC and costed
- Example usecases have been validated to run in an acceptable amount of budget considering the increased use of hashing builtins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me, but calculating a few hashes (see example pseudocode) is well within the budget last time I checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculating a few hashes (see example pseudocode) is well within the budget last time I checked.
Is that really true? If you're referring to the pseudocode here (under Rationale), then you need the hashes original
and script
, and I think those have to be calculated on the chain (or at least one of them does, no?), so there's a potentially large cost that has to be paid before you even get to that pseudocode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original
and script
are constants here. original
is a fixed script, and can thus be computed beforehand and inlined into the script. script
comes from ScriptContext
.
This may be a n00b question and I may have forgotten my PL coursework, but doesn't this make the hash of the script reliant on the parser/lexer now? This is not worse than the way it works for compilation today, I am just wondering if the exact same AST will be generated between different platform bindings (e.g., Plutus/Haskell, Helios, Aiken) or if the writer/deployers will have to be cognizant of the platform choice. |
@michaelpj
|
@matiwinnetou why can't they use a datum? |
Thanks for the review Michael. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic idea here looks sensible, but I'd like to see some figures comparing the time taken to compute the Merkle hash of a script with the time taken by the current method (computing the hash of a serialised script), and perhaps with the time taken to actually evaluate the script. There's a cost that has to be paid somewhere and I'm worried that it might be prohibitively expensive. I could be totally wrong about that though! We really need to see some numbers. We have a bunch of validation scripts here which we use for benchmarking and which we believe to be reasonably realistic: they'd be good candidates for benchmarking Merklisation costs.
Apart from that, the main things that I find myself wondering about are (a) how much this will affect the work that the node needs to do preparing a transaction for validation, and (b) how compelling the use case given here is in comparison with existing techniques (and with forthcoming extensions to the ledger model). I'll leave those issues to better-qualified people though.
The universe of types used on-chain is always `DefaultUni`. | ||
Each possible data type is handled differently, with each having | ||
a different prefix. The total number of prefixes does not exceed | ||
255. If it did, the prefix would have to be increased to two bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it 255 or 256? I think any unsigned byte is a valid prefix, but I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I'm dumb.
The serialisation according to [CIP-58](https://github.com/cardano-foundation/CIPs/blob/a1b9ff0190ad9f3e51ae23e85c7a8f29583278f0/CIP-%3F/README.md#representation-of-builtininteger-as-builtinbytestring-and-conversions), | ||
prefixed with the two-byte prefix, is hashed. | ||
|
||
In pseudocode: `hash $ prefix <> prefix' <> serialiseCIP58 n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here? Is it that prefix
tells you that you've got an integer and prefix'
tells you the sign?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I guess that's what the sentence on line 121 means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this is a mistake. This is a previous scheme I had, but there's no reason not to collapse it into one byte.
but it has to be proven to be random, hence hashing the prefix byte | ||
is the best option. | ||
|
||
In pseudocode: `hash prefix` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I found it a little confusing that everything used prefix
(further complicated by the fact that earlier on it mentions that there's a version prefix attached to serialised Plutus Core scripts). It might be clearer if it said error_prefix
, lamabs_prefix
and so on, like it does later. You could even propose concrete values for the prefixes and use those. We might introduce more Term
constructors in the future, but I don't think that's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|
||
The hash of a `Builtin` is the hash of the prefix prepended to the base-256 encoded | ||
(i.e. serialised to bytestring) index of the built-in function. | ||
Because there are less than 256 built-ins, this is currently the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less than 256 or less than 257? I think that if we had 256 you could still get away with one byte here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
257
BIP144 uses this trick to avoid submitting the parts of the script that aren't used. | ||
Given that reference scripts are common in Haskell, this isn't a big win for efficiency, | ||
but it might be worth implementing for the sake of scripts used only once. | ||
This CIP however doesn't require that that be implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we tried a very similar Merklisation scheme, but for different reasons. We were looking at ways to reduce script sizes and the idea of using Merklisation to let us omit unused parts of the AST in fully applied validators seemed promising. It turned out that that involved replacing subtrees of the AST with hashes which were large (32 bytes) and incompressible, and that meant that we couldn't get any worthwhile size reductions, so we abandoned that idea. However that was for an entirely different purpose, so I don't think it's too relevant here.
|
||
In pseudocode: `hash $ prefix <> blake2b_256 (serialiseData d)` | ||
|
||
## Rationale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the potential cost of this is my main concern. Calculating the hash involves traversing the entire AST (although as the document points out it can be fused with the deserialisation process), but also calling the underlying hash function(s) at every node, which could become expensive compared with just feeding the serialised script directly to a hashing function in one go. I'd really like to see some figures for this: it's conceivable that computing the Merkle hash might be more expensive than executing the actual scripts, and that might make this proposal impractical. The estimates from our earlier experiments (which were maybe three years ago) were entirely theoretical though, and things have changed a lot since we did that: for one thing, we're using flat
instead of CBOR now, which makes the serialised scripts a lot smaller. I think some experiments would really be needed to decide whether the extra cost is a real issue.
Hence, they have been included. | ||
They use Merkle-tree hashing since that's the simplest and most useful in this case. | ||
|
||
## Path to Active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculating a few hashes (see example pseudocode) is well within the budget last time I checked.
Is that really true? If you're referring to the pseudocode here (under Rationale), then you need the hashes original
and script
, and I think those have to be calculated on the chain (or at least one of them does, no?), so there's a potentially large cost that has to be paid before you even get to that pseudocode.
@@ -0,0 +1,270 @@ | |||
--- | |||
CIP: ? | |||
Title: Merkelised Plutus Scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be annoyingly pedantic, I'll point out that the process is named after Ralph Merkle, so it's Merklisation (or Merklization) rather than Merkelisation (which sounds like something from German politics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise this. I had thought of this, but Merklisation looks odd.
currently has 8 constructors. On-chain, annotations are always the unit type, | ||
and are hence ignored for this algorithm. Each case/constructor is generally handled by | ||
hashing the concatenation of a prefix (single byte corresponding to the | ||
constructor index) along with the hashes of the arguments passed to the constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which I think also included the serialized versions of the nodes in the value that gets hashed. Not sure if that's important
I'm not quite sure what you mean. It talks about "[serialising] all of the contents of the node into bytestrings", but I think by "contents" I meant all of the fields (things like variable names) except subnodes: you wouldn't serialise those and calculate hashes, but instead recursively apply the Merkle hash process. I think the overall process is basically similar to what's going on here.
For the efficiency argument, could you instead of merklising every node of the AST, make it optionally merkled or the hash of the serialisation of each node. Since usually specific parts of the tree need to be referenced (such as in the case of parameterisation), fully merkling is wasteful, but with partial merklisation you could introduce a 'merkle' keyword to whatever high level language, erase it from the UPLC and use it to generate a tree. This also means the tree used for this can be provided on-chain for referencing, for example so you can identify a parameterised contract without actually knowing the parameters. |
There are two ways of doing this:
The structure of the Flat encoding is thankfully quite simple. One thing I'm wondering about is, will you ever hash a script without also running it? If you will run it, you have to decode it anyway. Assuming that there are instances where you don't currently have to run them (and thus decode them),
For the latter, e.g. rather than hashing an integer the way described in the CIP, Hashing a Going back to @micahkendall's idea, perhaps we can make it doable: This scheme is a bit complicated, but AFAICT would keep hashing just as fast for all existing scripts. As for hashing the constants, that would probably be kept as described in the CIP for constants inside This scheme however seems more complicated than what's described in this CIP, and in the end I have to run some benchmarks to see in practice how efficient Merkle hashing can get. I will work on this when I find time, I suppose the CIP is blocked on that until then. |
You may be aware of this already, but FYI there's a very detailed specification of the flat format as it's used for Plutus Core in Appendix E of the draft of the updated Plutus Core specification here. |
I've thought about this for a bit, and come to the conclusion that the scheme described in the previous comment is in fact flawed. However, I think the original scheme in the CIP is in fact not problematic. When we attach a reference script to a UTXO, the hash verification can never fail, because the hash rather than being verified, can be computed from the attached script. The node can then cache the computation of the hash of this script and store it as part of the UTXO (if this is not already done). The other case to consider is when we need to pass in a witness/concrete script that matches a script hash. There are two broad fixes to this:
With this, AFAICT, performance is no longer a problem. I am fine with either of the above two solutions. 2) might be less disruptive, but 1) seems cleaner. Thoughts? @kwxm @michaelpj and others |
I'm not a huge fan of either of those:
Plus both of these would be quite a bit of work for the ledger. So it doesn't seem worth it to me. I wonder if there's a more focussed change that would get you what you want. This changes the entire means of hashing scripts in order to find out whether a script is applied to something. We're not really using all that power! An alternative would be to have something like this:
I don't particularly love this either, but I think it's at least worth thinking about alternatives that aren't so invasive. |
I don't really know much about hashing so this question is more out of curiosity. Why is merkilisation better than allowing parameterized reference scripts (#354)? It seems that there isn't a consensus on whether the merkilisation can be done efficiently enough. Does anyone have an idea of what the resource cost would be to allowing a custom parameter in reference scripts? |
@@ -0,0 +1,270 @@ | |||
--- | |||
CIP: ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CIP: ? | |
CIP: 46 |
@michaelpj That seems feasible. I don't see any issues with that design, and though it's effectively "1 level" of Merkle hashing, it seems to be powerful enough for almost all use cases? The use case this wouldn't cover, is passing in an arbitrary script on-chain, then applying a parameter to it, because that arbitrary script may already be using 1 level of Merkle hashing. IMO though, checking the script hash should be phase 2 anyway, and so should anything that doesn't have to be in phase 1 (though this may be due to ignorance on my part). I am however fine with implementing this proposed solution. It seems simple to implement. One question that remains is, should the In that case, do we need to increment the Plutus language version? Seems like a hard fork would be enough. |
I don't see why. All other hash checking is phase 1. Phase 2 only exists for script evaluation, because it's so expensive.
To be clear, that was a straw proposal. I don't really like it either. I just wanted to encourage the search for more ideas. |
I had a long talk with @micahkendall , and I believe that what you proposed is more than sufficient, in addition to being the optimal solution. One minor change, however, is that it probably makes sense to have it be data PlutusScript = JustScript ActualPlutusScript | AppliedScript ActualPlutusScript Data Morally, this applied argument is similar to redeemers and datums, and hence should be in the same format. Everything in the motivation can still be done with this scheme. What do you think? @michaelpj |
@L-as if this submission is not |
Currently, the hash of a script is simply the hash of its serialisation.
This CIP proposes changing this such that the hash of a script (term)
is a function of its immediate children's hashes, forming a Merkle Tree from the AST.
This allows one to shallowly verify a script's hash, and is useful on Cardano,
because it allows scripts to check that a script hash is an instantiation of a parameterised script.
In addition, a
blake2b_224
built-in function must be added.This is inspired by BIP-144,
but the motivations are very different.
Rendered