Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query currentProtocolParameters fails against 8.0.0 node #314

Closed
AndrewWestberg opened this issue May 18, 2023 · 26 comments
Closed

Query currentProtocolParameters fails against 8.0.0 node #314

AndrewWestberg opened this issue May 18, 2023 · 26 comments
Labels
infrastructure server Issues which regard the server

Comments

@AndrewWestberg
Copy link

What Git revision / release tag are you using?

Ogmios 5.6.0

Do you use any client SDK? If yes, which one?

Kogmios

Describe what the problem is?

The underlying cbor of the response between node and Ogmios has changed.
maxCollateralInputs used to be in a cbor integer value outside of the array returned from the node. This has been moved inside the array (bug fixed).

Cbor indexes for certain parameters coming back in the payload have changed
Protocol Version is now an array instead of two separate integer numbers:
protocolMajorVersion was: index 12... now, index 12[0]
protocolMinorVersion was: index 13... now, index 12[1]

Any index that was over 13 is now off by 1. for example
utxoCostPerByte was: index 15... now, index 14

What should be the expected behavior?

Well, it's probably going to take a special ogmios release to fix it. The node team obviously broke the contract of communicating with the node on these given miniprotocols.

If applicable, what are the logs from the server around the occurrence of the problem?

image

@JaredCorduan
Copy link

I suspect that this is the same issues that was going to be resolved by IntersectMBO/cardano-node#5100.

It was closed only because it targeted a branch that was merged.

@AndrewWestberg
Copy link
Author

@JaredCorduan What is the path forward here? Should we fix all downstream tools like Ogmios, or wait for an 8.0.1 patch to the node that restores the miniprotocol version api contract?

@JaredCorduan
Copy link

@AndrewWestberg I am personally working on node 8.1.0 myself now. See IntersectMBO/cardano-node#5243. I think I am through all real the integration work, and am just waiting on things like a now consensus release and updating packages and rebasing, etc. So I think it makes sense to wait on 8.1.0.

@disassembler
Copy link

@AndrewWestberg can you test that PR out and see if it addresses your problems?

@lehins
Copy link

lehins commented May 19, 2023

I actually know the reason for this failure. Serialization for PParams was fixed in 8.0.0. Before it was serialized as invalid CBOR. Here is the commit that fixes it: IntersectMBO/cardano-ledger@e015e09

In hindsight, I think there was a graceful way to fix this bug by using a different serializer in the query implementation and adding a new query with fixed serialization, but I suspect it is too late now, since serialization has already changed. Downstream tools will need to adjust their deserializers

@JaredCorduan
Copy link

@lehins is totally right, I was wrong. Folks will need to adjust to 8.0.0, apologies :(

@AndrewWestberg
Copy link
Author

@JaredCorduan @lehins
I think delegationsAndRewards is also busted. Was that also "invalid" cbor? To be fair, the cbor in previous releases was not invalid, it was just two cbor objects instead of the array wrapping everything into a single object.

Even if the cbor had bugs before, you should respect the protocol version contracts instead of having Ogmios, phyrhose, TxPipe, Goroboros, and everybody else downstream code work-arounds.

@KtorZ
Copy link
Member

KtorZ commented May 19, 2023

@lehins Before it was serialized as invalid CBOR

That is untrue. The protocol version used to be serialised as a list of two integers which is perfectly valid CBOR. It might not be what you would semantically expect or not the most elegant choice for the structure but, from a binary standpoint, no one cares -- it is valid. (edit: indeed, it seems like the protocol version, a tuple of number, was serialised as a sequence of numbers without any CBOR major type in the middle of a map sequence; causing an off-by one error on the map sequence).

Though we do care that the format has now changed in an incompatible manner.

@lehins Downstream tools will need to adjust their deserializers

I don't think this is an acceptable outcome. The reason being that the ouroboros mini-protocols and the entire framework around it are built to be multi-era and multi-codecs capable. There's an inherent complexity coming with that framework. Client applications are performing a handshake with the node to negotiate precise protocol versions. Specific codecs are associated to versions. So, it is reasonable from client applications to expect servers (i.e. the node) to honor the other part of the protocol. If not, then let's please ditch the entire mini-protocol framework and use something a lot simpler and more adapted where breaking changes like that are meant to happen.

@JaredCorduan I was wrong. Folks will need to adjust to 8.0.0, apologies :(

I don't think you were wrong no. Changing the serialization back to what they were is the right call. If the format was different from what it should've been then it is actually too late to change it. Or if changed, then it needs to go hand-in-hand with the multi-version codecs supported by the Ouroboros framework.

@coot
Copy link

coot commented May 19, 2023

I agree that we need to fix the encoding. The whole contract of negotiating a given NodeToClientV_X is that each mini-protocol & its encoding are backward compatible (thus immutable).

@KtorZ versioning we do is a very standard thing, and it doesn't even have anything to do with mini-protocols (/ typed-protocols). Furthermore, structure gives simplicity and that's been our aim since the beginning.

@KtorZ
Copy link
Member

KtorZ commented May 19, 2023

@coot structure gives simplicity and that's been our aim since the beginning.

Well, I think that from a client perspective, we kind of missed the aim for simplicity 😅 ... I've never encountered something as complicated as the mini protocol to use as a client application in my entire engineering career. I can fathom what I get from that complexity, the immutability, version negotiation and the correct-by-construction protocols so that's okay.

But if I am now being told that I can't even get the one thing I was promised in exchange of that complexity, I am -- reasonably I believe -- a little disappointed.

@coot versioning we do is a very standard thing, and it doesn't even have anything to do with mini-protocols

I am not sure to follow your thoughts here? Surely, codecs are currently versioned -- or at least pretend to be? For example, this is how Ogmios currently selects codecs using methods and data-types from the ouroboros-network libraries:

codecs
    :: forall m. (MonadST m)
    => EpochSlots
    -> NodeToClientVersion
    -> ClientCodecs (CardanoBlock StandardCrypto) m
codecs epochSlots nodeToClientV =
    clientCodecs cfg (supportedVersions ! nodeToClientV) nodeToClientV
  where
    supportedVersions = supportedNodeToClientVersions (Proxy @(CardanoBlock StandardCrypto))

@JaredCorduan
Copy link

The part which was incorrect CBOR was a list of length 4, but where the CBOR claimed it was a list of length 5.

@lehins
Copy link

lehins commented May 19, 2023

To be fair, the cbor in previous releases was not invalid, it was just two cbor objects instead of the array wrapping everything into a single object.

@AndrewWestberg It actually was invalid, that's how I discovered it. Generic cbor decoder was choking on the PParams, because number of elements was reported one smaller than there was actually number of elements in the list.

Even if the cbor had bugs before, you should respect the protocol version contracts instead of having Ogmios, phyrhose, TxPipe, Goroboros, and everybody else downstream code work-arounds.

I totally agree with you. This breakage was unintentional., That's why I said: "I think there was a graceful way to fix this bug, but..."

@lehins
Copy link

lehins commented May 19, 2023

That is untrue. The protocol version used to be serialised as a list of two integers which is perfectly valid CBOR. It might not be what you would semantically expect or not the most elegant choice for the structure but, from a binary standpoint, no one cares -- it is valid.

@KtorZ This is true. It was wrong, not because protocol version was unrolled, but it was wrong because this unrolling affected the total number of elements in the list without adjusting the encoded value for the length of the list. I am actually surprised that no-one in the community discovered this.

@lehins
Copy link

lehins commented May 19, 2023

I agree that we need to fix the encoding.

@coot You mean that we need to revert to the broken encoding in order to support downstream users of the query.

I have no problem with that. If that's what people need, I am not gonna stand in the way. However, if that's what we wanna do then we also have to deprecate this query and implement a new one that has the correct encoding.

@KtorZ

But if I am now being told that I can't even get the one thing I was promised in exchange of that complexity, I am -- reasonably I believe -- a little disappointed.

Don't be so sensitive and get disappointed so quickly, nothing has been decided yet. We are having a discussion here. This query breakage was unintentional, I apologize for that. But the bug was real and needed to be fixed. Problem is that the serialization bug was in ledger, while the breakage happened in consensus. The take away from this should be that we need to invent a mechanism to protect us from this happening in the future.

@KtorZ
Copy link
Member

KtorZ commented May 19, 2023

I mean, this isn't a first time a "bug is fixed" and things break down the line. I'd hope that by now, we had a better process in place to prevent this kind of situation from happening a lot earlier.

A software version has been tagged and released, and people should reasonably expect to use this as is. This one particular issue is low impact because it only affects state queries as far as I can tell but what if it had affected the block serialisation format? I am puzzled that discrepancies like this can surface all the way without being detected prior to a release. So, yes, I am a little sensitive about this now. I mean not to sound aggressive or anything like this, but that's somehow hard to capture on a written form. I am just puzzled and worried.

Note that one way to have caught earlier would have been to have cddl specification for the state queries, and test against the specification. I believe we've been gently asking for these for a while in the community but priorities have been elsewhere.

@KtorZ
Copy link
Member

KtorZ commented May 19, 2023

@lehins I believe that this particular issue can be "fixed" by simply bumping the size of the serialized map by one (and reverting the change). This makes the PParams serialization valid, yet preserves the on-the-wire format for client applications and existing decoder where the protocol version is inlined as two fields. The reason why no one bumped into this earlier is probably because they're either using the ledger as a library directly (e.g. Ogmios, Hydra, db-sync); or because no one have written a generic cbor decoder for protocol parameters and have written them by hand; thus likely ignoring the encoded length altogether.

That doesn't quite solve: #313 however, which I believe to be quite closely related. But for another encoder.

@lehins
Copy link

lehins commented May 19, 2023

This one particular issue is low impact because it only affects state queries as far as I can tell but what if it had affected the block serialisation format?

Ledger is fully responsible for [de]serialization of everything that goes on chain. That's why we have cddl spec and a whole lot of tests that ensure we don't break it. In fact we recently introduced versioned serialization for all types that live on chain, so we can gracefully fix bugs and deprecate undesired features when going from one era to another. With respect to serialization of queries, we don't have the same scrutiny, unfortunately. In my opinion the problem lays in the separation of concerns: consensus is responsible for the queries, while serialization of the types returned by queries live in ledger. This makes it incredibly hard for ledger team to know serialization of which types from ledger state is allowed to change and which ones cannot be changed. Now that it is on my radar I'll make sure we'll have this problem resolved

Note that one way to have caught earlier would have been to have cddl specification for the state queries, and test against the specification. I believe we've been gently asking for these for a while in the community but priorities have been elsewhere.

I 100% support you on this. Considering the issue that just happen, we should give it a higher priority.

I believe that this particular issue can be "fixed" by simply bumping the size of the serialized map by one (and reverting the change).

I do like this suggestion. This would allow us to avoid deprecation of the old query, while fixing the bug of invalid CBOR. Technically it would still be a breaking change to the query, but the impact would be minimal, as you very well alluded. It does have another small downside of not matching the way protocol version is stored in PParamsUpdates, but I don't think it is terribly important.

I am not quite sure what's the reason for #313 That query hasn't changed, nor its serialization. That requires some investigation.

@coot
Copy link

coot commented May 19, 2023

I have no problem with that. If that's what people need, I am not gonna stand in the way. However, if that's what we wanna do then we also have to deprecate this query and implement a new one that has the correct encoding.

@lehins we can change the encoding, as long as a new NodeToClientVersion was introduced. If that was the case, then the tooling needs to upgrade; if we failed to upgrade the version we need to fix it 8.0, bump version of the node-to-client protocol in 8.1 and then use the new encoding. I assumed it's the later case, but I am not that sure now, somebody needs to investigate it.

@KtorZ I won't argue but you're confusing simplicity with things being easy. We aim for the former not for the latter. And by the way in the context of cardano-wallet you haven't even been exposed to typed-protocols but to wrappers which hide quite a lot of its complexity. Deadlock freedom doesn't come for free, but the benefits are tremendous; if you had to analyse rare deadlocks by only having access to logs, you know what I mean.

@disassembler
Copy link

From my perspective reverting it doesn't solve the problem because end users still cannot use generic decoders. However; the fix in 8.0.0 is sub-optimal in changing the logic users have to use. The optimal solution is to keep the structure the same but prefix with 5 rather than 4 so it fixes the cbor decoder issues and causes minimal impact to end users. Would you agree with doing that @KtorZ To elaborate, what we propose is we switch it to:

(5, [a, b, c, d, e]) - generic parser can parse it and mostly backwards compatible with 1.35

Per @coot we should also increase the network protocol version to denote a breaking change.

@lehins
Copy link

lehins commented May 19, 2023

NodeToClientVersion was introduced

@coot NodeToClient was not introduced in 8.0, because we didn't realize this change in serialization of this query

@KtorZ
Copy link
Member

KtorZ commented May 19, 2023

@disassembler I believe you and I are saying the same thing.

@AndrewWestberg
Copy link
Author

This issue appears to be working with Ogmios 5.6.0 against cardano-node 8.1.1-pre.

@SebastienGllmt
Copy link

Cardano Preview network just hardforked to protocol version 9 so everybody is forced to upgrade to cardano-node v8 or higher, but this issue is still a blocker

@lehins
Copy link

lehins commented Jul 18, 2023

@SebastienGllmt This issue has been fixed in cardano-node-8.1, so I am not quite sure why it hasn't been closed yet.

Also, the fact that SanchoNet (preview) is on protocol version 9, doesn't force anyone to upgrade, only those who want to play around with it can do that. The Cardano TestNet is still on protocol version 8, so if you use it for testing and such, you can continue doing so.

@KtorZ
Copy link
Member

KtorZ commented Jul 18, 2023

@lehins i think 8.1 only fixed part of the problem. What I understood from @disassembler is that there are still changes wrt how the ledger now expects definite vs indefinite cbor structure. So it still fails.

This will get fixed with Ogmios' next upgrade. I was hoping to delay the integration with 8.x to later but it sounds necessary.

@lehins
Copy link

lehins commented Jul 18, 2023 via email

@KtorZ KtorZ added server Issues which regard the server infrastructure labels Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure server Issues which regard the server
Projects
None yet
Development

No branches or pull requests

7 participants