Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update ADR-040 to store hash(value) in SMT leaf #9680

Merged
merged 8 commits into from
Sep 15, 2021

Conversation

roysc
Copy link
Contributor

@roysc roysc commented Jul 12, 2021

Description

This revises ADR-40 to specify storing hash(value) as the SMT mapped value.
This should allow the same indexing functionality as using hash(key + value), while working much better with IPLD by storing only the hash of the mapped content.
Some discussion here: #9331 (comment)


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • included the correct type prefix in the PR title
  • added ! to the type prefix if API or client breaking change - N/A
  • targeted the correct branch (see PR Targeting)
  • provided a link to the relevant issue or specification
  • followed the guidelines for building modules - N/A
  • included the necessary unit and integration tests - N/A
  • added a changelog entry to CHANGELOG.md
  • included comments for documenting Go code - N/A
  • updated the relevant documentation or specification
  • reviewed "Files changed" and left comments if necessary
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed ! in the type prefix if API or client breaking change
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic
  • reviewed API design and naming
  • reviewed documentation is accurate
  • reviewed tests and test coverage
  • manually tested (if applicable)

@github-actions github-actions bot added the T: ADR An issue or PR relating to an architectural decision record label Jul 12, 2021
@robert-zaremba robert-zaremba self-assigned this Jul 12, 2021
Copy link
Contributor

@i-norden i-norden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of IPLD it is more conventional to store hash(value) so we can map it to an IPLD block that stores the value. But as I think about this more, perhaps we actually do want to include key information in the IPLD in this manner... hash(key || value) => key || value is still a proper content hash mapping since the key is included as part of the IPLD content. This would be a more compact way to store key information in the IPLD DAG, but if we switch to storing hash(value) as the leaf value we can still store the raw key in the IPLD DAG with a distinct IPLD that maps hash(key) => key (i.e. we IPLDize the inverse index).

Sorry for going back and forth on this @roysc! So I think the most compelling reason to switch to storing hash(value) as the leaf value is preventing collisions when key1 || value1 == key2 || value2 (e.g. a || bc == ab || c).

In any case I think we should explicitly define the tree nodes, including the node prefixes and hash function used. E.g.

  • An inner node is 0x01 || left_hash || right_hash.
  • A leaf node is 0x00 || path || leaf_value.
    • The leaf_value is the SHA_256(value).
    • The path is the SHA_256(key).
    • left_hash and right_hash are SHA_256(left_node) and SHA_256(right_node), respectively
    • || is byte concatenation


SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(key, value)` in a leaf. Since we don't know a structure of a value (in particular if it contains the key) we hash both the key and the value in the `SC` leaf.
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf.
Copy link
Collaborator

@robert-zaremba robert-zaremba Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `hash(value)` in a leaf.
SMT is a merkle tree structure: we don't store keys directly. For every `(key, value)` pair, `hash(key)` is stored in a path (we hash a key to evenly distribute keys in the tree) and `0x00 || key || hash(value)` in a leaf.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed - in the leaf we need to commit to the key as well (it's not enough that it is in a path).

Copy link
Contributor

@i-norden i-norden Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash(key)/path is stored in the leaf node, even if not as part of the "leaf value" (i.e. the value returned from calling Get on the SMT)

leaf node == prefix || path || leaf_value == prefix || hash(key) || hash(value_provided_to_the_SMT)

When calling Set(key, value_provided_to_the_SMT) it hashes the key into the path and the value_provided_to_the_SMT into the leaf_value. When calling Get(key) it returns value_provided_to_the_SMT.

Note that value_provided_to_the_SMT currently is hash(key || value_in_the_StateStore) so that when we call Get we retrieve hash(key || value_in_the_StateStore) which is the key we need for the (current) inverse index.

So the SMT leaf node is, in current practice, prefix || hash(key) || hash(hash(key || value_in_the_StateStore))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so we can add prefix + to my suggestion. We can use || operator instead of + if you prefer.

Note that value_provided_to_the_SMT currently is hash(key || value_in_the_StateStore)

Why is that? We should provide key and obj_value without modifying it. SMT will do all necessary operations.

Copy link
Contributor Author

@roysc roysc Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was just a misunderstanding of the ADR language, I think. We didn't think it was describing the SMT's internal structure, since the hashed values are not exposed by the SMT interface (so we assumed the hashed value should be passed in). But we can add methods for that and fork the code if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? We should provide key and obj_value without modifying it. SMT will do all necessary operations.

Like Roy said it is because the current implementation does not expose the hashed values.

We did it like this so that when we would Get from the SMT we retrieved the value we needed for the old inverse index hash(key || value).

  1. Set takes a key and value, Get only returns the value provided to Set not some internal transformation of key and/or value- this would be really odd behavior for a Setter and Getter interface.
  2. If the value we provided to Set was the unhashed "value" (key || obj_value), then when we would Get from the SMT we would get that unhashed value (again, not some internal- hashed- transformation of the value we provided) and we would have to hash it again at the level above the SMT before we could use it in the inverse index. This would have worked but would mean we were duplicating hashing efforts at the two levels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are on the same page. In my suggestion I added hash(key) to the leaf (I'm using + operator rather than ||). It seams we need to update it to (as you noted in the comment above):

prefix + hash(key) + hash(value_provided_to_the_SMT)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping, @roysc , @i-norden - let's update the paragraph to what's in the SMT leaf and merge this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-zaremba
Copy link
Collaborator

@roysc , let's update the PR and merge it.

@robert-zaremba
Copy link
Collaborator

We need to add an information about the prefix.

@roysc
Copy link
Contributor Author

roysc commented Jul 30, 2021

I would argue the prefix is an implementation detail and doesn't need to be in the spec, since it's not relevant to the ADR's functionality?

@robert-zaremba
Copy link
Collaborator

I would argue the prefix is an implementation detail and doesn't need to be in the spec, since it's not relevant to the ADR's functionality?

We are defining what's in the SMT leaf, so let's be precise - because this spec will be used by IBC, and they need a full information.

@roysc
Copy link
Contributor Author

roysc commented Jul 31, 2021

Let's be thorough then, we should include the spec for internal nodes and the hash function used as well. I'll link the spec from Celestia and quote the relevant parts here.

@roysc
Copy link
Contributor Author

roysc commented Jul 31, 2021

@robert-zaremba updated, let me know how that looks.

@roysc roysc force-pushed the roysc/adr-040-update-smt-leaf branch from d65f218 to bf2a783 Compare August 1, 2021 03:53
@roysc
Copy link
Contributor Author

roysc commented Aug 1, 2021

I realized a couple things need clarification, though maybe just for my own understanding. This seems as good a place as any to ask, as they are relevant questions:

  1. @robert-zaremba what is the anticipated use case for the reverse index? In what context would only h(k), and not k ,be available to the program? Do we actually need to have h(k) available anywhere outside of proof generation? I may just not be familiar enough with the SDK's usage to see this.
  2. @i-norden in what context will we need h(v) for IPLD? Would there be any case where h(v) is needed that v is not also available?

If these hashes won't be needed outside of a context where their preimages are also available, we can simplify the design and implementation somewhat.

@robert-zaremba
Copy link
Collaborator

let's update here that reverse index is needed for IPDL

Copy link
Collaborator

@robert-zaremba robert-zaremba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - we clarify what's needed in a call.

@robert-zaremba robert-zaremba changed the title docs: Update ADR-40 to store hash(value) in SMT leaf docs: Update ADR-040 to store hash(value) in SMT leaf Aug 4, 2021
@roysc
Copy link
Contributor Author

roysc commented Aug 10, 2021

I picked a few nits, but this is done and should be good to merge.

@robert-zaremba
Copy link
Collaborator

we miss one more approval. @i-norden could you approve a well?

@robert-zaremba
Copy link
Collaborator

@roysc , could you sync this branch? It should merge automatically once it's synced.

@robert-zaremba robert-zaremba added the A:automerge Automatically merge PR once all prerequisites pass. label Sep 15, 2021
@mergify mergify bot merged commit 86b0c0a into cosmos:master Sep 15, 2021
@roysc roysc deleted the roysc/adr-040-update-smt-leaf branch September 16, 2021 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A:automerge Automatically merge PR once all prerequisites pass. T: ADR An issue or PR relating to an architectural decision record
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants