Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: incremental-hasher #261

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions src/block/interface.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,22 @@ export interface Phantom<T> {
[Marker]?: T
}

/**
* [Multicodec code] usually used to tag [multiformat]. It is simply an Integer that
* utilizes `Phantom` type to capture code name which typescript aware tools will
* surface providing info about the code withouth having to look it up in the table.
*
* Type also can be used to convey the fact that value must be a multicodec
*
* [multiformat]:https://multiformats.io/
* [multicodec code]:https://github.com/multiformats/multicodec/blob/master/table.csv
*/
export type MulticodecCode<
Code extends number = number,
Name extends string = string
> = Code & Phantom<Name>


/**
* Represents an IPLD block (including its CID) that can be decoded to data of
* type `T`.
Expand Down
70 changes: 66 additions & 4 deletions src/hashes/interface.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
// # Multihash
import type { MulticodecCode } from '../block/interface.js'

/**
* Represents a multihash digest which carries information about the
Expand All @@ -9,7 +10,7 @@
// a bunch of places that parse it to extract (code, digest, size). By creating
// this first class representation we avoid reparsing and things generally fit
// really nicely.
export interface MultihashDigest<Code extends number = number> {
export interface MultihashDigest<Code extends MulticodecCode = MulticodecCode, Size extends number = number> {
/**
* Code of the multihash
*/
Expand All @@ -23,7 +24,7 @@ export interface MultihashDigest<Code extends number = number> {
/**
* byte length of the `this.digest`
*/
size: number
size: Size

/**
* Binary representation of this multihash digest.
Expand All @@ -35,7 +36,7 @@ export interface MultihashDigest<Code extends number = number> {
* Hasher represents a hashing algorithm implementation that produces as
* `MultihashDigest`.
*/
export interface MultihashHasher<Code extends number = number> {
export interface MultihashHasher<Code extends MulticodecCode = MulticodecCode> {
/**
* Takes binary `input` and returns it (multi) hash digest. Return value is
* either promise of a digest or a digest. This way general use can `await`
Expand Down Expand Up @@ -67,6 +68,67 @@ export interface MultihashHasher<Code extends number = number> {
* `SyncMultihashHasher` is useful in certain APIs where async hashing would be
* impractical e.g. implementation of Hash Array Mapped Trie (HAMT).
*/
export interface SyncMultihashHasher<Code extends number = number> extends MultihashHasher<Code> {
export interface SyncMultihashHasher<Code extends MulticodecCode = MulticodecCode> extends MultihashHasher<Code> {
digest: (input: Uint8Array) => MultihashDigest<Code>
}

/**
* Incremental variant of the `MultihashHasher` that can be used to compute
* digest of the payloads that would be impractical or impossible to load all
* into a memory.
*/
export interface IncrementalMultihashHasher<
Code extends MulticodecCode,
Size extends number,
Digest = MultihashDigest<Code, Size>
> {
/**
* Size of the digest this hasher produces.
*/
size: Size

/**
* Code of the multihash
*/
code: Code

/**
* Name of the multihash
*/
name: string

/**
* Number of bytes that were consumed.
*/
count(): bigint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if someone is hashing >9PiB of data in JS then 👏👏👏.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ... is this overkill?

Comment on lines +100 to +103
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Number of bytes that were consumed.
*/
count(): bigint

Let's just drop this method, we can revisit if we find it really necessary.


/**
* Returns multihash digest of the bytes written so far. Should not have
* side-effects, meaning you should be able to write some more bytes and
* call `digest` again to get the digest for all the bytes written from
* creation (or from reset)
*/
digest(): Digest

/**
* Computes the digest of the given input and writes it into the provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the given input

Do you mean "of the bytes written so far"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Computes the digest of the given input and writes it into the provided
* Computes the digest of the bytes written so far and writes it into the provided

* `output` at given `offset`. The `offset` is optional and if omitted is
* implicitly `0`. The optional `asMultihash` whith is implicitly `true`,
* can be use to control whether multihash prefix is written, if `false`
* only the raw digest writtend omitting the prefix.
*/
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return output? Not sure how useful it is to chain this method.

Suggested change
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): Uint8Array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better convention for method name when receiving a parameter to mutate. digestInto is a bit awkwardly named (IMO!).

digestBYOB? 🙄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asMultihash - do we need? The point of this library is multiformats.

Copy link
Contributor Author

@Gozala Gozala Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went into this in the linked issue. In practice I have encountered many instances where I do need to leave out the prefix, I could go and use another library in those instances, but seems like we could just expose this. That sayid I agree that this method is kind of meh and we could do better.

@vasco-santos suggested having a whole another method, personally I'm wondering if perhaps we should have methods just to get digest without a prefix as this is low level API anyway ?

Suggested change
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
// multihash prefix for it
header: Uint8Array
// only writes the digest without a prefix
digestInto(output: Uint8Array, offset?: number): this

Or alternatively something like this

Suggested change
digestInto(output: Uint8Array, offset?: number, asMultihash?: boolean): this
encodeMultihashInto(target: Uint8Array, offset?: number): this
encodeMultihashHeaderInto(target: Uint8Array, offset?: number): this
encodeDigestInto(target: Uint8Array, offset?: number): thsi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

digestInto is a bit awkwardly named (IMO!).

I'm happy to call this something, else I was trying to align with varint.encodeInto, which as it turns out was called varint.encodeTo 😅

digestBYOB? 🙄

Works for me although I had to google to figure out what BYOB stand for.

Alternatively we could just have read(bytes: Uint8Array, offfset?: number): this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return output? Not sure how useful it is to chain this method.

I personally find mutations that return values misleading, that said I'm amendable to the idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to google to figure out what BYOB stand for.

BYOB was aiming for familiarity with https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamBYOBReader etc. but I guess that was a miss 😆


/**
* Writes bytes to be digested.
*/
write(bytes: Uint8Array): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically in streaming hashers this is called update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with calling it update although I do find that name confusing personally as I think of update as overwrite as opposed to append.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but streaming hashers aren't appending to a buffer they are updating their internal state with the new data you pass.


/**
* Resets this hasher to its initial state. Can be used to recycle this
* instance. It resets `count` and and discards all the bytes that were
* written prior.
*/
reset(): this
}