Update test files #125

ben221199 · 2024-06-05T12:04:21Z

This pull request adds test vectors for Base 45 and Proquint.

rvagg · 2024-06-06T02:10:15Z

wow, proquint too, I've been tempted for a long time to hack on that in JS but never got around to it

I'll verify, it'll depend on inspiration levels and my ability to pull myself away from more pressing things, plz be patient

ben221199 · 2024-06-06T07:10:12Z

I did read the spec and wrote some JS thing like this:

var CONSONANTS = "bdfghjklmnprstvz".split("");
var VOWELS     = "aiou".split("");

function encode(str){
    if(str.length>0 && str.length%2!=0){
        throw new Exception("Should be divisible by 16 bits");
    }
    var shorts = str.length/2;
    var parts = [];
    for(var i=0;i<shorts;i++){
        var value = str[i*2].charCodeAt(0)*256 + str[i*2+1].charCodeAt(0);
        parts.push(CONSONANTS[value>>12 & 0xF]+VOWELS[value>>10 & 0x3]+CONSONANTS[value>>6 & 0xF]+VOWELS[value>>4 & 0x3]+CONSONANTS[value>>0 & 0xF]);
    }
    return parts.join('-');
}

After that, I patched it a little, because it wasn't perfect. The * 256 was changed to << 8, the + became |, and the input string was changed to be an UInt8Array. Then it should work and be able to encode. Note that the input must always have an even amount of bytes. In the spec I didn't see anything that allowed any length, because that would cause the middle consonant sometimes to be split in halve, possibly giving the possibility to use multiple letters where the first two bits of the letter index are the same as the last two bits of the bytes. I don't think they want that.

tests/case_insensitivity.csv

Ref: https://datatracker.ietf.org/doc/html/rfc9285 Ref: multiformats/multibase#123 Ref: multiformats/multibase#125

rvagg · 2024-06-06T13:23:07Z

My hand-rolled proquint implementation at multiformats/js-multiformats#292 gives the following vectors:

Decentralize everything!! = pro-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fah
yes mani ! = pro-lojoj-lasob-kujod-kunon-fabod
hello world = pro-kodoj-kudos-kusob-litoz-lanos-kib
\x00yes mani ! = pro-badun-kijug-fadot-kajov-kohob-fah
\x00\x00yes mani ! = pro-babab-lojoj-lasob-kujod-kunon-fabod

Left some additional notes in multiformats/js-multiformats#292, but other reference implementations that can do more than just 16 and 32 bit numbers would be nice to try against.

rvagg · 2024-06-06T13:24:41Z

The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one.

ben221199 · 2024-06-06T13:39:47Z

My outputs for those values (when odd, I added 0x00, else my implementation only prints p):

pro-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fahab (Same, but with ab)
pro-lojoj-lasob-kujod-kunon-fabod (Same)
pro-kodoj-kudos-kusob-litoz-lanos-kibab (Same, but with ab)
pro-badun-kijug-fadot-kajov-kohob-fahab (Same, but with ab)
pro-babab-lojoj-lasob-kujod-kunon-fabod (Same)

Seems your system works.

In order to support 8-bit chunks instead of the 16-bit chunks now, I think we should contact the original inventor of the thing. If we just do it ourselves, we break spec and possibly compatibility with other systems. I don't like that.

ben221199 · 2024-06-06T13:55:43Z

The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one.

Okay, another option that doesn't break proquint spec, but gives us the possibility to indicate an odd amount of bytes would be this:

The p is our prefix.
Then it is followed by ro-.
Then follows the real 16-bit chunk encoded proquint output.

How do we tell the system that the content has an odd number of bytes? Well, like this:

First we decide what byte we want for the padding. I think 0x00 is fine. That will cause the additional ab in the end.
Then, instead of ro-, we output or- to indicate the last byte can be removed from the end.

Because the ro- thing seems very Multibase specific, I don't think this will cause problems with systems elsewhere.

ben221199 · 2024-06-06T13:56:39Z

So:

por-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fahab (Odd)
pro-lojoj-lasob-kujod-kunon-fabod (Even)
por-kodoj-kudos-kusob-litoz-lanos-kibab (Odd)
por-badun-kijug-fadot-kajov-kohob-fahab (Odd)
pro-babab-lojoj-lasob-kujod-kunon-fabod (Even)

rvagg · 2024-06-07T05:20:51Z

That's an interesting option, although I still prefer mine. cvc is still pronounceable and 3 vs 5 make it very clear that there's truncation. But let's go fishing a bit more! I'll check with some folks who have some connection with the source of proquint in the multibase table. Then maybe we just go ask dsw's opinion over at https://github.com/dsw/proquint. But first I did some digging and here's what I discovered:

Proquint was first introduced, contributed by an external contributor to go-namesys but it was put together by Jeromy and merged by Juan in 2014, right back in the very early days of the whole stack: ipfs/go-namesys@59eb235
The implementation used depends on there being an even number of bytes! https://github.com/Bren2010/proquint/blob/master/proquint.go - that thing is going to panic if it gets an odd number to encode, but it'll always decode to an even number. I suppose the intention was to just use it for ipv4 addresses.
It was finally removed from go-ipfs in 2021 with a big resolver refactor: feat: support custom DoH resolvers ipfs/kubo#8068 (go-namesys was bumped to v0.2.0 and removed it entirely).
One of the problems that lead to its removal was that it lead to confusing errors for users, afaik nobody ever used them but it ended up getting involved in decoding bad ipns or dnslink strings, Error: not a valid proquint string (what on earth is this??), that was pushed out in go-ipfs v0.9.0: https://github.com/ipfs/kubo/releases/tag/v0.9.0
With the removal from go-ipfs, it was then shifted to the multibase table, along with the original mini-spec in there: feat: add proquint #78 - the v0.9.0 release said "We have removed support for proquints as they were out of place and largely unused, however proquints are valid multibases so if there is renewed interest in them there is a way forward".

And that's where we are.

But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions!

ben221199 · 2024-07-17T08:54:56Z

But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions!

Well, it is the only Multibase encoding that doesn't support odd bytes yet. In my opinion, a good Multibase encoding should be able to encode any data. For Proquints that is now not the case. I think there are 3 options:

Remove Proquints from Multibase (I don't like this option)
Keep Proquints and make it odd byte compatible without breaking specs (e.g. pro and por or another solution)
Keep Proquints and make it odd byte compatible with breaking specs (I don't like this option either)

Tagging @dsw and @randomwalker here.

rvagg · 2024-07-17T10:39:48Z

I disagree that it breaks the proquint spec by shortening the last section. Any block of 3 characters using the same rules is still pronounceable and I don't think you can read the spec as being particularly strict about this, any more than its use being only applicable to 32-bit integers which seems to be what the main implementations are being used for. So I'm still in favour of keeping pro as a stable prefix so as not to have two parsing branches at the beginning for a multiformats consumer, and just shortening the last block to 3 (cvc) characters.

dsw · 2024-07-17T22:02:28Z

I see: the problem is that what if the number of bits gets cut off in the middle of a proquint or in the middle of a proquint character? The problem really is that you are implicitly encoding the length of the string in the length of the characters. Hex has the same problem: the length is expected to be a multiple of 4 bits. The solution that CPUs pick is they just zero-extend or sign-extend out to the machine word size and then they encode the length otherwise. Is this somehow not an option? Daniel

…

On Wed, Jul 17, 2024 at 3:40 AM Rod Vagg ***@***.***> wrote: I disagree that it breaks the proquint spec by shortening the last section. Any block of 3 characters using the same rules is still pronounceable and I don't think you can read the spec as being particularly strict about this, any more than its use being only applicable to 32-bit integers which seems to be what the main implementations are being used for. So I'm still in favour of keeping pro as a stable prefix so as not to have two parsing branches at the beginning for a multiformats consumer, and just shortening the last block to 3 (cvc) characters. — Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAL6XACNWTWIDVR2VHY4RDZMZCYXAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZSHE4TGOBTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ben221199 · 2024-07-18T08:17:32Z

Hi @dsw, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles without padding.

However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length x, you know you need a hex string with length 2x.

The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount x of bytes, you know that the output has x/2 proquint words. The problem occurs when the input data has an odd amount of bytes. The formula x/2 will output a fraction, so we came up with two options:

@rvagg option:
- If the input has an even amount of bytes, do nothing.
- If the input has an odd amount of bytes, cut the (last) proquint in half. So, hex 00 00 would be babab, but hex 00 would be bab. The unused last two bits are set to zero. (In my opinion this would break the Proquint spec at https://arxiv.org/html/0901.4016)
@ben221199 option:
- If the input has an even amount of bytes, do nothing.
- If the input has an odd amount of bytes, add 0x00 to the end. The data has now an even amount of bytes, so it is possible to encode it as usual. However, to indicate we padded the data with one byte, we use a different prefix. Normally it is p + ro (pro), but when the data input has odd bytes, it will become p + or (por).

dsw · 2024-07-18T18:52:00Z

Hm. I think changing the prefix would be weird. I think we should keep the prefix but extend the encoding. Let me think about it.

…

On Thu, Jul 18, 2024 at 1:17 AM Ben ***@***.***> wrote: Hi @dsw <https://github.com/dsw>, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles for example without padding. However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length x, you know you need a hex string with length 2x. The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount x of bytes, you know that the output has x/2 proquint words. The problem occurs when the input data has an odd amount of bytes. The formula x/2 will output a fraction, so we came up with two options: - @rvagg <https://github.com/rvagg> option: - If the input has an even amount of bytes, do nothing. - If the input has an odd amount of bytes, cut the (last) proquint in half. So, hex 00 00 would be babab, but hex 00 would be bab. The unused last two bits are set to zero. (In my opinion this would break the Proquint spec at https://arxiv.org/html/0901.4016) - @ben221199 <https://github.com/ben221199> option: - If the input has an even amount of bytes, do nothing. - If the input has an odd amount of bytes, add 0x00 to the end. The data has now an even amount of bytes, so it is possible to encode it as usual. However, to indicate we padded the data with one byte, we use a different prefix. Normally it is p + ro (pro), but when the data input has odd bytes, it will become p + or (por). — Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAL6XARME6CMBBLZSHSVR3ZM523HAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVHEYDOOJYG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dsw · 2024-07-18T19:12:44Z

Encode the last 2 bits using another vowel, preceded by a spoken but unwritten "y". To be canonical, do this only in the case where the encoding would require a consonant at the end of a string but not enough bits remain to make one. Now we can encode any string of even bit length. The point of proquints was to be easily pronounceable and understandable, and the vowels "a", "i", "o", "u" all satisfy those properties when next to each other. I made the encoding alternate consonant-vowel mostly just to prevent unpronounceable consonant clusters. You could still end up with weird vowel clusters. However, (1) those are not that bad to say, and (2) and we could have a convention like Hungarian where you often put a "j" between them. Since "j" is used but "y" is unused we could use that instead. Actually writing the "y" would make the width irregular. However we could make the "y" pronounced without being written whenever there are two vowels in a row. Note that doing this requires the following pronunciation rules to prevent ambiguity: * g = hard g as in "golf", * j = hard j as in "just", * y = soft y as in "yes". So in your example: 0x00 = 0qba[y]a. Daniel On Thu, Jul 18, 2024 at 11:51 AM Daniel Wilkerson < ***@***.***> wrote:

…

Hm. I think changing the prefix would be weird. I think we should keep the prefix but extend the encoding. Let me think about it. On Thu, Jul 18, 2024 at 1:17 AM Ben ***@***.***> wrote: > Hi @dsw <https://github.com/dsw>, I see your point with hexadecimal. > Many encodings/bases in multibase are not able to properly encode and > decode nibbles for example without padding. > > However, in the computer world we have decided that 8 bits (octet) is the > smallest piece of information we can send over the internet and it is also > standardized in many other ways. Because hex encodes each nibble (4 bits) > with an alphanumeric and two of those nibbles fit in a byte, there is no > problem. For every data with length x, you know you need a hex string > with length 2x. > > The problem with Proquints is that we are now not talking about things > smaller than a byte/octet, but a thing that is bigger: 16 bits. When > encoding data with an even amount x of bytes, you know that the output > has x/2 proquint words. The problem occurs when the input data has an > odd amount of bytes. The formula x/2 will output a fraction, so we came > up with two options: > > - @rvagg <https://github.com/rvagg> option: > - If the input has an even amount of bytes, do nothing. > - If the input has an odd amount of bytes, cut the (last) proquint > in half. So, hex 00 00 would be babab, but hex 00 would be bab. > The unused last two bits are set to zero. (In my opinion this would break > the Proquint spec at https://arxiv.org/html/0901.4016) > - @ben221199 <https://github.com/ben221199> option: > - If the input has an even amount of bytes, do nothing. > - If the input has an odd amount of bytes, add 0x00 to the end. > The data has now an even amount of bytes, so it is possible to encode it as > usual. However, to indicate we padded the data with one byte, we use a > different prefix. Normally it is p + ro (pro), but when the data > input has odd bytes, it will become p + or (por). > > — > Reply to this email directly, view it on GitHub > <#125 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAAL6XARME6CMBBLZSHSVR3ZM523HAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVHEYDOOJYG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

ben221199 · 2024-07-18T20:09:12Z

Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations?

And how does 0qba[y]a work? Where does 0q (zero Q) come from? I know 00 00 is babab as seen below, so I would only expect the last 2 or 3 letters different. Did you just mean baya?

I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding.

dsw · 2024-07-18T21:05:11Z

Ah, I never said in the spec that I intended the unambiguous indicator of a proquint to be the prefix "0q", to parallel "0x" for hex. The "y" is in square brackets because you do not write it, you just say it. I think changing the prefix to indicate length is frankly a really bad idea: the format has not changed and that is what a prefix usually indicates. So I am going to update the spec to clarify all of this. I appreciate your pointing out the problem to me.

…

On Thu, Jul 18, 2024 at 1:09 PM Ben ***@***.***> wrote: Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations? And how does 0qba[y]a work? Where does 0q (zero Q) come from? I know 00 00 is babab as seen below, so I would only expect the last 2 or 3 letters different. Did you just mean baya? image.png (view on web) I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

ben221199 · 2024-07-19T07:50:46Z

Ah, I never said in the spec that I intended the unambiguous indicator of a proquint to be the prefix "0q", to parallel "0x" for hex.

Ah, fair enough. Seems fine to me. (Multibase uses a different prefix of course.)

The "y" is in square brackets because you do not write it, you just say it.

So, I see baa, and I say baya? 🤔 CVV indeed makes 8 bits, where CVC makes 10 bits. It makes sense in some way, but many implementations have to be changed. (And I care about backward compatability.)

I think changing the prefix to indicate length is frankly a really bad idea: the format has not changed and that is what a prefix usually indicates.

We wouldn't indicate length, we would indicate padding. The Multibase prefix letter p still stays p. The letters ro don't have a meaning other then making the prefix sound like pro, as in proquint. My first idea was using the inverse of it, or, to make por. But, because the data is "padded", it would also be possible to use ad, which makes pad. 😎 At least, the conclusion is we only change the meaningless second and third letters in this case.

ben221199 · 2024-07-19T07:59:19Z

Now I'm thinking, it seems you want this 16-bit problem to be solved in Proquint itself, so that it will support 8-bit from now on too. But is it a problem if Proquint just happens to be 16-bit chunked and we have to solve the problem elsewhere? I tried to solve it using the Multibase prefix pro/por/pad, but that doesn't work with 0q. If you want to solve it for 0q too, then I understand your option to change the spec, but in any other case, I think it is too drastic a change.

dsw · 2024-07-19T11:56:25Z

I never said that Proquints is 16-bit based. I said Proquints are "Identifiers that are Readable, Spellable, and Pronounceable". You are trying to maintain a constraint that you made up and that makes no sense to maintain. Lexing two vowels in a row is the *least* drastic change and *is* backwards compatible. There is no ambiguity in lexing the format, which makes it backwards compatible. If you just allow any letter in any position, the code actually gets simpler, rather than adding any special case for the new semantics. Changing the prefix is a very odd choice and I would like you to please not make it. I have literally never seen someone change a prefix to indicate a length. It is frankly a bad factoring to associate a prefix with a string length; I have literally never seen anyone even consider doing that.

…

On Fri, Jul 19, 2024 at 12:59 AM Ben ***@***.***> wrote: Now I'm thinking, it seems you want this 16-bit problem to be solved in Proquint itself, so that it will support 8-bit from now on too. But is it a problem if Proquint just happens to be 16-bit chunked and we have to solve the problem elsewhere? I tried to solve it using the Multibase prefix pro/ por/pad, but that doesn't work with 0q. If you want to solve it for 0q too, then I understand your option to change the spec, but in any other case, I think it is too drastic a change. — Reply to this email directly, view it on GitHub <#125 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAL6XEX5IW7WPXDEOGOVY3ZNDBO3AVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYGU4TCOJYHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Update test files with Base45 and Proquint

cf254c8

ben221199 mentioned this pull request Jun 5, 2024

Test vectors needed for Base45 #124

Open

ben221199 added 4 commits June 5, 2024 14:09

Fix case insensitivity test

300ed1e

Fix case insensitivity test for Proquint

eeaf1ec

Fix case insensitivity test for Base 45

e6b7619

Fix case insensitivity test for Base 45 again

df1ef5f

rvagg self-requested a review June 6, 2024 02:10

rvagg self-assigned this Jun 6, 2024

rvagg reviewed Jun 6, 2024

View reviewed changes

tests/case_insensitivity.csv Show resolved Hide resolved

rvagg added a commit to multiformats/js-multiformats that referenced this pull request Jun 6, 2024

feat: add base45 multibase

eef3821

Ref: https://datatracker.ietf.org/doc/html/rfc9285 Ref: multiformats/multibase#123 Ref: multiformats/multibase#125

rvagg mentioned this pull request Jun 6, 2024

feat: add base45 multibase multiformats/js-multiformats#291

Open

rvagg added a commit to multiformats/js-multiformats that referenced this pull request Jun 6, 2024

feat: add base45 multibase

1c3d54f

Ref: https://datatracker.ietf.org/doc/html/rfc9285 Ref: multiformats/multibase#123 Ref: multiformats/multibase#125

ben221199 mentioned this pull request Jun 6, 2024

Add proquint test vectors #107

Open

rvagg added a commit to multiformats/js-multiformats that referenced this pull request Jun 6, 2024

feat: add base45 multibase

72c7df7

Ref: https://datatracker.ietf.org/doc/html/rfc9285 Ref: multiformats/multibase#123 Ref: multiformats/multibase#125

rvagg added a commit to multiformats/js-multiformats that referenced this pull request Jun 6, 2024

feat: add base45 multibase

b962767

Ref: https://datatracker.ietf.org/doc/html/rfc9285 Ref: multiformats/multibase#123 Ref: multiformats/multibase#125

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update test files #125

Update test files #125

ben221199 commented Jun 5, 2024 •

edited

Loading

rvagg commented Jun 6, 2024

ben221199 commented Jun 6, 2024 •

edited

Loading

rvagg commented Jun 6, 2024

rvagg commented Jun 6, 2024

ben221199 commented Jun 6, 2024

ben221199 commented Jun 6, 2024

ben221199 commented Jun 6, 2024

rvagg commented Jun 7, 2024

ben221199 commented Jul 17, 2024

rvagg commented Jul 17, 2024

dsw commented Jul 17, 2024 via email

ben221199 commented Jul 18, 2024 •

edited

Loading

dsw commented Jul 18, 2024 via email

dsw commented Jul 18, 2024 via email

ben221199 commented Jul 18, 2024

dsw commented Jul 18, 2024 via email

ben221199 commented Jul 19, 2024

ben221199 commented Jul 19, 2024

dsw commented Jul 19, 2024 via email

Update test files #125

Are you sure you want to change the base?

Update test files #125

Conversation

ben221199 commented Jun 5, 2024 • edited Loading

rvagg commented Jun 6, 2024

ben221199 commented Jun 6, 2024 • edited Loading

rvagg commented Jun 6, 2024

rvagg commented Jun 6, 2024

ben221199 commented Jun 6, 2024

ben221199 commented Jun 6, 2024

ben221199 commented Jun 6, 2024

rvagg commented Jun 7, 2024

ben221199 commented Jul 17, 2024

rvagg commented Jul 17, 2024

dsw commented Jul 17, 2024 via email

ben221199 commented Jul 18, 2024 • edited Loading

dsw commented Jul 18, 2024 via email

dsw commented Jul 18, 2024 via email

ben221199 commented Jul 18, 2024

dsw commented Jul 18, 2024 via email

ben221199 commented Jul 19, 2024

ben221199 commented Jul 19, 2024

dsw commented Jul 19, 2024 via email

ben221199 commented Jun 5, 2024 •

edited

Loading

ben221199 commented Jun 6, 2024 •

edited

Loading

ben221199 commented Jul 18, 2024 •

edited

Loading