Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update test files #125

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

ben221199
Copy link
Contributor

@ben221199 ben221199 commented Jun 5, 2024

This pull request adds test vectors for Base 45 and Proquint.

@rvagg
Copy link
Member

rvagg commented Jun 6, 2024

wow, proquint too, I've been tempted for a long time to hack on that in JS but never got around to it

I'll verify, it'll depend on inspiration levels and my ability to pull myself away from more pressing things, plz be patient

@rvagg rvagg self-requested a review June 6, 2024 02:10
@rvagg rvagg self-assigned this Jun 6, 2024
@ben221199
Copy link
Contributor Author

ben221199 commented Jun 6, 2024

I did read the spec and wrote some JS thing like this:

var CONSONANTS = "bdfghjklmnprstvz".split("");
var VOWELS     = "aiou".split("");

function encode(str){
    if(str.length>0 && str.length%2!=0){
        throw new Exception("Should be divisible by 16 bits");
    }
    var shorts = str.length/2;
    var parts = [];
    for(var i=0;i<shorts;i++){
        var value = str[i*2].charCodeAt(0)*256 + str[i*2+1].charCodeAt(0);
        parts.push(CONSONANTS[value>>12 & 0xF]+VOWELS[value>>10 & 0x3]+CONSONANTS[value>>6 & 0xF]+VOWELS[value>>4 & 0x3]+CONSONANTS[value>>0 & 0xF]);
    }
    return parts.join('-');
}

After that, I patched it a little, because it wasn't perfect. The * 256 was changed to << 8, the + became |, and the input string was changed to be an UInt8Array. Then it should work and be able to encode. Note that the input must always have an even amount of bytes. In the spec I didn't see anything that allowed any length, because that would cause the middle consonant sometimes to be split in halve, possibly giving the possibility to use multiple letters where the first two bits of the letter index are the same as the last two bits of the bytes. I don't think they want that.

@rvagg
Copy link
Member

rvagg commented Jun 6, 2024

My hand-rolled proquint implementation at multiformats/js-multiformats#292 gives the following vectors:

  • Decentralize everything!! = pro-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fah
  • yes mani ! = pro-lojoj-lasob-kujod-kunon-fabod
  • hello world = pro-kodoj-kudos-kusob-litoz-lanos-kib
  • \x00yes mani ! = pro-badun-kijug-fadot-kajov-kohob-fah
  • \x00\x00yes mani ! = pro-babab-lojoj-lasob-kujod-kunon-fabod

Left some additional notes in multiformats/js-multiformats#292, but other reference implementations that can do more than just 16 and 32 bit numbers would be nice to try against.

@rvagg
Copy link
Member

rvagg commented Jun 6, 2024

The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one.

@ben221199
Copy link
Contributor Author

My outputs for those values (when odd, I added 0x00, else my implementation only prints p):

  • pro-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fahab (Same, but with ab)
  • pro-lojoj-lasob-kujod-kunon-fabod (Same)
  • pro-kodoj-kudos-kusob-litoz-lanos-kibab (Same, but with ab)
  • pro-badun-kijug-fadot-kajov-kohob-fahab (Same, but with ab)
  • pro-babab-lojoj-lasob-kujod-kunon-fabod (Same)

Seems your system works.


In order to support 8-bit chunks instead of the 16-bit chunks now, I think we should contact the original inventor of the thing. If we just do it ourselves, we break spec and possibly compatibility with other systems. I don't like that.

@ben221199
Copy link
Contributor Author

The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one.

Okay, another option that doesn't break proquint spec, but gives us the possibility to indicate an odd amount of bytes would be this:

The p is our prefix.
Then it is followed by ro-.
Then follows the real 16-bit chunk encoded proquint output.

How do we tell the system that the content has an odd number of bytes? Well, like this:

First we decide what byte we want for the padding. I think 0x00 is fine. That will cause the additional ab in the end.
Then, instead of ro-, we output or- to indicate the last byte can be removed from the end.

Because the ro- thing seems very Multibase specific, I don't think this will cause problems with systems elsewhere.

@ben221199
Copy link
Contributor Author

So:

  • por-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fahab (Odd)
  • pro-lojoj-lasob-kujod-kunon-fabod (Even)
  • por-kodoj-kudos-kusob-litoz-lanos-kibab (Odd)
  • por-badun-kijug-fadot-kajov-kohob-fahab (Odd)
  • pro-babab-lojoj-lasob-kujod-kunon-fabod (Even)

@rvagg
Copy link
Member

rvagg commented Jun 7, 2024

That's an interesting option, although I still prefer mine. cvc is still pronounceable and 3 vs 5 make it very clear that there's truncation. But let's go fishing a bit more! I'll check with some folks who have some connection with the source of proquint in the multibase table. Then maybe we just go ask dsw's opinion over at https://github.com/dsw/proquint. But first I did some digging and here's what I discovered:

  • Proquint was first introduced, contributed by an external contributor to go-namesys but it was put together by Jeromy and merged by Juan in 2014, right back in the very early days of the whole stack: ipfs/go-namesys@59eb235
  • The implementation used depends on there being an even number of bytes! https://github.com/Bren2010/proquint/blob/master/proquint.go - that thing is going to panic if it gets an odd number to encode, but it'll always decode to an even number. I suppose the intention was to just use it for ipv4 addresses.
  • It was finally removed from go-ipfs in 2021 with a big resolver refactor: feat: support custom DoH resolvers ipfs/kubo#8068 (go-namesys was bumped to v0.2.0 and removed it entirely).
  • One of the problems that lead to its removal was that it lead to confusing errors for users, afaik nobody ever used them but it ended up getting involved in decoding bad ipns or dnslink strings, Error: not a valid proquint string (what on earth is this??), that was pushed out in go-ipfs v0.9.0: https://github.com/ipfs/kubo/releases/tag/v0.9.0
  • With the removal from go-ipfs, it was then shifted to the multibase table, along with the original mini-spec in there: feat: add proquint #78 - the v0.9.0 release said "We have removed support for proquints as they were out of place and largely unused, however proquints are valid multibases so if there is renewed interest in them there is a way forward".

And that's where we are.

But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions!

@ben221199
Copy link
Contributor Author

But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions!

Well, it is the only Multibase encoding that doesn't support odd bytes yet. In my opinion, a good Multibase encoding should be able to encode any data. For Proquints that is now not the case. I think there are 3 options:

  • Remove Proquints from Multibase (I don't like this option)
  • Keep Proquints and make it odd byte compatible without breaking specs (e.g. pro and por or another solution)
  • Keep Proquints and make it odd byte compatible with breaking specs (I don't like this option either)

Tagging @dsw and @randomwalker here.

@rvagg
Copy link
Member

rvagg commented Jul 17, 2024

I disagree that it breaks the proquint spec by shortening the last section. Any block of 3 characters using the same rules is still pronounceable and I don't think you can read the spec as being particularly strict about this, any more than its use being only applicable to 32-bit integers which seems to be what the main implementations are being used for. So I'm still in favour of keeping pro as a stable prefix so as not to have two parsing branches at the beginning for a multiformats consumer, and just shortening the last block to 3 (cvc) characters.

@dsw
Copy link

dsw commented Jul 17, 2024 via email

@ben221199
Copy link
Contributor Author

ben221199 commented Jul 18, 2024

Hi @dsw, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles without padding.

However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length x, you know you need a hex string with length 2x.

The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount x of bytes, you know that the output has x/2 proquint words. The problem occurs when the input data has an odd amount of bytes. The formula x/2 will output a fraction, so we came up with two options:

  • @rvagg option:
    • If the input has an even amount of bytes, do nothing.
    • If the input has an odd amount of bytes, cut the (last) proquint in half. So, hex 00 00 would be babab, but hex 00 would be bab. The unused last two bits are set to zero. (In my opinion this would break the Proquint spec at https://arxiv.org/html/0901.4016)
  • @ben221199 option:
    • If the input has an even amount of bytes, do nothing.
    • If the input has an odd amount of bytes, add 0x00 to the end. The data has now an even amount of bytes, so it is possible to encode it as usual. However, to indicate we padded the data with one byte, we use a different prefix. Normally it is p + ro (pro), but when the data input has odd bytes, it will become p + or (por).

@dsw
Copy link

dsw commented Jul 18, 2024 via email

@dsw
Copy link

dsw commented Jul 18, 2024 via email

@ben221199
Copy link
Contributor Author

Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations?

And how does 0qba[y]a work? Where does 0q (zero Q) come from? I know 00 00 is babab as seen below, so I would only expect the last 2 or 3 letters different. Did you just mean baya?

image

I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding.

@dsw
Copy link

dsw commented Jul 18, 2024 via email

@ben221199
Copy link
Contributor Author

Ah, I never said in the spec that I intended the unambiguous indicator of a proquint to be the prefix "0q", to parallel "0x" for hex.

Ah, fair enough. Seems fine to me. (Multibase uses a different prefix of course.)

The "y" is in square brackets because you do not write it, you just say it.

So, I see baa, and I say baya? 🤔 CVV indeed makes 8 bits, where CVC makes 10 bits. It makes sense in some way, but many implementations have to be changed. (And I care about backward compatability.)

I think changing the prefix to indicate length is frankly a really bad idea: the format has not changed and that is what a prefix usually indicates.

We wouldn't indicate length, we would indicate padding. The Multibase prefix letter p still stays p. The letters ro don't have a meaning other then making the prefix sound like pro, as in proquint. My first idea was using the inverse of it, or, to make por. But, because the data is "padded", it would also be possible to use ad, which makes pad. 😎 At least, the conclusion is we only change the meaningless second and third letters in this case.

@ben221199
Copy link
Contributor Author

Now I'm thinking, it seems you want this 16-bit problem to be solved in Proquint itself, so that it will support 8-bit from now on too. But is it a problem if Proquint just happens to be 16-bit chunked and we have to solve the problem elsewhere? I tried to solve it using the Multibase prefix pro/por/pad, but that doesn't work with 0q. If you want to solve it for 0q too, then I understand your option to change the spec, but in any other case, I think it is too drastic a change.

@dsw
Copy link

dsw commented Jul 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants