-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update test files #125
base: master
Are you sure you want to change the base?
Update test files #125
Conversation
wow, proquint too, I've been tempted for a long time to hack on that in JS but never got around to it I'll verify, it'll depend on inspiration levels and my ability to pull myself away from more pressing things, plz be patient |
I did read the spec and wrote some JS thing like this: var CONSONANTS = "bdfghjklmnprstvz".split("");
var VOWELS = "aiou".split("");
function encode(str){
if(str.length>0 && str.length%2!=0){
throw new Exception("Should be divisible by 16 bits");
}
var shorts = str.length/2;
var parts = [];
for(var i=0;i<shorts;i++){
var value = str[i*2].charCodeAt(0)*256 + str[i*2+1].charCodeAt(0);
parts.push(CONSONANTS[value>>12 & 0xF]+VOWELS[value>>10 & 0x3]+CONSONANTS[value>>6 & 0xF]+VOWELS[value>>4 & 0x3]+CONSONANTS[value>>0 & 0xF]);
}
return parts.join('-');
} After that, I patched it a little, because it wasn't perfect. The |
My hand-rolled proquint implementation at multiformats/js-multiformats#292 gives the following vectors:
Left some additional notes in multiformats/js-multiformats#292, but other reference implementations that can do more than just 16 and 32 bit numbers would be nice to try against. |
The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one. |
My outputs for those values (when odd, I added
Seems your system works. In order to support 8-bit chunks instead of the 16-bit chunks now, I think we should contact the original inventor of the thing. If we just do it ourselves, we break spec and possibly compatibility with other systems. I don't like that. |
Okay, another option that doesn't break proquint spec, but gives us the possibility to indicate an odd amount of bytes would be this: The How do we tell the system that the content has an odd number of bytes? Well, like this: First we decide what byte we want for the padding. I think Because the |
So:
|
That's an interesting option, although I still prefer mine. cvc is still pronounceable and 3 vs 5 make it very clear that there's truncation. But let's go fishing a bit more! I'll check with some folks who have some connection with the source of proquint in the multibase table. Then maybe we just go ask dsw's opinion over at https://github.com/dsw/proquint. But first I did some digging and here's what I discovered:
And that's where we are. But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions! |
Well, it is the only Multibase encoding that doesn't support odd bytes yet. In my opinion, a good Multibase encoding should be able to encode any data. For Proquints that is now not the case. I think there are 3 options:
Tagging @dsw and @randomwalker here. |
I disagree that it breaks the proquint spec by shortening the last section. Any block of 3 characters using the same rules is still pronounceable and I don't think you can read the spec as being particularly strict about this, any more than its use being only applicable to 32-bit integers which seems to be what the main implementations are being used for. So I'm still in favour of keeping |
I see: the problem is that what if the number of bits gets cut off in the
middle of a proquint or in the middle of a proquint character?
The problem really is that you are implicitly encoding the length of the
string in the length of the characters. Hex has the same problem: the
length is expected to be a multiple of 4 bits.
The solution that CPUs pick is they just zero-extend or sign-extend out to
the machine word size and then they encode the length otherwise. Is this
somehow not an option?
Daniel
…On Wed, Jul 17, 2024 at 3:40 AM Rod Vagg ***@***.***> wrote:
I disagree that it breaks the proquint spec by shortening the last
section. Any block of 3 characters using the same rules is still
pronounceable and I don't think you can read the spec as being particularly
strict about this, any more than its use being only applicable to 32-bit
integers which seems to be what the main implementations are being used
for. So I'm still in favour of keeping pro as a stable prefix so as not
to have two parsing branches at the beginning for a multiformats consumer,
and just shortening the last block to 3 (cvc) characters.
—
Reply to this email directly, view it on GitHub
<#125 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAL6XACNWTWIDVR2VHY4RDZMZCYXAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZSHE4TGOBTGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @dsw, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles without padding. However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount
|
Hm. I think changing the prefix would be weird. I think we should keep
the prefix but extend the encoding. Let me think about it.
…On Thu, Jul 18, 2024 at 1:17 AM Ben ***@***.***> wrote:
Hi @dsw <https://github.com/dsw>, I see your point with hexadecimal. Many
encodings/bases in multibase are not able to properly encode and decode
nibbles for example without padding.
However, in the computer world we have decided that 8 bits (octet) is the
smallest piece of information we can send over the internet and it is also
standardized in many other ways. Because hex encodes each nibble (4 bits)
with an alphanumeric and two of those nibbles fit in a byte, there is no
problem. For every data with length x, you know you need a hex string
with length 2x.
The problem with Proquints is that we are now not talking about things
smaller than a byte/octet, but a thing that is bigger: 16 bits. When
encoding data with an even amount x of bytes, you know that the output
has x/2 proquint words. The problem occurs when the input data has an odd
amount of bytes. The formula x/2 will output a fraction, so we came up
with two options:
- @rvagg <https://github.com/rvagg> option:
- If the input has an even amount of bytes, do nothing.
- If the input has an odd amount of bytes, cut the (last) proquint
in half. So, hex 00 00 would be babab, but hex 00 would be bab. The
unused last two bits are set to zero. (In my opinion this would break the
Proquint spec at https://arxiv.org/html/0901.4016)
- @ben221199 <https://github.com/ben221199> option:
- If the input has an even amount of bytes, do nothing.
- If the input has an odd amount of bytes, add 0x00 to the end. The
data has now an even amount of bytes, so it is possible to encode it as
usual. However, to indicate we padded the data with one byte, we use a
different prefix. Normally it is p + ro (pro), but when the data
input has odd bytes, it will become p + or (por).
—
Reply to this email directly, view it on GitHub
<#125 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAL6XARME6CMBBLZSHSVR3ZM523HAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVHEYDOOJYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Encode the last 2 bits using another vowel, preceded by a spoken but
unwritten "y". To be canonical, do this only in the case where the
encoding would require a consonant at the end of a string but not enough
bits remain to make one. Now we can encode any string of even bit length.
The point of proquints was to be easily pronounceable and understandable,
and the vowels "a", "i", "o", "u" all satisfy those properties when next to
each other. I made the encoding alternate consonant-vowel mostly just to
prevent unpronounceable consonant clusters.
You could still end up with weird vowel clusters. However, (1) those are
not that bad to say, and (2) and we could have a convention like Hungarian
where you often put a "j" between them. Since "j" is used but "y" is
unused we could use that instead. Actually writing the "y" would make the
width irregular. However we could make the "y" pronounced without being
written whenever there are two vowels in a row.
Note that doing this requires the following pronunciation rules to prevent
ambiguity:
* g = hard g as in "golf",
* j = hard j as in "just",
* y = soft y as in "yes".
So in your example: 0x00 = 0qba[y]a.
Daniel
On Thu, Jul 18, 2024 at 11:51 AM Daniel Wilkerson <
***@***.***> wrote:
… Hm. I think changing the prefix would be weird. I think we should keep
the prefix but extend the encoding. Let me think about it.
On Thu, Jul 18, 2024 at 1:17 AM Ben ***@***.***> wrote:
> Hi @dsw <https://github.com/dsw>, I see your point with hexadecimal.
> Many encodings/bases in multibase are not able to properly encode and
> decode nibbles for example without padding.
>
> However, in the computer world we have decided that 8 bits (octet) is the
> smallest piece of information we can send over the internet and it is also
> standardized in many other ways. Because hex encodes each nibble (4 bits)
> with an alphanumeric and two of those nibbles fit in a byte, there is no
> problem. For every data with length x, you know you need a hex string
> with length 2x.
>
> The problem with Proquints is that we are now not talking about things
> smaller than a byte/octet, but a thing that is bigger: 16 bits. When
> encoding data with an even amount x of bytes, you know that the output
> has x/2 proquint words. The problem occurs when the input data has an
> odd amount of bytes. The formula x/2 will output a fraction, so we came
> up with two options:
>
> - @rvagg <https://github.com/rvagg> option:
> - If the input has an even amount of bytes, do nothing.
> - If the input has an odd amount of bytes, cut the (last) proquint
> in half. So, hex 00 00 would be babab, but hex 00 would be bab.
> The unused last two bits are set to zero. (In my opinion this would break
> the Proquint spec at https://arxiv.org/html/0901.4016)
> - @ben221199 <https://github.com/ben221199> option:
> - If the input has an even amount of bytes, do nothing.
> - If the input has an odd amount of bytes, add 0x00 to the end.
> The data has now an even amount of bytes, so it is possible to encode it as
> usual. However, to indicate we padded the data with one byte, we use a
> different prefix. Normally it is p + ro (pro), but when the data
> input has odd bytes, it will become p + or (por).
>
> —
> Reply to this email directly, view it on GitHub
> <#125 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAAL6XARME6CMBBLZSHSVR3ZM523HAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVHEYDOOJYG4>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations? And how does I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding. |
Ah, I never said in the spec that I intended the unambiguous indicator
of a proquint to be the prefix "0q", to parallel "0x" for hex.
The "y" is in square brackets because you do not write it, you just say it.
I think changing the prefix to indicate length is frankly a really bad
idea: the format has not changed and that is what a prefix usually
indicates.
So I am going to update the spec to clarify all of this. I appreciate
your pointing out the problem to me.
…On Thu, Jul 18, 2024 at 1:09 PM Ben ***@***.***> wrote:
Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations?
And how does 0qba[y]a work? Where does 0q (zero Q) come from? I know 00 00 is babab as seen below, so I would only expect the last 2 or 3 letters different. Did you just mean baya?
image.png (view on web)
I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Ah, fair enough. Seems fine to me. (Multibase uses a different prefix of course.)
So, I see
We wouldn't indicate length, we would indicate padding. The Multibase prefix letter |
Now I'm thinking, it seems you want this 16-bit problem to be solved in Proquint itself, so that it will support 8-bit from now on too. But is it a problem if Proquint just happens to be 16-bit chunked and we have to solve the problem elsewhere? I tried to solve it using the Multibase prefix |
I never said that Proquints is 16-bit based. I said Proquints are
"Identifiers that are Readable, Spellable, and Pronounceable". You are
trying to maintain a constraint that you made up and that makes no sense to
maintain.
Lexing two vowels in a row is the *least* drastic change and *is* backwards
compatible. There is no ambiguity in lexing the format, which makes it
backwards compatible. If you just allow any letter in any position, the
code actually gets simpler, rather than adding any special case for the new
semantics.
Changing the prefix is a very odd choice and I would like you to please not
make it. I have literally never seen someone change a prefix to indicate a
length. It is frankly a bad factoring to associate a prefix with a string
length; I have literally never seen anyone even consider doing that.
…On Fri, Jul 19, 2024 at 12:59 AM Ben ***@***.***> wrote:
Now I'm thinking, it seems you want this 16-bit problem to be solved in
Proquint itself, so that it will support 8-bit from now on too. But is it a
problem if Proquint just happens to be 16-bit chunked and we have to solve
the problem elsewhere? I tried to solve it using the Multibase prefix pro/
por/pad, but that doesn't work with 0q. If you want to solve it for 0q
too, then I understand your option to change the spec, but in any other
case, I think it is too drastic a change.
—
Reply to this email directly, view it on GitHub
<#125 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAL6XEX5IW7WPXDEOGOVY3ZNDBO3AVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYGU4TCOJYHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This pull request adds test vectors for Base 45 and Proquint.