Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding into an existing buffer #21

Closed
bakkot opened this issue May 15, 2023 · 21 comments · Fixed by #33
Closed

encoding into an existing buffer #21

bakkot opened this issue May 15, 2023 · 21 comments · Fixed by #33

Comments

@bakkot
Copy link
Collaborator

bakkot commented May 15, 2023

Can we extend the decoding methods to write to an existing buffer?

The simplest thing I can imagine is a into option, which takes a buffer and writes into it, and returns that buffer instead of a new one. You can make a view if you want an offset. And it could throw (before doing any writes) if the data doesn't fit (or not).

Unfortunately you're probably also going to want the number of bytes written, and figuring that out in the case of padding in base64 is annoying. In the streaming case we could return a written count, at least, but there's no obvious place to put a count in the non-streaming case. Maybe the option should only be for the streaming case?

Alternatively we could have new methods for writing to an existing buffer, at the cost of having new methods.

I'm not sure either is worth it. cc @syg

@bakkot
Copy link
Collaborator Author

bakkot commented May 15, 2023

A concrete possible design:

Constrain it to only the streaming case. Add a written property to the returned object to tell you how many bytes actually got written. Add an optional into argument which takes an existing Uint8Array (if omitted a new one is allocated; if included the passed array is the result in the returned object).

And then extra gets used both for "input wasn't fully padded" and for "data didn't fit into the passed array". It can be arbitrarily large, so hopefully engines would be able to implement it as a view of the original string (which is easier because it's a suffix, not an arbitrary slice).

I wasn't originally planning to add a hex version of the streaming API, but we'd presumably need to add it to support this design.

@jridgewell
Copy link
Member

jridgewell commented May 16, 2023

Trying to work out how this would be used, I think it's a little cumbersome? That's because the "there's still more to process" is both coming from the decoder (you didn't have enough buffer to decode) and from the chunking stream (you didn't give me enough bytes to decode):

let chunks = ["mpmZmZmZuT+am", "ZmZmZnJPzMz", "MzMz", "M9M/mpmZmZmZ", "2T8="];
let reader = new Float64Array(1);
let { buffer, byteLength } = reader;

let output = [];
let extra;

for (let c of chunks) {
  // Need to loop multiple times per chunk, in case the BYOB isn't large
  // enough.
  for (let i = 0; true; i++) {
    let written = 0;

    // I think we need to pass extra as the input param on the second call?
    // Becasue extra gets prepended to the input, and if we passed the same `c`
    // again it'd get all screwy.
    let next = i === 0 ? c : extra;
    let nextExtra = i === 0 ? extra : undefined;
    let decoded = Uint8Array.fromPartialBase64(next, {
      more: true,
      extra: nextExtra,
      into: buffer,
    });

    ({ extra, written } = decoded);

    for (let i = 0; i < written / byteLength; i++) {
      output.push(reader[i]);
    }

    // If written is less than our buffer length, it means the decoder didn't
    // receive enough input bytes.
    if (written < buffer.length) break;
  }
}

// Need to flush the decoder
// …

console.log(output);

Because extra has a double meaning, it requires extra (🥁) handling, too. I think it'd be better if extra is just an opaque structure, with an offset being returned/used so that so that the decoder knows how far forward to skip when c is recalled:

let chunks = ["mpmZmZmZuT+am", "ZmZmZnJPzMz", "MzMz", "M9M/mpmZmZmZ", "2T8="];
let reader = new Float64Array(1);
let { buffer, byteLength } = reader;

let output = [];
let extra;

for (let c of chunks) {
  // Need to loop multiple times per chunk, in case the BYOB isn't large
  // enough.
  for (let i = 0; true; i++) {
    let offset = 0, written = 0;

    // Look Ma, the same `c` is used! And the same `extra`!
    let decoded = Uint8Array.fromPartialBase64(c, {
      more: true,
      extra,
      into: buffer,
      offset
    });

    ({ extra, written, offset } = decoded);

    for (let i = 0; i < written / reader.byteLength; i++) {
      output.push(reader[i]);
    }

    // If written is less than our buffer length, it means the decoder didn't
    // receive enough input bytes.
    if (written < buffer.length) break;
  }
}

// Need to flush the decoder
// …

console.log(output);

@syg
Copy link

syg commented May 23, 2023

Constrain it to only the streaming case. Add a written property to the returned object to tell you how many bytes actually got written. Add an optional into argument which takes an existing Uint8Array (if omitted a new one is allocated; if included the passed array is the result in the returned object).

I would also expect an offset to pass written back to the next chunk without creating a new Uint8Array window.

Agreed it's a bit cumbersome but still seems worth it to me to avoid copies.

To @jridgewell's example:

That's because the "there's still more to process" is both coming from the decoder (you didn't have enough buffer to decode) and from the chunking stream (you didn't give me enough bytes to decode):

I don't understand the "and from the chunking stream" part. If I'm decoding a series of chunks, it seems fairly rare I would even know ahead of time how many bytes the final result ought to be. Wouldn't you write the decoding loop to be done when the input is exhausted?

@jridgewell
Copy link
Member

jridgewell commented May 23, 2023

Wouldn't you write the decoding loop to be done when the input is exhausted?

Yes. The outer loop (for (let c of chunks) {}) is to be expected. And having to pass an extra param from one call in the loop to the next is totally fine, that's just us maintaining the state.

But the way the inner loop needs to handle the changed params to Uint8Array.fromPartialBase64 is unexpected, and it's because the buffer may not have enough room to handle decoding. The extra now has to handle maintaining state from the chunk-to-chunk and buffer-out-of-space, and we need to call the next inner-loop with the changed extra as the input in the out-of-space case.


PS. How would fromPartialBase64 handle a large output buffer? Eg, I've already allocated 1mb of buffer, and I want to decode each chunk into the appropriate spot in that buffer? I think we'd need an intoOffset to keep track of how much has been written into the buffer already, so that the next call can start at that index.

Because we're passing into: ArrayBuffer, we can't use Uint8Array's byteOffset to give an offset view of the buffer. And having to create new Uint8Arrays, even over the same array buffer, is soo slow, so I wouldn't recommend doing this either.

@syg
Copy link

syg commented May 23, 2023

@jridgewell I don't understand how propagating extra from one chunk to the next is any different than propagating written from one chunk to the next. That's still "not enough buffer to hold the decoded bytes", not "not enough input to fill all the space in the buffer" issue (which I still think is not an issue), no?

And yes, I also expected an offset per above.

@jridgewell
Copy link
Member

jridgewell commented May 23, 2023

I don't understand how propagating extra from one chunk to the next is any different than propagating written from one chunk to the next.

See the sample code:

    // I think we need to pass extra as the input param on the second call?
    // Becasue extra gets prepended to the input, and if we passed the same `c`
    // again it'd get all screwy.
    let next = i === 0 ? c : extra;
    let nextExtra = i === 0 ? extra : undefined;
    let decoded = Uint8Array.fromPartialBase64(next, {
      more: true,
      extra: nextExtra,
      into: buffer,
    });

We're not just passing extra, we're manipulating the input params to handle the fact that extra also means out-of-space. If we pass offset, this becomes a much more reasonable:

    // Look Ma, the same `c` is used! And the same `extra`!
    let decoded = Uint8Array.fromPartialBase64(c, {
      more: true,
      extra,
      into: buffer,
      offset
    });

I'm fine with calling fromPartialBase64 multiple times on the same chunk, and passing multiple state values into the options bag so the next call knows how to resume. I'm not fine with having to rearrange the params on the 2nd+ calls to handle the same chunk.


And yes, I also expected an offset per above.

I think you're responding to my:

I think we'd need an intoOffset to keep track of how much has been written into the buffer already, so that the next call can start at that index

I'm saying there are 2 offsets:

  1. Offset into c, the the number of bytes (and maybe ⅓ bytes?) that have been processed in this chunk
  2. Offset into buffer, the index to start writing bytes in my preallocated buffer.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 17, 2023

Agreed about the need for an intoOffset which specifies the offset within into at which to start writing.

But playing with these examples, I'm realizing I was wrong above when I said that extra was necessarily a suffix. In the case that the data didn't fit, it might be that the passed extra didn't fit, in which case the resulting extra would need to cover both the remaining characters from the passed extra as well as the entire passed chunk. That's probably bad. So probably my original suggestion doesn't work. I'll try to come up with something else.

@jridgewell I don't understand what extra is supposed to be in your modified example.

Also,

Because we're passing into: ArrayBuffer, we can't use Uint8Array's byteOffset to give an offset view of the buffer. And having to create new Uint8Arrays, even over the same array buffer, is soo slow, so I wouldn't recommend doing this either.

I was envisioning into: Uint8Array, not into: ArrayBuffer. I'm surprised this is slow; I don't see why it would fundamentally need to be be.

@jridgewell
Copy link
Member

@jridgewell I don't understand what extra is supposed to be in your modified example.

extra is the same as it is now.

I'm surprised this is slow; I don't see why it would fundamentally need to be be.

It shouldn't be, but it's what I found while performance testing sourcemap-codec. It's faster to have a single subarray and shift the buffer data around than to allocate a new subarray every time you want to flush. This might be because of the increased memory pressure 🤷

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 17, 2023

extra is the same as it is now.

"Now" meaning in the current design, rather than the one I'm suggesting here?

That is, it's the 0-3 characters at the end which don't fit cleanly into a 4-character chunk to decode. So in the overflow case it would be null (or the empty string), since we haven't gotten to the end of the chunk yet. (Except when extra was passed and doesn't fit; see below.) Is that right?

That makes sense, I think. But suppose the output buffer has only a single byte of space left, and extra is three characters long. What do you return in that case? Or do you just throw?

I guess if offset is in terms of bytes decoded from the input thus far, and starts from the beginning of extra (if it's passed), then you could return the same extra and an offset of 1. Those are kind of weird units for offset (since I'd normally assume it's a number of characters), but it does work, I think.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 17, 2023

As a slight tweak, instead of having a separate offset return value, you could have a boolean to indicate whether the input has been consumed (up to the last partial 4-character chunk), with the expectation that if not you'll call it again with the same arguments but incrementing the offset into the input chunk by written:

let offsetIntoOutputBuffer = 0;
for (let c of chunks) {
  // Need to loop multiple times per chunk, in case the BYOB isn't large
  // enough.
  let extra, written, finishedChunk;
  let bytesReadFromThisChunk = 0;
  while (true) {
    // if `finishedChunk` is false, then the returned `extra` is the passed `extra`
    0, { extra, written, finishedChunk } = Uint8Array.fromPartialBase64(c, {
      more: true,
      extra,
      into: buffer,
      bytesReadFromThisChunk,
      offsetIntoOutputBuffer,
    });
    bytesReadFromThisChunk += written;
    offsetIntoOutputBuffer += written;

    if (!finishedChunk) {
      // output is full; consume it
      console.log(reader);
      offsetIntoOutputBuffer = 0;
    } else {
      // finished this chunk; time for the next one
      // make sure that the current value of `extra` is passed on to the next chunk
      break;
    }
  }
}

Edit: on reflection I feel a little weird about returning the input extra, even though it's convenient. I think I'd prefer it be null, in which case you have to do something more like

let { extra: next, written, finishedChunk } = Uint8Array.fromPartialBase64(c, {
  more: true,
  extra,
  into: buffer,
  bytesReadFromThisChunk,
  offsetIntoOutputBuffer,
});
bytesReadFromThisChunk += written;
offsetIntoOutputBuffer += written;

if (!finishedChunk) {
  // output is full; consume it
  // assert: extra is null (or the empty string idk)
  console.log(reader);
  offsetIntoOutputBuffer = 0;
} else {
  // finished this chunk; time for the next one
  // make sure that the current value of `extra` is passed on to the next chunk
  extra = next;
  break;
}

This is marginally more annoying since you need to conditionally reassign extra. But you're already doing stuff conditionally, so it's not much different.

I don't feel strongly about extra and could be convince to stick with the round-trip option.

@jridgewell
Copy link
Member

"Now" meaning in the current design, rather than the one I'm suggesting here?

Yes, the current design.

So in the overflow case it would be null (or the empty string), since we haven't gotten to the end of the chunk yet. (Except when extra was passed and doesn't fit; see below.) Is that right?

Either seems fine.

But suppose the output buffer has only a single byte of space left, and extra is three characters long. What do you return in that case? Or do you just throw?

I think this might be two issues? What happens if the we can't flush 4-bytes to the output, and what happens if the output's buffer.length < 4?

For the first, I would return the 4 bytes in extra, and have the caller go around again once they're done processing what we could write.

In the second, it's more like do we want to handle partial writes of the bytes stored in extra, and it can only happen if the buffer is too small to ever be useful. I would just validate that at the beginning.

I guess if offset is in terms of bytes decoded from the input thus far, and starts from the beginning of extra (if it's passed), then you could return the same extra and an offset of 1

I think offset needs to be explicitly the index of the input string that we've already process, not tied to extra.


As a slight tweak, instead of having a separate offset return value, you could have a boolean to indicate whether the input has been consumed (up to the last partial 4-character chunk), with the expectation that if not you'll call it again with the same arguments but incrementing the offset into the input chunk by written:

Does the bytes consumed from input always equal the bytes written to output? I think that can't be the case, because of the 4/3s encoding. You need both.

But if you're concern is with a bunch of return values, we can change extra to be an opaque state instead. It could encode both the 1-3 bytes that were seen but not processed, and the offset into the source string.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 20, 2023

For the first, I would return the 4 bytes in extra, and have the caller go around again once they're done processing what we could write.

Sorry, which 4 bytes? extra is a string, in this API, not a list of bytes. And right now it can only contain strings of length 0 to 3.

In the second, it's more like do we want to handle partial writes of the bytes stored in extra, and it can only happen if the buffer is too small to ever be useful.

It can also happen if you are getting very close to the end of the buffer. E.g. if you're reading 4 Float64 values at a time, that buffer will be 32 bytes long, which is not a multiple of 3 and so will not fill up precisely. If your first chunk is 43 base64 characters, that will leave your metaphorical cursor (intoOffset in my sample code) pointing at index 30 within the buffer, which will only let it contain another two bytes. But you might quite reasonably want to fill up those last two bytes before moving on.

I think offset needs to be explicitly the index of the input string that we've already process, not tied to extra.

There's a few problems with that.

  1. Indexes into the string are not actually able to represent the thing you want them to - if you read 2 bytes out of a base64-encoded string, the place to read the next byte is partway through the 3rd character of the string. I guess technically we could point at that character and have it mean "partway through that character" based on its position mod 4 within the string, but that's kinda gross.

  2. Since extra is logically prepended to the input before processing, it doesn't really make sense to index the input and also make use of extra - that would amount to splicing extra halfway into the input chunk, which is never the thing you want.

  3. That doesn't let you walk partway into extra, which means that you can't decode only a single byte or two bytes, which is necessary to be able to completely fill buffers.

Does the bytes consumed from input always equal the bytes written to output?

When I say "bytes consumed from the input", what I mean is "if you decoded this whole string to a list of bytes, how many of those bytes would we already have written to the output". The input is characters, not bytes.

But if you're concern is with a bunch of return values

Not hugely concerned; it's more the issue above, about being able to resume partway through a 4-character unit.

we can change extra to be an opaque state instead

I am very much trying to avoid having opaque values here. The fact that everything is serializable is extremely useful and I would be loathe to give it up.


Let me try to step back a little: with the design I've suggested this comment and the subsequent comment, you get a few nice properties:

  • you can completely fill up a buffer before having to move on; the user never has to figure out how to handle mostly-full buffers (which can get really annoying)
  • everything is serializable - just normal JS values, nothing opaque
  • you still only need one fromPartialBase64 call within your nested loop

Those are good properties to have. I'm happy to explore alternative ways of achieving them but I wouldn't really want to give any of them up (well, except the third, which I don't care that much about).

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 21, 2023

@michaelficarra proposes having a "number of bytes left to read from this chunk" int instead of "is this chunk finished" boolean, so you can resize your buffer.

I was initially on board with that, but the edge case when your chunk has a length which is not a multiple of 4 characters - say, 7. If you call this and pass it a sufficiently-large buffer, you'll get the 3 bytes out from the first set of 4 characters, and then the remaining 3 characters are returned in extra. You're supposed to pass those into the subsequent call, so "finishedChunk" would be true in that case, but it's not obvious to me that this is the natural interpretation of a "number of bytes left to read from this chunk" integer.

Thoughts?

@jridgewell
Copy link
Member

jridgewell commented Jun 22, 2023

I'm going to mostly ignore the responses because I think we're opening up too many threads for this to be productive.

Let's go back to your example in #21 (comment). I think this is mostly fine, but I think there's 2 flaws:

First, the use of written:

bytesReadFromThisChunk += written;
offsetIntoOutputBuffer += written;

This must be a bug, because we'll have written 3 bytes for every 4 bytes of input. There has to be a separate read and written count, right?


Second, the need for finishedChunk. This is because you're set on extra being a string, and that you need to see all 4 bytes of input before anything can be written. (You hint at this again in #21 (comment)).

If extra were not literal string chars, then this isn't an issue. If you've only seen 3 bytes, then you can write 2 bytes, and store the partial 2 bits in extra (or state). The bytesReadFromThisChunk will equal chunk.length, and we know we can move on.

I think the goal of having the extra be serializable is not important at all. But if so, this could literally be an uint8 (at most 6 bits of unprocessed input, 2 bits to signal how many unprocessed bits are present).


Rewriting your example with this in mind, I'd be happy with this API:

let offsetIntoOutputBuffer = 0;
let state;
for (let c of chunks) {
  let written, read;
  let bytesReadFromThisChunk = 0;

  // Need to loop multiple times per chunk, in case the BYOB isn't large
  // enough.
  while (bytesReadFromThisChunk < c.length) {
    0, { state, written, read } = Uint8Array.fromPartialBase64(c, {
      more: true,
      extra,
      into: buffer,
      bytesReadFromThisChunk,
      offsetIntoOutputBuffer,
    });

    bytesReadFromThisChunk += read;
    offsetIntoOutputBuffer += written;

    if (offsetIntoOutputBuffer === buffer.length) {
      // output is full; consume it
      offsetIntoOutputBuffer = 0;
    }
  }
}

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 22, 2023

This must be a bug, because we'll have written 3 bytes for every 4 bytes of input.

Well, this is a tangent if we're discussing your design, but - in my design, that's not a bug. The idea was that it indicates the number of bytes you've decoded from the input, not the number of characters you've consumed. The input is a list of characters, not of bytes; only the output is a list of bytes. (Representing "number of characters you've read from the input" is awkward because you can consume part of a character, but you've always decoded a whole number of bytes.)

If we do something like you're proposing, we'd want to use a different name than bytesReadFromThisChunk, since it doesn't represent a number of bytes.

If extra were not literal string chars, then this isn't an issue. [...] this could literally be an uint8 (at most 6 bits of unprocessed input, 2 bits to signal how many unprocessed bits are present).

Interesting thought.

I'm open to alternatives to a string, here, though I'd be a little uncomfortable with using different bits in a JS number to indicate different things. That seems pretty unprecedented in the web platform (outside of flags in WebGL etc, but even then every bit represents a flag). If engines aren't worried about an extra allocation here, I'd feel better about a { bits, value } pair.

Also, why at most 6? I guess the idea is that it counts as "reading" a character as soon as you've used any bits from it? In that case, you only have at most 4 unprocessed bits, since each character in the input represents 6 bits in the output, and if you've used at least 1 bit from a character, then you've used at least 2.

(For a given 4-character unit of input, you either read 1 byte of output, which uses the first character plus 2 bits from the second character, leaving 4 bits remaining in that character; or you read 2 bytes of output, which uses the first two characters plus 4 bits from the third character, leaving 2 bits remaining in that character; or you read 3 bytes of output, which uses all 4 characters.)


So, let me write a further iteration:

let offsetIntoOutputBuffer = 0;
let extra;
for (let c of chunks) {
  let written, read;
  let offsetIntoInput = 0;

  // Need to loop multiple times per chunk, in case the BYOB isn't large
  // enough.
  while (offsetIntoInput < c.length) {
    0, { extra, written, read } = Uint8Array.fromPartialBase64(c, {
      more: true,
      extra,
      into: buffer,
      offsetIntoInput,
      offsetIntoOutputBuffer,
    });
    // extra is one of `undefined`, `{ bits: 2, value: v1 }`, or `{ bits: 4, value: v2 }`, where v1 ranges from 0 to 3 and v2 ranges from 0 to 15

    offsetIntoInput += read;
    offsetIntoOutputBuffer += written;

    if (offsetIntoOutputBuffer === buffer.length) {
      // output is full; consume it
      offsetIntoOutputBuffer = 0;
    }
  }
}

I note that this affects the design even if you're not writing into an existing buffer - in my original conception, if you were decoding a length 7 string and not providing an output buffer, you'd get a length 3 buffer plus an extra containing the 3 unused characters. With this design, you'd instead get a length 5 buffer plus an extra representing the 2 unused bits from the 7th character.

That change is fine when going in this direction. But it's awkward if we make the symmetric change to extra in toPartialBase64 - we'd end up creating base64 strings whose length is not a multiple of 4. I guess we're not obligated to make them precisely symmetric, though; we could have extra in toPartialBase64 be a JS number representing 0-2 full bytes from the underlying buffer, instead of getting into partial cases. Fortunately there's no "writing to an existing limited-size buffer" case to worry about when outputting strings.

@jridgewell
Copy link
Member

jridgewell commented Jun 22, 2023

Well, this is a tangent if we're discussing your design, but - in my design, that's not a bug. The idea was that it indicates the number of bytes you've decoded from the input, not the number of characters you've consumed

But wouldn't it be incorrect after the first set of 4 input chars are decoded? We can't both increment the input offset by 4 and the output offset by 4, because you can only output 3 bytes.

though I'd be a little uncomfortable with using different bits in a JS number to indicate different things. That seems pretty unprecedented in the web platform (outside of flags in WebGL etc, but even then every bit represents a flag).

VLQ is super common (source maps), and it's 1 bit for continue and 5 bits for data (per bytes, it's even less efficient than base64, because it uses base64…). And I've definitely done bit-packing with flags in JS.

Also, why at most 6? … In that case, you only have at most 4 unprocessed bits, since each character in the input represents 6 bits in the output, and if you've used at least 1 bit from a character, then you've used at least 2.

Maybe I misunderstand base64 encoding, but I'd assumed if you fed 1 input char, you'd have 6 bits of partial data and 2 bits of waste. We'd need another input char to get the final 2 bits of the first output byte. For 2 input chars, you have 12 bits of data, so 1 output byte and 4 bits of partial data. With 3 input chars, you have 18 bits of data, so 2 output bytes and 2 bits of partial data. And at 4 input chars, 24 bits of data, so 3 bytes of output and 0 bits of partial data.

Doesn't that mean we have at most 6 bits of partial data?

I guess the idea is that it counts as "reading" a character as soon as you've used any bits from it?

Yes. Once you read an input char, the read pointer increments. If you don't have enough data to process, that char's bits are stored as partial data. You don't need to reread that char again.


So, let me write a further iteration:

LGTM, though I'd love to save that allocation and just use a int.


With this design, you'd instead get a length 5 buffer plus an extra representing the 2 unused bits from the 7th character.

SGTM.

But it's awkward if we make the symmetric change to extra in toPartialBase64 - we'd end up creating base64 strings whose length is not a multiple of 4.

I don't think that's a hard requirement? Like, unpadded base64 is everywhere and it's not a multiple of 4.

we could have extra in toPartialBase64 be a JS number representing 0-2 full bytes from the underlying buffer, instead of getting into partial cases. Fortunately there's no "writing to an existing limited-size buffer" case to worry about when outputting strings.

Is 0-2 bytes necessary? I think every set of 1-4 input bytes could be handled by writing a base64 char output and keeping 0-6 bits + 2 bits, right?

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 22, 2023

But wouldn't it be incorrect after the first set of 4 input chars are decoded? We can't both increment the input offset by 4 and the output offset by 4, because you can only output 3 bytes.

The idea is that after producing the first 3 bytes of output, the resulting written would be 3, and you'd increment both the "total bytes written" and "total bytes written using this chunk" counters by 3. Though again, that's not applicable to your design, which wants to count characters of input consumed from this chunk, rather than bytes produced from this chunk.

VLQ is super common (source maps), and it's 1 bit for continue and 5 bits for data (per bytes, it's even less efficient than base64, because it uses base64…). And I've definitely done bit-packing with flags in JS.

VLQ is a specific well-known format, not a bespoke bitpacking, and I suspect most JS programmers won't have encountered bespoke packings before. I've certainly done it but I don't want to generalize from my experience.

I could be convinced bit packing is fine here, but I'd like to get more input.

I'd assumed if you fed 1 input char, you'd have 6 bits of partial data and 2 bits of waste. We'd need another input char to get the final 2 bits of the first output byte. For 2 input chars, you have 12 bits of data, so 1 output byte and 4 bits of partial data. With 3 input chars, you have 18 bits of data, so 2 output bytes and 2 bits of partial data. And at 4 input chars, 24 bits of data, so 3 bytes of output and 0 bits of partial data.

Ah, I see my mistake. In the case that you only have a single character of input, you can't produce even a single byte of output, so I was imagining in this case that you would have read and written both be 0. But actually you'd want to represent all 6 bits from the final character in extra, and have read be 1 so you can move on to the next chunk. So yes, up to 6 bits.

I don't think that's a hard requirement? Like, unpadded base64 is everywhere and it's not a multiple of 4.

It's not a hard requirement, but it's convenient for the ultimate destination not to need to handle unpadded base64. Not every system is set up to deal with it.

Is 0-2 bytes necessary? I think every set of 1-4 input bytes could be handled by writing a base64 char output and keeping 0-6 bits + 2 bits, right?

Right, it's only necessary if we're trying to have the property of producing only correctly padded base64 (and no extra padding characters on non-final chunks, so the consumer can concatenate if they want to). Which as I say is not a hard requirement, but it's convenient.

And since you can fit 0-2 full bytes plus a 2-bit length in uint32, I don't see much reason not to go with that even if we do use bitpacking, since it gives us the nice-to-have property of correctly padding.

@bakkot
Copy link
Collaborator Author

bakkot commented Jun 26, 2023

PTAL: #26

I did notice that the more parameter is not necessary, which simplifies things a little. I've kept extra as an object for the moment, but changing that to use bit-packing instead would be straightforward and doesn't affect how it's used at all.

I also didn't change toPartialBase64, because, being a prototype method, you need to be able to call extra.toPartialBase64 for the final portion. So I've only changed fromPartialBase64. I haven't yet added any way to decode hex to an existing buffer, but probably will add fromPartialHex later.

Here's the two complete examples from the playground for that PR:

Single input:

let input = 'SGVsbG8gV29ybGQ=';
let buffer = new Uint8Array(4);
let outputOffset = 0;
let inputOffset = 0;
let extra, written, read;

while (inputOffset < input.length) {
  0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
    extra,
    into: buffer,
    inputOffset,
    outputOffset,
  });

  inputOffset += read;
  outputOffset += written;

  if (outputOffset === buffer.length) {
    // output is full; consume it
    console.log([...buffer]);
    outputOffset = 0;
  }
}
if (outputOffset > 0) {
  console.log([...buffer].slice(0, outputOffset));
}

Chunked input:

let chunks = ['VGhpcyB', 'pcyBzb2', '1lIGV4YW1w', 'bGUgZGF0YS4='];
let output = new Uint8Array(4);
let outputOffset = 0;
let extra;
for (let chunk of chunks) {
  let written, read;
  let inputOffset = 0;

  while (inputOffset < chunk.length) {
    0, { extra, written, read } = Uint8Array.fromPartialBase64(chunk, {
      extra,
      into: output,
      inputOffset,
      outputOffset,
    });

    inputOffset += read;
    outputOffset += written;

    if (outputOffset === output.length) {
      // output is full; consume it
      console.log([...output]);
      outputOffset = 0;
    }
  }
}
if (outputOffset > 0) {
  console.log([...output].slice(0, outputOffset));
}

@annevk
Copy link
Member

annevk commented Jul 4, 2023

I think this should mirror TextEncoder's encodeInto() as closely as possible.

@syg
Copy link

syg commented Nov 22, 2023

V8's feelings here are that BYOB will be important even in the initial proposal for adoption. Copying is fast, but a "scratch buffer" implementation for the decoding loop need to be possible to write without extra allocations of buffers.

With the upcoming simplified streaming API proposal, Anne's suggestion of a fromBase64Into(string, u8) (name up to bikeshedding) sounds good to me. It can return a { read, written } where the read is the index cursor into the source string. Ideally there would be no result object allocation either, but I see no way around that.

@bakkot
Copy link
Collaborator Author

bakkot commented Dec 13, 2023

Adding this in #33; PTAL.

let {read, written } = Uint8Array.fromBase64Into(string, target, { outputOffset })

@bakkot bakkot closed this as completed in #33 Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants