-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding into an existing buffer #21
Comments
A concrete possible design: Constrain it to only the streaming case. Add a And then I wasn't originally planning to add a hex version of the streaming API, but we'd presumably need to add it to support this design. |
Trying to work out how this would be used, I think it's a little cumbersome? That's because the "there's still more to process" is both coming from the decoder (you didn't have enough buffer to decode) and from the chunking stream (you didn't give me enough bytes to decode): let chunks = ["mpmZmZmZuT+am", "ZmZmZnJPzMz", "MzMz", "M9M/mpmZmZmZ", "2T8="];
let reader = new Float64Array(1);
let { buffer, byteLength } = reader;
let output = [];
let extra;
for (let c of chunks) {
// Need to loop multiple times per chunk, in case the BYOB isn't large
// enough.
for (let i = 0; true; i++) {
let written = 0;
// I think we need to pass extra as the input param on the second call?
// Becasue extra gets prepended to the input, and if we passed the same `c`
// again it'd get all screwy.
let next = i === 0 ? c : extra;
let nextExtra = i === 0 ? extra : undefined;
let decoded = Uint8Array.fromPartialBase64(next, {
more: true,
extra: nextExtra,
into: buffer,
});
({ extra, written } = decoded);
for (let i = 0; i < written / byteLength; i++) {
output.push(reader[i]);
}
// If written is less than our buffer length, it means the decoder didn't
// receive enough input bytes.
if (written < buffer.length) break;
}
}
// Need to flush the decoder
// …
console.log(output); Because let chunks = ["mpmZmZmZuT+am", "ZmZmZnJPzMz", "MzMz", "M9M/mpmZmZmZ", "2T8="];
let reader = new Float64Array(1);
let { buffer, byteLength } = reader;
let output = [];
let extra;
for (let c of chunks) {
// Need to loop multiple times per chunk, in case the BYOB isn't large
// enough.
for (let i = 0; true; i++) {
let offset = 0, written = 0;
// Look Ma, the same `c` is used! And the same `extra`!
let decoded = Uint8Array.fromPartialBase64(c, {
more: true,
extra,
into: buffer,
offset
});
({ extra, written, offset } = decoded);
for (let i = 0; i < written / reader.byteLength; i++) {
output.push(reader[i]);
}
// If written is less than our buffer length, it means the decoder didn't
// receive enough input bytes.
if (written < buffer.length) break;
}
}
// Need to flush the decoder
// …
console.log(output); |
I would also expect an Agreed it's a bit cumbersome but still seems worth it to me to avoid copies. To @jridgewell's example:
I don't understand the "and from the chunking stream" part. If I'm decoding a series of chunks, it seems fairly rare I would even know ahead of time how many bytes the final result ought to be. Wouldn't you write the decoding loop to be done when the input is exhausted? |
Yes. The outer loop ( But the way the inner loop needs to handle the changed params to PS. How would Because we're passing |
@jridgewell I don't understand how propagating And yes, I also expected an |
See the sample code: // I think we need to pass extra as the input param on the second call?
// Becasue extra gets prepended to the input, and if we passed the same `c`
// again it'd get all screwy.
let next = i === 0 ? c : extra;
let nextExtra = i === 0 ? extra : undefined;
let decoded = Uint8Array.fromPartialBase64(next, {
more: true,
extra: nextExtra,
into: buffer,
}); We're not just passing // Look Ma, the same `c` is used! And the same `extra`!
let decoded = Uint8Array.fromPartialBase64(c, {
more: true,
extra,
into: buffer,
offset
}); I'm fine with calling
I think you're responding to my:
I'm saying there are 2 offsets:
|
Agreed about the need for an But playing with these examples, I'm realizing I was wrong above when I said that @jridgewell I don't understand what Also,
I was envisioning |
It shouldn't be, but it's what I found while performance testing sourcemap-codec. It's faster to have a single subarray and shift the buffer data around than to allocate a new subarray every time you want to flush. This might be because of the increased memory pressure 🤷 |
"Now" meaning in the current design, rather than the one I'm suggesting here? That is, it's the 0-3 characters at the end which don't fit cleanly into a 4-character chunk to decode. So in the overflow case it would be That makes sense, I think. But suppose the output buffer has only a single byte of space left, and I guess if |
As a slight tweak, instead of having a separate let offsetIntoOutputBuffer = 0;
for (let c of chunks) {
// Need to loop multiple times per chunk, in case the BYOB isn't large
// enough.
let extra, written, finishedChunk;
let bytesReadFromThisChunk = 0;
while (true) {
// if `finishedChunk` is false, then the returned `extra` is the passed `extra`
0, { extra, written, finishedChunk } = Uint8Array.fromPartialBase64(c, {
more: true,
extra,
into: buffer,
bytesReadFromThisChunk,
offsetIntoOutputBuffer,
});
bytesReadFromThisChunk += written;
offsetIntoOutputBuffer += written;
if (!finishedChunk) {
// output is full; consume it
console.log(reader);
offsetIntoOutputBuffer = 0;
} else {
// finished this chunk; time for the next one
// make sure that the current value of `extra` is passed on to the next chunk
break;
}
}
} Edit: on reflection I feel a little weird about returning the input let { extra: next, written, finishedChunk } = Uint8Array.fromPartialBase64(c, {
more: true,
extra,
into: buffer,
bytesReadFromThisChunk,
offsetIntoOutputBuffer,
});
bytesReadFromThisChunk += written;
offsetIntoOutputBuffer += written;
if (!finishedChunk) {
// output is full; consume it
// assert: extra is null (or the empty string idk)
console.log(reader);
offsetIntoOutputBuffer = 0;
} else {
// finished this chunk; time for the next one
// make sure that the current value of `extra` is passed on to the next chunk
extra = next;
break;
} This is marginally more annoying since you need to conditionally reassign I don't feel strongly about |
Yes, the current design.
Either seems fine.
I think this might be two issues? What happens if the we can't flush 4-bytes to the output, and what happens if the output's For the first, I would return the 4 bytes in In the second, it's more like do we want to handle partial writes of the bytes stored in
I think
Does the bytes consumed from input always equal the bytes written to output? I think that can't be the case, because of the 4/3s encoding. You need both. But if you're concern is with a bunch of return values, we can change |
Sorry, which 4 bytes?
It can also happen if you are getting very close to the end of the buffer. E.g. if you're reading 4 Float64 values at a time, that buffer will be 32 bytes long, which is not a multiple of 3 and so will not fill up precisely. If your first chunk is 43 base64 characters, that will leave your metaphorical cursor (
There's a few problems with that.
When I say "bytes consumed from the input", what I mean is "if you decoded this whole string to a list of bytes, how many of those bytes would we already have written to the output". The input is characters, not bytes.
Not hugely concerned; it's more the issue above, about being able to resume partway through a 4-character unit.
I am very much trying to avoid having opaque values here. The fact that everything is serializable is extremely useful and I would be loathe to give it up. Let me try to step back a little: with the design I've suggested this comment and the subsequent comment, you get a few nice properties:
Those are good properties to have. I'm happy to explore alternative ways of achieving them but I wouldn't really want to give any of them up (well, except the third, which I don't care that much about). |
@michaelficarra proposes having a "number of bytes left to read from this chunk" int instead of "is this chunk finished" boolean, so you can resize your buffer. I was initially on board with that, but the edge case when your chunk has a length which is not a multiple of 4 characters - say, 7. If you call this and pass it a sufficiently-large buffer, you'll get the 3 bytes out from the first set of 4 characters, and then the remaining 3 characters are returned in Thoughts? |
I'm going to mostly ignore the responses because I think we're opening up too many threads for this to be productive. Let's go back to your example in #21 (comment). I think this is mostly fine, but I think there's 2 flaws: First, the use of bytesReadFromThisChunk += written;
offsetIntoOutputBuffer += written; This must be a bug, because we'll have written 3 bytes for every 4 bytes of input. There has to be a separate Second, the need for If I think the goal of having the Rewriting your example with this in mind, I'd be happy with this API: let offsetIntoOutputBuffer = 0;
let state;
for (let c of chunks) {
let written, read;
let bytesReadFromThisChunk = 0;
// Need to loop multiple times per chunk, in case the BYOB isn't large
// enough.
while (bytesReadFromThisChunk < c.length) {
0, { state, written, read } = Uint8Array.fromPartialBase64(c, {
more: true,
extra,
into: buffer,
bytesReadFromThisChunk,
offsetIntoOutputBuffer,
});
bytesReadFromThisChunk += read;
offsetIntoOutputBuffer += written;
if (offsetIntoOutputBuffer === buffer.length) {
// output is full; consume it
offsetIntoOutputBuffer = 0;
}
}
} |
Well, this is a tangent if we're discussing your design, but - in my design, that's not a bug. The idea was that it indicates the number of bytes you've decoded from the input, not the number of characters you've consumed. The input is a list of characters, not of bytes; only the output is a list of bytes. (Representing "number of characters you've read from the input" is awkward because you can consume part of a character, but you've always decoded a whole number of bytes.) If we do something like you're proposing, we'd want to use a different name than
Interesting thought. I'm open to alternatives to a string, here, though I'd be a little uncomfortable with using different bits in a JS number to indicate different things. That seems pretty unprecedented in the web platform (outside of flags in WebGL etc, but even then every bit represents a flag). If engines aren't worried about an extra allocation here, I'd feel better about a Also, why at most 6? I guess the idea is that it counts as "reading" a character as soon as you've used any bits from it? In that case, you only have at most 4 unprocessed bits, since each character in the input represents 6 bits in the output, and if you've used at least 1 bit from a character, then you've used at least 2. (For a given 4-character unit of input, you either read 1 byte of output, which uses the first character plus 2 bits from the second character, leaving 4 bits remaining in that character; or you read 2 bytes of output, which uses the first two characters plus 4 bits from the third character, leaving 2 bits remaining in that character; or you read 3 bytes of output, which uses all 4 characters.) So, let me write a further iteration: let offsetIntoOutputBuffer = 0;
let extra;
for (let c of chunks) {
let written, read;
let offsetIntoInput = 0;
// Need to loop multiple times per chunk, in case the BYOB isn't large
// enough.
while (offsetIntoInput < c.length) {
0, { extra, written, read } = Uint8Array.fromPartialBase64(c, {
more: true,
extra,
into: buffer,
offsetIntoInput,
offsetIntoOutputBuffer,
});
// extra is one of `undefined`, `{ bits: 2, value: v1 }`, or `{ bits: 4, value: v2 }`, where v1 ranges from 0 to 3 and v2 ranges from 0 to 15
offsetIntoInput += read;
offsetIntoOutputBuffer += written;
if (offsetIntoOutputBuffer === buffer.length) {
// output is full; consume it
offsetIntoOutputBuffer = 0;
}
}
} I note that this affects the design even if you're not writing into an existing buffer - in my original conception, if you were decoding a length 7 string and not providing an output buffer, you'd get a length 3 buffer plus an That change is fine when going in this direction. But it's awkward if we make the symmetric change to |
But wouldn't it be incorrect after the first set of 4 input chars are decoded? We can't both increment the input offset by 4 and the output offset by 4, because you can only output 3 bytes.
VLQ is super common (source maps), and it's 1 bit for continue and 5 bits for data (per bytes, it's even less efficient than base64, because it uses base64…). And I've definitely done bit-packing with flags in JS.
Maybe I misunderstand base64 encoding, but I'd assumed if you fed 1 input char, you'd have 6 bits of partial data and 2 bits of waste. We'd need another input char to get the final 2 bits of the first output byte. For 2 input chars, you have 12 bits of data, so 1 output byte and 4 bits of partial data. With 3 input chars, you have 18 bits of data, so 2 output bytes and 2 bits of partial data. And at 4 input chars, 24 bits of data, so 3 bytes of output and 0 bits of partial data. Doesn't that mean we have at most 6 bits of partial data?
Yes. Once you read an input char, the read pointer increments. If you don't have enough data to process, that char's bits are stored as partial data. You don't need to reread that char again.
LGTM, though I'd love to save that allocation and just use a int.
SGTM.
I don't think that's a hard requirement? Like, unpadded base64 is everywhere and it's not a multiple of 4.
Is 0-2 bytes necessary? I think every set of 1-4 input bytes could be handled by writing a base64 char output and keeping 0-6 bits + 2 bits, right? |
The idea is that after producing the first 3 bytes of output, the resulting
VLQ is a specific well-known format, not a bespoke bitpacking, and I suspect most JS programmers won't have encountered bespoke packings before. I've certainly done it but I don't want to generalize from my experience. I could be convinced bit packing is fine here, but I'd like to get more input.
Ah, I see my mistake. In the case that you only have a single character of input, you can't produce even a single byte of output, so I was imagining in this case that you would have
It's not a hard requirement, but it's convenient for the ultimate destination not to need to handle unpadded base64. Not every system is set up to deal with it.
Right, it's only necessary if we're trying to have the property of producing only correctly padded base64 (and no extra padding characters on non-final chunks, so the consumer can concatenate if they want to). Which as I say is not a hard requirement, but it's convenient. And since you can fit 0-2 full bytes plus a 2-bit length in uint32, I don't see much reason not to go with that even if we do use bitpacking, since it gives us the nice-to-have property of correctly padding. |
PTAL: #26 I did notice that the I also didn't change Here's the two complete examples from the playground for that PR: Single input: let input = 'SGVsbG8gV29ybGQ=';
let buffer = new Uint8Array(4);
let outputOffset = 0;
let inputOffset = 0;
let extra, written, read;
while (inputOffset < input.length) {
0, { extra, written, read } = Uint8Array.fromPartialBase64(input, {
extra,
into: buffer,
inputOffset,
outputOffset,
});
inputOffset += read;
outputOffset += written;
if (outputOffset === buffer.length) {
// output is full; consume it
console.log([...buffer]);
outputOffset = 0;
}
}
if (outputOffset > 0) {
console.log([...buffer].slice(0, outputOffset));
} Chunked input: let chunks = ['VGhpcyB', 'pcyBzb2', '1lIGV4YW1w', 'bGUgZGF0YS4='];
let output = new Uint8Array(4);
let outputOffset = 0;
let extra;
for (let chunk of chunks) {
let written, read;
let inputOffset = 0;
while (inputOffset < chunk.length) {
0, { extra, written, read } = Uint8Array.fromPartialBase64(chunk, {
extra,
into: output,
inputOffset,
outputOffset,
});
inputOffset += read;
outputOffset += written;
if (outputOffset === output.length) {
// output is full; consume it
console.log([...output]);
outputOffset = 0;
}
}
}
if (outputOffset > 0) {
console.log([...output].slice(0, outputOffset));
} |
I think this should mirror |
V8's feelings here are that BYOB will be important even in the initial proposal for adoption. Copying is fast, but a "scratch buffer" implementation for the decoding loop need to be possible to write without extra allocations of buffers. With the upcoming simplified streaming API proposal, Anne's suggestion of a |
Adding this in #33; PTAL. let {read, written } = Uint8Array.fromBase64Into(string, target, { outputOffset }) |
Can we extend the decoding methods to write to an existing buffer?
The simplest thing I can imagine is a
into
option, which takes a buffer and writes into it, and returns that buffer instead of a new one. You can make a view if you want an offset. And it could throw (before doing any writes) if the data doesn't fit (or not).Unfortunately you're probably also going to want the number of bytes written, and figuring that out in the case of padding in base64 is annoying. In the streaming case we could return a
written
count, at least, but there's no obvious place to put a count in the non-streaming case. Maybe the option should only be for the streaming case?Alternatively we could have new methods for writing to an existing buffer, at the cost of having new methods.
I'm not sure either is worth it. cc @syg
The text was updated successfully, but these errors were encountered: