Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Encoding.TryGetBytes/Chars #84425

Closed
stephentoub opened this issue Apr 6, 2023 · 8 comments · Fixed by #85120
Closed

[API Proposal]: Encoding.TryGetBytes/Chars #84425

stephentoub opened this issue Apr 6, 2023 · 8 comments · Fixed by #85120
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Encoding
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Apr 6, 2023

Background and motivation

In .NET Core 2.1 we added an Encoding.GetBytes(ReadOnlySpan<char>, Span<byte>) and Encoding.GetChars(ReadOnlySpan<byte>, Span<char>) that will encode/decode the source into the destination and return how many chars/bytes it wrote. However, the destination span needs to be large enough to store all the written data; if it's too small, the method throws. Encoding doesn't provide a variant that allows for failing without exception when the destination is too small, and that's particularly useful when writing out a larger composed output where you loop to grow the buffer to accommodate more output.

Workarounds today are to first call GetByte/CharCount to determine how much space is required, or to use alternate APIs for the specific encoding in question, e.g. UTF8.FromUtf16 that's an OperationStatus-based API.

API Proposal

namespace System.Text;

public abstract class Encoding
{
    public int GetBytes(ReadOnlySpan<char> chars, Span<byte> bytes);
+   public bool TryGetBytes(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten);

    public int GetChars(ReadOnlySpan<byte> bytes, Span<char> chars);
+   public bool TryGetChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten);
}

The base virtual implementation will use GetByte/CharCount and GetBytes/GetChars, but implementations like UTF8Encoding can then do better in their overrides.

API Usage

if (Encoding.UTF8.TryGetBytes(str, destination, out int bytesWritten))
{
    _length += bytesWritten;
    return true;
}

return false;

Alternative Designs

No response

Risks

No response

@stephentoub stephentoub added area-System.Text.Encoding api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 6, 2023
@stephentoub stephentoub added this to the 8.0.0 milestone Apr 6, 2023
@ghost
Copy link

ghost commented Apr 6, 2023

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

In .NET Core 2.1 we added an Encoding.GetBytes(ReadOnlySpan<char>, Span<byte>) and Encoding.GetChars(ReadOnlySpan<byte>, Span<char>) that will encode/decode the source into the destination and return how many chars/bytes it wrote. However, the destination span needs to be large enough to store all the written data; if it's too small, the method throws. Encoding doesn't provide a variant that allows for failing without exception when the destination is too small, and that's particularly useful when writing out a larger composed output where you loop to grow the buffer to accommodate more output.

Workarounds today are to first call GetByte/CharCount to determine how much space is required, or to use alternate APIs for the specific encoding in question, e.g. UTF8.FromUtf16 that's an OperationStatus-based API.

API Proposal

namespace System.Text;

public abstract class Encoding
{
    public int GetBytes(ReadOnlySpan<char> chars, Span<byte> bytes);
+   public bool TryGetBytes(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten);

    public int GetChars(ReadOnlySpan<byte> bytes, Span<char> chars);
+   public bool TryGetChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten);
}

API Usage

if (Encoding.UTF8.TryGetBytes(str, destination, out int bytesWritten))
{
    _length += bytesWritten;
    return true;
}

return false;

Alternative Designs

No response

Risks

No response

Author: stephentoub
Assignees: -
Labels:

area-System.Text.Encoding, api-ready-for-review

Milestone: 8.0.0

@stephentoub
Copy link
Member Author

cc: @GrabYourPitchforks, @EgorBo

@stephentoub
Copy link
Member Author

stephentoub commented Apr 7, 2023

I also spoke with @EgorBo about making the UTF8EncodingSealed override of TryGetBytes a JIT intrinsic, and he's prototyped it. When we use TryGetBytes from the Utf8.TryWrite interpolated string handler, this should then enable all of the valid literals created for the interpolated string to be encoded at JIT time once rather than on every append, and also the copies unrolled (effectively turning the implementation of TryGetBytes("literal", dest) into "literal"u8.TryCopyTo(dest)). Anyone else using Encoding.UTF8.TryGetBytes with a literal would also benefit, whether that literal was supplied directly or exposed via inlining.

@eiriktsarpalis
Copy link
Member

that's particularly useful when writing out a larger composed output where you loop to grow the buffer to accommodate more output.

But isn't that suboptimal compared to proactively calculating an upper bound via the GetMaxByteCount/GetMaxCharCount methods?

@stephentoub
Copy link
Member Author

But isn't that suboptimal compared to proactively calculating an upper bound via the GetMaxByteCount/GetMaxCharCount methods?

GetMaxByte/CharCount will typically be a complete overestimate. You can't use the result of those to determine whether the data you have will fit into a span you have, as the space remaining in the span might be perfectly sufficient to store the encoded data but still less than those maxes.

Maybe I'm not understanding your question?

@eiriktsarpalis
Copy link
Member

Maybe our use cases are different, I typically rent/allocate a span after I calculate the max length.

@stephentoub
Copy link
Member Author

stephentoub commented Apr 7, 2023

The use case here is you're given a buffer to write into.

But even if it weren't, the max approach while fast to compute can also result in overallocating by 4x (in the case of UTF8).

@bartonjs
Copy link
Member

bartonjs commented Apr 20, 2023

Video

Corrected the proposal to make these methods virtual. Otherwise, looks good as proposed.

We discussed the names source/destination versus the chars/bytes pattern, but since the existing non-Try method has chars/bytes, we went for local consistency.

namespace System.Text;

public abstract partial class Encoding
{
    public virtual bool TryGetBytes(ReadOnlySpan<char> chars, Span<byte> bytes, out int bytesWritten);
    public virtual bool TryGetChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int charsWritten);
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Apr 20, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 20, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 23, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Encoding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants