Invalid UTF-8 error handling policy #257

jmacd · 2024-05-10T23:29:18Z

This is a draft following research done into

open-telemetry/opentelemetry-specification#3421

and

open-telemetry/opentelemetry-specification#3950.

XSAM

@jmacd Thanks for inviting me to share my thoughts.

Overall, I like this proposal. We might need to mention how to deal with the current mapping rules since if we convert invalid utf-8 string into �; we probably don't need to convert invalid utf-8 into bytes.

And, I prefer to convert invalid utf-8 into � instead of silently changing the type if we expose the byte-slice valued attributes on our API.

text/0000-utf8-handling.md

jmacd · 2024-05-14T15:26:09Z

@XSAM This was discussed in the Spec SIG today. There appears to be not much support for binary-attribute values. I think it's bad for the users, but it's not so bad if we automatically correct invalid UTF-8. Therefore, I will move forward with only half of this proposal.

text/0257-utf8-handling.md

joaopgrassi

I think it's great that this is handled. For ex, if a receiver uses the bindings we offer here https://github.com/open-telemetry/opentelemetry-proto-java they will drop the entire batch if anything contains invalid UTF-8.

text/0257-utf8-handling.md

Co-authored-by: Gerhard Stöbich <[email protected]> Co-authored-by: Joao Grassi <[email protected]>

jmacd · 2024-10-10T23:31:09Z

@open-telemetry/specs-logs-approvers @open-telemetry/specs-metrics-approvers @open-telemetry/specs-trace-approvers

Please consider this updated OTEP.

The changes I have applied:

The OTel group has already decided not to support byte-valued attributes: document this. (Tough!)
Specific wording for SDK requirements: SHOULD be opt-out, SHOULD replace invalid sequences w/ �, etc.
Specific wording for Collector "behavior": SHOULD be opt-out, SHOULD follow each receiver for validation of external data, not recommended for processor manipulations.

jsuereth · 2024-10-11T12:44:37Z

text/0257-utf8-handling.md

+simple and preserves what valid content can be recovered from the
+data.
+
+#### Dropping data


What's the cost of performing this check on valid strings in the collector?

I see this as a performance tradeoff for where to do the enforcement of utf-8, and my preference would be to push as much to generation side as possible.

I'll read your alternatives considered, as you probably call this out.

"Generation side" - SDK/Exporter? Then the question becomes "should the collector trust the input" (I think the answer is "no").

jsuereth · 2024-10-11T12:52:10Z

text/0257-utf8-handling.md

+`rejected_data_points`, or `rejected_log_records` count along with a
+message indicating data loss.
+
+### Survey of existing systems


You should add a few more.

E.g. Java - only enforces UTF-8 when attempting to read the bytes into Java's String format. See: https://github.com/protocolbuffers/protobuf/blob/0bfe41b27e3dd8a30ae383210d7af10c28a642ea/java/core/src/main/java/com/google/protobuf/Internal.java#L56 for the gore-y details

jsuereth · 2024-10-11T12:58:11Z

text/0257-utf8-handling.md

+send, simply resulting from invalid UTF-8.
+
+Considering whether components could support permissive data
+transport, it appears to be risky and for little reward.  If


I'd still like to understand the cost implications for validating on receivers.

I think the tradeoff in permissive is risky -

You require ANYONE who needs to interpret a string as UTF-8 to handle failure, at that moment.

However, in a risk/reward trade-off, for well-behaving systems, avoiding UTF-8 validation at every endpoint can add up.

I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.

How likely do we think utf-8 issues are in practice?

What is the cost of performing this check in collector components?

Personally - I think, related to your "consider invalid utf8 a bug in a processor", we should push repsonsible utf-8 as close to generation as possible, so I'd rather see this as an opt-in feature of otel than opt-out. BUT, I may be missing some context or use cases where this is highly problematic.

I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.

+1, I don't just "like" it, I think we SHOULD do this.

Here are my reasons:

The collector is normally sending the data to some endpoints. Many backend services already perform such validation/correction, so folks might want to just do it once in the backend rather than duplicating the effort.

There are cases where data can be consumed directly on the collector side (e.g. a collector running in a local data center might decide to trigger a rollback due to certain metrics KPI drop during a deployment), I think this is a general pattern for things that run on Edge.

Depending on the ownership of different parts of the system, the parts could be designed to be trusting each other or not. Collector needs to provide flexibility.

Things could break between Collector and backend (e.g. bit flips caused by high energy particles from the universe, hardware failures), certain software needs to handle these as part of the design.

Regarding which one should be the default, based on what I've seen in Microsoft across Windows/Office/Azure/etc. I think it should be off-by-default and allow folks to opt-in.

jmacd · 2024-10-11T22:18:00Z

@jsuereth and @reyang I appreciate the feedback. Both of you are, I think, suggesting to make UTF-8 validation an opt-in instead of an opt-out feature. I support that motion. The most critical thing for me is that if the SDK is configured with a permissive stance (opt-out), the SDK "MUST" configure its underlying technologies in support.

Opting-out does not mean doing nothing, in other words, it means explicitly configuring a pipeline to permit invalid UTF-8 unless a user opts-in to UTF-8 validation.

When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.

reyang · 2024-10-11T22:42:51Z

When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too.

I think if we have very limited bandwidth, we should do (b). (a) can be added later if we see a huge demand.
One technical detail - I think for attribute values with string type, we should do some correction, for attribute names that have invalid UTF-8, it could be a very bad idea. I'm a bit on the fence here...

tsloughter · 2024-10-13T17:45:04Z

text/0257-utf8-handling.md

+
+#### No byte-slice valued attribute API
+
+As a caveat, the OpenTelemetry project has previously debated and


Was this ever formally rejected?

I'd like this option because we technically already support it in Erlang. The fact the main string type in Erlang/Elixir is binary and the SDK safety mechanism mentioned elsewhere in this doc that stores invalid utf8 in bytes_value of the proto.

So because attribute values are already type binary the user can pass any binary data they want as an attribute value and it gets used.

I recall it being rejected informally in spec sig meetings but maybe different luck with a formal proposal. Do you think there would be any chance of that?

Draft invalid UTF-8 error handling policy

4f6c3a5

This was referenced May 10, 2024

Clarify that API users SHOULD use valid UTF-8 for exportable data open-telemetry/opentelemetry-specification#3421

Open

Clarify the valid content in primitive type string open-telemetry/opentelemetry-specification#3950

Open

XSAM reviewed May 11, 2024

View reviewed changes

text/0000-utf8-handling.md Outdated Show resolved Hide resolved

update, lint, pr num

6df233d

jmacd marked this pull request as ready for review May 13, 2024 21:58

jmacd requested a review from a team May 13, 2024 21:58

Flarna reviewed May 21, 2024

View reviewed changes

text/0257-utf8-handling.md Outdated Show resolved Hide resolved

joaopgrassi approved these changes May 22, 2024

View reviewed changes

text/0257-utf8-handling.md Outdated Show resolved Hide resolved

jmacd and others added 8 commits May 22, 2024 09:13

Apply suggestions from code review

c30ff3d

Co-authored-by: Gerhard Stöbich <[email protected]> Co-authored-by: Joao Grassi <[email protected]>

Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/utf8

e9a0d56

Edits.

47b656f

Merge branch 'jmacd/utf8' of github.com:jmacd/oteps into jmacd/utf8

15353f1

lint

f23f8b0

apply feedback; be specific

d79c75f

typo

e84d1e9

edits

10721f9

edit

4dbb16f

jsuereth reviewed Oct 11, 2024

View reviewed changes

tsloughter reviewed Oct 13, 2024

View reviewed changes

mx-psi mentioned this pull request Oct 15, 2024

[otlpreceiver] OpenTelemetry Collector accepts non-utf8 data open-telemetry/opentelemetry-collector#11449

Open

mugli mentioned this pull request Oct 28, 2024

[pkg/ottl] truncate_all function corrupts UTF-8 encoding open-telemetry/opentelemetry-collector-contrib#36017

Open

mtwo added priority:p1 triaged labels Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid UTF-8 error handling policy #257

Invalid UTF-8 error handling policy #257

jmacd commented May 10, 2024

XSAM left a comment

jmacd commented May 14, 2024

joaopgrassi left a comment

jmacd commented Oct 10, 2024

jsuereth Oct 11, 2024

reyang Oct 11, 2024

jsuereth Oct 11, 2024 •

edited

Loading

jsuereth Oct 11, 2024

reyang Oct 11, 2024

jmacd commented Oct 11, 2024

reyang commented Oct 11, 2024

tsloughter Oct 13, 2024


		#### No byte-slice valued attribute API

		As a caveat, the OpenTelemetry project has previously debated and

Invalid UTF-8 error handling policy #257

Are you sure you want to change the base?

Invalid UTF-8 error handling policy #257

Conversation

jmacd commented May 10, 2024

XSAM left a comment

Choose a reason for hiding this comment

jmacd commented May 14, 2024

joaopgrassi left a comment

Choose a reason for hiding this comment

jmacd commented Oct 10, 2024

jsuereth Oct 11, 2024

Choose a reason for hiding this comment

reyang Oct 11, 2024

Choose a reason for hiding this comment

jsuereth Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

jsuereth Oct 11, 2024

Choose a reason for hiding this comment

reyang Oct 11, 2024

Choose a reason for hiding this comment

jmacd commented Oct 11, 2024

reyang commented Oct 11, 2024

tsloughter Oct 13, 2024

Choose a reason for hiding this comment

jsuereth Oct 11, 2024 •

edited

Loading