-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid UTF-8 error handling policy #257
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmacd Thanks for inviting me to share my thoughts.
Overall, I like this proposal. We might need to mention how to deal with the current mapping rules since if we convert invalid utf-8 string into �; we probably don't need to convert invalid utf-8 into bytes.
And, I prefer to convert invalid utf-8 into � instead of silently changing the type if we expose the byte-slice valued attributes on our API.
@XSAM This was discussed in the Spec SIG today. There appears to be not much support for binary-attribute values. I think it's bad for the users, but it's not so bad if we automatically correct invalid UTF-8. Therefore, I will move forward with only half of this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's great that this is handled. For ex, if a receiver uses the bindings we offer here https://github.com/open-telemetry/opentelemetry-proto-java they will drop the entire batch if anything contains invalid UTF-8.
Co-authored-by: Gerhard Stöbich <[email protected]> Co-authored-by: Joao Grassi <[email protected]>
@open-telemetry/specs-logs-approvers @open-telemetry/specs-metrics-approvers @open-telemetry/specs-trace-approvers Please consider this updated OTEP. The changes I have applied:
|
simple and preserves what valid content can be recovered from the | ||
data. | ||
|
||
#### Dropping data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the cost of performing this check on valid strings in the collector?
I see this as a performance tradeoff for where to do the enforcement of utf-8, and my preference would be to push as much to generation side as possible.
I'll read your alternatives considered, as you probably call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Generation side" - SDK/Exporter? Then the question becomes "should the collector trust the input" (I think the answer is "no").
`rejected_data_points`, or `rejected_log_records` count along with a | ||
message indicating data loss. | ||
|
||
### Survey of existing systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a few more.
E.g. Java - only enforces UTF-8 when attempting to read the bytes into Java's String format. See: https://github.com/protocolbuffers/protobuf/blob/0bfe41b27e3dd8a30ae383210d7af10c28a642ea/java/core/src/main/java/com/google/protobuf/Internal.java#L56 for the gore-y details
send, simply resulting from invalid UTF-8. | ||
|
||
Considering whether components could support permissive data | ||
transport, it appears to be risky and for little reward. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to understand the cost implications for validating on receivers.
I think the tradeoff in permissive is risky -
You require ANYONE who needs to interpret a string as UTF-8 to handle failure, at that moment.
However, in a risk/reward trade-off, for well-behaving systems, avoiding UTF-8 validation at every endpoint can add up.
I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.
- How likely do we think utf-8 issues are in practice?
- What is the cost of performing this check in collector components?
Personally - I think, related to your "consider invalid utf8 a bug in a processor", we should push repsonsible utf-8 as close to generation as possible, so I'd rather see this as an opt-in feature of otel than opt-out. BUT, I may be missing some context or use cases where this is highly problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having validaiton as opt-in/opt-out, I'm not sure which should be the default though.
+1, I don't just "like" it, I think we SHOULD do this.
Here are my reasons:
- The collector is normally sending the data to some endpoints. Many backend services already perform such validation/correction, so folks might want to just do it once in the backend rather than duplicating the effort.
- There are cases where data can be consumed directly on the collector side (e.g. a collector running in a local data center might decide to trigger a rollback due to certain metrics KPI drop during a deployment), I think this is a general pattern for things that run on Edge.
- Depending on the ownership of different parts of the system, the parts could be designed to be trusting each other or not. Collector needs to provide flexibility.
- Things could break between Collector and backend (e.g. bit flips caused by high energy particles from the universe, hardware failures), certain software needs to handle these as part of the design.
Regarding which one should be the default, based on what I've seen in Microsoft across Windows/Office/Azure/etc. I think it should be off-by-default and allow folks to opt-in.
@jsuereth and @reyang I appreciate the feedback. Both of you are, I think, suggesting to make UTF-8 validation an opt-in instead of an opt-out feature. I support that motion. The most critical thing for me is that if the SDK is configured with a permissive stance (opt-out), the SDK "MUST" configure its underlying technologies in support. Opting-out does not mean doing nothing, in other words, it means explicitly configuring a pipeline to permit invalid UTF-8 unless a user opts-in to UTF-8 validation. When UTF-8 validation is selected (opt-in), it seems we have two options: (a) reject individual items, (b) correct invalid UTF-8. Do either of you think both of these options are worthwhile? I think (b) should be preferred, but I would accept (a) too. |
I think if we have very limited bandwidth, we should do (b). (a) can be added later if we see a huge demand. |
|
||
#### No byte-slice valued attribute API | ||
|
||
As a caveat, the OpenTelemetry project has previously debated and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this ever formally rejected?
I'd like this option because we technically already support it in Erlang. The fact the main string type in Erlang/Elixir is binary
and the SDK safety mechanism mentioned elsewhere in this doc that stores invalid utf8 in bytes_value
of the proto.
So because attribute values are already type binary
the user can pass any binary data they want as an attribute value and it gets used.
I recall it being rejected informally in spec sig meetings but maybe different luck with a formal proposal. Do you think there would be any chance of that?
This is a draft following research done into
open-telemetry/opentelemetry-specification#3421
and
open-telemetry/opentelemetry-specification#3950.