-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@TylerHelmuth @evan-bradley Any thoughts on this? |
I'd very much like @djaglowski opinion as well. At a first glance I am against a I would also prefer to stick to well defined Do you have an existing use case that is going to benefit from this parser? |
Let's omit for now, I've removed it from the proposal. The stdlib supports parsing non-strictly here with a simple bool flag, so it's not a hard thing to support at a later time if we decide there is some use.
I realize now I really failed to explain any of the rationale for why I decided to propose what I did for the XML parsing. I can't really find a "well-defined" and common xml -> map pattern. It seems like there are a bunch of different schemes. The idea for this scheme I'm proposing is informed largely from this library: https://pkg.go.dev/github.com/clbanning/mxj#section-readme (to be clear, not suggesting we use this library, but the scheme it uses seemed appropriate). But alternatively, you could parse it how datadog does. I've avoided that scheme because it doesn't seem to handle collisions between attribute and tag names. There's also a similar scheme for elastic's logstash agent: Example payload{
"parsed_message" => {
"User" => {
"Name" => "Joe",
"Email" => "[email protected]",
"ID" => {
"content" => [
[0] "en",
[1] "00001"
]
}
},
"Text" => "User did a thing"
},
"@version" => "1",
"event" => {
"original" => "<Log><User><ID content=\"en\">00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>"
},
"host" => {
"hostname" => "Brandons-MBP"
},
"@timestamp" => 2024-02-20T19:14:07.919944Z,
"message" => "<Log><User><ID content=\"en\">00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>"
} In that example, you can see that there was a collision ( So basically, I came up with this scheme by kinda combining what I think are the best aspects from these.
Yes, we have a customer who we are moving to OTel that needs XML parsing capabilities. We've decided that the best place for that is the transform processor, we're trying to avoid usage of logstransform since it's eventually going to be removed, so we'd like to have this in OTTL as opposed to only being available in stanza-based receivers. |
I suggest we leave it out initially, always using strict mode. If necessary, we can look at adding an option later.
The problem with xml -> map is that xml is not a perfect 1:1 like json or yaml. In order to preserve all the semantics of the tag name, attributes, content, and child elements, any mapping requires tradeoffs. Comparing the two proposals, I see important differences.
The proposed design above prioritizes concision but in my opinion comes with unacceptable downsides.
The stanza design uses verbose and explicit structure in order to ensure clarity and consistency.
To illustrate some of these points, compare the following inputs, which all follow the same simple xml schema. (In the proposed format, I could not cleanly represent the output as yaml so I used json instead.)
|
My main issue with the Stanza design is it's really not a great experience for a user. What I mean is, imagine you have this example as a log: <Customer>
<Order>
<OrderID>000000<OrderID>
<Item>SomeItem</Item>
</Order>
</Customer> If I'm looking to match all e.g. logs with OrderID 000000, it's actually really difficult with the Stanza structure. tag: Customer
children:
- tag: Order
children:
- tag: OrderID
chardata: "000000"
- tag: Item
chardata: "SomeItem" I need a way to match You might be able to find a way with the transform processor to match this? But even then, I think a lot of backends are limited in this type of matching, and the logs lose a lot of their use when they aren't properly searchable. I also think this represents a lot of XML payloads you'll see, where there are a bunch of distinct tags with useful data to match on. So it makes sense to optimize for that case. That's why you see datadog and logstash parse such attributes and tags are always the keys. So while I largely agree with your analysis on the tradeoffs of the proposed format, I do find forcing the Stanza format could greatly harm the ability to actually do anything useful with the parsed payload (and maybe I'm just not familiar with something in the collector that would help here). I think it could help to have some switch that could modify the format to have a more key:value approach, similar to how logstash allows switching between the two. |
Can you clarify how you would do this with the proposed structure? |
It would end up looking like this: "#tag": "Customer"
Order:
"#tag": Order
OrderID:
"#tag": OrderID
"#chardata": "000000"
Item:
"#tag": Item
"#chardata": "SomeItem" Which means the order ID is able to be referenced as e.g. That being said, that's a bit of a cherry picked example, because you might imagine something like this: <Customer>
<Order>
<OrderID>000000<OrderID>
<Item>SomeItem</Item>
</Order>
<Order>
<OrderID>000001<OrderID>
<Item>SecondItem</Item>
</Order>
</Customer> Which becomes: "#tag": "Customer"
"Order#0":
"#tag": Order
OrderID:
"#tag": OrderID
"#chardata": "000000"
Item:
"#tag": "Item"
"#chardata": "SomeItem"
"Order#1":
OrderID:
"#tag": OrderID,
"#chardata": "000001"
Item:
"#tag": "Item"
"#chardata": "SecondItem" ^ This is not easily walkable. Not using an array makes this problem worse. I do agree that this should be an array, maybe taking the datadog style where we group by tag (I feel that we can omit all but the top-level "#tag" if we do this too, it was mainly to preserve the original tag somewhere in case of collision) "#tag": "Customer"
Order:
- OrderID:
"#chardata": "000000"
Item:
"#chardata": "SomeItem"
- OrderID:
"#chardata": "000001"
Item:
"#chardata": "SecondItem" I guess my point is, if the XML schema is written in such a way that simple "key -> (obj|string)" map is possible, it would really be advantageous to create the map that way without arrays. That being said, I also think that in a lot of situations the Stanza format is preferable. Like above, if you're schema makes use of "one-or-more" type elements, that consistency is really valuable. Logstash allows you to configure this to force arrays (it forces them by default), so basically the first example here would be (also applying the datadog style for collisions): "#tag": "Customer"
Order:
- OrderID:
- "#chardata": "000000"
Item:
- "#chardata": "SomeItem" I think the one thing that isn't resolved by this is the naming conventions. With this design of having, it's necessary to avoid collisions with some naming scheme (or handling collisions in a lossy way). We could also resolve collisions the way that they are resolved in logstash, but I think just throwing them into an array together is really unclear (e.g. if you have a tag "chardata" and some character data, both the character data and the tag would be dumped in this array, which isn't great) I'm not fully against the old Stanza method of doing it, it's really just that one thing with the arrays, and maybe it makes sense to just start with the Stanza method of parsing, and seeing what users think, and adding options to tweak the output if needed. |
It's not clear to me that any approach mentioned here is both consistent and generally ottl-queryable. In the absence of a clear path to both, I think it makes sense to prioritize consistency first and we can look for ways to increase queryability later. |
That's fair. I'll modify the proposal to just be the old Stanza format, then. |
Built in looping in OTTL may help with then |
Proposal is now modified to be the Stanza format. |
I've opened #31487 implementing this. If we need more discussion about implementation, let me know, happy to discuss more! |
**Description:** * Adds a ParseXML converter function that can be used to parse an XML document to a pcommon.Map value **Link to tracking Issue:** Closes #31133 **Testing:** Unit tests Manually tested parsing XML logs **Documentation:** Added documentation for the ParseXML function to the ottl_funcs README. --------- Co-authored-by: Evan Bradley <[email protected]>
Component(s)
pkg/ottl
Is your feature request related to a problem? Please describe.
We have logs that are in formatted as XML strings sent to a gateway collector. We'd like to be able to parse these XML strings into maps so the log can be searched and operated on properly.
Describe the solution you'd like
I'd like to add a new converter function, ParseXML.
ParseXML(target)
target
is the string to parse as XML, returning a pcommon.Map.Currently, there is not an XML parser operator in pkg/stanza. However, there is one for standalone stanza, which we could use as the basis for XML parsing.
These are the parsing rules it follows:
content
fieldtag
fieldattribute
fieldchildren
field.Here are some examples:
Text values in nested elements
->
Tag collision
->
Further nested example
->
Attribute only element
->
Comments are ignored
->
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: