[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

BinaryFissionGames · 2024-02-08T16:05:36Z

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

We have logs that are in formatted as XML strings sent to a gateway collector. We'd like to be able to parse these XML strings into maps so the log can be searched and operated on properly.

Describe the solution you'd like

I'd like to add a new converter function, ParseXML.

ParseXML(target)

target is the string to parse as XML, returning a pcommon.Map.

Currently, there is not an XML parser operator in pkg/stanza. However, there is one for standalone stanza, which we could use as the basis for XML parsing.

These are the parsing rules it follows:

All character data for an XML element is trimmed and placed in a content field
The tag for an XML element is trimmed and placed in a tag field
The attributes for an XML element is placed as a map[string]string in a attribute field
Processing instructions, directives, and comments are ignored and not represented in to resultant map
All child elements are parsed as above, and placed in an array in a children field.

Here are some examples:

Text values in nested elements

<Log><User><ID>00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>

->

children:
- children:
  - content: "00001"
    tag: ID
  - content: Joe
    tag: Name
  - content: [email protected]
    tag: Email
  tag: User
- content: User did a thing
  tag: Text
tag: Log

Tag collision

<Log>This record has a collision<User id="0001"/><User id="0002"/></Log>

->

children:
- attributes:
    id: "0001"
  tag: User
- attributes:
    id: "0002"
  tag: User
content: This record has a collision
tag: Log

Further nested example

<Log><User><ID>00001</ID><Name><First>Joe</First></Name></User><Text>User did a thing</Text></Log>

->

children:
- children:
  - content: "00001"
    tag: ID
  - children:
    - content: Joe
      tag: First
    tag: Name
  tag: User
- content: User did a thing
  tag: Text
tag: Log

Attribute only element

<HostInfo hostname="example.com" zone="east-1" cloudprovider="aws" />

->

tag: HostInfo
attributes:
  cloudprovider: aws
  hostname: example.com
  zone: east-1

Comments are ignored

<Log>This has a comment <!-- This is comment text --></Log>

->

tag: Log
content: This has a comment

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-08T16:05:53Z

Pinging code owners:

pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

BinaryFissionGames · 2024-02-16T15:28:54Z

@TylerHelmuth @evan-bradley Any thoughts on this?

TylerHelmuth · 2024-02-20T17:47:10Z

I'd very much like @djaglowski opinion as well.

At a first glance I am against a strict option - it feels to me like we should accept valid xml or error - we don't handle malformed data for other formats.

I would also prefer to stick to well defined xml -> map patterns (whatever those are). Let's not reinvent anything about parsing XML unless there is a good reason.

Do you have an existing use case that is going to benefit from this parser?

BinaryFissionGames · 2024-02-20T19:30:58Z

At a first glance I am against a strict option - it feels to me like we should accept valid xml or error - we don't handle malformed data for other formats.

Let's omit for now, I've removed it from the proposal. The stdlib supports parsing non-strictly here with a simple bool flag, so it's not a hard thing to support at a later time if we decide there is some use.

I would also prefer to stick to well defined xml -> map patterns (whatever those are). Let's not reinvent anything about parsing XML unless there is a good reason.

I realize now I really failed to explain any of the rationale for why I decided to propose what I did for the XML parsing.

I can't really find a "well-defined" and common xml -> map pattern. It seems like there are a bunch of different schemes. The idea for this scheme I'm proposing is informed largely from this library: https://pkg.go.dev/github.com/clbanning/mxj#section-readme (to be clear, not suggesting we use this library, but the scheme it uses seemed appropriate).

But alternatively, you could parse it how datadog does. I've avoided that scheme because it doesn't seem to handle collisions between attribute and tag names.

There's also a similar scheme for elastic's logstash agent:

Example payload

{
    "parsed_message" => {
        "User" => {
             "Name" => "Joe",
            "Email" => "[email protected]",
               "ID" => {
                "content" => [
                    [0] "en",
                    [1] "00001"
                ]
            }
        },
        "Text" => "User did a thing"
    },
          "@version" => "1",
             "event" => {
        "original" => "<Log><User><ID content=\"en\">00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>"
    },
              "host" => {
        "hostname" => "Brandons-MBP"
    },
        "@timestamp" => 2024-02-20T19:14:07.919944Z,
           "message" => "<Log><User><ID content=\"en\">00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>"
}

In that example, you can see that there was a collision (content is some meta key for logstash) and it ended up just dumping them in an array together, which means I can't necessarily differentiate between which one was the content and which one was the attribute.

So basically, I came up with this scheme by kinda combining what I think are the best aspects from these.
Which is:

Add a prefix to attributes to prevent collisions with tag names and "meta" keys
Add a prefix to "meta" keys to prevent collisions with attributes and keys
Default to parsing just the tag name as the key, and having the mapping be flat (avoid arrays when possible)
Have a strategy for tag collisions (I'm using a key suffix here, it might be more appropriate to use an array in this case, the way datadog handles it)

Do you have an existing use case that is going to benefit from this parser?

Yes, we have a customer who we are moving to OTel that needs XML parsing capabilities. We've decided that the best place for that is the transform processor, we're trying to avoid usage of logstransform since it's eventually going to be removed, so we'd like to have this in OTTL as opposed to only being available in stanza-based receivers.

djaglowski · 2024-02-20T19:49:48Z

At a first glance I am against a strict option - it feels to me like we should accept valid xml or error - we don't handle malformed data for other formats.

I suggest we leave it out initially, always using strict mode. If necessary, we can look at adding an option later.

I would also prefer to stick to well defined xml -> map patterns (whatever those are). Let's not reinvent anything about parsing XML unless there is a good reason.

The problem with xml -> map is that xml is not a perfect 1:1 like json or yaml. In order to preserve all the semantics of the tag name, attributes, content, and child elements, any mapping requires tradeoffs. Comparing the two proposals, I see important differences.

Naming of keys
Representation of order
Duplication of information
Consistency (or lack thereof) in the output shape resulting from a given xml schema.

The proposed design above prioritizes concision but in my opinion comes with unacceptable downsides.

It relies on "tricky" naming conventions. Original keys are in many cases not preserved. Instead, the user must perform a mental (or actual) translation in order to understand their precise meaning. Additionally, the proposed naming conventions would be problematic if translated directly into yaml, since # is a special character denoting a comment. (Technically it can be escaped, but this would certainly lead to problems for users.)
It preserves ordering of child elements in a way which is correct, but elements cannot be accessed directly based on index. Instead, it requires building "tricky" keys which themselves contain the index. It's also necessary to know ahead of time which type of tag you expect to find at a given index.
Tag names are effectively duplicated, as they are represented both as keys and as values within the object to which the key refers. This is not only redundant but prone to corruption. For example, if a users wishes to rename a tag name, they must do it in two different places, each requiring a different mechanism, either or which can corrupt the structure if not done correctly with the other.
A given xml schema will be parsed into a different shape depending on cardinality of children. For example, see the table below, showing how "Customers" would each be represented with structures that differ beyond cardinality, despite matching the same xml schema.

The stanza design uses verbose and explicit structure in order to ensure clarity and consistency.

It preserves all tag names and attributes keys exactly as they are. The interleaved metadata layers are the downside of this approach, but they are always clear and consistent: "tag", "attributes", "content", "children".
It represents ordered elements using ordered structures (i.e. arrays). Elements can be accesses without prior knowledge of what the tag is. The size of the array can be queried, etc. Basically, most/all of the reasons why arrays exist are relevent to ordered data so why not use an array?
There's no duplication in the data. Tag names appear exactly once per tag, always in the same place relative to the object.
This representation always converts the same xml schema in the same way.

To illustrate some of these points, compare the following inputs, which all follow the same simple xml schema. (In the proposed format, I could not cleanly represent the output as yaml so I used json instead.)

Input

Proposed Above

Implemented in stanza

<Customer>
  <Order>
    <Item/>
  </Order>
</Customer>

{
  "#tag": "Customer",
  "Order": {
    "#tag": "Order",
    "Item": { }
  } 
}

tag: Customer
children:
- tag: Order
  children:
  - tag: Item

<Customer>
  <Order>
    <Item/>
    <Item/>
  </Order>
</Customer>

{
  "#tag": "Customer",
  "Order": {
    "#tag": "Order",
    "Item#0": { },
    "Item#1": { }
  } 
}

tag: Customer
children:
- tag: Order
  children:
  - tag: Item
  - tag: Item

<Customer>
  <Order>
    <Item/>
  </Order>
  <Order>
    <Item/>
    <Item/>
  </Order>
</Customer>

{
  "#tag": "Customer",
  "Order#0": {
    "#tag": "Order",
    "Item": { },
  },
  "Order#1": {
    "Item#0": { },
    "Item#1": { }
  } 
}

tag: Customer
children:
- tag: Order
  children:
  - tag: Item
- tag: Order
  children:
  - tag: Item
  - tag: Item

BinaryFissionGames · 2024-02-20T20:34:52Z

My main issue with the Stanza design is it's really not a great experience for a user.

What I mean is, imagine you have this example as a log:

<Customer>
  <Order>
    <OrderID>000000<OrderID>
    <Item>SomeItem</Item>
  </Order>
</Customer>

If I'm looking to match all e.g. logs with OrderID 000000, it's actually really difficult with the Stanza structure.

tag: Customer
children:
- tag: Order
  children:
  - tag: OrderID
    chardata: "000000"
  - tag: Item
    chardata: "SomeItem"

I need a way to match children[0].children[*].chardata == "000000" if I want to filter on chardata. Except, I don't necessarily know where the OrderID element is indexed. It very well could change depending on the specific log message.

You might be able to find a way with the transform processor to match this? But even then, I think a lot of backends are limited in this type of matching, and the logs lose a lot of their use when they aren't properly searchable.

I also think this represents a lot of XML payloads you'll see, where there are a bunch of distinct tags with useful data to match on. So it makes sense to optimize for that case. That's why you see datadog and logstash parse such attributes and tags are always the keys.

So while I largely agree with your analysis on the tradeoffs of the proposed format, I do find forcing the Stanza format could greatly harm the ability to actually do anything useful with the parsed payload (and maybe I'm just not familiar with something in the collector that would help here).

I think it could help to have some switch that could modify the format to have a more key:value approach, similar to how logstash allows switching between the two.

djaglowski · 2024-02-20T21:15:38Z

If I'm looking to match all e.g. logs with OrderID 000000, it's actually really difficult with the Stanza structure.

Can you clarify how you would do this with the proposed structure?

BinaryFissionGames · 2024-02-21T17:17:47Z

Can you clarify how you would do this with the proposed structure?

It would end up looking like this:

"#tag": "Customer"
Order:
  "#tag": Order
  OrderID:
    "#tag": OrderID
    "#chardata": "000000"
  Item:
    "#tag": Item
    "#chardata": "SomeItem"

Which means the order ID is able to be referenced as e.g. attributes["Order"]["OrderID"]["#chardata"]

That being said, that's a bit of a cherry picked example, because you might imagine something like this:

<Customer>
  <Order>
    <OrderID>000000<OrderID>
    <Item>SomeItem</Item>
  </Order>
  <Order>
    <OrderID>000001<OrderID>
    <Item>SecondItem</Item>
  </Order>
</Customer>

Which becomes:

"#tag": "Customer"
"Order#0":
  "#tag": Order
  OrderID:
    "#tag": OrderID
    "#chardata": "000000"
  Item:
    "#tag": "Item"
    "#chardata": "SomeItem"
"Order#1":
  OrderID:
    "#tag": OrderID,
    "#chardata": "000001"
  Item:
    "#tag": "Item"
    "#chardata": "SecondItem"

^ This is not easily walkable. Not using an array makes this problem worse. I do agree that this should be an array, maybe taking the datadog style where we group by tag (I feel that we can omit all but the top-level "#tag" if we do this too, it was mainly to preserve the original tag somewhere in case of collision)

"#tag": "Customer"
Order:
  - OrderID:
       "#chardata": "000000"
    Item:
       "#chardata": "SomeItem"
  - OrderID:
       "#chardata": "000001"
    Item:
       "#chardata": "SecondItem"

I guess my point is, if the XML schema is written in such a way that simple "key -> (obj|string)" map is possible, it would really be advantageous to create the map that way without arrays.

That being said, I also think that in a lot of situations the Stanza format is preferable. Like above, if you're schema makes use of "one-or-more" type elements, that consistency is really valuable.

Logstash allows you to configure this to force arrays (it forces them by default), so basically the first example here would be (also applying the datadog style for collisions):

"#tag": "Customer"
Order:
  -  OrderID:
        - "#chardata": "000000"
      Item:
        - "#chardata": "SomeItem"

I think the one thing that isn't resolved by this is the naming conventions. With this design of having, it's necessary to avoid collisions with some naming scheme (or handling collisions in a lossy way).

We could also resolve collisions the way that they are resolved in logstash, but I think just throwing them into an array together is really unclear (e.g. if you have a tag "chardata" and some character data, both the character data and the tag would be dumped in this array, which isn't great)

I'm not fully against the old Stanza method of doing it, it's really just that one thing with the arrays, and maybe it makes sense to just start with the Stanza method of parsing, and seeing what users think, and adding options to tweak the output if needed.

djaglowski · 2024-02-21T21:27:06Z

It's not clear to me that any approach mentioned here is both consistent and generally ottl-queryable. In the absence of a clear path to both, I think it makes sense to prioritize consistency first and we can look for ways to increase queryability later.

BinaryFissionGames · 2024-02-21T21:29:41Z

That's fair. I'll modify the proposal to just be the old Stanza format, then.

TylerHelmuth · 2024-02-21T21:35:55Z

Built in looping in OTTL may help with then children[0].children[*].chardata == "000000" problem.

BinaryFissionGames · 2024-02-21T22:24:03Z

Proposal is now modified to be the Stanza format.

BinaryFissionGames · 2024-02-28T17:06:58Z

I've opened #31487 implementing this. If we need more discussion about implementation, let me know, happy to discuss more!

**Description:** * Adds a ParseXML converter function that can be used to parse an XML document to a pcommon.Map value **Link to tracking Issue:** Closes #31133 **Testing:** Unit tests Manually tested parsing XML logs **Documentation:** Added documentation for the ParseXML function to the ottl_funcs README. --------- Co-authored-by: Evan Bradley <[email protected]>

BinaryFissionGames added enhancement New feature or request needs triage New item requiring triage labels Feb 8, 2024

github-actions bot added the pkg/ottl label Feb 8, 2024

evan-bradley added priority:p2 Medium and removed needs triage New item requiring triage labels Feb 8, 2024

github-actions bot mentioned this issue Feb 13, 2024

Weekly Report: 2024-02-06 - 2024-02-13 #31192

Closed

TylerHelmuth mentioned this issue Feb 21, 2024

[pkg/ottl] Determine approach to looping #29289

Open

BinaryFissionGames mentioned this issue Feb 28, 2024

[pkg/ottl]: Add ParseXML converter #31487

Merged

TylerHelmuth closed this as completed in #31487 Mar 15, 2024

djaglowski mentioned this issue Sep 17, 2024

[pkg/ottl] Enhance ParseXML function with intuitive parse format and add MarshalXML function to convert parsed output back to XML #35076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

BinaryFissionGames commented Feb 8, 2024 •

edited

Loading

github-actions bot commented Feb 8, 2024

BinaryFissionGames commented Feb 16, 2024

TylerHelmuth commented Feb 20, 2024 •

edited

Loading

BinaryFissionGames commented Feb 20, 2024 •

edited

Loading

djaglowski commented Feb 20, 2024 •

edited

Loading

BinaryFissionGames commented Feb 20, 2024 •

edited

Loading

djaglowski commented Feb 20, 2024

BinaryFissionGames commented Feb 21, 2024 •

edited

Loading

djaglowski commented Feb 21, 2024

BinaryFissionGames commented Feb 21, 2024

TylerHelmuth commented Feb 21, 2024

BinaryFissionGames commented Feb 21, 2024

BinaryFissionGames commented Feb 28, 2024

[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

Comments

BinaryFissionGames commented Feb 8, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Feb 8, 2024

BinaryFissionGames commented Feb 16, 2024

TylerHelmuth commented Feb 20, 2024 • edited Loading

BinaryFissionGames commented Feb 20, 2024 • edited Loading

djaglowski commented Feb 20, 2024 • edited Loading

BinaryFissionGames commented Feb 20, 2024 • edited Loading

djaglowski commented Feb 20, 2024

BinaryFissionGames commented Feb 21, 2024 • edited Loading

djaglowski commented Feb 21, 2024

BinaryFissionGames commented Feb 21, 2024

TylerHelmuth commented Feb 21, 2024

BinaryFissionGames commented Feb 21, 2024

BinaryFissionGames commented Feb 28, 2024

BinaryFissionGames commented Feb 8, 2024 •

edited

Loading

TylerHelmuth commented Feb 20, 2024 •

edited

Loading

BinaryFissionGames commented Feb 20, 2024 •

edited

Loading

djaglowski commented Feb 20, 2024 •

edited

Loading

BinaryFissionGames commented Feb 20, 2024 •

edited

Loading

BinaryFissionGames commented Feb 21, 2024 •

edited

Loading