Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

Closed
BinaryFissionGames opened this issue Feb 8, 2024 · 13 comments · Fixed by #31487
Closed

[pkg/ottl]: Add a ParseXML converter to parse XML strings #31133

BinaryFissionGames opened this issue Feb 8, 2024 · 13 comments · Fixed by #31487
Labels
enhancement New feature or request pkg/ottl priority:p2 Medium

Comments

@BinaryFissionGames
Copy link
Contributor

BinaryFissionGames commented Feb 8, 2024

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

We have logs that are in formatted as XML strings sent to a gateway collector. We'd like to be able to parse these XML strings into maps so the log can be searched and operated on properly.

Describe the solution you'd like

I'd like to add a new converter function, ParseXML.

ParseXML(target)

  • target is the string to parse as XML, returning a pcommon.Map.

Currently, there is not an XML parser operator in pkg/stanza. However, there is one for standalone stanza, which we could use as the basis for XML parsing.

These are the parsing rules it follows:

  1. All character data for an XML element is trimmed and placed in a content field
  2. The tag for an XML element is trimmed and placed in a tag field
  3. The attributes for an XML element is placed as a map[string]string in a attribute field
  4. Processing instructions, directives, and comments are ignored and not represented in to resultant map
  5. All child elements are parsed as above, and placed in an array in a children field.

Here are some examples:

Text values in nested elements
<Log><User><ID>00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>

->

children:
- children:
  - content: "00001"
    tag: ID
  - content: Joe
    tag: Name
  - content: [email protected]
    tag: Email
  tag: User
- content: User did a thing
  tag: Text
tag: Log
Tag collision
<Log>This record has a collision<User id="0001"/><User id="0002"/></Log>

->

children:
- attributes:
    id: "0001"
  tag: User
- attributes:
    id: "0002"
  tag: User
content: This record has a collision
tag: Log
Further nested example
<Log><User><ID>00001</ID><Name><First>Joe</First></Name></User><Text>User did a thing</Text></Log>

->

children:
- children:
  - content: "00001"
    tag: ID
  - children:
    - content: Joe
      tag: First
    tag: Name
  tag: User
- content: User did a thing
  tag: Text
tag: Log
Attribute only element
<HostInfo hostname="example.com" zone="east-1" cloudprovider="aws" />

->

tag: HostInfo
attributes:
  cloudprovider: aws
  hostname: example.com
  zone: east-1
Comments are ignored
<Log>This has a comment <!-- This is comment text --></Log>

->

tag: Log
content: This has a comment

Describe alternatives you've considered

No response

Additional context

No response

@BinaryFissionGames BinaryFissionGames added enhancement New feature or request needs triage New item requiring triage labels Feb 8, 2024
Copy link
Contributor

github-actions bot commented Feb 8, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@evan-bradley evan-bradley added priority:p2 Medium and removed needs triage New item requiring triage labels Feb 8, 2024
@BinaryFissionGames
Copy link
Contributor Author

@TylerHelmuth @evan-bradley Any thoughts on this?

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Feb 20, 2024

I'd very much like @djaglowski opinion as well.

At a first glance I am against a strict option - it feels to me like we should accept valid xml or error - we don't handle malformed data for other formats.

I would also prefer to stick to well defined xml -> map patterns (whatever those are). Let's not reinvent anything about parsing XML unless there is a good reason.

Do you have an existing use case that is going to benefit from this parser?

@BinaryFissionGames
Copy link
Contributor Author

BinaryFissionGames commented Feb 20, 2024

At a first glance I am against a strict option - it feels to me like we should accept valid xml or error - we don't handle malformed data for other formats.

Let's omit for now, I've removed it from the proposal. The stdlib supports parsing non-strictly here with a simple bool flag, so it's not a hard thing to support at a later time if we decide there is some use.

I would also prefer to stick to well defined xml -> map patterns (whatever those are). Let's not reinvent anything about parsing XML unless there is a good reason.

I realize now I really failed to explain any of the rationale for why I decided to propose what I did for the XML parsing.

I can't really find a "well-defined" and common xml -> map pattern. It seems like there are a bunch of different schemes. The idea for this scheme I'm proposing is informed largely from this library: https://pkg.go.dev/github.com/clbanning/mxj#section-readme (to be clear, not suggesting we use this library, but the scheme it uses seemed appropriate).

But alternatively, you could parse it how datadog does. I've avoided that scheme because it doesn't seem to handle collisions between attribute and tag names.

There's also a similar scheme for elastic's logstash agent:

Example payload
{
    "parsed_message" => {
        "User" => {
             "Name" => "Joe",
            "Email" => "[email protected]",
               "ID" => {
                "content" => [
                    [0] "en",
                    [1] "00001"
                ]
            }
        },
        "Text" => "User did a thing"
    },
          "@version" => "1",
             "event" => {
        "original" => "<Log><User><ID content=\"en\">00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>"
    },
              "host" => {
        "hostname" => "Brandons-MBP"
    },
        "@timestamp" => 2024-02-20T19:14:07.919944Z,
           "message" => "<Log><User><ID content=\"en\">00001</ID><Name>Joe</Name><Email>[email protected]</Email></User><Text>User did a thing</Text></Log>"
}

In that example, you can see that there was a collision (content is some meta key for logstash) and it ended up just dumping them in an array together, which means I can't necessarily differentiate between which one was the content and which one was the attribute.

So basically, I came up with this scheme by kinda combining what I think are the best aspects from these.
Which is:

  • Add a prefix to attributes to prevent collisions with tag names and "meta" keys
  • Add a prefix to "meta" keys to prevent collisions with attributes and keys
  • Default to parsing just the tag name as the key, and having the mapping be flat (avoid arrays when possible)
  • Have a strategy for tag collisions (I'm using a key suffix here, it might be more appropriate to use an array in this case, the way datadog handles it)

Do you have an existing use case that is going to benefit from this parser?

Yes, we have a customer who we are moving to OTel that needs XML parsing capabilities. We've decided that the best place for that is the transform processor, we're trying to avoid usage of logstransform since it's eventually going to be removed, so we'd like to have this in OTTL as opposed to only being available in stanza-based receivers.

@djaglowski
Copy link
Member

djaglowski commented Feb 20, 2024

At a first glance I am against a strict option - it feels to me like we should accept valid xml or error - we don't handle malformed data for other formats.

I suggest we leave it out initially, always using strict mode. If necessary, we can look at adding an option later.

I would also prefer to stick to well defined xml -> map patterns (whatever those are). Let's not reinvent anything about parsing XML unless there is a good reason.

The problem with xml -> map is that xml is not a perfect 1:1 like json or yaml. In order to preserve all the semantics of the tag name, attributes, content, and child elements, any mapping requires tradeoffs. Comparing the two proposals, I see important differences.

  1. Naming of keys
  2. Representation of order
  3. Duplication of information
  4. Consistency (or lack thereof) in the output shape resulting from a given xml schema.

The proposed design above prioritizes concision but in my opinion comes with unacceptable downsides.

  1. It relies on "tricky" naming conventions. Original keys are in many cases not preserved. Instead, the user must perform a mental (or actual) translation in order to understand their precise meaning. Additionally, the proposed naming conventions would be problematic if translated directly into yaml, since # is a special character denoting a comment. (Technically it can be escaped, but this would certainly lead to problems for users.)
  2. It preserves ordering of child elements in a way which is correct, but elements cannot be accessed directly based on index. Instead, it requires building "tricky" keys which themselves contain the index. It's also necessary to know ahead of time which type of tag you expect to find at a given index.
  3. Tag names are effectively duplicated, as they are represented both as keys and as values within the object to which the key refers. This is not only redundant but prone to corruption. For example, if a users wishes to rename a tag name, they must do it in two different places, each requiring a different mechanism, either or which can corrupt the structure if not done correctly with the other.
  4. A given xml schema will be parsed into a different shape depending on cardinality of children. For example, see the table below, showing how "Customers" would each be represented with structures that differ beyond cardinality, despite matching the same xml schema.

The stanza design uses verbose and explicit structure in order to ensure clarity and consistency.

  1. It preserves all tag names and attributes keys exactly as they are. The interleaved metadata layers are the downside of this approach, but they are always clear and consistent: "tag", "attributes", "content", "children".
  2. It represents ordered elements using ordered structures (i.e. arrays). Elements can be accesses without prior knowledge of what the tag is. The size of the array can be queried, etc. Basically, most/all of the reasons why arrays exist are relevent to ordered data so why not use an array?
  3. There's no duplication in the data. Tag names appear exactly once per tag, always in the same place relative to the object.
  4. This representation always converts the same xml schema in the same way.

To illustrate some of these points, compare the following inputs, which all follow the same simple xml schema. (In the proposed format, I could not cleanly represent the output as yaml so I used json instead.)

Input Proposed Above Implemented in stanza
<Customer>
  <Order>
    <Item/>
  </Order>
</Customer>
{
  "#tag": "Customer",
  "Order": {
    "#tag": "Order",
    "Item": { }
  } 
}
tag: Customer
children:
- tag: Order
  children:
  - tag: Item
<Customer>
  <Order>
    <Item/>
    <Item/>
  </Order>
</Customer>
{
  "#tag": "Customer",
  "Order": {
    "#tag": "Order",
    "Item#0": { },
    "Item#1": { }
  } 
}
tag: Customer
children:
- tag: Order
  children:
  - tag: Item
  - tag: Item
<Customer>
  <Order>
    <Item/>
  </Order>
  <Order>
    <Item/>
    <Item/>
  </Order>
</Customer>
{
  "#tag": "Customer",
  "Order#0": {
    "#tag": "Order",
    "Item": { },
  },
  "Order#1": {
    "Item#0": { },
    "Item#1": { }
  } 
}
tag: Customer
children:
- tag: Order
  children:
  - tag: Item
- tag: Order
  children:
  - tag: Item
  - tag: Item

@BinaryFissionGames
Copy link
Contributor Author

BinaryFissionGames commented Feb 20, 2024

My main issue with the Stanza design is it's really not a great experience for a user.

What I mean is, imagine you have this example as a log:

<Customer>
  <Order>
    <OrderID>000000<OrderID>
    <Item>SomeItem</Item>
  </Order>
</Customer>

If I'm looking to match all e.g. logs with OrderID 000000, it's actually really difficult with the Stanza structure.

tag: Customer
children:
- tag: Order
  children:
  - tag: OrderID
    chardata: "000000"
  - tag: Item
    chardata: "SomeItem"

I need a way to match children[0].children[*].chardata == "000000" if I want to filter on chardata. Except, I don't necessarily know where the OrderID element is indexed. It very well could change depending on the specific log message.

You might be able to find a way with the transform processor to match this? But even then, I think a lot of backends are limited in this type of matching, and the logs lose a lot of their use when they aren't properly searchable.

I also think this represents a lot of XML payloads you'll see, where there are a bunch of distinct tags with useful data to match on. So it makes sense to optimize for that case. That's why you see datadog and logstash parse such attributes and tags are always the keys.

So while I largely agree with your analysis on the tradeoffs of the proposed format, I do find forcing the Stanza format could greatly harm the ability to actually do anything useful with the parsed payload (and maybe I'm just not familiar with something in the collector that would help here).

I think it could help to have some switch that could modify the format to have a more key:value approach, similar to how logstash allows switching between the two.

@djaglowski
Copy link
Member

If I'm looking to match all e.g. logs with OrderID 000000, it's actually really difficult with the Stanza structure.

Can you clarify how you would do this with the proposed structure?

@BinaryFissionGames
Copy link
Contributor Author

BinaryFissionGames commented Feb 21, 2024

Can you clarify how you would do this with the proposed structure?

It would end up looking like this:

"#tag": "Customer"
Order:
  "#tag": Order
  OrderID:
    "#tag": OrderID
    "#chardata": "000000"
  Item:
    "#tag": Item
    "#chardata": "SomeItem"

Which means the order ID is able to be referenced as e.g. attributes["Order"]["OrderID"]["#chardata"]

That being said, that's a bit of a cherry picked example, because you might imagine something like this:

<Customer>
  <Order>
    <OrderID>000000<OrderID>
    <Item>SomeItem</Item>
  </Order>
  <Order>
    <OrderID>000001<OrderID>
    <Item>SecondItem</Item>
  </Order>
</Customer>

Which becomes:

"#tag": "Customer"
"Order#0":
  "#tag": Order
  OrderID:
    "#tag": OrderID
    "#chardata": "000000"
  Item:
    "#tag": "Item"
    "#chardata": "SomeItem"
"Order#1":
  OrderID:
    "#tag": OrderID,
    "#chardata": "000001"
  Item:
    "#tag": "Item"
    "#chardata": "SecondItem"

^ This is not easily walkable. Not using an array makes this problem worse. I do agree that this should be an array, maybe taking the datadog style where we group by tag (I feel that we can omit all but the top-level "#tag" if we do this too, it was mainly to preserve the original tag somewhere in case of collision)

"#tag": "Customer"
Order:
  - OrderID:
       "#chardata": "000000"
    Item:
       "#chardata": "SomeItem"
  - OrderID:
       "#chardata": "000001"
    Item:
       "#chardata": "SecondItem"

I guess my point is, if the XML schema is written in such a way that simple "key -> (obj|string)" map is possible, it would really be advantageous to create the map that way without arrays.


That being said, I also think that in a lot of situations the Stanza format is preferable. Like above, if you're schema makes use of "one-or-more" type elements, that consistency is really valuable.

Logstash allows you to configure this to force arrays (it forces them by default), so basically the first example here would be (also applying the datadog style for collisions):

"#tag": "Customer"
Order:
  -  OrderID:
        - "#chardata": "000000"
      Item:
        - "#chardata": "SomeItem"

I think the one thing that isn't resolved by this is the naming conventions. With this design of having, it's necessary to avoid collisions with some naming scheme (or handling collisions in a lossy way).

We could also resolve collisions the way that they are resolved in logstash, but I think just throwing them into an array together is really unclear (e.g. if you have a tag "chardata" and some character data, both the character data and the tag would be dumped in this array, which isn't great)


I'm not fully against the old Stanza method of doing it, it's really just that one thing with the arrays, and maybe it makes sense to just start with the Stanza method of parsing, and seeing what users think, and adding options to tweak the output if needed.

@djaglowski
Copy link
Member

It's not clear to me that any approach mentioned here is both consistent and generally ottl-queryable. In the absence of a clear path to both, I think it makes sense to prioritize consistency first and we can look for ways to increase queryability later.

@BinaryFissionGames
Copy link
Contributor Author

That's fair. I'll modify the proposal to just be the old Stanza format, then.

@TylerHelmuth
Copy link
Member

Built in looping in OTTL may help with then children[0].children[*].chardata == "000000" problem.

@BinaryFissionGames
Copy link
Contributor Author

Proposal is now modified to be the Stanza format.

@BinaryFissionGames
Copy link
Contributor Author

I've opened #31487 implementing this. If we need more discussion about implementation, let me know, happy to discuss more!

TylerHelmuth pushed a commit that referenced this issue Mar 15, 2024
**Description:**
* Adds a ParseXML converter function that can be used to parse an XML
document to a pcommon.Map value

**Link to tracking Issue:** Closes #31133

**Testing:**
Unit tests
Manually tested parsing XML logs

**Documentation:**
Added documentation for the ParseXML function to the ottl_funcs README.

---------

Co-authored-by: Evan Bradley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/ottl priority:p2 Medium
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants