Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add json_array_parser parser and assign_keys transformer #30644

Merged
merged 30 commits into from
Jan 23, 2024

Conversation

RoeiDimi
Copy link
Contributor

@RoeiDimi RoeiDimi commented Jan 17, 2024

Description:

We are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and in one use case the json is a simple headerless jarray and so we need a way to parse it and match headers to each field (something similar to what csv_parser does but also supports types supported in a json format and nested objects)

Link to tracking Issue:
#30321

Testing:

  • unittests
    All the tests found in csv_parser were copied and adjusted adding test scenarios for different types (numbers, booleans, null) as well as a test for parsing a nested object as a part of the jarray
  • End to end tests
    Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table

Documentation:

@RoeiDimi RoeiDimi requested a review from a team January 17, 2024 13:29
Copy link

linux-foundation-easycla bot commented Jan 17, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@RoeiDimi RoeiDimi changed the title Add a headerless jarray parser operator Add json_array_parser parser and assign_keys transformer Jan 22, 2024
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RoeiDimi, the design feels right. I left a lot of comments but mostly just nits. A few minor things I think we need to take care of though.

pkg/stanza/docs/operators/assign_keys.md Show resolved Hide resolved
pkg/stanza/operator/parser/json_array/json_array_parser.go Outdated Show resolved Hide resolved
pkg/stanza/operator/parser/json_array/json_array_parser.go Outdated Show resolved Hide resolved
pkg/stanza/operator/parser/json_array/json_array_parser.go Outdated Show resolved Hide resolved
pkg/stanza/operator/parser/json_array/json_array_parser.go Outdated Show resolved Hide resolved
pkg/stanza/operator/parser/json_array/json_array_parser.go Outdated Show resolved Hide resolved
pkg/stanza/operator/transformer/assign_keys/assign_keys.go Outdated Show resolved Hide resolved
pkg/stanza/operator/transformer/assign_keys/assign_keys.go Outdated Show resolved Hide resolved
pkg/stanza/operator/transformer/assign_keys/assign_keys.go Outdated Show resolved Hide resolved
pkg/stanza/operator/transformer/assign_keys/assign_keys.go Outdated Show resolved Hide resolved
@djaglowski
Copy link
Member

@RoeiDimi, I was able to resolve the issue by adding the new dependency directly to the root module. Thanks for the new features and for iterating on the design with me.

@djaglowski djaglowski merged commit aee7b70 into open-telemetry:main Jan 23, 2024
85 checks passed
@github-actions github-actions bot added this to the next release milestone Jan 23, 2024
@RoeiDimi
Copy link
Contributor Author

RoeiDimi commented Jan 23, 2024

Thanks @djaglowski, it's been a pleasure working together!

One last question though - assign_keys can't really parse the values straight into attributes (the way csv_parser can).
can you think of an easy way to make that happen?

i mean, I now understand that all this may not solve my initial issue completely.
with the initial design i could do this -

- type: csv_parser
      header: TimeGenerated,SourceIP,SourcePort,DestinationIP,DestinationPort,Protocol,SentBytes,ReceivedBytes,ExtID
      parse_from: body
      parse_to: attributes
      is_jarray: true

we have a generic exporter code that reads all the key-values from attributes and sends them.

Can you think of a way I could use the json_array_parser -> assign_keys design to achieve that?

@djaglowski
Copy link
Member

djaglowski commented Jan 23, 2024

@RoeiDimi, looks like we lost track of the goal somewhat but luckily you should be able to flatten after assigning keys:

    operators:
      - type: json_array_parser
        parse_from: body
        parse_to: attributes.tmp
      - type: assign_keys
        keys: ["a", "b", "c"]
        field: attributes.tmp
      - type: flatten
        field: attributes.tmp

Edit: I should point out this is not entirely by luck, but rather is a natural benefit of the highly decomposed approach we've always pursued with these operators. With a robust set of simple operations, we can typically compose them to do what's needed.

@RoeiDimi
Copy link
Contributor Author

@djaglowski yeah that's pretty cool, thank you so much! :)

I apologize but I have another question :
About the feature gate we added - how do I enable it? I saw this link

so I tried running the collector with all the following variations but none worked -

  • --feature-gates=logs.assignKeys,logs.jsonParserArray
  • --feature-gates=assignKeys,+jsonParserArray
  • --feature-gates=assignKeys,jsonParserArray
  • --feature-gates=+logs.assignKeys,+logs.jsonParserArray

@sumo-drosiek
Copy link
Member

sumo-drosiek commented Jan 24, 2024

@djaglowski, @RoeiDimi Shouldn't we have the functionality in OTTL language? I was able to achive the same functionality I believe using transformprocessor:

exporters:
  logging:
    verbosity: detailed
processors:
  transform/parsejson:
    log_statements:
      - context: log
        statements:
          // ParseJSON expects `{}` so I need to build it manually
          - set(body, ParseJSON(Concat(["{\"tmp\":", body, "}"], "")))
          - set(body["key1"], body["tmp"][0])
          - set(body["key2"], body["tmp"][1])
          - set(body["key3"], body["tmp"][2])
          - delete_key(body, "tmp")
receivers:
  filelog/containers:
    start_at: beginning
    include:
      - ./tmp/logs/5.json
service:
  pipelines:
    logs/containers:
      exporters:
      - logging
      processors:
      - transform/parsejson
      receivers:
      - filelog/containers

./tmp/logs/5.json

["value1", "value2", "value3"]

output:

LogRecord #0
ObservedTimestamp: 2024-01-24 08:11:49.193363736 +0000 UTC
Timestamp: 1970-01-01 00:00:00 +0000 UTC
SeverityText: 
SeverityNumber: Unspecified(0)
Body: Map({"key1":"value1","key2":"value2","key3":"value3"})
Attributes:
     -> log.file.name: Str(5.json)
Trace ID: 
Span ID: 
Flags: 0
        {"kind": "exporter", "data_type": "logs", "name": "logging"}

@RoeiDimi
Copy link
Contributor Author

RoeiDimi commented Jan 24, 2024

@sumo-drosiek @djaglowski , this suggestion may be interesting as after i ran large-scale stress tests it seems that using the 3 operators together (json_array_parser, assign_keys, flatten) result in a massive performance hit. Seems like chaining operators is not very efficient in general as the initial design with a single operator (headerless_jarray_parser) that does the same logic as in json_array_parser+assign_keys together was significantly better performance-wise.

anyway @sumo-drosiek I'd like to test your suggestion, i tried adding this to the builder config -

processors:
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/transformprocessor v0.90.0

but im getting and error saying the revision is wrong -
go: github.com/open-telemetry/opentelemetry-collector-contrib/[email protected]: reading github.com/open-telemetry/opentelemetry-collector-contrib/transformprocessor/go.mod at revision transformprocessor/v0.90.0: unknown revision transformprocessor/v0.90.0

is it not v0.90.0? can you please share the revision you used?

Thanks guys for helping me through this entirely new world of OTEL collector hehe..

@sumo-drosiek
Copy link
Member

@RoeiDimi It should be github.com/open-telemetry/opentelemetry-collector-contrib/processor/transformprocessor

@RoeiDimi
Copy link
Contributor Author

RoeiDimi commented Jan 25, 2024

Hi @djaglowski
I know this is closed as completed but after trying to use the solution we went for, I see too big of a performance hit to be able to use it. Using these 3 operators is at least 2X slower than using the original headerless_jarray_parser alone.
I also tried @sumo-drosiek 's suggestion (thank you for trying to help!) and it shows roughly the same numbers as using the 3 operators

For my benchmarks I'm using Azure kubernetes, single node, single pod, 5 cpus. I tested the following:

  1. headerless-jarray-parser can take ~10-11 mil events per minute
  2. 3 operators chained (or the processor suggestion) ~4-5 mil events per minute
  3. Just for testing i removed the flatten to see if this is the only issue but its just a part of it, i got ~6-7 mil events per minute
  4. json_array_parser alone got me back to the ~10-11 mil events per minute

It also makes sense in a way as headerless-jarray-parser could iterate only once on the input array, create a map and then replace attributes with that map..

I totally get the refactor we went for from a modularity point of view but is there any chance you'd agree to now also adding headerless-jarray-parser for performance reasons?

@djaglowski
Copy link
Member

@RoeiDimi, thanks for providing detailed performance numbers.

I think given the notable performance benefit, it's reasonable to revisit the combined operator.

json_array_parser and assign_keys are both useful in their own right, even if composition affects performance. I'd like to keep these capabilities in place regardless.

If we were to apply the same terminology update as we arrived at, we are talking about a headerless_json_array_parser. Rather than having two closely related parsers, headerless_json_array_parser and json_array_parser, we should consider what this looks like to support both use cases. In other words, pull the header/less part into the config. I imagine this could work as follows:

  • Re-add header settings, equivalent to those in the csv parser.
  • If the neither header setting is used, then the parser behaves exactly as it does today. That is, the result placed in parse_to is expected to be []any.
  • If either header setting is used, then we perform the parsing step and then apply the header, and ultimately place an object in parse_to.

The benefit of this approach is that the operator just has the ability to apply an optional behavior, but one that is closely related to the first step of parsing the array. WDYT?

@RoeiDimi
Copy link
Contributor Author

@djaglowski yes, that's a good, clean idea
I guess we can now continue here - #30814
:)

djaglowski pushed a commit that referenced this pull request Jan 31, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Adding a feature following
#30644.
This feature allow json_array_parser parser to accept a comma-delimited
header and for every json array it parses, output a map which contains
the header fileds as keys and the matching values are the ones parsed
from the input json array.

This feature as added mainly for performance reasons as from a
functional POV, this is mostly similar to chaining the 2 operators:
`json_array_parser -> assign_keys `
**Link to tracking Issue:** <Issue number if applicable>

#30321
**Testing:** <Describe what testing was performed and which tests were
added.>

- unittests
- End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table and
performance looks good

**Documentation:** <Describe the documentation added.>

-
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this pull request Feb 1, 2024
…try#30644)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
We are using otel-collector as an infrastructure and receive many types
of data from a client. The client's sent data is always a form of json
and in one use case the json is a simple headerless jarray and so we
need a way to parse it and match headers to each field (something
similar to what csv_parser does but also supports types supported in a
json format and nested objects)

**Link to tracking Issue:** <Issue number if applicable>

open-telemetry#30321

**Testing:** <Describe what testing was performed and which tests were
added.>
* unittests
All the tests found in csv_parser were copied and adjusted adding test
scenarios for different types (numbers, booleans, null) as well as a
test for parsing a nested object as a part of the jarray
* End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table

**Documentation:** <Describe the documentation added.>
*
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md)
*
[assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md)

---------

Co-authored-by: Daniel Jaglowski <[email protected]>
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this pull request Feb 1, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Adding a feature following
open-telemetry#30644.
This feature allow json_array_parser parser to accept a comma-delimited
header and for every json array it parses, output a map which contains
the header fileds as keys and the matching values are the ones parsed
from the input json array.

This feature as added mainly for performance reasons as from a
functional POV, this is mostly similar to chaining the 2 operators:
`json_array_parser -> assign_keys `
**Link to tracking Issue:** <Issue number if applicable>

open-telemetry#30321
**Testing:** <Describe what testing was performed and which tests were
added.>

- unittests
- End to end tests
Used generated traffic on a running otel collector thats using the
parser and verified the data is as expected in the end table and
performance looks good

**Documentation:** <Describe the documentation added.>

-
[json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants