Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Drain uses different tokenizer based on log format #13384

Merged
merged 13 commits into from
Jul 4, 2024

Conversation

cyriltovena
Copy link
Contributor

@cyriltovena cyriltovena commented Jul 2, 2024

What this PR does / why we need it:

This replace the tokenizer with special one depending on the log format. It also discard json logs.

I also improve performance by removing most of allocations

benchstat before.txt after.txt
name                                                            old time/op    new time/op    delta
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-16          1.71ms ± 0%    0.84ms ± 2%   -51.08%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-16        123µs ± 3%      57µs ± 4%   -53.82%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-16             302µs ±24%     172µs ± 6%   -43.19%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-16    5.87ms ± 1%    3.01ms ±18%   -48.80%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/journald.txt-16              2.63ms ± 4%    1.85ms ± 3%   -29.62%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kafka.txt-16                 1.85ms ± 6%    1.03ms ± 2%   -44.42%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-16            2.29ms ± 3%    1.40ms ± 2%   -38.93%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/vault.txt-16                 1.89ms ± 9%    1.11ms ± 9%   -40.96%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/calico.txt-16                3.02ms ±28%    1.48ms ± 3%   -51.13%  (p=0.008 n=5+5)

name                                                            old alloc/op   new alloc/op   delta
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-16          1.35MB ± 0%    0.03MB ± 0%   -97.96%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-16       96.7kB ± 0%     0.0kB ± 0%  -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-16             545kB ± 0%       5kB ± 0%   -99.07%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-16    4.80MB ± 0%    0.00MB ± 8%  -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/journald.txt-16              3.19MB ± 0%    0.03MB ± 0%   -99.18%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kafka.txt-16                 2.98MB ± 0%    0.02MB ± 0%   -99.19%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-16            3.17MB ± 0%    0.02MB ± 0%   -99.22%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/vault.txt-16                 2.87MB ± 0%    0.02MB ± 0%   -99.16%  (p=0.016 n=5+4)
Drain_TrainExtractsPatterns/testdata/calico.txt-16                3.16MB ± 0%    0.03MB ± 0%   -99.20%  (p=0.008 n=5+5)

name                                                            old allocs/op  new allocs/op  delta
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-16           20.0k ± 0%      0.1k ± 0%   -99.42%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-16        1.60k ± 0%     0.00k       -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-16               660 ± 0%       210 ± 0%   -68.18%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-16     80.0k ± 0%      0.0k       -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/journald.txt-16               3.96k ± 0%     1.01k ± 0%   -74.47%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kafka.txt-16                  3.99k ± 0%     1.00k ± 0%   -74.96%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-16             4.00k ± 0%     1.00k ± 0%   -74.91%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/vault.txt-16                  4.00k ± 0%     1.00k ± 0%   -75.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/calico.txt-16                 4.04k ± 0%     1.02k ± 0%   -74.76%  (p=0.008 n=5+5)

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@cyriltovena cyriltovena marked this pull request as ready for review July 3, 2024 21:30
@cyriltovena cyriltovena requested a review from a team as a code owner July 3, 2024 21:30
@cyriltovena
Copy link
Contributor Author

for !t.dec.EOL() && t.dec.ScanKeyval() {
key := t.dec.Key()
if isVariableField(key) {
tokens = append(tokens, unsafeString(t.dec.Key()), t.varReplace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that this works - I thought that ScanKeyval reused those values after every call but it seems tthat it doesn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns either a copy if it needs to decode quoted bytes or just a subslice.

buf := bytes.NewBuffer(make([]byte, 0, 1024))
enc := gologfmt.NewEncoder(buf)
for i := 0; i < len(tokens); i += 2 {
k, v := tokens[i], tokens[i+1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this handle multi-word values? Are they a single token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct they are.

@cyriltovena cyriltovena merged commit bc01e6f into grafana:main Jul 4, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants