feat: Drain uses different tokenizer based on log format #13384

cyriltovena · 2024-07-02T16:40:24Z

What this PR does / why we need it:

This replace the tokenizer with special one depending on the log format. It also discard json logs.

I also improve performance by removing most of allocations

benchstat before.txt after.txt
name                                                            old time/op    new time/op    delta
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-16          1.71ms ± 0%    0.84ms ± 2%   -51.08%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-16        123µs ± 3%      57µs ± 4%   -53.82%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-16             302µs ±24%     172µs ± 6%   -43.19%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-16    5.87ms ± 1%    3.01ms ±18%   -48.80%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/journald.txt-16              2.63ms ± 4%    1.85ms ± 3%   -29.62%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kafka.txt-16                 1.85ms ± 6%    1.03ms ± 2%   -44.42%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-16            2.29ms ± 3%    1.40ms ± 2%   -38.93%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/vault.txt-16                 1.89ms ± 9%    1.11ms ± 9%   -40.96%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/calico.txt-16                3.02ms ±28%    1.48ms ± 3%   -51.13%  (p=0.008 n=5+5)

name                                                            old alloc/op   new alloc/op   delta
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-16          1.35MB ± 0%    0.03MB ± 0%   -97.96%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-16       96.7kB ± 0%     0.0kB ± 0%  -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-16             545kB ± 0%       5kB ± 0%   -99.07%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-16    4.80MB ± 0%    0.00MB ± 8%  -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/journald.txt-16              3.19MB ± 0%    0.03MB ± 0%   -99.18%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kafka.txt-16                 2.98MB ± 0%    0.02MB ± 0%   -99.19%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-16            3.17MB ± 0%    0.02MB ± 0%   -99.22%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/vault.txt-16                 2.87MB ± 0%    0.02MB ± 0%   -99.16%  (p=0.016 n=5+4)
Drain_TrainExtractsPatterns/testdata/calico.txt-16                3.16MB ± 0%    0.03MB ± 0%   -99.20%  (p=0.008 n=5+5)

name                                                            old allocs/op  new allocs/op  delta
Drain_TrainExtractsPatterns/testdata/agent-logfmt.txt-16           20.0k ± 0%      0.1k ± 0%   -99.42%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/ingester-logfmt.txt-16        1.60k ± 0%     0.00k       -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/drone-json.txt-16               660 ± 0%       210 ± 0%   -68.18%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/distributor-logfmt.txt-16     80.0k ± 0%      0.0k       -100.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/journald.txt-16               3.96k ± 0%     1.01k ± 0%   -74.47%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kafka.txt-16                  3.99k ± 0%     1.00k ± 0%   -74.96%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/kubernetes.txt-16             4.00k ± 0%     1.00k ± 0%   -74.91%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/vault.txt-16                  4.00k ± 0%     1.00k ± 0%   -75.00%  (p=0.008 n=5+5)
Drain_TrainExtractsPatterns/testdata/calico.txt-16                 4.04k ± 0%     1.02k ± 0%   -74.76%  (p=0.008 n=5+5)

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
- Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

cyriltovena · 2024-07-04T06:58:17Z

Fixes https://github.com/grafana/loki-private/issues/1014

benclive · 2024-07-04T08:50:20Z

pkg/pattern/drain/line_tokenizer.go

+	for !t.dec.EOL() && t.dec.ScanKeyval() {
+		key := t.dec.Key()
+		if isVariableField(key) {
+			tokens = append(tokens, unsafeString(t.dec.Key()), t.varReplace)


Interesting that this works - I thought that ScanKeyval reused those values after every call but it seems tthat it doesn't?

It returns either a copy if it needs to decode quoted bytes or just a subslice.

benclive · 2024-07-04T08:52:34Z

pkg/pattern/drain/line_tokenizer.go

+	buf := bytes.NewBuffer(make([]byte, 0, 1024))
+	enc := gologfmt.NewEncoder(buf)
+	for i := 0; i < len(tokens); i += 2 {
+		k, v := tokens[i], tokens[i+1]


How does this handle multi-word values? Are they a single token?

correct they are.

feat: Drain uses different tokenizer based on log format

2efa571

pull-request-size bot added the size/L label Jul 2, 2024

cyriltovena added 9 commits July 2, 2024 19:27

fixes panic in metrics stat

feb699b

fixes panic in pattern ingestion for json

458d2ec

add support for json

3cf6a6e

Merge remote-tracking branch 'upstream/main' into feat/drain-format

8367a67

add support for json

9776b54

Merge remote-tracking branch 'upstream/main' into feat/drain-format

63b36ff

improve performance by removing allocs

9e6cfc4

use better unsafe function

466197d

lint

b390b58

cyriltovena marked this pull request as ready for review July 3, 2024 21:30

cyriltovena requested a review from a team as a code owner July 3, 2024 21:30

cyriltovena added 2 commits July 4, 2024 00:00

lint files

347b0e9

Merge branch 'main' into feat/drain-format

16eea97

skip empty streams

a7bd24f

benclive reviewed Jul 4, 2024

View reviewed changes

benclive approved these changes Jul 4, 2024

View reviewed changes

cyriltovena merged commit bc01e6f into grafana:main Jul 4, 2024
60 checks passed

This was referenced Jul 8, 2024

chore(k210): release 3.1.0 #13435

Closed

chore(k210): release 3.1.0 #13462

Closed

loki-gh-app bot mentioned this pull request Jul 15, 2024

chore(k211): release 3.1.0 #13521

Closed

loki-gh-app bot mentioned this pull request Jul 22, 2024

chore(k212): release 3.1.0 #13595

Closed

This was referenced Aug 15, 2024

chore(k215): release 3.2.0 #13905

Open

chore(k216): release 3.2.0 #13929

Open

loki-gh-app bot mentioned this pull request Sep 9, 2024

chore(k218): release 3.2.0 #14088

Merged

loki-gh-app bot mentioned this pull request Sep 23, 2024

chore(k221): release 3.2.0 #14214

Open

loki-gh-app bot mentioned this pull request Sep 30, 2024

chore(k222): release 3.2.0 #14305

Open

loki-gh-app bot mentioned this pull request Oct 7, 2024

chore(k223): release 3.2.0 #14402

Open

This was referenced Oct 14, 2024

chore(k224): release 3.2.0 #14486

Open

chore(k225): release 3.2.0 #14543

Closed

This was referenced Oct 21, 2024

chore(k225): release 3.2.0 #14548

Open

chore(k226): release 3.2.0 #14625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Drain uses different tokenizer based on log format #13384

feat: Drain uses different tokenizer based on log format #13384

cyriltovena commented Jul 2, 2024 •

edited

Loading

cyriltovena commented Jul 4, 2024

benclive Jul 4, 2024

cyriltovena Jul 4, 2024

benclive Jul 4, 2024

cyriltovena Jul 4, 2024

feat: Drain uses different tokenizer based on log format #13384

feat: Drain uses different tokenizer based on log format #13384

Conversation

cyriltovena commented Jul 2, 2024 • edited Loading

cyriltovena commented Jul 4, 2024

benclive Jul 4, 2024

Choose a reason for hiding this comment

cyriltovena Jul 4, 2024

Choose a reason for hiding this comment

benclive Jul 4, 2024

Choose a reason for hiding this comment

cyriltovena Jul 4, 2024

Choose a reason for hiding this comment

cyriltovena commented Jul 2, 2024 •

edited

Loading