enhancement(regex_parser transform): Add RegexSet support to regex #2493

mre · 2020-04-29T13:13:13Z

This allows to specify multiple regular expressions to be defined
that will be matched on the input using regex::RegexSet.
Fixes #2469.

Note that this is a breaking change, as it requires all configs to be rewritten
to use regexes = ["..."] instead of regex = "...". Alternatively, we could
support both fields in the config; although I think this would lead to confusion down
the road.

Design decisions

The individual regular expressions are kept in a vector named regexes, while
the actual RegexSet is called regexset in the transform. The reason for keeping both is that
RegexSet::matches returns the set of regular expressions that match in the given text.
The set returned contains the index of each regular expression that matches in the given text.
The index is in correspondence with the order of regular expressions given to
RegexSet's constructor, which matches with the index in the vector.
Type coercions are applied to all captures with the same name across all patterns.
The reasoning behind this was that types with the same name will probably be
used similarly further down the pipeline, no matter which pattern was matching.
Also, splitting up the coercions for each pattern would likely cause additional duplication
and make the configuration syntax harder to understand for beginners.

Open questions

If possible, we should avoid the two calls to clone(). Was considering to store references
to the regular expressions inside the vector instead of the expressions themselves,
but I was hoping to avoid lifetimes to keep the code more maintainable. Maybe it's a trade-off,
worth considering, though - or there is a better way.

As discussed with @Hoverbear and @lukesteensen.

binarylogic · 2020-04-29T13:37:52Z

Very nice! Thanks for your work on this. I'd like to see if we can deprecate the previous regex option without breaking backward compatibility. If you don't want to spend time doing that we can get someone on the team to implement that.

mre · 2020-04-29T13:47:56Z

I'd like to see if we can deprecate the previous regex option without breaking backward compatibility.

Good idea! We could print a warning when vector boots, e.g.

⚠️ Usage of `regex` is deprecated and will be removed with version 1.0.
Please upgrade your config to use `regexes` instead like so:
`regexes = ['<pattern_here>']`

The <pattern_here> part would be the actual pattern from the user's config to make upgrading very easy.
Even better, we could provide a vector --fix command (similar to rustfix), that would automatically update the configuration to the latest version.

binarylogic · 2020-04-29T13:53:35Z

Agree! I say we start with a single-line warn log message and then consider the --fix option with something more general, like #1037. As you can see in #1037, we'd like to version our configuration and use that as a guide for updating. We're still discussing that, but I think that would be the best approach. Curious what you think!

mre · 2020-05-04T13:24:16Z

@binarylogic sounds good!
I've added the warning now and it looks okay to me when I run vector --config vector.toml:

May 04 15:18:39.938  INFO vector: Vector is starting. version="0.9.1" git_version="v0.9.0-105-g57dd046" released="Mon, 04 May 2020 13:14:03 +0000" arch="x86_64"
May 04 15:18:39.944  WARN vector::transforms::regex_parser: Usage of `regex` is deprecated and will be removed in a future version. Please upgrade your config to use `regexes` instead: `regexes = ['^/appdata/nomad/alloc/(?P<alloc_id>\w[^/]*?)/alloc/logs/(?P<container_name>.*?)\.(?P<logtype>\w+).*']`

Let me know if the wording should be changed.

The CI pipeline is currently failing, but it doesn't look like it's related to the changes in this PR.
Before merging this, I'd propose to run a benchmark to check for performance regressions.

mre · 2020-05-04T13:26:17Z

src/transforms/regex_parser.rs

+                warn!(
+                    "Usage of `regex` is deprecated and will be removed in a future version. \
+                     Please upgrade your config to use `regexes` instead: \
+                     `regexes = ['{}']`",


Shall we add a link to the regex transform documentation here?

Seems like a good idea!

lukesteensen

This is looking great, thank you @mre! I added notes about a couple of things we'll want to clean up before merging, and a couple more general things that I'm curious about.

src/transforms/regex_parser.rs

mre · 2020-05-11T22:24:33Z

Found some time to fix the tests. The PR should be good to go now.
@lukesteensen @binarylogic can you have another look please? 😊

There's a caveat in the implementation:
Say your input is 1234 235.42 true and the regexes are

regexes = [
  '(?P<id>\d+)',
  '(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)',
]

then the first pattern will match. That's because it matches the pattern anywhere in the input. It's expected behavior for regular expressions of course, but in the case of multiple patterns, this can be confusing. (At least I was confused by that, hence the failing test.)

The proper solution is to capture the entire line like so:

regexes = [
  '^(?P<id>\d+)$',
  '^(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)$',
]

This works as "intended" and the second pattern matches.
I wonder if we should document this somehow or if it's obvious.
Alternatively, we could add another exhaustive option or so, which would enforce that the full pattern has to be matched. I'd argue that this would make things unnecessarily complex, though.

Hoverbear · 2020-05-11T22:29:16Z

@mre I think a little warning in the docs is plenty. :) I don't think an extra option is necessary.

lukesteensen

New changes look great! Thank you so much for this @mre, we really appreciate it.

If @binarylogic and @Hoverbear agree, I'd like to go ahead with the s/regexes/patterns/ rename, but this seems otherwise just about ready to go.

Hoverbear · 2020-05-12T03:12:59Z

Fine with me!

…dev#2469) This allows to specify multiple regular expression patterns to be defined that will be matched on the input using `regex::RegexSet`. Signed-off-by: Matthias Endler <[email protected]>

mre · 2020-05-12T13:11:03Z

Okay renamed regexes to patterns now and signed off the commits.
Also added a link to the docs in case a config is still using the old regex field.
The only missing issue from my side is the failing make generate. It seems to fail on a seemingly unrelated change, but I might be wrong:

check-blog_1                                      | `/` is not writable.
check-blog_1                                      | Bundler will use `/tmp/bundler20200512-1-chzyp41' as your home directory temporarily.
check-blog_1                                      | ---> The resulting hash from the `/.meta/**/*.toml` files failed
check-blog_1                                      |      validation against the following schema:
check-blog_1                                      |      
check-blog_1                                      |          /.meta/schema/meta.json
check-blog_1                                      |      
check-blog_1                                      |      The errors include:
check-blog_1                                      |      
check-blog_1                                      |          * The value at `/sources/vector/options/tls/children/verify_certificate/kubernetes` failed validation for `/definitions/field/additionalProperties`, reason: `schema`

@binarylogic is that something you could help me with?

Hoverbear · 2020-05-13T18:04:28Z

@mre It looks like the problem exists on master as well. Please give us a day or two to fix it. :)

lukesteensen · 2020-05-13T19:49:47Z

I think that's something we can sort on master, so I'm going to go ahead and merge this as-is.

Thanks again @mre! We're super appreciative of the time and effort you've put in here to make Vector better.

Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah $(?P<socket_code>[0-9]+): (?:[^$]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096 Signed-off-by: Jesse Szwedko <[email protected]>

…3164) * fix(regex_parser transform): Correctly assign capture group fields Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah $(?P<socket_code>[0-9]+): (?:[^$]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096

…ectordotdev#2493) Signed-off-by: Brian Menges <[email protected]>

…ectordotdev#3164) * fix(regex_parser transform): Correctly assign capture group fields Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah $(?P<socket_code>[0-9]+): (?:[^$]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in vectordotdev#2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: vectordotdev#3096 Signed-off-by: Brian Menges <[email protected]>

mre requested a review from lukesteensen as a code owner April 29, 2020 13:13

mre changed the title ~~WIP: Add RegexSet support to regex~~ WIP: enhancement(regex_parser transform): Add RegexSet support to regex Apr 29, 2020

mre force-pushed the regexset branch from 4051ee3 to b137907 Compare April 29, 2020 13:35

mre requested a review from binarylogic as a code owner April 29, 2020 13:35

mre force-pushed the regexset branch 3 times, most recently from 0f5ee0f to 3b92e9e Compare April 29, 2020 15:09

mre changed the title ~~WIP: enhancement(regex_parser transform): Add RegexSet support to regex~~ enhancement(regex_parser transform): Add RegexSet support to regex Apr 29, 2020

mre force-pushed the regexset branch 2 times, most recently from 1891428 to 0fbb770 Compare May 4, 2020 11:07

mre commented May 4, 2020

View reviewed changes

lukesteensen requested changes May 4, 2020

View reviewed changes

lukesteensen approved these changes May 12, 2020

View reviewed changes

enhancement(regex_parser transform): Multi-pattern support (vectordot…

f1b625b

…dev#2469) This allows to specify multiple regular expression patterns to be defined that will be matched on the input using `regex::RegexSet`. Signed-off-by: Matthias Endler <[email protected]>

mre force-pushed the regexset branch from 1cf67a9 to f1b625b Compare May 12, 2020 13:10

lukesteensen merged commit 0cdc500 into vectordotdev:master May 13, 2020

fanatid mentioned this pull request May 13, 2020

Change config types in docs to appropriate types in code #2592

Closed

mre deleted the regexset branch May 19, 2020 07:35

fanatid mentioned this pull request Jun 2, 2020

docs: Fix wrong example in tokenizer module (version 0.9.1) #2716

Merged

Hoverbear changed the title ~~enhancement(regex_parser transform): Add RegexSet support to regex~~ enhancement(regex_parser transform)!: Add RegexSet support to regex Jul 13, 2020

Hoverbear changed the title ~~enhancement(regex_parser transform)!: Add RegexSet support to regex~~ enhancement(regex_parser transform): Add RegexSet support to regex Jul 13, 2020

jszwedko mentioned this pull request Jul 22, 2020

fix(regex_parser transform): Correctly assign capture group fields #3164

Merged

binarylogic mentioned this pull request Aug 7, 2020

Allow multiple patterns in parser transforms #1477

Closed

mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020

enhancement(regex_parser transform): Add RegexSet support to regex (v…

c635db1

…ectordotdev#2493) Signed-off-by: Brian Menges <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement(regex_parser transform): Add RegexSet support to regex #2493

enhancement(regex_parser transform): Add RegexSet support to regex #2493

mre commented Apr 29, 2020 •

edited

Loading

binarylogic commented Apr 29, 2020

mre commented Apr 29, 2020

binarylogic commented Apr 29, 2020

mre commented May 4, 2020

mre May 4, 2020

Hoverbear May 11, 2020

lukesteensen left a comment

mre commented May 11, 2020

Hoverbear commented May 11, 2020

lukesteensen left a comment

Hoverbear commented May 12, 2020

mre commented May 12, 2020

Hoverbear commented May 13, 2020 •

edited

Loading

lukesteensen commented May 13, 2020

enhancement(regex_parser transform): Add RegexSet support to regex #2493

enhancement(regex_parser transform): Add RegexSet support to regex #2493

Conversation

mre commented Apr 29, 2020 • edited Loading

Design decisions

Open questions

binarylogic commented Apr 29, 2020

mre commented Apr 29, 2020

binarylogic commented Apr 29, 2020

mre commented May 4, 2020

mre May 4, 2020

Choose a reason for hiding this comment

Hoverbear May 11, 2020

Choose a reason for hiding this comment

lukesteensen left a comment

Choose a reason for hiding this comment

mre commented May 11, 2020

Hoverbear commented May 11, 2020

lukesteensen left a comment

Choose a reason for hiding this comment

Hoverbear commented May 12, 2020

mre commented May 12, 2020

Hoverbear commented May 13, 2020 • edited Loading

lukesteensen commented May 13, 2020

mre commented Apr 29, 2020 •

edited

Loading

Hoverbear commented May 13, 2020 •

edited

Loading