-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement(regex_parser transform): Add RegexSet support to regex #2493
Conversation
Very nice! Thanks for your work on this. I'd like to see if we can deprecate the previous |
Good idea! We could print a warning when
The |
Agree! I say we start with a single-line |
0f5ee0f
to
3b92e9e
Compare
1891428
to
0fbb770
Compare
@binarylogic sounds good!
Let me know if the wording should be changed. The CI pipeline is currently failing, but it doesn't look like it's related to the changes in this PR. |
src/transforms/regex_parser.rs
Outdated
warn!( | ||
"Usage of `regex` is deprecated and will be removed in a future version. \ | ||
Please upgrade your config to use `regexes` instead: \ | ||
`regexes = ['{}']`", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a link to the regex transform documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thank you @mre! I added notes about a couple of things we'll want to clean up before merging, and a couple more general things that I'm curious about.
Found some time to fix the tests. The PR should be good to go now. There's a caveat in the implementation: regexes = [
'(?P<id>\d+)',
'(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)',
] then the first pattern will match. That's because it matches the pattern anywhere in the input. It's expected behavior for regular expressions of course, but in the case of multiple patterns, this can be confusing. (At least I was confused by that, hence the failing test.) The proper solution is to capture the entire line like so: regexes = [
'^(?P<id>\d+)$',
'^(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)$',
] This works as "intended" and the second pattern matches. |
@mre I think a little warning in the docs is plenty. :) I don't think an extra option is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New changes look great! Thank you so much for this @mre, we really appreciate it.
If @binarylogic and @Hoverbear agree, I'd like to go ahead with the s/regexes/patterns/
rename, but this seems otherwise just about ready to go.
Fine with me! |
…dev#2469) This allows to specify multiple regular expression patterns to be defined that will be matched on the input using `regex::RegexSet`. Signed-off-by: Matthias Endler <[email protected]>
Okay renamed
@binarylogic is that something you could help me with? |
@mre It looks like the problem exists on master as well. Please give us a day or two to fix it. :) |
I think that's something we can sort on master, so I'm going to go ahead and merge this as-is. Thanks again @mre! We're super appreciative of the time and effort you've put in here to make Vector better. |
Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096 Signed-off-by: Jesse Szwedko <[email protected]>
Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096 Signed-off-by: Jesse Szwedko <[email protected]>
Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096 Signed-off-by: Jesse Szwedko <[email protected]>
…3164) * fix(regex_parser transform): Correctly assign capture group fields Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096
…3164) * fix(regex_parser transform): Correctly assign capture group fields Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in #2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: #3096
…ectordotdev#2493) Signed-off-by: Brian Menges <[email protected]>
…ectordotdev#3164) * fix(regex_parser transform): Correctly assign capture group fields Previously, was incorrectly mapping the capture indexes of the matched pattern across the capture groups of all patterns so that, with something like: ```toml [sources.in] type = "stdin" [transforms.regex] type = "regex_parser" inputs = ["in"] patterns = [ '^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)', "^notblah (?P<close_while>.+)$", ] [sinks.out] inputs = ["regex"] type = "console" encoding.codec = "json" ``` And a line of: ``` 'notblah something `` Would end up setting both the `socket_code` and `close_while` fields: ```json { "close_while": "something", "host": "jesse-thinkpad", "socket_code": "something", "source_type": "stdin", "timestamp": "2020-07-22T19:30:12.647060371Z" } ``` This change simply updates `RegexParser.capture_names` to also be a `Vec` of the capture information for each pattern similar to `capture_logs` and uses the same match index later to access it. A couple of questions that came up while I was looking at this: It looks like, if multiple patterns match, it simply chooses the first one. Is this what we want? It was indirectly discussed in vectordotdev#2493 but a preference wasn't explicitly stated and it doesn't appear to be documented (in `master`) for the new `patterns` field. Once we decide what the behavior should be, I can document it and/or change the implementation if needed. I might have expected it to apply each matching pattern. I expected to still see the deprecated `regex` parameter in the, unrelased, documentation; just marked as deprecated, but it appears to have been dropped wholesale in https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41 . I just wanted to clarify if this is the expected the process for deprecating parameters. Fixes: vectordotdev#3096 Signed-off-by: Brian Menges <[email protected]>
This allows to specify multiple regular expressions to be defined
that will be matched on the input using
regex::RegexSet
.Fixes #2469.
Note that this is a breaking change, as it requires all configs to be rewritten
to use
regexes = ["..."]
instead ofregex = "..."
. Alternatively, we couldsupport both fields in the config; although I think this would lead to confusion down
the road.
Design decisions
regexes
, whilethe actual
RegexSet
is calledregexset
in the transform. The reason for keeping both is thatRegexSet::matches returns the set of regular expressions that match in the given text.
The set returned contains the index of each regular expression that matches in the given text.
The index is in correspondence with the order of regular expressions given to
RegexSet's constructor, which matches with the index in the vector.
The reasoning behind this was that types with the same name will probably be
used similarly further down the pipeline, no matter which pattern was matching.
Also, splitting up the coercions for each pattern would likely cause additional duplication
and make the configuration syntax harder to understand for beginners.
Open questions
clone()
. Was considering to store referencesto the regular expressions inside the vector instead of the expressions themselves,
but I was hoping to avoid lifetimes to keep the code more maintainable. Maybe it's a trade-off,
worth considering, though - or there is a better way.
As discussed with @Hoverbear and @lukesteensen.