Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(regex_parser transform): Add RegexSet support to regex #2493

Merged
merged 1 commit into from
May 13, 2020

Conversation

mre
Copy link
Contributor

@mre mre commented Apr 29, 2020

This allows to specify multiple regular expressions to be defined
that will be matched on the input using regex::RegexSet.
Fixes #2469.

Note that this is a breaking change, as it requires all configs to be rewritten
to use regexes = ["..."] instead of regex = "...". Alternatively, we could
support both fields in the config; although I think this would lead to confusion down
the road.

Design decisions

  • The individual regular expressions are kept in a vector named regexes, while
    the actual RegexSet is called regexset in the transform. The reason for keeping both is that
    RegexSet::matches returns the set of regular expressions that match in the given text.
    The set returned contains the index of each regular expression that matches in the given text.
    The index is in correspondence with the order of regular expressions given to
    RegexSet's constructor, which matches with the index in the vector.
  • Type coercions are applied to all captures with the same name across all patterns.
    The reasoning behind this was that types with the same name will probably be
    used similarly further down the pipeline, no matter which pattern was matching.
    Also, splitting up the coercions for each pattern would likely cause additional duplication
    and make the configuration syntax harder to understand for beginners.

Open questions

  • If possible, we should avoid the two calls to clone(). Was considering to store references
    to the regular expressions inside the vector instead of the expressions themselves,
    but I was hoping to avoid lifetimes to keep the code more maintainable. Maybe it's a trade-off,
    worth considering, though - or there is a better way.

As discussed with @Hoverbear and @lukesteensen.

@mre mre requested a review from lukesteensen as a code owner April 29, 2020 13:13
@mre mre changed the title WIP: Add RegexSet support to regex WIP: enhancement(regex_parser transform): Add RegexSet support to regex Apr 29, 2020
@mre mre requested a review from binarylogic as a code owner April 29, 2020 13:35
@binarylogic
Copy link
Contributor

Very nice! Thanks for your work on this. I'd like to see if we can deprecate the previous regex option without breaking backward compatibility. If you don't want to spend time doing that we can get someone on the team to implement that.

@mre
Copy link
Contributor Author

mre commented Apr 29, 2020

I'd like to see if we can deprecate the previous regex option without breaking backward compatibility.

Good idea! We could print a warning when vector boots, e.g.

⚠️ Usage of `regex` is deprecated and will be removed with version 1.0.
Please upgrade your config to use `regexes` instead like so:
`regexes = ['<pattern_here>']`

The <pattern_here> part would be the actual pattern from the user's config to make upgrading very easy.
Even better, we could provide a vector --fix command (similar to rustfix), that would automatically update the configuration to the latest version.

@binarylogic
Copy link
Contributor

Agree! I say we start with a single-line warn log message and then consider the --fix option with something more general, like #1037. As you can see in #1037, we'd like to version our configuration and use that as a guide for updating. We're still discussing that, but I think that would be the best approach. Curious what you think!

@mre mre force-pushed the regexset branch 3 times, most recently from 0f5ee0f to 3b92e9e Compare April 29, 2020 15:09
@mre mre changed the title WIP: enhancement(regex_parser transform): Add RegexSet support to regex enhancement(regex_parser transform): Add RegexSet support to regex Apr 29, 2020
@mre mre force-pushed the regexset branch 2 times, most recently from 1891428 to 0fbb770 Compare May 4, 2020 11:07
@mre
Copy link
Contributor Author

mre commented May 4, 2020

@binarylogic sounds good!
I've added the warning now and it looks okay to me when I run vector --config vector.toml:

May 04 15:18:39.938  INFO vector: Vector is starting. version="0.9.1" git_version="v0.9.0-105-g57dd046" released="Mon, 04 May 2020 13:14:03 +0000" arch="x86_64"
May 04 15:18:39.944  WARN vector::transforms::regex_parser: Usage of `regex` is deprecated and will be removed in a future version. Please upgrade your config to use `regexes` instead: `regexes = ['^/appdata/nomad/alloc/(?P<alloc_id>\w[^/]*?)/alloc/logs/(?P<container_name>.*?)\.(?P<logtype>\w+).*']`

Let me know if the wording should be changed.

The CI pipeline is currently failing, but it doesn't look like it's related to the changes in this PR.
Before merging this, I'd propose to run a benchmark to check for performance regressions.

warn!(
"Usage of `regex` is deprecated and will be removed in a future version. \
Please upgrade your config to use `regexes` instead: \
`regexes = ['{}']`",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a link to the regex transform documentation here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea!

Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thank you @mre! I added notes about a couple of things we'll want to clean up before merging, and a couple more general things that I'm curious about.

src/transforms/regex_parser.rs Outdated Show resolved Hide resolved
src/transforms/regex_parser.rs Show resolved Hide resolved
src/transforms/regex_parser.rs Outdated Show resolved Hide resolved
src/transforms/regex_parser.rs Outdated Show resolved Hide resolved
src/transforms/regex_parser.rs Outdated Show resolved Hide resolved
@mre
Copy link
Contributor Author

mre commented May 11, 2020

Found some time to fix the tests. The PR should be good to go now.
@lukesteensen @binarylogic can you have another look please? 😊

There's a caveat in the implementation:
Say your input is 1234 235.42 true and the regexes are

regexes = [
  '(?P<id>\d+)',
  '(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)',
]

then the first pattern will match. That's because it matches the pattern anywhere in the input. It's expected behavior for regular expressions of course, but in the case of multiple patterns, this can be confusing. (At least I was confused by that, hence the failing test.)

The proper solution is to capture the entire line like so:

regexes = [
  '^(?P<id>\d+)$',
  '^(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)$',
]

This works as "intended" and the second pattern matches.
I wonder if we should document this somehow or if it's obvious.
Alternatively, we could add another exhaustive option or so, which would enforce that the full pattern has to be matched. I'd argue that this would make things unnecessarily complex, though.

@Hoverbear
Copy link
Contributor

@mre I think a little warning in the docs is plenty. :) I don't think an extra option is necessary.

Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New changes look great! Thank you so much for this @mre, we really appreciate it.

If @binarylogic and @Hoverbear agree, I'd like to go ahead with the s/regexes/patterns/ rename, but this seems otherwise just about ready to go.

@Hoverbear
Copy link
Contributor

Fine with me!

…dev#2469)

This allows to specify multiple regular expression patterns
to be defined that will be matched on the input using
`regex::RegexSet`.

Signed-off-by: Matthias Endler <[email protected]>
@mre
Copy link
Contributor Author

mre commented May 12, 2020

Okay renamed regexes to patterns now and signed off the commits.
Also added a link to the docs in case a config is still using the old regex field.
The only missing issue from my side is the failing make generate. It seems to fail on a seemingly unrelated change, but I might be wrong:

check-blog_1                                      | `/` is not writable.
check-blog_1                                      | Bundler will use `/tmp/bundler20200512-1-chzyp41' as your home directory temporarily.
check-blog_1                                      | ---> The resulting hash from the `/.meta/**/*.toml` files failed
check-blog_1                                      |      validation against the following schema:
check-blog_1                                      |      
check-blog_1                                      |          /.meta/schema/meta.json
check-blog_1                                      |      
check-blog_1                                      |      The errors include:
check-blog_1                                      |      
check-blog_1                                      |          * The value at `/sources/vector/options/tls/children/verify_certificate/kubernetes` failed validation for `/definitions/field/additionalProperties`, reason: `schema`

@binarylogic is that something you could help me with?

@Hoverbear
Copy link
Contributor

Hoverbear commented May 13, 2020

@mre It looks like the problem exists on master as well. Please give us a day or two to fix it. :)
image

@lukesteensen
Copy link
Member

I think that's something we can sort on master, so I'm going to go ahead and merge this as-is.

Thanks again @mre! We're super appreciative of the time and effort you've put in here to make Vector better.

@lukesteensen lukesteensen merged commit 0cdc500 into vectordotdev:master May 13, 2020
@mre mre deleted the regexset branch May 19, 2020 07:35
@Hoverbear Hoverbear changed the title enhancement(regex_parser transform): Add RegexSet support to regex enhancement(regex_parser transform)!: Add RegexSet support to regex Jul 13, 2020
@Hoverbear Hoverbear changed the title enhancement(regex_parser transform)!: Add RegexSet support to regex enhancement(regex_parser transform): Add RegexSet support to regex Jul 13, 2020
jszwedko added a commit that referenced this pull request Jul 22, 2020
Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in #2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: #3096

Signed-off-by: Jesse Szwedko <[email protected]>
jszwedko added a commit that referenced this pull request Jul 24, 2020
Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in #2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: #3096

Signed-off-by: Jesse Szwedko <[email protected]>
jszwedko added a commit that referenced this pull request Jul 24, 2020
Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in #2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: #3096

Signed-off-by: Jesse Szwedko <[email protected]>
jszwedko added a commit that referenced this pull request Jul 28, 2020
…3164)

* fix(regex_parser transform): Correctly assign capture group fields

Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in #2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: #3096
fanatid pushed a commit that referenced this pull request Jul 28, 2020
…3164)

* fix(regex_parser transform): Correctly assign capture group fields

Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in #2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: #3096
mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020
mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020
…ectordotdev#3164)

* fix(regex_parser transform): Correctly assign capture group fields

Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in vectordotdev#2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: vectordotdev#3096
Signed-off-by: Brian Menges <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add RegexSet support to regex
4 participants