Skip to content

Commit

Permalink
fix(regex_parser transform): Correctly assign capture group fields
Browse files Browse the repository at this point in the history
Previously, was incorrectly mapping the capture indexes of the matched
pattern across the capture groups of all patterns so that, with
something like:

```toml
[sources.in]
type = "stdin"

[transforms.regex]
type = "regex_parser"
inputs = ["in"]
patterns = [
	'^blah \((?P<socket_code>[0-9]+): (?:[^\)]+)\) while (?P<timeout_while>.)',
	"^notblah (?P<close_while>.+)$",
]

[sinks.out]
inputs = ["regex"]
type = "console"
encoding.codec = "json"
```

And a line of:

```
'notblah something
``

Would end up setting both the `socket_code` and `close_while` fields:

```json
{
  "close_while": "something",
  "host": "jesse-thinkpad",
  "socket_code": "something",
  "source_type": "stdin",
  "timestamp": "2020-07-22T19:30:12.647060371Z"
}
```

This change simply updates `RegexParser.capture_names` to also be a `Vec` of
the capture information for each pattern similar to `capture_logs` and uses the
same match index later to access it.

A couple of questions that came up while I was looking at this:

It looks like, if multiple patterns match, it simply chooses the first one. Is
this what we want? It was indirectly
discussed in #2493 but a preference
wasn't explicitly stated and it doesn't appear to be documented (in `master`)
for the new `patterns` field. Once we decide what the behavior should be, I can
document it and/or change the implementation if needed. I might have expected
it to apply each matching pattern.

I expected to still see the deprecated `regex` parameter in the, unrelased,
documentation; just marked as deprecated, but it appears to have been dropped
wholesale in
https://github.com/timberio/vector/pull/2493/files#diff-4d642800436bfa506ff51f7b75556d9dL41
. I just wanted to clarify if this is the expected the process for deprecating
parameters.

Fixes: #3096

Signed-off-by: Jesse Szwedko <[email protected]>
  • Loading branch information
jszwedko committed Jul 24, 2020
1 parent fe69741 commit 2770bb1
Showing 1 changed file with 38 additions and 10 deletions.
48 changes: 38 additions & 10 deletions src/transforms/regex_parser.rs
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ pub struct RegexParser {
drop_failed: bool,
target_field: Option<Atom>,
overwrite_target: bool,
capture_names: Vec<(usize, Atom, Conversion)>,
capture_names: Vec<Vec<(usize, Atom, Conversion)>>,
capture_locs: Vec<CaptureLocations>,
}

Expand Down Expand Up @@ -152,7 +152,7 @@ impl RegexParser {

// Calculate the location (index into the capture locations) of
// each named capture, and the required type coercion.
let capture_names: Vec<(usize, Atom, Conversion)> = patterns
let capture_names: Vec<Vec<(usize, Atom, Conversion)>> = patterns
.iter()
.map(|regex| {
let capture_names: Vec<(usize, Atom, Conversion)> = regex
Expand All @@ -168,11 +168,10 @@ impl RegexParser {
.collect();
capture_names
})
.flatten()
.collect();

// Pre-calculate if the source field name should be dropped.
drop_field = drop_field && !capture_names.iter().any(|(_, f, _)| *f == field);
drop_field = drop_field && !capture_names.iter().flatten().any(|(_, f, _)| *f == field);

Self {
regexset,
Expand Down Expand Up @@ -232,7 +231,7 @@ impl Transform for RegexParser {
}
}

for (idx, name, conversion) in &self.capture_names {
for (idx, name, conversion) in &self.capture_names[id] {
if let Some((start, end)) = capture_locs.get(*idx) {
let capture: Value = value[start..end].into();
match conversion.convert(capture) {
Expand Down Expand Up @@ -474,24 +473,53 @@ mod tests {
}

#[test]
fn supports_multiple_patterns() {
fn chooses_first_of_multiple_matching_patterns() {
let log = do_transform(
"1234 235.42 true",
r#"[
'^(?P<id>\d+)$',
'^(?P<id>\d+) (?P<time>[\d.]+) (?P<check>\S+)$',
'^(?P<id1>\d+)',
'^(?P<id2>\d+) (?P<time>[\d.]+) (?P<check>\S+)$',
]"#,
r#"
drop_field = false
[types]
id1 = "int"
id2 = "int"
time = "float"
check = "boolean"
"#,
)
.unwrap();

assert_eq!(log[&"id1".into()], Value::Integer(1234));
assert_eq!(log.get(&"id2".into()), None);
assert_eq!(log.get(&"time".into()), None);
assert_eq!(log.get(&"check".into()), None);
assert!(log.get(&"message".into()).is_some());
}

#[test]
// https://github.com/timberio/vector/issues/3096
fn correctly_maps_capture_groups_for_if_matching_pattern_is_not_first() {
let log = do_transform(
"match1234 235.42 true",
r#"[
'^nomatch(?P<id1>\d+)$',
'^match(?P<id2>\d+) (?P<time>[\d.]+) (?P<check>\S+)$',
]"#,
r#"
drop_field = false
[types]
id = "int"
id1 = "int"
id2 = "int"
time = "float"
check = "boolean"
"#,
)
.unwrap();

assert_eq!(log[&"id".into()], Value::Integer(1234));
assert_eq!(log.get(&"id1".into()), None);
assert_eq!(log[&"id2".into()], Value::Integer(1234));
assert_eq!(log[&"time".into()], Value::Float(235.42));
assert_eq!(log[&"check".into()], Value::Boolean(true));
assert!(log.get(&"message".into()).is_some());
Expand Down

0 comments on commit 2770bb1

Please sign in to comment.