-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(codecs): new syslog codec #17668
Conversation
✅ Deploy Preview for vrl-playground ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
38dfb93
to
c2c811c
Compare
c2c811c
to
ef11555
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by review if helpful.
It's very dense, so main takeaways are:
- Refactor (facility + severity logic): Consider making these "string to number"
match
blocks DRY?- You could leverage an
enum
for these two types instead? - With the
strum
crate this would greatly simplify thedeserialize_facility()
anddeserialize_severity()
methods, while also making theget_num_facility()
+get_num_severity()
methods redundant? - Vector already has to some extent implemented equivalent enums, although I think my suggested enum with
strum
is more appropriate here?
- You could leverage an
- Refactor (simplify methods with less nesting): I've additionally provided suggestions to simply
encode()
,get_field_or_config()
+get_field()
andget_timestamp()
.encode()
could be improved, but the approach is better for constructing the string vspush()
/push_str()
(especially when no capacity was set in advance).PRI
value construction might be better hidden away as a method on yourSyslogRFC
enum?
- Questions: A few queries regarding
SyslogSerializerConfig
struct:- Should it really inline the syslog data fields (specific to composing a syslog message) with the config settings of the serializer? I might have misunderstood the intent, as I am not familiar with Vector codebase.
- Perhaps replace the
#[serde(default)]
attribute annotations with preference to a single annotation on the whole struct + additional#[derive(Default)]
? - The three struct fields you customized defaults for could be better served as
Option<String>
?- The default methods could still apply when
None
, which (with the exception of assigning an actual fallback withapp_name
) would apply theNIL_VALUE
fallback duringencode()
via yourget_field_or_config()
method? - Additionally these RFC 5424 fields roughly replace RFC 3164
tag
field (which you could construct from these fields). Does it make sense to keeptag
in this struct? How will it be used? (the doc comments don't highlight that it is only valid with RFC 3164) NIL_VALUE
variable might benefit with some added context for maintainers?
- The default methods could still apply when
fn encode(&mut self, event: Event, buffer: &mut BytesMut) -> Result<(), Self::Error> { | ||
match event { | ||
Event::Log(log) => { | ||
let mut buf = String::from("<"); | ||
let pri = get_num_facility(&self.config.facility, &log) * 8 + get_num_severity(&self.config.severity, &log); | ||
buf.push_str(&pri.to_string()); | ||
buf.push_str(">"); | ||
match self.config.rfc { | ||
SyslogRFC::Rfc3164 => { | ||
let timestamp = get_timestamp(&log); | ||
let formatted_timestamp = format!(" {} ", timestamp.format("%b %e %H:%M:%S")); | ||
buf.push_str(&formatted_timestamp); | ||
buf.push_str(&get_field("hostname", &log)); | ||
buf.push(' '); | ||
buf.push_str(&get_field_or_config(&self.config.tag, &log)); | ||
buf.push_str(": "); | ||
if self.config.add_log_source { | ||
add_log_source(&log, &mut buf); | ||
} | ||
}, | ||
SyslogRFC::Rfc5424 => { | ||
buf.push_str("1 "); | ||
let timestamp = get_timestamp(&log); | ||
buf.push_str(×tamp.to_rfc3339_opts(SecondsFormat::Millis, true)); | ||
buf.push(' '); | ||
buf.push_str(&get_field("hostname", &log)); | ||
buf.push(' '); | ||
buf.push_str(&get_field_or_config(&&self.config.app_name, &log)); | ||
buf.push(' '); | ||
buf.push_str(&get_field_or_config(&&self.config.proc_id, &log)); | ||
buf.push(' '); | ||
buf.push_str(&get_field_or_config(&&self.config.msg_id, &log)); | ||
buf.push_str(" - "); // no structured data | ||
if self.config.add_log_source { | ||
add_log_source(&log, &mut buf); | ||
} | ||
} | ||
} | ||
let mut payload = if self.config.payload_key.is_empty() { | ||
serde_json::to_vec(&log).unwrap_or_default() | ||
} else { | ||
get_field(&&self.config.payload_key, &log).as_bytes().to_vec() | ||
}; | ||
let mut vec = buf.as_bytes().to_vec(); | ||
vec.append(&mut payload); | ||
buffer.put_slice(&vec); | ||
}, | ||
_ => {} | ||
} | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn encode(&mut self, event: Event, buffer: &mut BytesMut) -> Result<(), Self::Error> { | |
match event { | |
Event::Log(log) => { | |
let mut buf = String::from("<"); | |
let pri = get_num_facility(&self.config.facility, &log) * 8 + get_num_severity(&self.config.severity, &log); | |
buf.push_str(&pri.to_string()); | |
buf.push_str(">"); | |
match self.config.rfc { | |
SyslogRFC::Rfc3164 => { | |
let timestamp = get_timestamp(&log); | |
let formatted_timestamp = format!(" {} ", timestamp.format("%b %e %H:%M:%S")); | |
buf.push_str(&formatted_timestamp); | |
buf.push_str(&get_field("hostname", &log)); | |
buf.push(' '); | |
buf.push_str(&get_field_or_config(&self.config.tag, &log)); | |
buf.push_str(": "); | |
if self.config.add_log_source { | |
add_log_source(&log, &mut buf); | |
} | |
}, | |
SyslogRFC::Rfc5424 => { | |
buf.push_str("1 "); | |
let timestamp = get_timestamp(&log); | |
buf.push_str(×tamp.to_rfc3339_opts(SecondsFormat::Millis, true)); | |
buf.push(' '); | |
buf.push_str(&get_field("hostname", &log)); | |
buf.push(' '); | |
buf.push_str(&get_field_or_config(&&self.config.app_name, &log)); | |
buf.push(' '); | |
buf.push_str(&get_field_or_config(&&self.config.proc_id, &log)); | |
buf.push(' '); | |
buf.push_str(&get_field_or_config(&&self.config.msg_id, &log)); | |
buf.push_str(" - "); // no structured data | |
if self.config.add_log_source { | |
add_log_source(&log, &mut buf); | |
} | |
} | |
} | |
let mut payload = if self.config.payload_key.is_empty() { | |
serde_json::to_vec(&log).unwrap_or_default() | |
} else { | |
get_field(&&self.config.payload_key, &log).as_bytes().to_vec() | |
}; | |
let mut vec = buf.as_bytes().to_vec(); | |
vec.append(&mut payload); | |
buffer.put_slice(&vec); | |
}, | |
_ => {} | |
} | |
Ok(()) | |
} | |
fn encode(&mut self, event: Event, buffer: &mut BytesMut) -> Result<(), Self::Error> { | |
if let Event::Log(log) = event { | |
let pri_num = get_num_facility(&self.config.facility, &log) * 8 + get_num_severity(&self.config.severity, &log); | |
let pri = ["<", &pri_num.to_string(), ">"].concat(); | |
let timestamp = get_timestamp(&log); | |
let hostname = get_field("hostname", &log); | |
let mut syslog_packet = match self.config.rfc { | |
SyslogRFC::Rfc3164 => { | |
let header = [ | |
×tamp.format("%b %e %H:%M:%S"), | |
&hostname, | |
].join(" "); | |
let msg = [ | |
&get_field_or_config(&self.config.tag, &log), // `TAG` field | |
":", // Start of `CONTENT` field | |
// `payload` later appends here the remaining data for the `CONTENT` field | |
].concat(); | |
[pri, header, " ", msg].concat() | |
}, | |
SyslogRFC::Rfc5424 => { | |
// Equivalent to `TAG` field from RFC 3164: | |
let tag = [ | |
&get_field_or_config(&self.config.app_name, &log), | |
&get_field_or_config(&self.config.proc_id, &log), | |
&get_field_or_config(&self.config.msg_id, &log), | |
].join(" "); | |
let header = [ | |
"1", // `VERSION` field | |
×tamp.to_rfc3339_opts(SecondsFormat::Millis, true), | |
&hostname, | |
// RFC 5424 moves equivalent data of the RFC 3164 `TAG` field | |
// from `MSG` part to `HEADER` part: | |
&tag, | |
// "structured data" support not implemented: | |
NIL_VALUE, | |
].join(" "); | |
[pri, header].concat() | |
} | |
}; | |
// Add final delimiter before `payload` is appended: | |
syslog_packet.push(' '); | |
// Insert third-party message content prior to payload?: | |
if self.config.add_log_source { | |
add_log_source(&log, &mut syslog_packet); | |
} | |
let mut payload = if self.config.payload_key.is_empty() { | |
serde_json::to_vec(&log).unwrap_or_default() | |
} else { | |
get_field(&self.config.payload_key, &log).as_bytes().to_vec() | |
}; | |
let mut vec = syslog_packet.as_bytes().to_vec(); | |
vec.append(&mut payload); | |
buffer.put_slice(&vec); | |
} | |
Ok(()) | |
} |
Change overview
- Matching on
event
is not necessary, useif let
on expected enum variant instead. Removes redundant logic and indentation. - RFC 3164 begins the
TIMESTAMP
header field directly after thePRI
part/field, should not have a delimiter between values (reference). - Common fields adjusted to be referenced as variables prior to RFC match blocks.
- Use
[ ... ].join(" ")
/[ ... ].concat()
methods to construct the string(s). Easier to grok by minimizing the delimiter +.push_str()
/push()
noise, additionally more performant AFAIK as allocated capacity forString
value should be more accurate (could be improved still due to updates aftermatch
). - For clarity, at least during review process for those that are less familiar with the structure (and perhaps benefit of future maintenance) I've better communicated the separate packet parts
pri
,header
,msg
and added some comments for review context (one is a question foradd_log_source
, as that seems dependent on deployment context, rather than generic syslog support?).
I am not familiar with Vector internals, but the logic interacting with the log
input seems like there might be a better approach? (especially with the payload
handling converting to a byte vec to append the payload?) I could be mistaken 😅
Additional context
While I think the suggestion above is a step in the right direction, some of the logic here may be better extracted to one of the existing data structures (config / RFC) as methods to generate the pri
+ header
+ msg
parts of the syslog packet value?
- At least that logic seems less useful to
encode()
for grokking a breakdown of the encode process. - RFC 5424 fields could represent a conversion to RFC 3164 format, rather than interleaving the fields into the same config struct?
RFC 5424 provides a comparison to BSD syslog (RFC 3164) (syslog-ng
docs provide a more comprehensive overview of the two RFC formats that is nicer to digest):
Collapsed for brevity (click to expand)
- There is advice for interop with
TIMESTAMP
field (user local timezone + drop timezone and year information), and the inverse if ingesting RFC 3164 syslog to relay/collect. - RFC 3164 documents relay behaviours that RFC 5424 doesn't focus on (potentially relevant for syslog source + sink?)
HOSTNAME
is compatible.PRI
value syntax and semantics are retained for compatibility.- NOTE: While the
PRI
part is defined formally as a field of theHEADER
part (unlike in RFC 3164), the header now starts with theVERSION
field which like the first field of theHEADER
part from RFC 3164 (TIMESTAMP
) begins immediately afterPRI
without a space delimiter.TIMESTAMP
follows afterVERSION
+ space delimiter.
- NOTE: While the
- The
MSG
part was previously split intoTAG
andCONTENT
fields.- Now the
MSG
part only represents theCONTENT
field while theTAG
field has been relocated into theHEADER
part and split into 3 separate fields (APP-NAME
,PROCID
,MSGID
). APP-NAME
would be equivalent toTAG
(usually a process name), whilePROCID
would have been the optional[value]
(usually PID) that would technically have been part of theCONTENT
field (along with:
that'd follow before the actual message payload).PROCID
is no longer wrapped with[ ]
, nor is:
used.MSGID
provides additional context, but all three of these fields are optional and can be opt-out via-
(akaNILVALUE
).
- Now the
- The new Structured Data field (_opt-out via
-
/NILVALUE_) is said to be treated as part of
CONTENT` if formatting to RFC 3164. Some resources note that this field isn't common to encounter in actual usage? - RFC 3164 implementations lack consistency.
- Wikipedia notes RFC 3164 (2001) was an effort to document the status quo at the time, while RFC 5424 (2009) formally standardized the format.
- Feb 2021 blogpost conveys that since RFC 5424 was published, it is now the more commonly encountered format (however some popular software like HAProxy is known to default to RFC3164 by default),
Rsyslog
follows RFC 5424.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference RFC 5424 syslog message is constructed elsewhere with the more convenient (but less performant) format!
macro:
Lines 1372 to 1388 in c7ae2a6
impl fmt::Display for SyslogMessageRfc5424 { | |
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | |
write!( | |
f, | |
"<{}>{} {} {} {} {} {} {} {}", | |
encode_priority(self.severity, self.facility), | |
self.version, | |
self.timestamp, | |
self.host, | |
self.appname, | |
self.procid, | |
self.msgid, | |
format_structured_data_rfc5424(&self.structured_data), | |
self.message | |
) | |
} | |
} |
Which shows one of the benefits of extracting that logic out of this encode()
method, performance/efficiency wise you may want to avoid unnecessary allocations (creating the String
value via .join()
/.concat()
should have similar benefits to String::with_capacity()
). You may want to wait for the Vector devs to weigh in on this.
References:
/// Tag | ||
#[serde(default)] | ||
tag: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC 3164 only. Can technically construct this during encode method from the RFC 5424 equivalent fields.
Is it useful to keep in this struct?
/// Payload key | ||
#[serde(default)] | ||
payload_key: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the doc comment mention the syslog equivalent representation that this lookup key will provide?
- RFC 3164 it is the
CONTENT
field ofMSG
part (technically this would be preceded by[value]:
/:
after theTAG
field, as the starting portion of theCONTENT
field before the actual message content). - RFC 5424 it is the
MSG
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this option. This should always be the field with the meaning message
.
/// Config used to build a `SyslogSerializer`. | ||
#[configurable_component] | ||
#[derive(Debug, Clone, Default)] | ||
pub struct SyslogSerializerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You apply the tag to each individual field, would it be better to just apply it once here and provide an impl Default
or derive(Default)
instead?
/// Config used to build a `SyslogSerializer`. | |
#[configurable_component] | |
#[derive(Debug, Clone, Default)] | |
pub struct SyslogSerializerConfig { | |
/// Config used to build a `SyslogSerializer`. | |
#[configurable_component] | |
#[derive(Debug, Default, Clone, Default)] | |
#[serde(default)] | |
pub struct SyslogSerializerConfig { |
/// RFC 5424 | ||
Rfc5424 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be useful to highlight that RFC 5424 obsoletes RFC 3164?:
This document obsoletes RFC 3164, which is an Informational document
describing some implementations found in the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum could also technically align with the equivalent Protocol
enum (used in decoding/format/syslog.rs
) from the syslog_loose
crate which is already added to Vector as a dependency?
There is no default implemented by that crates equivalent enum though. As Vector is only using it for parsing syslog messages, perhaps there isn't much value in re-use here 🤷♂️
fn get_num_facility(config_facility: &Facility, log: &LogEvent) -> u8 { | ||
match config_facility { | ||
Facility::Fixed(num) => return *num, | ||
Facility::Field(field_name) => { | ||
if let Some(field_value) = log.get(field_name.as_str()) { | ||
let field_value_string = String::from_utf8(field_value.coerce_to_bytes().to_vec()).unwrap_or_default(); | ||
let num_value = field_value_string.parse::<u8>(); | ||
match num_value { | ||
Ok(num) => { | ||
if num > 23 { | ||
return 1 // USER | ||
} else { | ||
return num | ||
} | ||
} | ||
Err(_) => { | ||
let num = match field_value_string.to_uppercase().as_str() { | ||
"KERN" => 0, | ||
"USER" => 1, | ||
"MAIL" => 2, | ||
"DAEMON" => 3, | ||
"AUTH" => 4, | ||
"SYSLOG" => 5, | ||
"LPR" => 6, | ||
"NEWS" => 7, | ||
"UUCP" => 8, | ||
"CRON" => 9, | ||
"AUTHPRIV" => 10, | ||
"FTP" => 11, | ||
"NTP" => 12, | ||
"SECURITY" => 13, | ||
"CONSOLE" => 14, | ||
"SOLARIS-CRON" => 15, | ||
"LOCAL0" => 16, | ||
"LOCAL1" => 17, | ||
"LOCAL2" => 18, | ||
"LOCAL3" => 19, | ||
"LOCAL4" => 20, | ||
"LOCAL5" => 21, | ||
"LOCAL6" => 22, | ||
"LOCAL7" => 23, | ||
_ => 24, | ||
}; | ||
if num > 23 { | ||
return 1 // USER | ||
} else { | ||
return num | ||
} | ||
} | ||
} | ||
} else { | ||
return 1 // USER | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the strum
/ enum
approach for Facility
and Severity
, these get_num_facility()
/ get_num_severity()
methods may be redundant and ca be dropped?
These two methods are only used within encode()
for composing PRIVAL
, which is as simple as either:
(facility as u8 << 3) | severity as u8
(facility as u8 * 8) + severity as u8
Since Facility
would be set already by deserialize_facility()
, I'm not sure how important all this logic here is?
My main question is with handling of log field with log.get()
:
if let Some(field_value) = log.get(field_name.as_str()) {
let field_value_string = String::from_utf8(field_value.coerce_to_bytes().to_vec()).unwrap_or_default();
// ...
You have conditions that return a fallback (Facility::User
/ 1
)
- One condition implies the value could exceed the bound of 23 (either as number parsed from string, or field name not matching a valid facility name).
- If
log.get()
returnsNONE
to default toFacility::User
.
Since it appears ok for the number variant to return a value directly (and thus has been set via deserialize_facility()
?), I'm assuming that these two lines extract the same string value that deserialize_facility()
would have accomplished (deserializing into value
)?
- I may have misunderstood your usage, and this
Field
enum variant may not map to the equivalent facility?- You store it to lookup the value via
log.get()
above here? (that would be either a number or syslog facility name?) - Does it provide some other functionality? (dynamic?)
- Are my suggestions thus not compatible?
- You store it to lookup the value via
This method otherwise doesn't appear needed, since it'd effectively become:
fn get_num_facility(config_facility: &Facility, log: &LogEvent) -> u8 { | |
match config_facility { | |
Facility::Fixed(num) => return *num, | |
Facility::Field(field_name) => { | |
if let Some(field_value) = log.get(field_name.as_str()) { | |
let field_value_string = String::from_utf8(field_value.coerce_to_bytes().to_vec()).unwrap_or_default(); | |
let num_value = field_value_string.parse::<u8>(); | |
match num_value { | |
Ok(num) => { | |
if num > 23 { | |
return 1 // USER | |
} else { | |
return num | |
} | |
} | |
Err(_) => { | |
let num = match field_value_string.to_uppercase().as_str() { | |
"KERN" => 0, | |
"USER" => 1, | |
"MAIL" => 2, | |
"DAEMON" => 3, | |
"AUTH" => 4, | |
"SYSLOG" => 5, | |
"LPR" => 6, | |
"NEWS" => 7, | |
"UUCP" => 8, | |
"CRON" => 9, | |
"AUTHPRIV" => 10, | |
"FTP" => 11, | |
"NTP" => 12, | |
"SECURITY" => 13, | |
"CONSOLE" => 14, | |
"SOLARIS-CRON" => 15, | |
"LOCAL0" => 16, | |
"LOCAL1" => 17, | |
"LOCAL2" => 18, | |
"LOCAL3" => 19, | |
"LOCAL4" => 20, | |
"LOCAL5" => 21, | |
"LOCAL6" => 22, | |
"LOCAL7" => 23, | |
_ => 24, | |
}; | |
if num > 23 { | |
return 1 // USER | |
} else { | |
return num | |
} | |
} | |
} | |
} else { | |
return 1 // USER | |
} | |
} | |
} | |
} | |
fn get_num_facility(config_facility: &Facility, log: &LogEvent) -> u8 { | |
config_facility as u8 | |
} |
fn get_num_severity(config_severity: &Severity, log: &LogEvent) -> u8 { | ||
match config_severity { | ||
Severity::Fixed(num) => return *num, | ||
Severity::Field(field_name) => { | ||
if let Some(field_value) = log.get(field_name.as_str()) { | ||
let field_value_string = String::from_utf8(field_value.coerce_to_bytes().to_vec()).unwrap_or_default(); | ||
let num_value = field_value_string.parse::<u8>(); | ||
match num_value { | ||
Ok(num) => { | ||
if num > 7 { | ||
return 6 // INFORMATIONAL | ||
} else { | ||
return num | ||
} | ||
} | ||
Err(_) => { | ||
let num = match field_value_string.to_uppercase().as_str() { | ||
"EMERGENCY" => 0, | ||
"ALERT" => 1, | ||
"CRITICAL" => 2, | ||
"ERROR" => 3, | ||
"WARNING" => 4, | ||
"NOTICE" => 5, | ||
"INFORMATIONAL" => 6, | ||
"DEBUG" => 7, | ||
_ => 8, | ||
}; | ||
if num > 7 { | ||
return 6 // INFORMATIONAL | ||
} else { | ||
return num | ||
} | ||
} | ||
} | ||
} else { | ||
return 6 // INFORMATIONAL | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback as raised for get_num_facility()
.
If the proposed enum is sufficient, this can be discarded to get the desired u8
value for the enum variant.
match log.get("@timestamp") { | ||
Some(value) => { | ||
if let Value::Timestamp(timestamp) = value { | ||
DateTime::<Local>::from(*timestamp) | ||
} else { | ||
Local::now() | ||
} | ||
}, | ||
_ => Local::now() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this is a bit simpler with map_or_else()
:
match log.get("@timestamp") { | |
Some(value) => { | |
if let Value::Timestamp(timestamp) = value { | |
DateTime::<Local>::from(*timestamp) | |
} else { | |
Local::now() | |
} | |
}, | |
_ => Local::now() | |
} | |
log.get("@timestamp").map_or_else( | |
|| Local::now(), | |
|timestamp| DateTime::<Local>::from(*timestamp) | |
) |
The two function args to map_or_else()
:
- Handles the fallback (else), so you only need to cover that once.
- Operates on the
Some()
value (map), so we can handle your conversion here without theif let
andmatch
nesting 👍
For reference to reviewers:
- RFC 3164
TIMESTAMP
is expected to be local timezone and without year, in the format ofMmm dd hh:mm:ss
.- Handled via chrono
.format("%b %e %H:%M:%S")
. - Note that advice from RFC 5424 (to format
TIMESTAMP
for conversion to RFC 3164) appears to be pre-processing advice?
- Handled via chrono
- RFC 5424
TIMESTAMP
expects RFC 3339 with some additional restraints (constrained subset of an ISO8601 timestamp).- Handled via chrono
.to_rfc3339_opts(SecondsFormat::Millis, true)
in the PR methodencode()
.
- Handled via chrono
fn get_field_or_config(config_name: &String, log: &LogEvent) -> String { | ||
if let Some(field_name) = config_name.strip_prefix("$.message.") { | ||
return get_field(field_name, log) | ||
} else { | ||
return config_name.clone() | ||
} | ||
} | ||
|
||
fn get_field(field_name: &str, log: &LogEvent) -> String { | ||
if let Some(field_value) = log.get(field_name) { | ||
return String::from_utf8(field_value.coerce_to_bytes().to_vec()).unwrap_or_default(); | ||
} else { | ||
return NIL_VALUE.to_string() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these two methods you shouldn't need an explicit return
keyword AFAIK? (omit the ;
for unwrap_or_default()
)
I am not familiar with Vector codebase, so if using return
keyword explicitly is their preferred convention, ignore me :)
You could alternatively adopt the unwrap_or_else()
, since the logic here is quite simple (providing a fallback value if None
).
/// Add log source | ||
#[serde(default)] | ||
add_log_source: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few fields in this struct unrelated to syslog packet/message format, would it be better to extract out the syslog format fields into their own sub struct? Or is that an issue with how Vector (or it's config UX) works?
You can reference existing syslog support in Vector, a similar struct is already defined (may want to align the struct field names to match?):
Lines 1326 to 1340 in c7ae2a6
#[derive(Deserialize, PartialEq, Clone, Debug)] | |
struct SyslogMessageRfc5424 { | |
msgid: String, | |
severity: Severity, | |
facility: Facility, | |
version: u8, | |
timestamp: String, | |
host: String, | |
source_type: String, | |
appname: String, | |
procid: usize, | |
message: String, | |
#[serde(flatten)] | |
structured_data: StructuredData, | |
} |
@polarathene Thanks for a detailed review, I've learned a few things. There's some history of this PR here: #6863. This code was supposed to mimic the fluentd syslog plugin, with some extensions. The vector configuration is not directly exposed to the user but auto-generated by a k8s operator. (thus RFC-specific fields in configuration, like Tag). The idea behind
(if we take severity as an example) is to be able to set the severity value in configuration as a number, within the correct limits, or a string, which in turn can be the name of a severity level, like "ALERT" or a log event field reference, if it is of the |
@syedriko Any progress on this? |
Sorry, I didn't get around to putting this response through earlier 😓 This was a good chunk of it below, with a little bit more and some pseudo-code that I've excluded. I have refactored this PR with a local build based on my original feedback. It'd need a little bit more work but if you'd like I could either open a new PR with that or do a 2nd review with the change suggestions in bulk? I have kept the third-party additions, so it retains that functionality, although I don't think it should have either the
You're welcome! Apologies for the verbosity 😅
Sounds great! ❤️ I'm lacking time to build it locally and configure for a project that I'm interested in bringing Vector into. Hopefully I can allocate some time this month or next to give it a try and ensure the syslog codec support integrates well 👍 Additional feedback
That functionality is retained (and simpler?) with the enum approach suggested: // Must be within the ordinal bounds, otherwise `None` => `Err(...)`:
let s = Severity::from_repr( num ).ok_or_else(|| /* Err */ )
// String must match a valid enum variant, otherwise `Err(...)`:
let s = Severity::from_str( str ).map_err( /* Err */ )
// Regardless of how you created `s`, easily get the desired output:
println!("{}", s as u8);
// strum implements Display (so you could also use `.to_string()`):
println!("{s}"); Making the custom deserialize methods quite a bit simpler to grok, and the // NOTE: This would be a bit nicer out of `encode()`, instead via a struct method?
let prival = (facility as u8 * 8) + severity as u8;
let pri = ["<", &prival.to_string(), ">"].concat(); I'm unable to provide much feedback for the latter two points.
|
Hi. Thanks for doing this, this could be very useful to incorporate. Is there a compelling reason to support RFC3164? Given how loosely specified 3164 is, I feel this would just add complexity that we don't really need. If we can I would prefer to drop this. It would be really useful to be able to create structured data. This could be done by adding a map of template fields to the configuration. Something similar to the way the loki sink handles labels. The ID field could just be a string field in the config. So the config for this would look something like: structured_data:
id: data@12312
fields:
thing: "{{ thing }}"
thang: "{{ thang }}" The spec allows for multiple structured data sections - but I think we should be able to just allow a single section to be specified, at least for the initial version. Would it be possible to add this? |
Some software still seems to emit logs in that format, so it's preferred if Vector is to ingest the logs but the user still wants to maintain compatibility with the current RFC 3164 logs. Which happens to be my case at a glance. I want to bring Vector into a project with a wide user base, but I don't want that to be too disruptive if possible. Supporting RFC 3164 doesn't seem like it requires that much more logic 🤷♂️
My refactor has this functionality implemented.
My approach just relied on remap to provide a |
It's more UX complexity I am concerned with for 3164.
I'm not sure I fully understand. What is your refactor? |
Could you elaborate? On the user end with config it defaults to RFC 5424, but you can explicitly add the
I've forked this PR and rewritten the logic based on my review feedback. It is simpler and easier to review IMO, I've implemented Structured Data support for both RFC 5424 and 3164 (as RFC 5424 advises by prepending it to a message when not I'll probably raise a separate PR in the weekend since there is not much activity here since my review. |
According to the RFC, 5424 has made 3164 obsolete. 3164 is very loosely specified, so really if we say we are conforming to 3164 it just means 'our interpretation of a 3164 structure'. If we have an incoming 3164 message and we send out a 3164 message - there is no guarantee that they would be exactly the same text. Sending a 3164 message to a system that claims to parse 3164 doesn't provide any real guarantee that it will actually be able to parse our 3164 message. Plus I don't like that the year gets dropped in the timestamp.. With a 5424 message it is much more precise and we can have more guarantees about the structure of the message. I'm not saying absolutely we shouldn't support 3164, if, say, there was a well known system that was unable to parse 5124 messages then that would be reason enough for us to support it. But to my knowledge there isn't such a system, but I am all ears if you know of any? |
Yes, and where possible it should be adopted. A project I work on that bundles popular software in Docker is still outputting logs with RFC 3164, and I plan to change that with Vector, but I initially want that RFC 3164 support, it is very minimal to support. Here is the RFC 3164 specific portion: SyslogRFC::Rfc3164 => {
// TIMESTAMP field format:
// https://datatracker.ietf.org/doc/html/rfc3164#section-4.1.2
let timestamp = self.timestamp.format("%b %e %H:%M:%S").to_string();
// MSG part begins with TAG field + optional context:
// https://datatracker.ietf.org/doc/html/rfc3164#section-4.1.3
let mut msg_start = self.tag.encode_rfc_3164();
// When RFC 5424 "Structured Data" is available, it can be included in the `CONTENT` field:
// https://datatracker.ietf.org/doc/html/rfc5424#appendix-A.1
if let Some(sd) = structured_data.as_deref() {
msg_start = msg_start + " " + sd
}
[
timestamp.as_str(),
hostname,
&msg_start,
].join(" ")
}
// ...
struct Tag {
app_name: String,
proc_id: Option<String>,
msg_id: Option<String>
}
// `.as_deref()` usage below avoids requiring `self.clone()`
impl Tag {
// Roughly equivalent RFC 5424 fields can compose the start of
// RFC 3164 MSG part (TAG + CONTENT fields):
// https://datatracker.ietf.org/doc/html/rfc5424#appendix-A.1
fn encode_rfc_3164(&self) -> String {
let Self { app_name, proc_id, msg_id } = self;
match proc_id.as_deref().or(msg_id.as_deref()) {
Some(context) => [&app_name, "[", &context, "]:"].concat(),
None => [&app_name, ":"].concat()
}
}
// ...
} Roughly 4 lines without comments and formatting (EDIT: Forgot the
Vector already has support for 3164 elsewhere, and outputs a specific format from the demo logs source for example. This aligns with the same 3164 logs I encounter with the popular software bundled via the project I maintain, so that's common enough for me 🤷 Sure it might not meet the expectation of everyone else for 3164, but I don't see an issue with this as:
Then don't use the format? That's specific to RFC 3164, it's expected.
It's a low maintenance feature to support as shown above. If you get requests for other variations with enough demand (or contributions) you can discuss if supporting that is worthwhile. It's a motivating factor for me to contribute an improved version of this PR with that functionality intact. I don't mind it being deprecated if it's necessary to drop it in future for some reason, but I'd like RFC 3164 so that I can use that when switching over to Vector, and transitioning away from it. If there is a strong argument from maintainers of Vector against that, I won't have much say, but I'd appreciate it considering how low maintenance it is and the existing RFC 3164 support present in Vector. |
@polarathene I am very much interested in you driving this forward given @syedriko has been pulled in a different direction for us. We currently support RFC3164 in the design that @syedriko has presented but we have upcoming usages where I can push dropping its support though would take it if the burden is low. Regarding some of the specific deployment extentions, those I could live without given they are not part of the spec. Looking forward to your new PR |
Sure, I'll try find some time to raise the PR of what I have done. Should be done before mid jan 👍 (hopefully earlier depending on workload elsewhere)
As stated, for my usage logs are in RFC3164 too. I can likewise probably address that, but I don't see the extra code to support RFC 3164 to be much of a maintenance burden.
Great! In my fork I implemented structured data (for RFC 3164 it is prepended to the log message, as RFC 5424 advises). You can optionally use VRL to construct the same content in your config instead of the hard-coded approach that was done here. UPDATE: Had a busy January and had to delay tackling this. Hoping to have time soon to make progress on pushing this forward. UPDATE (March 14th): Actively working on my fork to get it in shape for review 👍 |
@polarathene Any update on this? |
Sorry I got burnt out earlier. I am able to publish the progress I had, but it was not in a state I was comfortable with for review. It still needs more work to be ready, but perhaps it could get some review prior and maybe someone else can assist as I realize I've not been able to reliably push forward on this as I've wanted to with all my other commitments 😓 |
@polarathene Following-up if this ever happen? Can we either move this PR forward or see your proposed changes so that we can make progress |
No, nobody expressed interest since I wrote that comment.
I just wrapped up a large task elsewhere, so I have time to publish the branch (and potentially rebase if there is no major conflicts to resolve). I'll tackle that tomorrow, getting late here 👍 My branch diverged quite a bit from the one here, which while an improvement.. I'm not sure if it'll help much towards making progress if nobody else with the skills wants to assist 😅 I do want to complete the work, it's just been difficult to budget my volunteer time for. So if nobody helps, I will eventually be able to block time out again for coming back to this. I sunk a considerable chunk of time into the branch already, would be an awful waste to leave it to rot. |
It seems like I actually did push my branch up months ago and forgot about that: https://github.com/polarathene/vector/tree/feat/syslog-codec Diff (as if it was opened as a PR): master...polarathene:feat/syslog-codec The last commit was April, and locally it looks like I was refactoring some structs and fields to address some concerns with configuration + VRL. I had to shift priorities and I think ran into a tricky problem, so that progress was not completed. I haven't tried building the public branch I linked, but you could give that a go? (it's not been rebased to upstream, so there may be conflicts) The commits should be fairly tidy and scoped with associated context in each commit message. I don't have time to write up much guidance, especially since I haven't touched the branch for several months. If the branch itself fails to build successfully, that's probably what I was addressing at the time and why I didn't announce anything earlier. Pretty sure I had it working locally with the uncommitted changes. Hopefully that's helpful enough for now? I'll see if I can grok what I have locally, but I'm not sure if I'll have the time to get familiar with the uncommitted changes and push those improvements/refactors today. I could open a PR with what I have on the public branch, but I wouldn't want to waste reviewer time (my local refactoring + no time to write up a proper PR description). Feel free to give it a glance/try and provide feedback though 👍 |
Any updates on this? Really useful and needed! |
I haven't worked on it for about a month or so. I rarely have the time spare lately. It would probably help to know how you'd like to use/configure it.
I'm not sure if anyone else is trying to work with what I shared earlier, I assume they'd chime in here if they were. Not sure if anyone has tried to even run it 😅 Feedback on the last shared iteration would also be helpful to know if that's going in the right direction or not. I might have more time spare by October or November to finish up my branch. |
I do not know exactly what stalled this PR but I would recommend if you believe this one should be abandoned in favor of the code you have written then create a new PR. @syedriko is part of our organization and I would like to see this long standing ask pushed forward one way (merging this PR) or another (coming to consensus on your proposal). As is stands now we have nothing and are making no progress, and we can't make progress without a pull request. |
I was the only one that participated in review, that feedback wasn't applied. The one here has issues IIRC that were resolved in my branch.
I provided a diff link to my branch earlier: master...polarathene:feat/syslog-codec That is the equivalent of making that branch as a new PR. Once I can allocate some time to finish up the local refactoring I was working on, I can update that branch and raise an actual PR with some examples for building/testing if helpful. My work on it is purely voluntary, no employer backing, so dedicating my spare time on it is tricky 😓 When you say that we have nothing and making no progress without a PR, are you offering to participate in review? So far no one has provided any feedback on the linked branch/diff, so if a new PR left in a WIP status is opened will anything change? |
Apologies, this was on my radar, but it slipped by the side. I'll have a look over it as soon as I can. If I recall, I think the main changes we would like to see would be to use semantic meanings to fill out the fields rather than have them as configuration options. This way we can use |
Efforts are appreciated but AFAIK it is not equivalent to a PR; I'm pretty certain you are only able to comment and provide input with a pull request. Reviewers do not get notifications about responses and changes to branch diffs
I defer to those who are better qualified then myself.
@syedriko is ⬆️ something you can address or maybe review the referenced branch and come to consensus |
It's been a while since I submitted this PR and I think it would be good to go back and revisit why it is the way it is. This is not supposed to be a general-purpose Vector syslog encoder. It is a port of a legacy fluentd syslog plugin. Here https://gist.github.com/syedriko/2aa01a199e8a5d4026e1c42f618e684e I summed up the configuration this PR is driven from and how it affects what goes on the wire. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay in going over this. This will be a very useful feature once we get it through!
There are a number of changes that need to be made, mostly to do with relying on field meanings rather than configuring options. That way it does allow for more flexibility, the user can set these fields as needed message by message.
Let me know if you have any questions.
/// Facility | ||
#[serde(default)] | ||
#[serde(deserialize_with = "deserialize_facility")] | ||
facility: Facility, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this field optional. If None
we should attempt to get the value from the message with the meaning facility
.
/// Severity | ||
#[serde(default)] | ||
#[serde(deserialize_with = "deserialize_severity")] | ||
severity: Severity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this field optional. If None
we should attempt to get the value from the message with the meaning severity
.
/// Payload key | ||
#[serde(default)] | ||
payload_key: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this option. This should always be the field with the meaning message
.
|
||
/// Add log source | ||
#[serde(default)] | ||
add_log_source: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what this option is for, but I think it should be removed as add_log_source
seems very specific. We should work out a better way to make this suitable for a wider range of usecases.
|
||
/// App Name, RFC 5424 only | ||
#[serde(default = "default_app_name")] | ||
app_name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this field optional. If None
we should attempt to get the value from the message with the meaning service
.
|
||
/// Proc ID, RFC 5424 only | ||
#[serde(default = "default_nil_value")] | ||
proc_id: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is not needed.
If the message contains a field with a meaning of proc_id
, we should use that field. Otherwise, this field should be the PID of Vector. https://doc.rust-lang.org/std/process/fn.id.html
We will need to update the syslog decoder so that when it adds the proc id field to the message it adds it with a meaning of proc_id
.
/// App Name, RFC 5424 only | ||
#[serde(default = "default_app_name")] | ||
app_name: String, | ||
|
||
/// Proc ID, RFC 5424 only | ||
#[serde(default = "default_nil_value")] | ||
proc_id: String, | ||
|
||
/// Msg ID, RFC 5424 only | ||
#[serde(default = "default_nil_value")] | ||
msg_id: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this option and take the value from the event with a meaning of msgid
. We should also update the decoder to insert msgid
with this meaning.
String::from(NIL_VALUE) | ||
} | ||
|
||
fn add_log_source(log: &LogEvent, buf: &mut String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure quite what this is trying to do, but I'm pretty sure it should be removed. If users want these details as part of the message they can use vrl
to construct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StephenWakely What do you mean by "field with a meaning of X"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, fields can be given a meaning. This allows us to have fields coming from different sources with different names, but we are still able to get the field by their meaning without worrying about what their name is.
For example, the syslog source creates a field called hostname
, the journald source creates a field called host
. Both mean the same thing, but have different names.
The Papertrail sink needs to use the host field in its message. So rather than get the field by it's name it get's it by the meaning. See here.
This can also be managed in VRL with the set_semantic_meaning
function.
You'll notice in get_host that each log message can be Vector
or Legacy
. The concept of meaning is in transition and only applies to the Vector namespace. There is an RFC that goes into it here. A long term goal for us is to eventually remove the legacy namespace - but that is probably a long way away.
For the purposes of this PR, I would call the functions in log_event
where there exists one - eg get_message
. For other fields, eg procid
, we can just hardcode the field names for the Legacy namespace.
eg.
match self.namespace() {
LogNamespace::Vector => self.get_by_meaning("procid"),
LogNamespace::Legacy => self.get("procid"),
}
Base it on the names set by the syslog decoder.
Hope that makes sense. Any problems, give me a shout..
@StephenWakely I am a bit confused why you reviewed the PR given my prior feedback. I assume you were aware of the branch I linked? Do you want me to make a draft PR for master...polarathene:feat/syslog-codec I handled the facility and severity better IMO, along with other improvements. I have some refactoring to return to locally that still needs to be wrapped up before I push to the branch again, it had already addressed feedback from your recent review here where I had replaced I strongly believe the work I was doing is in a far better state to continue from. If there is promise of actual feedback / support to progress my branch further I will try allocate time to get the local changes sorted before the end of the month. Once that's done it's mostly getting some user feedback that the design works well for others and aligns with what Vector maintainers expect, we can sort out some tests and get it merged. If you instead want to continue with development of this PR branch and ignore mine, please let me know. I may have identified some bugs in the past and resolved them, each commit in my branch has detailed history for additional context (not that I expect anyone else to read that). |
Sorry to chime in, but I have several thoughts for @polarathene and their works
I think @jcantrill has mentioned in #17668 (comment)
If you think your PR is in a far better state to continue from, you could proceed to open a new separate PR as @jcantrill suggested. Reviewers couldn't give proper feedback without an issue or PR ticket of the related branch. As for this PR, it was opened by @syedriko to track feedback for their branch. By no means am I saying that one branch is better than the other, but in order for reviewers to leave proper feedback for a branch it needs to have an issue or PR ticket. This would allow reviewers to make comparison between branches and might proceed with the better one. I have been following this feature request silently as I have no capacity to provide feedback for the code. I appreciate all individuals that are involved here and couldn't wait for the feature to be available ❤️ |
I had stated why I had not done so multiple times prior to that suggestion:
If someone is willing to commit to providing a review, I'll open a PR for them. If a maintainer requests it, I'll open a PR for them. Now that someone has taken time to contribute a review, they ignored my own prior review and recent comments about the work I had done with a link to the diff. I could absolutely open a PR if they want to review it properly as well, but like I said I was hoping to just get general feedback if it was going in the right direction or if there was anything that stood out at a glance. That sort of thing doesn't necessarily require line by line review comments.
I'm not sure how you're going to improve making a comparison from the link I provided vs a PR? That link is the exact same diff view you'd get from a PR. The only difference at this point is there is no separate discussion with review comment functionality.
I understand this. I do reviews frequently. It does not change the fact that there is little additional value if that PR is going to be without a helpful description, and reflect a state for review that will be refactored. I honestly did not intend for such a long delay on my end, so in hindsight while slightly redundant and wasteful to reviewer time, the feedback I may have received by now would have been useful. It's a couple clicks for me to turn that diff link into a draft PR. I can do it as soon as someone is willing to offer the intention to provide feedback. Otherwise I think it would be more respectful to the project maintainers to review it once I can write up a proper PR description and ensure the branch itself is in a better state along with instructions for users like yourself to build and try it for additional feedback. |
@polarathene I'm betting @StephenWakely would be kind enough to review your PR since he already provided feedback on this one. IMO, maintainers can decide if you are 'wasting their time' by choosing to review (or not) your submission. The fact is, without a PR, nothing is moving forward.
@syedriko My original expectation was this would be a general purpose syslog encoder. This would be my preference and aligns with comments made by @StephenWakely . If @polarathene 's changes and code (and PR) accomplish this, I'm certain we can move our specific logic into VRL. |
The end result, yes, absolutely, the code in this PR as I submitted it, decidedly not, and it's the delta between the two we're converging on in this thread. @polarathene, your branch is more elegant Rust than mine. I think this PR has served its purpose and it's time for master...polarathene:feat/syslog-codec to become the new PR and implement @StephenWakely's comments in it. |
@syedriko thank you for your efforts and for kicking this off. If you are ok with moving ahead with @polarathene's branch then lets proceed with that. @polarathene apologies I hadn't realised there was a competing branch. Can you create a PR from your branch and I will take a look. I will make sure to be more timely with the review this time! Can you apply any of the feedback I have made to this PR regarding using the meanings to retrieve data from the event? |
I can open a PR yes, but it will be with an empty description with only the code itself and commit messages for context. I'm not sure if the current public branch is in a state that was usable, my local refactoring ran into an issue I wasn't sure how to approach / resolve.
Not at this point, I can make some time to address feedback within October, or once a PR is up apply code suggestions. However as mentioned I'd need to diff the local changes as I know with the refactor This is why I have been hesitant open a PR to seek a proper review. I will open a draft PR for you, and in the weekend look into what local changes I can commit to push. General feedback will be appreciated. I'd love to finish the PR sooner but have no employer at present, thus it's tricky to afford dedicating time to this over other priorities 😓 |
I've opened a PR: #21307 |
feat(syslog codec) POC of a syslog codec
This is a naive syslog codec implementation, geared towards the needs of the OpenShift logging stack