Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce source.address and destination.address. #247

Merged
merged 5 commits into from
Dec 11, 2018

Conversation

webmat
Copy link
Contributor

@webmat webmat commented Dec 6, 2018

As discussed earlier today, let's see if we can introduce this in Beta 2.

This would be very useful, as the field that's always guaranteed to have the endpoint address, no matter if it's an IP, a hostname or a unix socket.

@webmat webmat self-assigned this Dec 6, 2018
Copy link
Contributor

@MikePaquette MikePaquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example where this would play out? If an event contains an IP address string as the host, then why don't we just put that in the destination.domain field? Seeing an IP address string in a domain field is somewhat common and I'm not sure is a problem. Also, I'm concerned about populating the destination.ip field with information that is not coming directly from a packet or session event.

I think I prefer the simpler "just put whatever you get" into the destination.domain field.

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this change as it's provide a good place for a value where it's not clear what it is yet. So far we used domain but to me it always didn't feel obvious to put it there, especially the sockets.

@webmat
Copy link
Contributor Author

webmat commented Dec 7, 2018

@MikePaquette This would create a situation where source.domain and destination.domain in almost all flow events would contain IP addresses, however.

I'd much rather introduce this field that's known to be kind of a mess, whose name doesn't mean anything specific (address is more generic than domain), but is always guaranteed to have a value of whatever that endpoint was.

@andrewkroh
Copy link
Member

I think adding a generic address fields introduces ambiguity and will make it more difficult to query information. The field provides more flexibility when generating data, but the cost is that you need to query more fields to find information (ip, domain, and now address) and altogether lose the ability to do CIDR searches if the value is an IP.

It's a trade-off between doing the work at ingest time to classify the data or pushing that work to query-time.

(I'm considering the case where you have either an IP address or hostname, and ignoring unix sockets. For unix sockets perhaps having separate field distinctly for them would make sense.)

@ruflin
Copy link
Member

ruflin commented Dec 10, 2018

I think this is also about providing a field when we can't do it at ingest time. It could even be a post-processing job to split up the fields but it would still know what default field it should be based on.

For me address is not really the field we should encourage users to query on.

@webmat
Copy link
Contributor Author

webmat commented Dec 10, 2018

@andrewkroh The way I envision this (and perhaps my definition isn't clear enough) this field works exactly the reverse way:

If you have an IP:

  • .address == the IP
  • .ip == the IP
  • .domain == empty (unless you do reverse a DNS query)

If you have a domain:

  • .address == the domain
  • .domain == the domain
  • .ip == empty (unless you do a DNS query)

If you have a socket:

  • .address == the socket
  • .domain == empty
  • .ip == empty

Depending on the event stream being observed, the overwhelming majority of events will fall cleanly within expectations, using the expected field (e.g. access logs will mostly contain IP addresses). But if you then rely solely on the field you expect*, you are at risk of filtering out the weird events that fall outside of that expectation. Whether it's local access via a unix socket or it's a resolved name in the place of the IP address (httpd's HostnameLookups set to On).

The goal of this field is to allow looking for the long tail of weirdness. To have a place where you know if you look there, you have that endpoint's address, whatever form it took.

Also, the unix socket is so rare that I'd much rather have only .ip, .domain and .address with these semantics, than have .ip, .domain and .socket.

* Unless you ensure to always consider the events with "missing value" as well.

@MikePaquette
Copy link
Contributor

@webmat your explanation is helpful, and I am OK to define such fields as extended fields in ECS.

I would prefer a more descriptive/explicit field name that has less potential for confusion for users. For example, might_be_address, ip_or_domain, ip_or_host , ip_or_domain_or_socket

Related optimization question: would there be any advantage for subsequent queries/aggs/viz's if in the "domain" or "socket" cases, ECS converters populated the .ip fields with a constant known value, rather than leaving them unpopulated? (I am not sure how the IP datatype works when the field is empty.) ? e.g., 0.0.0.0 ?

@ruflin
Copy link
Member

ruflin commented Dec 10, 2018

+1 on using .address and put it into extended. I think we really need to field to make ingestion easy and so far it's the best name we came up with.

@webmat
Copy link
Contributor Author

webmat commented Dec 10, 2018

@MikePaquette For missing values, leaving the fields absent is the simplest option, and the one I think we should advocate.

For the name of the field, I do think .address makes most sense. It's simple, and it can also become a new pattern: we've been discussing using service.address to put contain addresses such as ODBC-style DSN strings. Address contains the whole thing as observed (well, with password anonymized) and then if your pipeline extracts domain/IP/port/user out of that, all of this goes to the expected fields. But .address is where the full raw value goes.

@webmat
Copy link
Contributor Author

webmat commented Dec 10, 2018

I meant to paste the Beats PR in my comment above: elastic/beats#8941

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides the changelog.

CHANGELOG.md Outdated
@@ -38,6 +38,7 @@ All notable changes to this project will be documented in this file based on the
* Reintroduce a streamlined `user_agent` field set. #240
* Add `geo.name` for ad hoc location names. #248
* Add `event.timezone` to allow for proper interpretation of incomplete timestamps. #258
* Add fields `source.address` and `destination.address`. #247
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is also added to client and server it seems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, will fix

Copy link
Contributor

@MikePaquette MikePaquette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation helps clarify purpose nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants