Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for event.ingested and event.processed #453

Closed
peterpramb opened this issue May 9, 2019 · 12 comments
Closed

Support for event.ingested and event.processed #453

peterpramb opened this issue May 9, 2019 · 12 comments

Comments

@peterpramb
Copy link

event.created is described as the time when the event was first read by an agent or by a pipeline. In a distributed setup however that means that information is lost about the time taken to process an event at the various stages.

Would it be possible to add new fields to event to keep that information? For example, event.processed can hold the time when e.g. logstash (the last one in a chain or maybe an array) actually processes the event, and event.ingested when an ingest pipeline handles it.

Does this make sense?

@ruflin
Copy link
Member

ruflin commented May 9, 2019

It's something we discussed in the past on goes a bit into event tracing. I can definitively see the value of these "measurements" on the event. I wonder if we could come up with a more generic way of storing multiple timestamps and we could let everyone add as many timestamps he needs.

@peterpramb
Copy link
Author

Well, there is the append pipeline processor which can create or extend an array, then only one field would be needed (as long as the order is preserved). Not sure if something similar exists in Logstash, or if it makes extracting the timestamps again too complicated.

@ruflin
Copy link
Member

ruflin commented May 10, 2019

My assumption is something similar exists in Logstash. The challenge I see with this is that we loose the context / meaning of each timestamp. It will also make it impossible to queries on one specific timestamp, for example processed.

@webmat
Copy link
Contributor

webmat commented May 10, 2019

Yes also see a need to provide guidance on how to track pipeline metadata. Timings like you describe here are part of it. We could also have a specific place for errors at the various stages.

I agree that keeping all of the context about each step will be a challenge. An array of keywords will lose this context, whereas an array of objects is difficult to use with Kibana...

So stay tuned for when we start tackling this, and suggestions are welcome.

In the meantime, don't forget that you can already track this the way you want, with custom fields. After experimenting with your use case, further feedback on your finding will be welcome :-)

@peterpramb
Copy link
Author

peterpramb commented May 12, 2019

Just thinking about some issues, independent on how the storage is actually done (feel free to ignore the SPAM):

  • event.created must be properly saved before @timestamp is overwritten for the first time somewhere in the chain (e.g. using timestamps from logfiles), or things will break.
  • Timezone calculation must be properly done (e.g. using timestamps from logfiles without TZ information), or things will break.
  • Storing the timestamps at various stages is nice for documentation, but will it help with visualization? The idea is to have some sort of stacked time diagram, showing the average delay at various stages. But how would Kibana handle raw timestamps?
  • To support visualization one probably should also store the calculated delay together with the raw timestamps.
  • Is there a need for raw timestamps at all? What about storing only the delay values? But without raw timestamps the delay can only be calculated absolutely (current time - event creation time). How could one recalculate the relative delays in Kibana again?
  • The whole date math can be done in scripts, but is not trivial and error-prone. Maybe this calls for some sort of processor in Logstash and Elasticsearch?
  • Can the whole processing be done implicitly without any user intervention?

@webmat
Copy link
Contributor

webmat commented May 23, 2019

Not spam at all. Thanks for the thoughts :-)

Re-reading your initial issue, I like the approach of focusing on the logical progression of the event in the pipeline, rather than the concrete steps in an arbitrary pipeline. The latter is how I had been thinking about it initially. But if we go on logical progression, we may be able to come up with a straightforward proposal for these timestamps (and other metadata), which users will be able to map on top of their concrete topologies.

@peterpramb
Copy link
Author

peterpramb commented May 24, 2019

Well, after rethinking my use case and my initial proposal I'm not sure that this is the right way to go. I'm basically interested in the avg. (and min/max/stddev/histogram/...) delays through the whole chain, but not for individual events. The data to metadata ratio is already worse enough for certain documents, so storing this information with the documents will make it even worse, and I don't see any use case in keeping that information. Instead I rather think about forwarding that information along with events and processing it directly in the endpoint.

Though similar, the proposal really covers two different use cases:

  • External

Each supporting component along the path might add timing information as metadata to events, at any depth it sees fit (e.g. Logstash processors). Even network delays could be calculated that way:

  /---\             /----\             /----\             /----\
  | S |-------------| C1 |-------------| C2 |-------------| ES |
  \---/             \----/             \----/             \----/

T_S(create)       T_C1(retr)         T_C2(retr)         T_ES(retr)
T_S(send)         T_C1(send)         T_C2(send)         T_ES(store)
t_S(delay)        t_C1(delay)        t_C2(delay)        t_ES(delay) 

       t_S_C1(delay)      t_C1_C2(delay)     t_C2_ES(delay)

ES as the last component in the path should then extract that information from incoming events, and remove the metadata afterwards. The processed information might be provided by introducing a new stats endpoint. When the data structure correctly represents the topology, one can even automatically draw topology diagrams from it.

  • Internal

This is really about ingest pipeline profiling, where each pipeline and processor adds the timing information as mentioned above.

While the node stats endpoint already provides total count and total processing time, it currently neither covers the actual topology (for example, sub-pipelines do not contain their processors and possible another level of sub-pipelines), nor does it calculate the actual delays yet. Additionally it might be helpful to not only provide overall stats for pipelines and processors, but introduce another granularity level like stats per 'document class' (configurable by the user, might default to event.dataset). That way different code paths could be analyzed, when a pipeline processes different events with different sub-pipelines.

Maybe both use cases fit into the same data structure, maybe not.

What do you think?

@peterpramb
Copy link
Author

peterpramb commented May 26, 2019

Hm, this still gives not the whole picture like I see it. Let's try it in a different way:

Did you ever see or use any network discovery tools? They actively scan and passively listen to your networks, and will, when you give them access to your network devices, extract all possible useful (or not so useful) information from them. In the end you get a more or less nice infrastructure diagram, showing all your network devices and their relationships (aka interconnections), as well as all sort of network and device information and operational statistics.

I'd like to propose something similar for the Elastic Stack, name it "automatic infrastructure discovery based on event flows" or something like that. Not by actively scanning anything, but by simply collecting infrastructure information from events, which were sequentially added by participating components along the event path. The nice thing is that you don't need to do anything special on Elasticsearch side, just look for and extract infrastructure information (and of course process it, but that should be simple ;). That could be as simple as just the event processing component identifier and timing information as mentioned in my last comment, and could grow to whatever makes sense for a component (e.g. pipeline profiling data in Logstash or Elasticsearch itself). The components do not even need to add their information to all processed events, based on the current event throughput this could be dynamically adjusted, e.g. only on every tenth event (therefore it is important to note that this will not be like time series data, as the event flow is not predictable). When a component is not aware of that discovery mechanism, then it will simply forward the event without adding anything useful to it (this is an important difference to using separate probe events, where not only all components along the path must support them, but must also route them correctly, e.g. to different Elasticsearch clusters).

One day I'd like to see a nice, configurable dashboard in Kibana, showing me the complete event processing infrastructure per event flows, all component information and statistics (like events/s, delay histograms), and their relationships. I should be able to identify any operational issues at a glance (maybe in combination with Watcher), for example event spikes or an extraordinary delay in event processing due to GC. In such cases, it would be useful to (when available) dig into a component further down (e.g. for getting more component statistics, or for profiling pipelines). The possibilities are really open to whatever someone could imagine.

I know that this is a lot different from my initial request, and that this no longer the right place to discuss it. Should we move that over to Elasticsearch, what do you suggest?

@jordansissel
Copy link

jordansissel commented Sep 16, 2019

To add another use case, we have data from laptops coming in with unpredictable delays due to factors such as being offline, etc. We want alerts on new data as it comes in, so we think that adding a field at ingest time (with an elasticsearch ingest processor or with logstash) would allow us to write alerts like "In the past 5 minutes, were any events received that match [some condition]"

event.created isn't a fit for this given its description (first seen by some kind of data pipeline/agent, which in this case would be auditbeat).

@graphaelli
Copy link
Member

does #582 close this?

@webmat
Copy link
Contributor

webmat commented Dec 13, 2019

does #582 close this?

Not fully. We've merged and released event.ingested, but we haven't addressed the pipeline timestamp yet. Let's keep it open for now.

@webmat
Copy link
Contributor

webmat commented Aug 18, 2020

We created meta-issue #940 to discuss capturing pipeline details in ECS. Since one of the two timestamps has already been added, and the other is related to the meta-issue, I'm closing in favor of the meta issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants