Support for event.ingested and event.processed #453

peterpramb · 2019-05-09T09:39:03Z

event.created is described as the time when the event was first read by an agent or by a pipeline. In a distributed setup however that means that information is lost about the time taken to process an event at the various stages.

Would it be possible to add new fields to event to keep that information? For example, event.processed can hold the time when e.g. logstash (the last one in a chain or maybe an array) actually processes the event, and event.ingested when an ingest pipeline handles it.

Does this make sense?

The text was updated successfully, but these errors were encountered:

ruflin · 2019-05-09T10:33:26Z

It's something we discussed in the past on goes a bit into event tracing. I can definitively see the value of these "measurements" on the event. I wonder if we could come up with a more generic way of storing multiple timestamps and we could let everyone add as many timestamps he needs.

peterpramb · 2019-05-09T11:18:11Z

Well, there is the append pipeline processor which can create or extend an array, then only one field would be needed (as long as the order is preserved). Not sure if something similar exists in Logstash, or if it makes extracting the timestamps again too complicated.

ruflin · 2019-05-10T08:07:20Z

My assumption is something similar exists in Logstash. The challenge I see with this is that we loose the context / meaning of each timestamp. It will also make it impossible to queries on one specific timestamp, for example processed.

webmat · 2019-05-10T14:29:06Z

Yes also see a need to provide guidance on how to track pipeline metadata. Timings like you describe here are part of it. We could also have a specific place for errors at the various stages.

I agree that keeping all of the context about each step will be a challenge. An array of keywords will lose this context, whereas an array of objects is difficult to use with Kibana...

So stay tuned for when we start tackling this, and suggestions are welcome.

In the meantime, don't forget that you can already track this the way you want, with custom fields. After experimenting with your use case, further feedback on your finding will be welcome :-)

peterpramb · 2019-05-12T15:01:08Z

Just thinking about some issues, independent on how the storage is actually done (feel free to ignore the SPAM):

event.created must be properly saved before @timestamp is overwritten for the first time somewhere in the chain (e.g. using timestamps from logfiles), or things will break.
Timezone calculation must be properly done (e.g. using timestamps from logfiles without TZ information), or things will break.
Storing the timestamps at various stages is nice for documentation, but will it help with visualization? The idea is to have some sort of stacked time diagram, showing the average delay at various stages. But how would Kibana handle raw timestamps?
To support visualization one probably should also store the calculated delay together with the raw timestamps.
Is there a need for raw timestamps at all? What about storing only the delay values? But without raw timestamps the delay can only be calculated absolutely (current time - event creation time). How could one recalculate the relative delays in Kibana again?
The whole date math can be done in scripts, but is not trivial and error-prone. Maybe this calls for some sort of processor in Logstash and Elasticsearch?
Can the whole processing be done implicitly without any user intervention?

webmat · 2019-05-23T15:33:02Z

Not spam at all. Thanks for the thoughts :-)

Re-reading your initial issue, I like the approach of focusing on the logical progression of the event in the pipeline, rather than the concrete steps in an arbitrary pipeline. The latter is how I had been thinking about it initially. But if we go on logical progression, we may be able to come up with a straightforward proposal for these timestamps (and other metadata), which users will be able to map on top of their concrete topologies.

peterpramb · 2019-05-24T18:16:25Z

Well, after rethinking my use case and my initial proposal I'm not sure that this is the right way to go. I'm basically interested in the avg. (and min/max/stddev/histogram/...) delays through the whole chain, but not for individual events. The data to metadata ratio is already worse enough for certain documents, so storing this information with the documents will make it even worse, and I don't see any use case in keeping that information. Instead I rather think about forwarding that information along with events and processing it directly in the endpoint.

Though similar, the proposal really covers two different use cases:

External

Each supporting component along the path might add timing information as metadata to events, at any depth it sees fit (e.g. Logstash processors). Even network delays could be calculated that way:

  /---\             /----\             /----\             /----\
  | S |-------------| C1 |-------------| C2 |-------------| ES |
  \---/             \----/             \----/             \----/

T_S(create)       T_C1(retr)         T_C2(retr)         T_ES(retr)
T_S(send)         T_C1(send)         T_C2(send)         T_ES(store)
t_S(delay)        t_C1(delay)        t_C2(delay)        t_ES(delay) 

       t_S_C1(delay)      t_C1_C2(delay)     t_C2_ES(delay)

ES as the last component in the path should then extract that information from incoming events, and remove the metadata afterwards. The processed information might be provided by introducing a new stats endpoint. When the data structure correctly represents the topology, one can even automatically draw topology diagrams from it.

Internal

This is really about ingest pipeline profiling, where each pipeline and processor adds the timing information as mentioned above.

While the node stats endpoint already provides total count and total processing time, it currently neither covers the actual topology (for example, sub-pipelines do not contain their processors and possible another level of sub-pipelines), nor does it calculate the actual delays yet. Additionally it might be helpful to not only provide overall stats for pipelines and processors, but introduce another granularity level like stats per 'document class' (configurable by the user, might default to event.dataset). That way different code paths could be analyzed, when a pipeline processes different events with different sub-pipelines.

Maybe both use cases fit into the same data structure, maybe not.

What do you think?

peterpramb · 2019-05-26T15:21:24Z

Hm, this still gives not the whole picture like I see it. Let's try it in a different way:

Did you ever see or use any network discovery tools? They actively scan and passively listen to your networks, and will, when you give them access to your network devices, extract all possible useful (or not so useful) information from them. In the end you get a more or less nice infrastructure diagram, showing all your network devices and their relationships (aka interconnections), as well as all sort of network and device information and operational statistics.

I'd like to propose something similar for the Elastic Stack, name it "automatic infrastructure discovery based on event flows" or something like that. Not by actively scanning anything, but by simply collecting infrastructure information from events, which were sequentially added by participating components along the event path. The nice thing is that you don't need to do anything special on Elasticsearch side, just look for and extract infrastructure information (and of course process it, but that should be simple ;). That could be as simple as just the event processing component identifier and timing information as mentioned in my last comment, and could grow to whatever makes sense for a component (e.g. pipeline profiling data in Logstash or Elasticsearch itself). The components do not even need to add their information to all processed events, based on the current event throughput this could be dynamically adjusted, e.g. only on every tenth event (therefore it is important to note that this will not be like time series data, as the event flow is not predictable). When a component is not aware of that discovery mechanism, then it will simply forward the event without adding anything useful to it (this is an important difference to using separate probe events, where not only all components along the path must support them, but must also route them correctly, e.g. to different Elasticsearch clusters).

One day I'd like to see a nice, configurable dashboard in Kibana, showing me the complete event processing infrastructure per event flows, all component information and statistics (like events/s, delay histograms), and their relationships. I should be able to identify any operational issues at a glance (maybe in combination with Watcher), for example event spikes or an extraordinary delay in event processing due to GC. In such cases, it would be useful to (when available) dig into a component further down (e.g. for getting more component statistics, or for profiling pipelines). The possibilities are really open to whatever someone could imagine.

I know that this is a lot different from my initial request, and that this no longer the right place to discuss it. Should we move that over to Elasticsearch, what do you suggest?

jordansissel · 2019-09-16T17:07:22Z

To add another use case, we have data from laptops coming in with unpredictable delays due to factors such as being offline, etc. We want alerts on new data as it comes in, so we think that adding a field at ingest time (with an elasticsearch ingest processor or with logstash) would allow us to write alerts like "In the past 5 minutes, were any events received that match [some condition]"

event.created isn't a fit for this given its description (first seen by some kind of data pipeline/agent, which in this case would be auditbeat).

graphaelli · 2019-12-13T18:09:48Z

does #582 close this?

webmat · 2019-12-13T21:30:31Z

does #582 close this?

Not fully. We've merged and released event.ingested, but we haven't addressed the pipeline timestamp yet. Let's keep it open for now.

webmat · 2020-08-18T19:25:08Z

We created meta-issue #940 to discuss capturing pipeline details in ECS. Since one of the two timestamps has already been added, and the other is related to the meta-issue, I'm closing in favor of the meta issue.

peterpramb mentioned this issue May 12, 2019

Provide per-pipeline timestamps elastic/elasticsearch#42107

Closed

spong mentioned this issue Jul 20, 2020

[Meta] Expand support for populating event.ingested elastic/beats#20073

Closed

4 tasks

webmat mentioned this issue Aug 18, 2020

[meta] Add support for pipeline details in ECS #940

Open

webmat closed this as completed Aug 18, 2020

rsk0 mentioned this issue Aug 20, 2023

new field set: for log documents themselves #2258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for event.ingested and event.processed #453

Support for event.ingested and event.processed #453

peterpramb commented May 9, 2019

ruflin commented May 9, 2019

peterpramb commented May 9, 2019

ruflin commented May 10, 2019

webmat commented May 10, 2019

peterpramb commented May 12, 2019 •

edited

Loading

webmat commented May 23, 2019

peterpramb commented May 24, 2019 •

edited

Loading

peterpramb commented May 26, 2019 •

edited

Loading

jordansissel commented Sep 16, 2019 •

edited

Loading

graphaelli commented Dec 13, 2019

webmat commented Dec 13, 2019

webmat commented Aug 18, 2020

Support for event.ingested and event.processed #453

Support for event.ingested and event.processed #453

Comments

peterpramb commented May 9, 2019

ruflin commented May 9, 2019

peterpramb commented May 9, 2019

ruflin commented May 10, 2019

webmat commented May 10, 2019

peterpramb commented May 12, 2019 • edited Loading

webmat commented May 23, 2019

peterpramb commented May 24, 2019 • edited Loading

peterpramb commented May 26, 2019 • edited Loading

jordansissel commented Sep 16, 2019 • edited Loading

graphaelli commented Dec 13, 2019

webmat commented Dec 13, 2019

webmat commented Aug 18, 2020

peterpramb commented May 12, 2019 •

edited

Loading

peterpramb commented May 24, 2019 •

edited

Loading

peterpramb commented May 26, 2019 •

edited

Loading

jordansissel commented Sep 16, 2019 •

edited

Loading