Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EventSource::SendManifest causes lost events which result in unmergable ETL files #77014

Closed
AloisKraus opened this issue Oct 13, 2022 · 14 comments · Fixed by #78213
Closed

EventSource::SendManifest causes lost events which result in unmergable ETL files #77014

AloisKraus opened this issue Oct 13, 2022 · 14 comments · Fixed by #78213

Comments

@AloisKraus
Copy link

Description

When doing ETW recording on Windows I have found that whenever a .NET 6.0 application is traced it will emit the ETW manifest whenever a .NET runtime session is started. These events are large (30+KB) and will cause from multiple processes (I have many .NET services running on that box) an event storm which will cause ETW to cause event drops.
That in itself is not too bad, but sometimes the ETW files become corrupt so that the merge operation with xperf or PerfView will fail. That is an issue.

Reproduction Steps

  1. Start many .NET 6 applications
  2. Start ETW Session for .NET Runtime provider
 C:\etl>xperf -start ClrSession -on  Microsoft-Windows-DotNETRuntime -f C:\temp\ETW.clrSession.etl -BufferSize 512 -minbuffers 256 -maxbuffers 512 -FileMode Sequential
  1. Stop ETW session
C:\etl>xperf -stop ClrSession
xperf: warning: Session "ClrSession" lost 241898 events.
The trace you have just captured "C:\temp\ETW.clrSession.etl" may contain personally identifiable information, including but not necessarily limited to paths to files accessed, paths to registry accessed and process names. Exact information depends on the events that were logged. Please be aware of this when sharing out this trace with other people.

To avoid losing events, for in-file tracing try increasing buffer size or the minimum/maximum number of buffers. It is also recommended that a separate secondary storage device is used for in-file tracing, especially when analyzing secondary storage activity. Another way to avoid losing events is to trace in memory (-buffering mode). This guarantees that events will not be lost, but the time range of events captured can be significantly shorter than with in-file tracing.

  1. Check if dotnet runtime provider has got lost events

Expected behavior

No lost events

Actual behavior

Lost events and sometimes unmergable ETL files

xperf -merge  broken.etl broken.user.etl final.etl -mergeonly
xperf: error: final.etl: The parameter is incorrect. (0x57).

I can provide example ETL files which are broken on a private channel.

Regression?

This was not the case with .NET Core 3.1

Known Workarounds

So far none. Does filtereing the event "fix" this when I later do a proper rundown?

tracelog -enableex ClrSession -guid #e13c0d23-ccbc-4e12-931b-d9cc2eee27e4 -EventIdFilter -out 1 65534

Configuration

No response

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 13, 2022
@tommcdon tommcdon added the question Answer questions and provide assistance, not an issue with source code or documentation. label Oct 13, 2022
@tommcdon tommcdon added this to the 8.0.0 milestone Oct 13, 2022
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Oct 13, 2022
@tommcdon
Copy link
Member

@brianrob @davmason

@brianrob
Copy link
Member

Starting with .NET 6, NativeRuntimeEventSource will dump its manifest into the event stream. To my knowledge, this is not used any tools, but it does have an impact on the verbosity of trace data coming from .NET 6+ processes. I can't explain why it would create unmergeable ETL files.

@davmason, this will likely require producing a minimal repro and reaching out to the ETW folks. The additional overhead of the manifest events, especially when there are many processes is also a good example of one of the negative side effects of emitting the manifest.

@AloisKraus
Copy link
Author

@brianrob: I guess that some lost events are part of the manifest which are incomplete which will cause the merge error. Since I cannot debug into Windows if the ETW infrastructure knows more about non manifest based providers during merge I can only speculate. There were no buffers but some events (e.g. 4) dropped. I do not know any other key events during merging which could cause ETW to say that the data is corrupt. That could be OS dependent which would be even worse because then we would need a fix on several Windows versions ...

@brianrob
Copy link
Member

It's super interesting to me that dropped events would cause a merge error, but I suppose anything is possible. The manifest events themselves aren't special, other than that they're much larger than most events. They actually aren't used at all by xperf as xperf doesn't know about them. Either way, if you can repro this reliably, it does feel like it's worth investigating. @davmason, is this something you can take a look at?

@davmason
Copy link
Member

@AloisKraus how often do you see the merge error? I played around with it, and when there are many 6.0 processes I can reliably see the dropped events but I haven't been able to reproduce the merge error.

@AloisKraus
Copy link
Author

AloisKraus commented Oct 17, 2022

@davmason: I have opened a case TrackingID#2210140040001578 which contains an example ETW which is broken. Yes we can reliable reproduce the issue when we start our application which starts ca. 50+ .NET 4.8 and some .NET 6.0 processes. When we start/stop profiling after the start the first time it causes reliably unmergeable ETL files.

@davmason
Copy link
Member

Thanks @AloisKraus. The .net team does not own xperf, I am trying to get your case routed to the appropriate people who can help. I have no particular knowledge about xperf itself.

For an immediate workaround have you considered using dotnet-trace? It only will work on .net core 3.1 and forward, but you can trace on a per process level instead of per machine level.

@AloisKraus
Copy link
Author

Hi @davmason I will try out dotnet-trace for docker containers, but the majority is still .NET 4.8 and larger amounts of C++. Using dotnet-trace is therefore not an option if I want to understand system level issues. We use it to automate perf trending tests to see the impact of Antivirus, OS Changes, our software and many other things. In fact I have written ETWAnalyzer to mass analyze ETW data which is working very well. Perhaps it is also something to consider for some key metrics for dotnet itself? The FileWriter example shows how it can be used to generate mass ETW data.

@tommcdon tommcdon added tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly and removed question Answer questions and provide assistance, not an issue with source code or documentation. labels Oct 25, 2022
@tommcdon tommcdon modified the milestones: 8.0.0, Future Oct 25, 2022
@tommcdon
Copy link
Member

Using dotnet-trace is therefore not an option if I want to understand system level issues.

Sorry to hear.

Perhaps it is also something to consider for some key metrics for dotnet itself? The FileWriter example shows how it can be used to generate mass ETW data.

Thanks for the suggestion @AloisKraus! Feel free to open another issue to track this as an enhancement.

the majority is still .NET 4.8 and larger amounts of C++

Since this is an issue external to .NET Core, I mark this with the tracking-external-issue label and move this into the Future milestone.

@AloisKraus
Copy link
Author

@tommcdon: I have found the issue of the xperf merge error. I was running xperf 10.0.22621 which had this merge bug which is no longer happening with the latest Win 11 Package with version 10.0.22621.

Still the event loss issue stays. Would it be possible to add a variable to skip on trace session start the manifest sending so one can do later a rundown where everything can be collected also with lost events. I do this already for .NET processes which works well. The rundown session always gets lost events but for simplicity I reset the lost event counter of the rundown session etl to get WPA to open these files without warning messages.

@davmason
Copy link
Member

Hi @AloisKraus,

We can use this issue as tracking for the request to introduce a configuration flag to not emit the events.

For the existing runtimes, are you able to avoid lost events by increasing your buffer size while tracing?

@AloisKraus
Copy link
Author

AloisKraus commented Oct 27, 2022

Pre .NET 6 including .NET 4.8 does not emit many large events. By default I use 512 KB buffers which were working fine for many years. When I look at the ETW manifest it is already 537 KB. So this will cause lost events whenever more than one process is trying to write chuncked manifest data to ETW:

 Directory of C:\Source\Git\CoreClr\src\vm
05/04/2018  08:04 AM           537,832 ClrEtwAll.man

As far as I know lost events are happening when an ETW buffer becomes full and no other buffer was already prepared to switch over. Then all logged events for that buffer are discarded. This will not scale with the number of .NET 6+ processes.

@brianrob
Copy link
Member

@davmason, does this manifest always match the native provider? If it does, this feels like a good reason to not emit this event, like pre .NET 6, as the events may be coming from an unregistered native provider or the EventSource. Thus, tools must know abou the provider without depending on it being in the event stream. When NativeRuntimeEventSource was initially created, it was special-cased out of emitting its manifest because it was an implementation detail that it was an EventSource, and not actually an EventSource that represented its own set of events.

@davmason
Copy link
Member

After discussion, it should be safe to not emit the manifest for NativeRuntimeEventSource

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 11, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 13, 2022
@tommcdon tommcdon removed the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Nov 14, 2022
@tommcdon tommcdon modified the milestones: Future, 8.0.0 Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants