Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApplyStartupHook diagnostic IPC command #86813

Merged
merged 23 commits into from
Jun 13, 2023

Conversation

jander-msft
Copy link
Member

@jander-msft jander-msft commented May 26, 2023

These changes allow adding diagnostic startup hooks through a new diagnostic IPC command. The new command has a single string property on it that represents the path to the assembly on disk that contains the startup hook.

The diagnostic startup hooks are stored in a static variable and then passed to StartupHookProvider.ManagedStartup; here the list of hooks is combined with the ones from the STARTUP_HOOKS app context data and then processed as startup hooks normally would be processed. Tests for both the managed library changes and the event pipe have been added.

I did not implement this in Mono; I did not see a way of dynamically providing startup hooks (e.g. such as environment variable) and didn't want to enable this dynamic notion without a scenario in place. I also have very little understanding on how to implement it for that runtime, so I've fixed it up enough that it would pass an empty string for the diagnostic startup hooks, thus preserving the existing behavior.

I did not implement this for AOT scenarios since adding startup hooks there doesn't make sense (no managed code execution).

Corresponding diagnostics change: dotnet/diagnostics#3918

Partially addresses #83756

cc @dotnet/dotnet-monitor

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label May 26, 2023
@ghost
Copy link

ghost commented May 26, 2023

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

These changes allow adding diagnostic startup hooks through a new diagnostic IPC command. The new command has a single string property on it that represents the path to the assembly on disk that contains the startup hook.

The diagnostic startup hooks are stored in a separate host property called DIAGNOSTIC_STARTUP_HOOKS and is provided to the default app domain upon creation. The StartupHookProvider in System.Private.CoreLib library has been updated to look for this new setting in addition to the existing STARTUP_HOOKS setting.

There are currently no added/updated tests as I'm first looking for validation of the concept and feedback on the initial implementation; once settled, I can invest more time in adding tests.

Partially addresses #83756

cc @dotnet/dotnet-monitor

Author: jander-msft
Assignees: -
Labels:

area-Diagnostics-coreclr, community-contribution

Milestone: -

@jander-msft
Copy link
Member Author

cc @vitek-karas and @elinor-fung: wanted your feedback on where this new property should be get/set from the diagnostic command (currently, it's going through the host runtime contract), and how to compose that (and maybe the other host properties) with the list of properties that were passed into coreclr_initialize (which is looks to be the same set as those from the host runtime contract, but may be different depending on how the caller invokes coreclr_intialize). Any other thoughts around structures to use, tests that could be added, etc... would be appreciated.

@jander-msft
Copy link
Member Author

jander-msft commented May 30, 2023

ILLink suppression needs to be updated to remove the trim warnings: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/ILLink/ILLink.Suppressions.LibraryBuild.xml#L22

@jkotas
Copy link
Member

jkotas commented May 31, 2023

Test failures:

  Starting:    Loader.StartupHooks.XUnitWrapper (parallel test collections = on, max threads = 6)
    Loader/StartupHooks/StartupHookTests/StartupHookTests.sh [FAIL]
      Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
         at System.String.MakeSeparatorListVectorized(System.ReadOnlySpan`1<Char>, System.Collections.Generic.ValueListBuilder`1<Int32> ByRef, Char, Char, Char)
         at System.String.MakeSeparatorListAny(System.ReadOnlySpan`1<Char>, System.ReadOnlySpan`1<Char>, System.Collections.Generic.ValueListBuilder`1<Int32> ByRef)
         at System.String.SplitInternal(System.ReadOnlySpan`1<Char>, Int32, System.StringSplitOptions)
         at System.StartupHookProvider.ProcessStartupHooks(System.String)
         at StartupHookTests.ValidHookName()

Copy link
Member

@vitek-karas vitek-karas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really try to add some test for this new capability. Hopefully we already have some test infra somewhere to help with the setup of the diag channel.

src/coreclr/vm/assembly.cpp Outdated Show resolved Hide resolved
@jander-msft
Copy link
Member Author

We should really try to add some test for this new capability. Hopefully we already have some test infra somewhere to help with the setup of the diag channel.

There seems to be event pipe tests located here; I don't think any of them involve loading an assembly separate from the entrypoint assembly, so there may be additional work for enabling that compilation and locating of that assembly in the test output.

I can also add some variants of the existing tests here for the managed side.

@jander-msft
Copy link
Member Author

/azp run runtime

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 86813 in repo dotnet/runtime

@davmason
Copy link
Member

davmason commented Jun 7, 2023

I'd agree (but am not too familiar with the internal workings there). It was also only non-Windows (so using sockets vs named pipes), right? Could you file an issue? Unless @davmason thinks this is expected.

I need a little bit to investigate, I don't know off the top of my head. I will let you know later today if it's expected or not.

@davmason
Copy link
Member

davmason commented Jun 7, 2023

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davmason
Copy link
Member

davmason commented Jun 8, 2023

I have done some research and the behavior of the ResumeRuntime command is by design. We keep track of each port's suspend status (so you can suspend for multiple sessions) and then pause startup until all ports that requested a suspend have marked themselves as resumed.

So if you have a reverse server port and then connect to the default diagnostic port to send ResumeRuntime, it will mark the default diagnostic port as resumed but keep suspended because the reverse server port is still waiting for its ResumeRuntime.

These two functions are the interesting ones, if anyone wants to dig through code:

ds_server_pause_for_diagnostics_monitor (void)
{
_is_paused_for_startup = true;
if (ds_ipc_stream_factory_any_suspended_ports ()) {
EP_ASSERT (ep_rt_wait_event_is_valid (&_server_resume_runtime_startup_event));
DS_LOG_ALWAYS_0 ("The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command.");
if (ep_rt_wait_event_wait (&_server_resume_runtime_startup_event, 5000, false) != 0) {
ds_rt_server_log_pause_message ();
DS_LOG_ALWAYS_0 ("The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command and has waited 5 seconds.");
ep_rt_wait_event_wait (&_server_resume_runtime_startup_event, EP_INFINITE_WAIT, false);
}
}
// allow wait failures to fall through and the runtime to continue coming up
}
void
ds_server_resume_runtime_startup (void)
{
ds_ipc_stream_factory_resume_current_port ();
if (!ds_ipc_stream_factory_any_suspended_ports () && ep_rt_wait_event_is_valid (&_server_resume_runtime_startup_event)) {
ep_rt_wait_event_set (&_server_resume_runtime_startup_event);
_is_paused_for_startup = false;
}
}

@jander-msft
Copy link
Member Author

I have done some research and the behavior of the ResumeRuntime command is by design. We keep track of each port's suspend status (so you can suspend for multiple sessions) and then pause startup until all ports that requested a suspend have marked themselves as resumed.

So if you have a reverse server port and then connect to the default diagnostic port to send ResumeRuntime, it will mark the default diagnostic port as resumed but keep suspended because the reverse server port is still waiting for its ResumeRuntime.

These two functions are the interesting ones, if anyone wants to dig through code:

ds_server_pause_for_diagnostics_monitor (void)
{
_is_paused_for_startup = true;
if (ds_ipc_stream_factory_any_suspended_ports ()) {
EP_ASSERT (ep_rt_wait_event_is_valid (&_server_resume_runtime_startup_event));
DS_LOG_ALWAYS_0 ("The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command.");
if (ep_rt_wait_event_wait (&_server_resume_runtime_startup_event, 5000, false) != 0) {
ds_rt_server_log_pause_message ();
DS_LOG_ALWAYS_0 ("The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command and has waited 5 seconds.");
ep_rt_wait_event_wait (&_server_resume_runtime_startup_event, EP_INFINITE_WAIT, false);
}
}
// allow wait failures to fall through and the runtime to continue coming up
}
void
ds_server_resume_runtime_startup (void)
{
ds_ipc_stream_factory_resume_current_port ();
if (!ds_ipc_stream_factory_any_suspended_ports () && ep_rt_wait_event_is_valid (&_server_resume_runtime_startup_event)) {
ep_rt_wait_event_set (&_server_resume_runtime_startup_event);
_is_paused_for_startup = false;
}
}

That wasn't the issue though; the hang problem was occurring from:

  1. Accept reverse connection stream
  2. Send ApplyStartupHook on default connection
  3. Dispose reverse connection stream
  4. Accept reverse connection stream (this is where the hang occurred)
  5. Send ResumeRuntime on reverse connection
  6. Dispose reverse connection stream

This commit fixed the test that shows changing from using the default connection for the ApplyStartupHook to using the reverse connection stream makes it work.

@jander-msft
Copy link
Member Author

It looks like the test legs that are cancelling due to timeouts are because the Helix runs are taking a lot longer than they normally should. I've looked an some of those and all of the work items eventually complete, just longer than the timeout for the AzDO job (which is 2.5 hrs). For example:

The other reported test failures seem to be unrelated to these changes.

Any guidance on what's the path forward from here?

@davmason
Copy link
Member

davmason commented Jun 8, 2023

That wasn't the issue though; the hang problem was occurring from:

  1. Accept reverse connection stream
  2. Send ApplyStartupHook on default connection
  3. Dispose reverse connection stream
  4. Accept reverse connection stream (this is where the hang occurred)
  5. Send ResumeRuntime on reverse connection
  6. Dispose reverse connection stream

This commit fixed the test that shows changing from using the default connection for the ApplyStartupHook to using the reverse connection stream makes it work.

In step 4 do you accept on the same port as step 1 or a different port? If it's the same port I am thinking to not bother with fixing it. It isn't the best behavior but since we expect to get the ResumeRuntime command on the same connection as the ReverseServer I don't think we need to spend a lot of effort making sure we fail the right way when we receive the command on a different port.

@jander-msft
Copy link
Member Author

That wasn't the issue though; the hang problem was occurring from:

  1. Accept reverse connection stream
  2. Send ApplyStartupHook on default connection
  3. Dispose reverse connection stream
  4. Accept reverse connection stream (this is where the hang occurred)
  5. Send ResumeRuntime on reverse connection
  6. Dispose reverse connection stream

This commit fixed the test that shows changing from using the default connection for the ApplyStartupHook to using the reverse connection stream makes it work.

In step 4 do you accept on the same port as step 1 or a different port? If it's the same port I am thinking to not bother with fixing it. It isn't the best behavior but since we expect to get the ResumeRuntime command on the same connection as the ReverseServer I don't think we need to spend a lot of effort making sure we fail the right way when we receive the command on a different port.

It's the same port.

@jander-msft
Copy link
Member Author

/azp run runtime

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 86813 in repo dotnet/runtime

@jander-msft
Copy link
Member Author

I keep forgetting I can't kick off new builds for this repo, can someone kick one off again for me? Trying to get a fresh PR build merge point and hoping to get through the backed up Helix queues.

@jander-msft
Copy link
Member Author

I was hoping to get a new build (not a rerun of the existing build) in order to see if the chrome-DebuggerTests and the build failure on freebsd could be avoided.

@jkotas
Copy link
Member

jkotas commented Jun 9, 2023

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jander-msft
Copy link
Member Author

Finally have a build that didn't time out.

I logged a know build issue (#87323) because the work items sent to the (AlmaLinux.8.Amd64.Open)[email protected]/dotnet-buildtools/prereqs:almalinux-8-helix-amd64 Helix queue are failing to be executed due to some Docker issue on the machines.

The Build Analysis seems to have attributed all of the failures to known issues.

I would like to ask for approvals and this to be merged if there isn't any additional feedback.

@davmason
Copy link
Member

I'm happy with the current state, @elinor-fung, @lateralusX, or @vitek-karas if you want to take a further look let us know, otherwise I will merge it after end of day Monday.

@davmason davmason merged commit 39bbff8 into dotnet:main Jun 13, 2023
@vitek-karas
Copy link
Member

Thanks a lot Justin, especially for being patient with us to figure out the final design.

@jander-msft jander-msft deleted the applystartuphook branch June 13, 2023 14:06
@ghost ghost locked as resolved and limited conversation to collaborators Jul 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants