Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events deadlock #226

Open
wants to merge 44 commits into
base: master
Choose a base branch
from
Open

Events deadlock #226

wants to merge 44 commits into from

Conversation

bibrak
Copy link

@bibrak bibrak commented Oct 30, 2024

Creates an internal DAG of the events at runtime. From that DAG maintains a topological sort. If there is a cycle in the DAG, adding that dependency in the topological sort will return error hence detecting the deadlock.

@bibrak bibrak changed the title Events deadlock [Draft] [RFC] Events deadlock Oct 30, 2024
}

ze_result_t
eventsDeadlockChecker::ZEeventsDeadlockChecker::zeCommandListAppendMemoryCopyPrologue(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be insterested in seeing a high level outline of how the algorithm works before getting to implementation of specific functions

  1. Which APIs will add things to graph
  2. Which APIs will remove things from graph
  3. Which APIs will execute deadlock detection mechanism


ze_result_t
eventsDeadlockChecker::ZEeventsDeadlockChecker::zeCommandListAppendMemoryCopyPrologue(
ze_command_list_handle_t hCommandList, ///< [in] handle of command list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to consider the specific properties of the command list to determine what to do with it.

For instance, a command list with ZE_COMMAND_LIST_FLAG_IN_ORDER has an implicit dependency between all things appended to it.

An immediate command list executes immediately and you probably want to do deadlock deetection immediately

a non-immediate command list does not execute until it is explicitly submittted, and you may want to delay deadlock detection until that point. (Or maybe not -not really sure on this one!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadlock detection happens as the API (action) is called from the user app (doesn't matter how the driver stages it in queues). The detection is only based on the hSignalEvent, and any phWaitEvents events.


if (eventToDagID.find(hSignalEvent) != eventToDagID.end()) {
if (eventToDagID[hSignalEvent] != invalidDagID) {
// This event already exists in the DAG. Get the DAG node ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nitpick, it's not really a DAG if there is a deadlock - because then there is a cycle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when a cycle is added the underlying DAG structure will not accept that .add() and will return error. This way we know there is a cycle and hence a deadlock.

SUCCESS_OR_TERMINATE(zeMemAllocDevice(context, &device_desc, buffer_size, 0, pDevice, &device_mem_ptr));

std::cout << "\n\n\n";

Copy link
Contributor

@bmyates bmyates Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an api zeEventHostSignal that allows signalling events from host. If that API were called at any point in time for event[2], then everything would work.

There is also the possibility that a later gpu command signals this event, and then there is also no problem in that case.

This seems intractable to me. I don't know that it is solvable because there is no way to predict what will happen at some point in the infinite future

Copy link
Author

@bibrak bibrak Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even in that case won't it be an ill formed use of the L0 API? Like a single action is being signaled twice? And even if it were ok then we will have an infinite cycle and a deadlock the second time it is triggered.


private:
// events point from/out to a DAG node. This map stores the DAG ID for each event (if there is one).
std::unordered_map<ze_event_handle_t, int> eventToDagID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

events can be reused many times and can be associated with many different command list actions. It seems like this is limiting events to only being used once

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have yet to recycle them. I will have to come up with something like eventToDagID[event] = invalidDagID; when there is a reset called or myriad of those ways that could happen that I am not aware of and will need help later.

std::cout << "\tDAG: Adding edge from " << dagID << " to " << this_action_new_node_id << std::endl;
} else {
std::cerr << "eventsDeadlockChecker: zeCommandListAppendMemoryCopyPrologue: Error: Wait event not found in eventToDagID map" << std::endl;
std::terminate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong we sohuld return an error instead of terminating.

We also need to make sure any error messages we log are understandable to someone that is not familiar with the internals here

Copy link
Contributor

@rwmcguir rwmcguir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the original Apache 2.0 LICENSE file from the XLA project put into the thirdparty/xla folder

Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
@bibrak bibrak changed the title [Draft] [RFC] Events deadlock [Draft] Events deadlock Nov 5, 2024
@bibrak bibrak changed the title [Draft] Events deadlock Events deadlock Nov 5, 2024
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
// SPDX-License-Identifier: Apache-2.0
// Copyright (C) 2024 Intel Corporation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lisanna-dettwyler a question in our other normal source files we have nomenclature like
//* @file graphcycles.h
in our comments. Should we insert these?
from legal perspective this is not necessary, but is infrastructure using this for anyreason?

Copy link
Contributor

@rwmcguir rwmcguir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Licensing perspective this looks good.
You have my informal +1 on License basis...

Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
@@ -18,6 +18,7 @@ By default, no validation modes will be enabled. The individual validation modes

- `ZE_ENABLE_PARAMETER_VALIDATION`
- `ZE_ENABLE_HANDLE_LIFETIME`
- `ZEL_ENABLE_EVENTSDEADLOCK_CHECKER`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we want to rename this to simply "events_checker" ? that way it can perform deadlock detection and any other checks on events as well in future

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -48,6 +49,9 @@ This mode maintains an internal mapping of each handle type to a state structure
- Additional per handle state checks added as needed
- Example - Check ze_cmdlist_handle_t open or closed

### `ZEL_ENABLE_EVENTSDEADLOCK_CHECKER`

The Events Deadlock Checker is designed to detect potential deadlocks that might occur due to improper event usage in the Level Zero API. It prints out wairning messages for user when it detects a potential deadlock.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "wairning"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

namespace validation_layer {

constexpr uint32_t invalidDagID = std::numeric_limits<uint32_t>::max();
constexpr ze_event_handle_t invalidEventAddress = std::numeric_limits<ze_event_handle_t>::max();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is max of an event_handle?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the max value of the pointer?

eventToDagID.erase(hEvent);
}

return ZE_RESULT_SUCCESS;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general question: how is epilogue call handled? if the call to driver returns an error then does epilogue get called? also does epilogue need to return the return value from the driver call?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrspruit do you know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of the result, the epilogue will be called like here:
https://github.com/oneapi-src/level-zero/blob/master/source/layers/validation/ze_valddi.cpp#L49 .

Perhaps we could expand this to pass in the "result" to the epilogue such that a check does not occur or don't call epilogue given an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either not call epilogue altogether if error in the actual zeCall or pass the error in the epilogue and then its as per the checker logic and behavior what it needs to do in case of error.

In the later case, then it will have to change API for all checkers for all zeAPI calls. Or just for the ZeEventDestroy()?

Again, in the former case it will be only for ZeEventDestroy() or all ZeAPI calls?

Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
@bibrak bibrak requested a review from rwmcguir November 12, 2024 17:18
rwmcguir
rwmcguir previously approved these changes Nov 12, 2024
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810

Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants