-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Events deadlock #226
base: master
Are you sure you want to change the base?
Events deadlock #226
Conversation
source/layers/validation/checkers/events_deadlock/zel_events_deadlock_checker.h
Outdated
Show resolved
Hide resolved
} | ||
|
||
ze_result_t | ||
eventsDeadlockChecker::ZEeventsDeadlockChecker::zeCommandListAppendMemoryCopyPrologue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be insterested in seeing a high level outline of how the algorithm works before getting to implementation of specific functions
- Which APIs will add things to graph
- Which APIs will remove things from graph
- Which APIs will execute deadlock detection mechanism
|
||
ze_result_t | ||
eventsDeadlockChecker::ZEeventsDeadlockChecker::zeCommandListAppendMemoryCopyPrologue( | ||
ze_command_list_handle_t hCommandList, ///< [in] handle of command list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may need to consider the specific properties of the command list to determine what to do with it.
For instance, a command list with ZE_COMMAND_LIST_FLAG_IN_ORDER
has an implicit dependency between all things appended to it.
An immediate command list executes immediately and you probably want to do deadlock deetection immediately
a non-immediate command list does not execute until it is explicitly submittted, and you may want to delay deadlock detection until that point. (Or maybe not -not really sure on this one!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deadlock detection happens as the API (action) is called from the user app (doesn't matter how the driver stages it in queues). The detection is only based on the hSignalEvent
, and any phWaitEvents
events.
|
||
if (eventToDagID.find(hSignalEvent) != eventToDagID.end()) { | ||
if (eventToDagID[hSignalEvent] != invalidDagID) { | ||
// This event already exists in the DAG. Get the DAG node ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a nitpick, it's not really a DAG if there is a deadlock - because then there is a cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when a cycle is added the underlying DAG structure will not accept that .add()
and will return error. This way we know there is a cycle and hence a deadlock.
SUCCESS_OR_TERMINATE(zeMemAllocDevice(context, &device_desc, buffer_size, 0, pDevice, &device_mem_ptr)); | ||
|
||
std::cout << "\n\n\n"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an api zeEventHostSignal
that allows signalling events from host. If that API were called at any point in time for event[2], then everything would work.
There is also the possibility that a later gpu command signals this event, and then there is also no problem in that case.
This seems intractable to me. I don't know that it is solvable because there is no way to predict what will happen at some point in the infinite future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even in that case won't it be an ill formed use of the L0 API? Like a single action is being signaled twice? And even if it were ok then we will have an infinite cycle and a deadlock the second time it is triggered.
|
||
private: | ||
// events point from/out to a DAG node. This map stores the DAG ID for each event (if there is one). | ||
std::unordered_map<ze_event_handle_t, int> eventToDagID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
events can be reused many times and can be associated with many different command list actions. It seems like this is limiting events to only being used once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have yet to recycle them. I will have to come up with something like eventToDagID[event] = invalidDagID;
when there is a reset called or myriad of those ways that could happen that I am not aware of and will need help later.
std::cout << "\tDAG: Adding edge from " << dagID << " to " << this_action_new_node_id << std::endl; | ||
} else { | ||
std::cerr << "eventsDeadlockChecker: zeCommandListAppendMemoryCopyPrologue: Error: Wait event not found in eventToDagID map" << std::endl; | ||
std::terminate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something goes wrong we sohuld return an error instead of terminating.
We also need to make sure any error messages we log are understandable to someone that is not familiar with the internals here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the original Apache 2.0 LICENSE file from the XLA project put into the thirdparty/xla folder
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
00217b0
to
dcaf532
Compare
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
// SPDX-License-Identifier: Apache-2.0 | ||
// Copyright (C) 2024 Intel Corporation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lisanna-dettwyler a question in our other normal source files we have nomenclature like
//* @file graphcycles.h
in our comments. Should we insert these?
from legal perspective this is not necessary, but is infrastructure using this for anyreason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Licensing perspective this looks good.
You have my informal +1 on License basis...
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
source/layers/validation/README.md
Outdated
@@ -18,6 +18,7 @@ By default, no validation modes will be enabled. The individual validation modes | |||
|
|||
- `ZE_ENABLE_PARAMETER_VALIDATION` | |||
- `ZE_ENABLE_HANDLE_LIFETIME` | |||
- `ZEL_ENABLE_EVENTSDEADLOCK_CHECKER` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we want to rename this to simply "events_checker" ? that way it can perform deadlock detection and any other checks on events as well in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
source/layers/validation/README.md
Outdated
@@ -48,6 +49,9 @@ This mode maintains an internal mapping of each handle type to a state structure | |||
- Additional per handle state checks added as needed | |||
- Example - Check ze_cmdlist_handle_t open or closed | |||
|
|||
### `ZEL_ENABLE_EVENTSDEADLOCK_CHECKER` | |||
|
|||
The Events Deadlock Checker is designed to detect potential deadlocks that might occur due to improper event usage in the Level Zero API. It prints out wairning messages for user when it detects a potential deadlock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "wairning"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
namespace validation_layer { | ||
|
||
constexpr uint32_t invalidDagID = std::numeric_limits<uint32_t>::max(); | ||
constexpr ze_event_handle_t invalidEventAddress = std::numeric_limits<ze_event_handle_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is max of an event_handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the max value of the pointer?
eventToDagID.erase(hEvent); | ||
} | ||
|
||
return ZE_RESULT_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general question: how is epilogue call handled? if the call to driver returns an error then does epilogue get called? also does epilogue need to return the return value from the driver call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nrspruit do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless of the result, the epilogue will be called like here:
https://github.com/oneapi-src/level-zero/blob/master/source/layers/validation/ze_valddi.cpp#L49 .
Perhaps we could expand this to pass in the "result" to the epilogue such that a check does not occur or don't call epilogue given an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, either not call epilogue altogether if error in the actual zeCall or pass the error in the epilogue and then its as per the checker logic and behavior what it needs to do in case of error.
In the later case, then it will have to change API for all checkers for all zeAPI calls. Or just for the ZeEventDestroy()?
Again, in the former case it will be only for ZeEventDestroy() or all ZeAPI calls?
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Related-To: NEO-12810 Signed-off-by: Chandio, Bibrak Qamar <[email protected]>
Creates an internal DAG of the events at runtime. From that DAG maintains a topological sort. If there is a cycle in the DAG, adding that dependency in the topological sort will return error hence detecting the deadlock.