Global mutex issues #72

broskoTT · 2024-09-23T14:18:28Z

After talking with @tt-vjovanovic, he raised an issue with me that there are some issues with global mutexes used:

Mutexes don't have "tt_" or anything like it in their name, making them hard to identify which process are they related to, on the system, from outside of our code
Mutexes aren't cleaned up when some code crashes. This then leads to hanging mutexes, and until you manually fix it, UMD won't work properly on that system.

The task at hand is to refactor usage of these mutexes a bit to address these issues.
It is still not clear however what is the right path to achieve this:

Refactor existing mutexes
According to the new UMD design, it might make sense to tie them to some of the new classes. This could come naturally from the previous point and from the redesign itself.
@joelsmithTT worked at one point on moving them to KMD, maybe that is the path to go

joelsmithTT · 2024-09-23T22:22:35Z

The presence of these mutexes reflect a system design flaw. Consider the case of the ARC_MSG mutexes: like the other mutexes in UMD, they are implemented with shared memory so that they work in a multiprocess context. If two separate UMD-based applications attempt to message a device's ARC firmware simultaneously, the mutex will serialize the accesses. (The same is true for a single UMD-based application running with multiple threads).

The design is flawed because UMD is not the only software that can interact with hardware. Applications based on the Luwen library can message the ARC firmware without participating in UMD's mutex scheme. The same is true of Python-based tooling used internally.

There are a variety of bad techniques to try to solve this:

Don't run programs that try to message the ARC (or use any other similarly guarded chip resource) simultaneously. This is the situation today. This isn't realistic: users want to run e.g. tt-smi simultaneously with an ML workload.
Everything that access the chip accesses it through UMD. This is not realistic: tools built out of the Luwen library are not going away.
Everything that accesses the chip agrees upon and respects a locking convention to guard hardware resources. KMD has basic support for this, but there is no established convention. Moreover, the KMD implementation does not handle lock contention properly: it forces userspace to poll. A userspace based approach is possible: teach Luwen and the Python codes to touch the same shared memory that UMD uses. This has other drawbacks, including the cleanup issue mentioned in this issue.

There is a good solution:

Lower resource management to KMD. KMD can enforce serialization. This would require augmenting the KMD interface.

broskoTT · 2024-09-24T06:45:14Z

Thanks for additional info.

Two approaches in solving this issue, which you identified:

Try to use the same mechanism in different codepaths. Which would mean specifically in our context to use same mutexes in Luwen and UMD. I think this is harder to maintain
Have a single point of entry to our device. You mention KMD, but maybe UMD can be that point as well. If we call something a driver (even though it is user mode), I believe that most of our tools should use it. We should not use different drivers, rather change the one driver as needed.

TTDRosen · 2024-09-24T22:32:47Z

TLDR for the below: I think that KMD, not UMD should be where the resource management for the chips lives.

I agree that we should have a single point of entry to simplify the device communication story. What is looks like you are proposing that we take the GPU route and turn UMD into a global dynamically linked library. This could work, but it's not clear to me how you gather and maintain a global view. From my understanding your two options for getting and maintaining a global view are to stick it in the driver (in which case why are we pretending that UMD is a requirement) or use the filesystem (at which point the original issues raise themselves again {also docker containers are harder to setup}). Furthermore Linux requires there be a single KMD per pci device but provides no such restrictions to UMD. Therefore KMD with its global view of our device's pci resources should be the one to arbitrate and hand those out.

In addition UMD is not trying to only do PCI resource management, it is also supposed to support simulations and whatever other interfaces customers want to implement. We should not fall into the trap of making it something for everyone.

UMD wants to be a high level "just read/write" interface. We should not tie a global install to an interface like that. I think it's far simpler to just install that as a dependency. We already see the trouble we have with fw flashing.

joelsmithTT mentioned this issue Oct 4, 2024

Rethink tt_SiliconDevice construction #118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global mutex issues #72

Global mutex issues #72

broskoTT commented Sep 23, 2024

joelsmithTT commented Sep 23, 2024

broskoTT commented Sep 24, 2024

TTDRosen commented Sep 24, 2024

Global mutex issues #72

Global mutex issues #72

Comments

broskoTT commented Sep 23, 2024

joelsmithTT commented Sep 23, 2024

broskoTT commented Sep 24, 2024

TTDRosen commented Sep 24, 2024