Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360

Merged
merged 8 commits into from
Oct 30, 2024

Conversation

monorimet
Copy link
Contributor

No description provided.

@monorimet
Copy link
Contributor Author

This sets HIP_VISIBLE_DEVICES from the .yaml -- it can be exposed as a pytest option as well (it's an option for the server CLI) but this is easiest and most obvious. #342 describes why this method was chosen for the runner (it's a little faster to init)

Comment on lines 101 to +102
ctest --timeout 30 --output-on-failure --test-dir build
pytest tests/apps/sd/e2e_test.py -v -s --system=amdgpu
HIP_VISIBLE_DEVICES=0 pytest tests/apps/sd/e2e_test.py -v -s --system=amdgpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sets HIP_VISIBLE_DEVICES from the .yaml -- it can be exposed as a pytest option as well (it's an option for the server CLI) but this is easiest and most obvious. #342 describes why this method was chosen for the runner (it's a little faster to init)

I think @saienduri also did something with HIP_VISIBLE_DEVICES for having multiple github actions runner instances on the same node, with one runner per GPU.

Copy link
Contributor Author

@monorimet monorimet Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner setup doesn't seem to be restricting the visible devices -- with no specified visibility, a test failed due to a known multi-gpu bug. Thats OK as long as we don't mind using an environment variable here. If we switch to a runner that manages that variable, and multi-gpu still isn't fixed, we can remove the usage here.

Copy link
Contributor

@saienduri saienduri Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monorimet can you try using ROCR_VISIBLE_DEVICES? That's what we've been using in IREE and SHARK-TestSuite

@monorimet monorimet enabled auto-merge (squash) October 29, 2024 21:51
worker = sysman.ls.create_worker(f"{name}-inference-{device.name}-{i}")
fiber = sysman.ls.create_fiber(worker, devices=[device])
self.workers.append(worker)
self.fibers.append(fiber)
self.locks.append(asyncio.Lock())
self.fiber_status.append(0)
Copy link
Contributor Author

@monorimet monorimet Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in love with this, but haven't dreamed up a different way about it yet. The context manager was making per-call difficult to toggle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're missing a data structure. If I read this right, you just want to be able to pick a free fiber, right? There should really be some kind of a Pool or something which we don't have yet, but you can fake it with a simple fiber idle list: all available fibers go on the idle_list. Then when you need one, you pop and have the fiber put itself back when done. Something like that. You'd typically use a data structure that can yield if none are available but I think you are somehow never managing to underflow here? The underflow blocking could be faked today with a Queue or something like that.

Copy link
Contributor

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's give this a go and see where it takes us. I think we'll end up landing on a simpler set of options but we can break it down then.

worker = sysman.ls.create_worker(f"{name}-inference-{device.name}-{i}")
fiber = sysman.ls.create_fiber(worker, devices=[device])
self.workers.append(worker)
self.fibers.append(fiber)
self.locks.append(asyncio.Lock())
self.fiber_status.append(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we're missing a data structure. If I read this right, you just want to be able to pick a free fiber, right? There should really be some kind of a Pool or something which we don't have yet, but you can fake it with a simple fiber idle list: all available fibers go on the idle_list. Then when you need one, you pop and have the fiber put itself back when done. Something like that. You'd typically use a data structure that can yield if none are available but I think you are somehow never managing to underflow here? The underflow blocking could be faked today with a Queue or something like that.

@monorimet monorimet enabled auto-merge (squash) October 30, 2024 00:33
@monorimet monorimet merged commit c1176b6 into main Oct 30, 2024
11 checks passed
@monorimet monorimet deleted the sfsd-concurrency branch October 30, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants