(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360

monorimet · 2024-10-29T20:49:12Z

No description provided.

monorimet · 2024-10-29T21:04:51Z

This sets HIP_VISIBLE_DEVICES from the .yaml -- it can be exposed as a pytest option as well (it's an option for the server CLI) but this is easiest and most obvious. #342 describes why this method was chosen for the runner (it's a little faster to init)

ScottTodd · 2024-10-29T21:07:42Z

.github/workflows/ci-sdxl.yaml

        ctest --timeout 30 --output-on-failure --test-dir build
-        pytest tests/apps/sd/e2e_test.py -v -s --system=amdgpu
+        HIP_VISIBLE_DEVICES=0 pytest tests/apps/sd/e2e_test.py -v -s --system=amdgpu


This sets HIP_VISIBLE_DEVICES from the .yaml -- it can be exposed as a pytest option as well (it's an option for the server CLI) but this is easiest and most obvious. #342 describes why this method was chosen for the runner (it's a little faster to init)

I think @saienduri also did something with HIP_VISIBLE_DEVICES for having multiple github actions runner instances on the same node, with one runner per GPU.

The runner setup doesn't seem to be restricting the visible devices -- with no specified visibility, a test failed due to a known multi-gpu bug. Thats OK as long as we don't mind using an environment variable here. If we switch to a runner that manages that variable, and multi-gpu still isn't fixed, we can remove the usage here.

@monorimet can you try using ROCR_VISIBLE_DEVICES? That's what we've been using in IREE and SHARK-TestSuite

monorimet · 2024-10-30T00:22:28Z

shortfin/python/shortfin_apps/sd/components/service.py

                worker = sysman.ls.create_worker(f"{name}-inference-{device.name}-{i}")
                fiber = sysman.ls.create_fiber(worker, devices=[device])
                self.workers.append(worker)
                self.fibers.append(fiber)
-                self.locks.append(asyncio.Lock())
+                self.fiber_status.append(0)


I'm not in love with this, but haven't dreamed up a different way about it yet. The context manager was making per-call difficult to toggle.

Yeah, we're missing a data structure. If I read this right, you just want to be able to pick a free fiber, right? There should really be some kind of a Pool or something which we don't have yet, but you can fake it with a simple fiber idle list: all available fibers go on the idle_list. Then when you need one, you pop and have the fiber put itself back when done. Something like that. You'd typically use a data structure that can yield if none are available but I think you are somehow never managing to underflow here? The underflow blocking could be faked today with a Queue or something like that.

stellaraccident

Ok, let's give this a go and see where it takes us. I think we'll end up landing on a simpler set of options but we can break it down then.

stellaraccident · 2024-10-30T00:30:42Z

shortfin/python/shortfin_apps/sd/components/service.py

                worker = sysman.ls.create_worker(f"{name}-inference-{device.name}-{i}")
                fiber = sysman.ls.create_fiber(worker, devices=[device])
                self.workers.append(worker)
                self.fibers.append(fiber)
-                self.locks.append(asyncio.Lock())
+                self.fiber_status.append(0)


Yeah, we're missing a data structure. If I read this right, you just want to be able to pick a free fiber, right? There should really be some kind of a Pool or something which we don't have yet, but you can fake it with a simple fiber idle list: all available fibers go on the idle_list. Then when you need one, you pop and have the fiber put itself back when done. Something like that. You'd typically use a data structure that can yield if none are available but I think you are somehow never managing to underflow here? The underflow blocking could be faked today with a Queue or something like that.

monorimet force-pushed the sfsd-concurrency branch from 5014b1f to a48fc23 Compare October 29, 2024 20:55

Adds program isolation optionality and fibers_per_device.

5b8dbdd

monorimet force-pushed the sfsd-concurrency branch from a48fc23 to 5b8dbdd Compare October 29, 2024 20:57

Specify device visibility with an env var.

5c8f1fb

ScottTodd reviewed Oct 29, 2024

View reviewed changes

eagarvey-amd and others added 4 commits October 29, 2024 16:11

Recover test changes from rebase

4dc6bf9

Small fix to test and disable tqdm progress by default.

dea7b3a

Pipe through progress option properly.

70d0ba6

Merge branch 'main' into sfsd-concurrency

bc805d2

monorimet enabled auto-merge (squash) October 29, 2024 21:51

Merge branch 'main' into sfsd-concurrency

57b92b9

monorimet disabled auto-merge October 29, 2024 23:20

monorimet requested a review from stellaraccident October 30, 2024 00:16

monorimet commented Oct 30, 2024

View reviewed changes

stellaraccident approved these changes Oct 30, 2024

View reviewed changes

monorimet enabled auto-merge (squash) October 30, 2024 00:33

Merge branch 'main' into sfsd-concurrency

df13660

monorimet merged commit c1176b6 into main Oct 30, 2024
11 checks passed

monorimet deleted the sfsd-concurrency branch October 30, 2024 00:37

stbaione mentioned this pull request Nov 5, 2024

Create Fiber Pools to Enable Batch Requests to Shortfin LLM Server #428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360

(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360

monorimet commented Oct 29, 2024

monorimet commented Oct 29, 2024

ScottTodd Oct 29, 2024

monorimet Oct 29, 2024 •

edited

Loading

saienduri Oct 30, 2024 •

edited

Loading

monorimet Oct 30, 2024 •

edited

Loading

stellaraccident Oct 30, 2024

stellaraccident left a comment

stellaraccident Oct 30, 2024

(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360

(shortfin-sd) Adds program isolation optionality and fibers_per_device. #360

Conversation

monorimet commented Oct 29, 2024

monorimet commented Oct 29, 2024

ScottTodd Oct 29, 2024

Choose a reason for hiding this comment

monorimet Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

saienduri Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

monorimet Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

stellaraccident Oct 30, 2024

Choose a reason for hiding this comment

stellaraccident left a comment

Choose a reason for hiding this comment

stellaraccident Oct 30, 2024

Choose a reason for hiding this comment

monorimet Oct 29, 2024 •

edited

Loading

saienduri Oct 30, 2024 •

edited

Loading

monorimet Oct 30, 2024 •

edited

Loading