Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: do not filter slots for mixed-slot-type pools #9902

Merged
merged 12 commits into from
Sep 13, 2024

Conversation

carolinaecalderon
Copy link
Contributor

@carolinaecalderon carolinaecalderon commented Sep 9, 2024

Ticket

CM-503

Description

For agent based deployments, if agents of different slot types are assigned to the same resource pool, change the "Compute Slots Allocated" label to "Unspecified Slots Allocated" label to clarify the mixed statusfor the user.
Additionally, add an error log message in zero slot or multi-slot-type cases.
Finally, if a resource pool's slot type is of type TYPE_UNSPECIFIED, do not filter out any agents from the slot progress bar count.

See new label for slots allocated for pools with multi-slot-type agents (pool1) or zero-slot agents (pool2)

Test Plan

See unit tests in agent, confirming that the slot type assigned in the resource summary is "zero" or "unspecified" for zero or multiple slot type agents. See screenshots of the changes in the webui in the test cluster

For release party, I really think this should get manually tested -- you must spin up your own aws Ubuntu devcluster (reach out to me for instructions) and configure devcluster to have 1 agent with all the gpus and 1 agent with artificial slots.

You can access my demo cluster webui at http://54.84.91.59:8080/. (Message me for the password) Resource pool pool1 has multiple slot type agents (CUDA agent1 and CPU agent2). The configuration for these agents is:

  - agent:            
      # Each agent stage should have a unique name for devcluster.    
           name: agent1
      pre:
        - sh: make -C agent build
      config_file:
        master_host: 127.0.0.1
        master_port: 8080
        slot_type: cuda
        resource_pool: pool1
        container_master_host: $DOCKER_LOCALHOST
        # Often dtrain clusters have multiple gpus per agent.
        # Each agent needs a unique agent_id.
        agent_id: agent1
        visible_gpus: 3
  - agent:
      # Each agent stage should have a unique name for devcluster.
      name: agent2
      pre:
        - sh: make -C agent build
      config_file:
        master_host: 127.0.0.1
        master_port: 8080
        resource_pool: pool1
        container_master_host: $DOCKER_LOCALHOST
        artificial_slots: 8
        # Each agent needs a unique agent_id.
        agent_id: agent2
  - agent:
      # Each agent stage should have a unique name for devcluster.
      name: agent3
      pre:
        - sh: make -C agent build
      config_file:
        master_host: 127.0.0.1
        master_port: 8080
        slot_type: cuda
        resource_pool: pool2
        container_master_host: $DOCKER_LOCALHOST
        # Often dtrain clusters have multiple gpus per agent.
        # Each agent needs a unique agent_id.
        agent_id: agent3
        visible_gpus: 0, 1, 2
Screen Shot 2024-09-12 at 10 12 08 AM Screen Shot 2024-09-12 at 10 20 16 AM

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@cla-bot cla-bot bot added the cla-signed label Sep 9, 2024
Copy link

netlify bot commented Sep 9, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit e831f0e
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66e4c1d0d21c94000890f4cf

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 54.52%. Comparing base (a58ed7c) to head (e831f0e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
webui/react/src/components/SlotAllocationBar.tsx 0.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9902   +/-   ##
=======================================
  Coverage   54.52%   54.52%           
=======================================
  Files        1252     1252           
  Lines      156551   156557    +6     
  Branches     3597     3600    +3     
=======================================
+ Hits        85356    85369   +13     
+ Misses      71063    71056    -7     
  Partials      132      132           
Flag Coverage Δ
backend 45.14% <100.00%> (+0.01%) ⬆️
harness 72.75% <ø> (ø)
web 54.33% <62.50%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
master/internal/rm/agentrm/resource_pool.go 39.74% <100.00%> (ø)
master/internal/rm/agentrm/summaries.go 100.00% <100.00%> (+39.13%) ⬆️
webui/react/src/utils/cluster.ts 100.00% <100.00%> (ø)
webui/react/src/components/SlotAllocationBar.tsx 16.40% <0.00%> (ø)

... and 5 files with indirect coverage changes

@determined-ci determined-ci requested a review from a team September 10, 2024 16:32
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Sep 10, 2024
@carolinaecalderon carolinaecalderon changed the title adding tmp devcluster chore: clarify slots available for mixed-slot-type pools Sep 11, 2024
@carolinaecalderon carolinaecalderon marked this pull request as ready for review September 11, 2024 21:33
Copy link
Contributor

@johnkim-det johnkim-det left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one note but otherwise web LGTM

webui/react/src/utils/cluster.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@ShreyaLnuHpe ShreyaLnuHpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@carolinaecalderon carolinaecalderon changed the title chore: clarify slots available for mixed-slot-type pools fix: do not filter slots for mixed-slot-type pools Sep 13, 2024
@carolinaecalderon carolinaecalderon force-pushed the carolinac/cuda-escalation branch 2 times, most recently from 0a80940 to 21c9290 Compare September 13, 2024 19:20
@determined-ci determined-ci requested a review from a team September 13, 2024 21:19
@carolinaecalderon carolinaecalderon merged commit 3a2ea56 into main Sep 13, 2024
82 of 95 checks passed
@carolinaecalderon carolinaecalderon deleted the carolinac/cuda-escalation branch September 13, 2024 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants