Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change tt_cluster to use single tt_SiliconDevice #13949

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

broskoTT
Copy link
Contributor

@broskoTT broskoTT commented Oct 17, 2024

Ticket

First step towards #13948

What's changed

Changing tt_cluster so that it uses a single tt_SiliconDevice (to be renamed to tt::umd::Cluster) for all connected chips.
In UMD, the following changes were made prior to this change:

  • Some API tests added to track current API usage.
    • Made a test which explicitly shows our current driver doesn't support multiple separate clusters of chips (e.g. two N300 cards not connected through ethernet).
  • chip_id added to setup_core_to_tlb_map, since each mmio chip has it's own set of TLBs
  • Changed wait_for_non_mmio_flush to be able to accept chip_id, so it can run only for a specific remote chip.
  • chip_id added to get_pcie_base_addr_from_device
  • dynamic_tlb_config removed from constructor (this might be checked in with another PR before this one, since it is already in master)

Related UMD PRs

tenstorrent/tt-umd#166
tenstorrent/tt-umd#179
tenstorrent/tt-umd#165
tenstorrent/tt-umd#183
tenstorrent/tt-umd#131

Note that each of these UMD PRs are well scoped (as it, they have merit to be checked in by themselves), and they each slightly change the API, but are contributing to the overall effort of using a single tt_SiliconDevice, as is done in this PR.

Checklist

I've ran T3K and TGG workloads to make sure everything still works for multi chip environment:
https://github.com/tenstorrent/tt-metal/actions/runs/11398968066
https://github.com/tenstorrent/tt-metal/actions/runs/11398965728

T3K unit tests: https://github.com/tenstorrent/tt-metal/actions/runs/11441393489
T3K demo tests: https://github.com/tenstorrent/tt-metal/actions/runs/11441405195
TG unit tests: https://github.com/tenstorrent/tt-metal/actions/runs/11441415308
TG demos: https://github.com/tenstorrent/tt-metal/actions/runs/11441422195
TGG unit tests: https://github.com/tenstorrent/tt-metal/actions/runs/11441430017
TGG demos: https://github.com/tenstorrent/tt-metal/actions/runs/11441437612

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently pointing to brosko/tt_metal_singledriver_test, which holds several PRs merged.
Will keep this comment open until I switch to a commit on main in this PR.

@broskoTT
Copy link
Contributor Author

Is there a set of tests running on any multichip environment (ideally multi mmio chips), which I can run to increase confidence in this change?

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me and okay to merge once the tests list we discussed offline have parity with main

if (is_tg_cluster_) {
num_host_mem_ch_per_mmio_device = HOST_MEM_CHANNELS;
}
uint32_t num_host_mem_ch_per_mmio_device = HOST_MEM_CHANNELS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets go for min(HOST_MEM_CHANNELS, number_of_chips) for now otherwise on some machines we might see the warning from umd that prints that X channels requested but only Y available

broskoTT added a commit to tenstorrent/tt-umd that referenced this pull request Oct 23, 2024
Currently tt_SiliconDevice accepts one map for everything. But that goes
against our effort to separate cluster vs chip responsibilities, related
to #157
This change includes:
- Adding chip_id to setup_core_to_tlb_map. There should be a map per
mmio chip
- Started a chip api tests file
- Added an api example/test on how TLB setup functions at the moment.
This also does some minor configuration testing, whether it throws or
not.
- Minor cosmetic changes to other API tests

Contributes to #154 since it adds more api tests.
This change will require tt_metal changes.

- Breaks existing usages of setup_core_to_tlb_map
- tt_metal corresponding PR:
tenstorrent/tt-metal#13949
- tt_debuda change: Not used
broskoTT added a commit to tenstorrent/tt-umd that referenced this pull request Oct 24, 2024
This attribute is tied to a chip. Although it could be argued that
currently we only support a cluster of chips of same architecture,
therefore this would be the same for all chips in a single cluster.
Still, this makes sense and further unblocks supporting multiple
architectures in a single cluster/driver. It will also allow it to be
lowered to TTDevice class which will is arch specific (currently
architecture_implementation).

Related to #157 , contributes to #154 .

- Breaks existing usages of get_pcie_base_addr_from_device
- tt_metal corresponding PR:
tenstorrent/tt-metal#13949
- tt_debuda change: Not used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants