Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel build with limited resource #981

Closed
wants to merge 0 commits into from

Conversation

phu0ngng
Copy link
Collaborator

@phu0ngng phu0ngng commented Jul 2, 2024

Description

  • Enforcing csrc build with Ninja for all frameworks.
  • The pyproject.toml is used to check and install required packages, thus all the found_xxx() functions were removed.
  • By default, ninja build takes all available threads (equivalent to make -j). One can specify the maximum number of involved threads by NVTE_MAX_BUILD_JOBS or MAX_JOBS env vars.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference:

build_tools/utils.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@timmoon10
Copy link
Collaborator

timmoon10 commented Jul 3, 2024

We abandoned this PR based on the following chain of logic:

  • We wanted to make Ninja a build-time dependency so we can have consistent build parallelization.
  • Setuptools has deprecated the setup_requires kwarg to setuptools.setup (see deprecated keywords in the list of setuptools keywords). The recommended approach is to create a pyproject.toml with a [build-system] table (see this PyPA guide).
  • When Pip detects a pyproject.toml, it uses build isolation (see Pip docs). That is, it builds within a temporary virtual environment with only build-time dependencies. As far as I can tell, this can only be disabled by the user running pip install --no-build-isolation. This is a deliberate design by the Python developers to enforce their vision of package hygiene.
  • However, building PyTorch and Paddle extensions requires access to Setuptools wrappers (see torch.utils.cpp_extension). It's also important for Transformer Engine to be framework-agnostic, so our set of build-time dependencies is dynamic.
  • We must either ask users to change their build workflows, try to circumvent Pip's build isolation, or find some way to specify dynamic build-time dependencies.

We're not the first to note how build isolation is poorly-suited for the ML ecosystem (see astral-sh/uv#1715). We should keep this in mind for the future in case we need to modernize the build process and add a pyproject.toml. Users may want to preemptively run with pip install --no-build-isolation so that we don't break their build workflows.

Fow now, the much simpler approach is to modify our build process to handle either Ninja or make. See #987.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants