Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832

Merged
merged 1 commit into from
Feb 18, 2024

Conversation

Qubitium
Copy link
Contributor

Flash-attn takes a lot of ram and cpu cores to compile. Most users do not have 64 cores or over 96GB of ram and hand adjusting MAX_JOBS is trial-and-error and not to mention a waste of dev time.

Problem this PR attempts to solve:

  1. Ninja MAX_JOBS is unaware of threads=4 spawned by nvcc.
  2. Ninja MAX_JOBS is unaware of actual available memory in environment
  3. Both 1 and 2 combined results in oom, heavy swap usage, oversubscribed threads to core count (thread starvation) and general reduced efficiency in compiling since ninja is not resource aware.

Without this PR, python setup install will OOM on an Intel 14700k consumer machine with 96GB of ram (swap disabled). Using readme recommended MAX_JOBS=4 will lead to very slow compilation. Both options are sub-optimal. This led to the reason I created this PR. The problem becomes worse for consumer machines that generally have 32GB of ram especially for laptops. Overall, the chance of you over-waiting for flash-attention to compile due to under-utilization or flat-out oom is extremely high for all environments.

PR tries to solve most of the headaches and auto-adjusts ninja MAX_JOBS based on both cpu core count and available memory so that it will compile near max efficiency under both consumer and server environments. There is no longer a need to manually adjust the MAX_JOBS value.

Base logic is the code calculate max efficient MAX_JOBS based on 2 metrics: 1. cpu cores and 2. available memory and then take the min() value of the two. Cores are real-cores / 2 since threads=4 and Memory is divided by a constant of 9GB since I observe each job uses from 8-9GB of ram during peak.  

Test Env:

Ubuntu 22.04
Torch 2.2
Intel 14700k + 96GB DDR5 6600 + swap disabled (to reproduce the worst possible oom situations). 

…ads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env.
@Qubitium Qubitium changed the title Fix common compilation issues caused by auto-adjusting ninja MAX_JOBS env var Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var Feb 17, 2024
@tridao tridao merged commit f45bbb4 into Dao-AILab:main Feb 18, 2024
@tridao
Copy link
Contributor

tridao commented Feb 18, 2024

This is great, thanks so much @Qubitium !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants