Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832

Qubitium · 2024-02-17T14:53:13Z

Flash-attn takes a lot of ram and cpu cores to compile. Most users do not have 64 cores or over 96GB of ram and hand adjusting MAX_JOBS is trial-and-error and not to mention a waste of dev time.

Problem this PR attempts to solve:

Ninja MAX_JOBS is unaware of threads=4 spawned by nvcc.
Ninja MAX_JOBS is unaware of actual available memory in environment
Both 1 and 2 combined results in oom, heavy swap usage, oversubscribed threads to core count (thread starvation) and general reduced efficiency in compiling since ninja is not resource aware.

Without this PR, python setup install will OOM on an Intel 14700k consumer machine with 96GB of ram (swap disabled). Using readme recommended MAX_JOBS=4 will lead to very slow compilation. Both options are sub-optimal. This led to the reason I created this PR. The problem becomes worse for consumer machines that generally have 32GB of ram especially for laptops. Overall, the chance of you over-waiting for flash-attention to compile due to under-utilization or flat-out oom is extremely high for all environments.

PR tries to solve most of the headaches and auto-adjusts ninja MAX_JOBS based on both cpu core count and available memory so that it will compile near max efficiency under both consumer and server environments. There is no longer a need to manually adjust the MAX_JOBS value.

Base logic is the code calculate max efficient MAX_JOBS based on 2 metrics: 1. cpu cores and 2. available memory and then take the min() value of the two. Cores are real-cores / 2 since threads=4 and Memory is divided by a constant of 9GB since I observe each job uses from 8-9GB of ram during peak.

Test Env:

Ubuntu 22.04
Torch 2.2
Intel 14700k + 96GB DDR5 6600 + swap disabled (to reproduce the worst possible oom situations).

…ads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env.

tridao · 2024-02-18T02:17:28Z

This is great, thanks so much @Qubitium !

Qubitium changed the title ~~Fix common compilation issues caused by auto-adjusting ninja MAX_JOBS env var~~ Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var Feb 17, 2024

This was referenced Feb 18, 2024

"flash_attn" cannot be installed on the A800 graphics card？ #806

Closed

You can't just make -j with no number #803

Closed

tridao merged commit f45bbb4 into Dao-AILab:main Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832

Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832

Qubitium commented Feb 17, 2024

tridao commented Feb 18, 2024

Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832

Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832

Conversation

Qubitium commented Feb 17, 2024

tridao commented Feb 18, 2024