Fix common compilation issues by auto-adjusting ninja MAX_JOBS env var #832
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Flash-attn takes a lot of ram and cpu cores to compile. Most users do not have 64 cores or over 96GB of ram and hand adjusting MAX_JOBS is trial-and-error and not to mention a waste of dev time.
Problem this PR attempts to solve:
Without this PR,
python setup install
will OOM on an Intel 14700k consumer machine with 96GB of ram (swap disabled). Using readme recommended MAX_JOBS=4 will lead to very slow compilation. Both options are sub-optimal. This led to the reason I created this PR. The problem becomes worse for consumer machines that generally have 32GB of ram especially for laptops. Overall, the chance of you over-waiting for flash-attention to compile due to under-utilization or flat-out oom is extremely high for all environments.PR tries to solve most of the headaches and auto-adjusts ninja MAX_JOBS based on both cpu core count and available memory so that it will compile near max efficiency under both consumer and server environments. There is no longer a need to manually adjust the MAX_JOBS value.
Base logic is the code calculate max efficient MAX_JOBS based on 2 metrics: 1. cpu cores and 2. available memory and then take the min() value of the two. Cores are real-cores / 2 since threads=4 and Memory is divided by a constant of 9GB since I observe each job uses from 8-9GB of ram during peak.
Test Env: