Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IPEX] Slice SDPA into smaller chunks #14353

Merged
merged 2 commits into from
Jan 1, 2024

Conversation

Nuullll
Copy link
Contributor

@Nuullll Nuullll commented Dec 18, 2023

Description

Slice scaled_dot_product_attention into smaller chunks so that the SDPA of each chunk wouldn't request any allocation larger than the given limit.

This was initially designed to work around the 4GB single block allocation limitation of Intel compute-runtime (RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 8.00 GiB). Then I found out that setting a smaller limit would reduce the VRAM footprint during SDPA calculation. The current limit (VRAM // 8) was tuned for Intel Arc A770 16G and A750 8G without sacrificing performance.

With this change, A770 16G can generate 512x512 of batch size 32 and A750 8G can generate batch size 16

Test results:

Common settings: --use-ipex --opt-sdp-attention, txt2img, DPM++ 2M Karras, 20 steps, 512x512 resolution, batch count = 5

  • Effective it/s == Batch size * Batch count * Steps / Total time taken
  • RE in the table refers to RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB!
  • OOM in the table refers to RuntimeError: Allocation is out of device memory on current platform.

A770 16G (connected with two monitors [taking up ~1.1GB VRAM])

Batch Size Before: Peak VRAM (GB) After: Peak VRAM (GB) Delta Before: Effective it/s After: Effective it/s Delta
1 6.8 6.6 -2.9% 5.95 6.45 +8.4%
2 8.5 8.4 -1.2% 7.66 7.84 +2.3%
4 11.6 11.1 -4.3% 8.95 9.17 +2.5%
8 15.9 13.9 -12.6% 4.46 10.74 +140.8%
16 RE 15.1 - - 11.40 -
32 RE 15.5 - - 11.24 -

A750 8G (not connected with monitors)

Batch Size Before: Peak VRAM (GB) After: Peak VRAM (GB) Delta Before: Effective it/s After: Effective it/s Delta
1 5.7 5.6 -1.8% 5.49 6.06 +10.4%
2 7.4 6.8 -8.1% 7.22 7.55 +4.6%
4 7.9 7.5 -5.1% 6.81 8.68 +27.5%
8 OOM 7.9 - - 9.47 -
16 RE 7.9 - - 9.15 -
32 RE OOM - - - -

Screenshots/videos:

image

Checklist:

@AUTOMATIC1111 AUTOMATIC1111 merged commit cba6fba into AUTOMATIC1111:dev Jan 1, 2024
3 checks passed
@w-e-w w-e-w mentioned this pull request Feb 17, 2024
@pawel665j pawel665j mentioned this pull request Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants