[IPEX] Slice SDPA into smaller chunks #14353

Nuullll · 2023-12-18T11:44:35Z

Description

Slice scaled_dot_product_attention into smaller chunks so that the SDPA of each chunk wouldn't request any allocation larger than the given limit.

This was initially designed to work around the 4GB single block allocation limitation of Intel compute-runtime (RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 8.00 GiB). Then I found out that setting a smaller limit would reduce the VRAM footprint during SDPA calculation. The current limit (VRAM // 8) was tuned for Intel Arc A770 16G and A750 8G without sacrificing performance.

With this change, A770 16G can generate 512x512 of batch size 32 and A750 8G can generate batch size 16

Test results:

Common settings: --use-ipex --opt-sdp-attention, txt2img, DPM++ 2M Karras, 20 steps, 512x512 resolution, batch count = 5

Effective it/s == Batch size * Batch count * Steps / Total time taken
RE in the table refers to RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB!
OOM in the table refers to RuntimeError: Allocation is out of device memory on current platform.

A770 16G (connected with two monitors [taking up ~1.1GB VRAM])

Batch Size	Before: Peak VRAM (GB)	After: Peak VRAM (GB)	Delta	Before: Effective it/s	After: Effective it/s	Delta
1	6.8	6.6	-2.9%	5.95	6.45	+8.4%
2	8.5	8.4	-1.2%	7.66	7.84	+2.3%
4	11.6	11.1	-4.3%	8.95	9.17	+2.5%
8	15.9	13.9	-12.6%	4.46	10.74	+140.8%
16	RE	15.1	-	-	11.40	-
32	RE	15.5	-	-	11.24	-

A750 8G (not connected with monitors)

Batch Size	Before: Peak VRAM (GB)	After: Peak VRAM (GB)	Delta	Before: Effective it/s	After: Effective it/s	Delta
1	5.7	5.6	-1.8%	5.49	6.06	+10.4%
2	7.4	6.8	-8.1%	7.22	7.55	+4.6%
4	7.9	7.5	-5.1%	6.81	8.68	+27.5%
8	OOM	7.9	-	-	9.47	-
16	RE	7.9	-	-	9.15	-
32	RE	OOM	-	-	-	-

Screenshots/videos:

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests with --test-server --use-ipex --opt-sdp-attention

[IPEX] Slice SDPA into smaller chunks

e4b4a9c

Nuullll requested a review from AUTOMATIC1111 as a code owner December 18, 2023 11:44

Fix device id

f586f49

Nuullll mentioned this pull request Dec 18, 2023

[Bug]: Intel Arc 770 and SDXL checkpoint generates garbage above 832x832 on 1.7 Release #14338

Closed

6 tasks

AUTOMATIC1111 approved these changes Jan 1, 2024

View reviewed changes

AUTOMATIC1111 merged commit cba6fba into AUTOMATIC1111:dev Jan 1, 2024
3 checks passed

Nuullll mentioned this pull request Jan 16, 2024

Arrays larger than 4 GB crashes intel/intel-extension-for-pytorch#325

Open

w-e-w mentioned this pull request Feb 17, 2024

1.8.0-RC #14948

Closed

pawel665j mentioned this pull request Apr 16, 2024

## 1.8.0-RC #15537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IPEX] Slice SDPA into smaller chunks #14353

[IPEX] Slice SDPA into smaller chunks #14353

Nuullll commented Dec 18, 2023

[IPEX] Slice SDPA into smaller chunks #14353

[IPEX] Slice SDPA into smaller chunks #14353

Conversation

Nuullll commented Dec 18, 2023

Description

Test results:

A770 16G (connected with two monitors [taking up ~1.1GB VRAM])

A750 8G (not connected with monitors)

Screenshots/videos:

Checklist: