Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel Arc A770 LLM Fails #3076

Closed
kprinssu opened this issue Jul 29, 2024 · 9 comments
Closed

Intel Arc A770 LLM Fails #3076

kprinssu opened this issue Jul 29, 2024 · 9 comments
Labels
bug Something isn't working unconfirmed

Comments

@kprinssu
Copy link

kprinssu commented Jul 29, 2024

I am trying to get text generation working on my Intel Arc A770 8GB and I am running into issues sycl not utilising the GPU. When I attempt to run use any prompt or any model I run into the following messages in the logs:

instance-1  | 9:55PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:34249): stderr Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
instance-1  | 9:55PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:34249): stderr Exception caught at file:/build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-sycl.cpp, line:3523, func:operator()
instance-1  | 9:55PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:34249): stderr SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
instance-1  | 9:55PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:34249): stderr   in function ggml_sycl_mul_mat_batched_sycl at /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-sycl.cpp:3523
instance-1  | 9:55PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:34249): stderr /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-sycl/common.hpp:103: SYCL error

It looks like my hardware does not support half (f16) precision?

LocalAI version:

v2.19.3

Environment, CPU architecture, OS, and Version:

> uname -a
Linux kishor-docker 6.8.0-38-generic #38+TEST2072755v20240712b1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 12 x86_64 x86_64 x86_64 GNU/Linux

Note: I had to use a custom test kernel from here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2072755/comments/4 to fix my Intel Arc from timing out. It looks like the 6.8 kernel has regressions and I am now running into it.

Edit:

I tried the mainline 6.10.2 kernel and I am still running into the same error. I am still seeing GPU hangs and resets.

Describe the bug

To Reproduce

Attempt to use text generation via the Web UI with any LLM model.

Expected behavior

Text to be produced and llama.cpp not to crash.

Logs

Logs:
https://gist.github.com/kprinssu/6cc5e9798018e05098ec091d1b5f4611

Linux Kernel logs:

[26569.857186] i915 0000:01:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in grpcpp_sync_ser [352969]
[26569.857197] i915 0000:01:00.0: [drm] grpcpp_sync_ser[352969] context reset due to GPU hang
[26572.958874] Fence expiration time out i915-0000:01:00.0:grpcpp_sync_ser[352969]:54!
[26589.959895] Fence expiration time out i915-0000:01:00.0:grpcpp_sync_ser[352969]:56!

Additional context

Docker Compose:

services:
  instance:
    image: quay.io/go-skynet/local-ai:v2.19.3-aio-gpu-intel-f32
    privileged: true
    ports:
      - 8085:8080
    volumes:
      - ./models:/build/models
      - ./images:/tmp/generated/images/
    devices:
      - /dev/dri:/dev/dri
    environment:
      - DEBUG=true
      - NEOReadDebugKeys=1
      - OverrideGpuAddressSpace=48
      - ZES_ENABLE_SYSMAN=1
    group_add:
      - 993
      - 996
      - 44

Note: I need to to the extra env vars above as sycl-ls did not list my GPU due to the 6.8 kernel (intel/compute-runtime#710 and intel/compute-runtime#710 (comment)).

Example Model Config:

name: gpt4-gpu
mmap: false
context_size: 4096

f16: true
gpu_layers: 4

parameters:
  model: llava-v1.6-mistral-7b.Q5_K_M.gguf

stopwords:
- "<|im_end|>"
- "<dummy32000>"
- "</tool_call>"
- "<|eot_id|>"
- "<|end_of_text|>"
@kprinssu kprinssu added bug Something isn't working unconfirmed labels Jul 29, 2024
@mudler
Copy link
Owner

mudler commented Jul 30, 2024

@kprinssu can you try disabling f16 in the model configuration?

@kprinssu
Copy link
Author

kprinssu commented Jul 30, 2024

@mudler That works but llama.cpp does not use my GPU. It uses my CPU, pegs all of my CPU cores, and then does not stream nor produce any output. Llama.cpp keeps running until I force the container.

Edit: Here are the logs when I attempt to re-use the same model config (with f16 set to false):

https://gist.github.com/kprinssu/a3f617daeb416919e3360365bc0d43a6

Error that causes sycl to fail (same as when f16 is true):

instance-1  | 2:24PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:33375): stdout MKL Warning: Incompatible OpenCL driver version. GPU performance may be reduced.
instance-1  | 2:24PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:33375): stderr Native API failed. Native API returns: -2 (PI_ERROR_DEVICE_NOT_AVAILABLE) -2 (PI_ERROR_DEVICE_NOT_AVAILABLE)
instance-1  | 2:24PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:33375): stderr Exception caught at file:/build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-sycl.cpp, line:3523, func:operator()
instance-1  | 2:24PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:33375): stderr SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
instance-1  | 2:24PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:33375): stderr   in function ggml_sycl_mul_mat_batched_sycl at /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-sycl.cpp:3523
instance-1  | 2:24PM DBG GRPC(llava-v1.6-mistral-7b.Q5_K_M.gguf-127.0.0.1:33375): stderr /build/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml-sycl/common.hpp:103: SYCL error

Kernel logs:

[  535.566050] i915 0000:01:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in grpcpp_sync_ser [17132]
[  535.566057] i915 0000:01:00.0: [drm] grpcpp_sync_ser[17132] context reset due to GPU hang
[  539.221533] Fence expiration time out i915-0000:01:00.0:grpcpp_sync_ser[17132]:54!
[  555.767327] Fence expiration time out i915-0000:01:00.0:grpcpp_sync_ser[17132]:56!

Model config for reference:

name: gpt4-gpu
mmap: false
context_size: 4096

f16: false
gpu_layers: 4

parameters:
  model: llava-v1.6-mistral-7b.Q5_K_M.gguf

stopwords:
- "<|im_end|>"
- "<dummy32000>"
- "</tool_call>"
- "<|eot_id|>"
- "<|end_of_text|>"

@mudler
Copy link
Owner

mudler commented Jul 30, 2024

I cannot confirm this yet, but gonna try with my Intel Arc cluster soon to update images. However - from the logs you are sharing it looks related to driver incompatibility issues. When I was setting up my cluster I remember I had to go down to use Ubuntu 22.04 LTS due to Intel drivers not being compatible.

Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-101-generic x86_64)

@kprinssu
Copy link
Author

I'll attempt to run a local build of llama and subsequently LocalAI on my host rather than via Docker and see if I run into the same error.

@kprinssu
Copy link
Author

kprinssu commented Jul 30, 2024

Just to follow up, it. indeed was a driver issue. I hacked in the Ubuntu 22.04 Jammy packages and it's working well. For reference, I had to use: https://chsasank.com/intel-arc-gpu-driver-oneapi-installation.html to install all the required libraries on my host machine. Then I installed local-ai as an native binary and it's working great!

Edit:

I am pretty sure I missed something as I am unable to get local-ai binary to use sycl.

@kprinssu kprinssu closed this as completed Aug 6, 2024
@kprinssu
Copy link
Author

kprinssu commented Aug 6, 2024

Just for clarification, I was unable get LocalAI working with GPU acceleration and I have pivoted to running llama-cpp directly.

@KimSHenriksen
Copy link

You can try the vulkan version. You may have to manually update the yaml files for the models and specify gpu_layers. Start with a low number, if you set it too high you'll get an error like 'OutOfDeviceMemoryError'.

@mudler
Copy link
Owner

mudler commented Aug 10, 2024

Just for clarification, I was unable get LocalAI working with GPU acceleration and I have pivoted to running llama-cpp directly.

if you could share the logs I might help - otherwise it's hard.

Did you tried the container images?

@kprinssu
Copy link
Author

kprinssu commented Aug 10, 2024

I have tried the images and unfortunately, I found that they stalled or time out. I tried my own custom build of Ollama and it seems to be working very very well now.

For reference, this is my Dockerfile to build my own Ollama:

FROM intel/oneapi-basekit

# Update and upgrade the existing packages
RUN apt-get update && apt-get upgrade -y && apt-get install -y --no-install-recommends \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/* /tmp/*

ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"

# Install conda
RUN mkdir -p ~/miniconda3 && \
  wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh && \
  bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 && \
  rm -rf ~/miniconda3/miniconda.sh && \
  ~/miniconda3/bin/conda init bash && \
  . ~/.bashrc && \
    conda create -n llm-cpp python=3.11 && \
    conda init && conda activate llm-cpp
RUN pip install --pre --upgrade ipex-llm[cpp]

RUN mkdir /app
WORKDIR /app
RUN init-ollama


# Default environment variables
ENV OLLAMA_NUM_GPU=999
ENV no_proxy=0.0.0.0

ENV ZES_ENABLE_SYSMAN=1
ENV SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
ENV SYCL_CACHE_PERSISTENT=1
ENV ONEAPI_DEVICE_SELECTOR=level_zero:0

# NOTE: Enable if the GPU is not an Intel Arc or Max GPU
ENV USE_XETLA=OFF

CMD ["sh", "./ollama serve"]

@mudler I believe these env vars will be helpful that I grabbed from Intel's ipex-llm documentation. Would we be able to document these on the LocalAI docs?

I found these helped accelerate the computation quite dramatically.

Environment variables:

ENV ZES_ENABLE_SYSMAN=1
ENV SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
ENV SYCL_CACHE_PERSISTENT=1
ENV ONEAPI_DEVICE_SELECTOR=level_zero:0

# Only set this for Intel Arc and Intel Max GPUs
ENV USE_XETLA=OFF

I also found excellent documentation on what each env var does on the sycl llvm repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

3 participants