You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environment, CPU architecture, OS, and Version:
Linux chrispc 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
This is ubuntu 22.04 on WSL2 with Nvidia drivers available in the VM.
Describe the bug
Using exllama2 directly just by cloning the repository and installing as per it's github I'm able to use an exl2 model example:
python ./test_inference.py -m ../Mixtral-8x7B-instruct-exl2 -p "In a land far far away ..."
-- Model: ../Mixtral-8x7B-instruct-exl2
-- Options: []
-- Loading model...
-- Loaded model in 32.8406 seconds
-- Loading tokenizer...
-- Warmup...
-- Generating...
In a land far far away ...
A group of explorers, with the help of a few friendly locals, must navigate through a series of increasingly difficult challenges in order to reach their ultimate goal: find the fabled city of gold.
The game is divided into several "scenes" or areas. Each scene contains a set of tasks and puzzles that must be solved in order to move on to the next scene.
Each scene is unique and requires different skills in order to solve the puzzles. Some scenes may require physical strength, others may require agility, and still others may require cunning and intellect.
The game
-- Response generated in 2.50 seconds, 128 tokens, 51.12 tokens/second (includes prompt eval.)
Using the same model with exllama2 through LocalAI I get
curl http://chrispc.zarek.cc:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "Mixtral",
"prompt": "A long time ago in a galaxy far, far away"
}'
{"error":{"code":500,"message":"grpc service not ready","type":""}}
See the logs below in the log section during this time.
To Reproduce
I adjusted the file ./backend/python/exllama2/install.sh to use the master branch of exllama2 just in case.
## A bash script installs the required dependencies of VALL-E-X and prepares the environment
-export SHA=c0ddebaaaf8ffd1b3529c2bb654e650bce2f790f+#export SHA=c0ddebaaaf8ffd1b3529c2bb654e650bce2f790f+export SHA=master
I'm building with sudo docker build --build-arg="BUILD_TYPE=cublas" --build-arg="CUDA_MAJOR_VERSION=12" --build-arg="CUDA_MINOR_VERSION=4" -t localai .
The Mixtral is from: https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2 I've tried 3.5bpw and 3.0bpw this particular run is 3.0 and both work fine when using the built in example from exllama2 and both fail in this same way when using LocalAI.
It seems that there is an issue with connecting to the gRPC service after loading the Mixtral model. The error message indicates a connection refusal when trying to reach the GRPC service at 127.0.0.1:46605.
To troubleshoot this issue, you can try the following steps:
Verify that the backend process is running properly. You can do this by checking the output of the backend command:
For exllama2: ps aux | grep exllama2
For exllama: ps aux | grep exllama
Ensure that the firewall is not blocking the GRPC port (46605 in this case). You may need to open the port in the firewall settings or add an exception.
Check if there are any other instances of the backend process running, as this could cause a conflict. You can do this by checking the process list using the ps command and looking for any duplicate processes.
Make sure that there is no network issue preventing the connection to the GRPC service. Check the network connectivity between the host and 127.0.0.1:46605.
Try restarting the backend process and see if the issue persists. You can do this by stopping the current process and starting a new one, for example:
For exllama2: kill -9 <process_id> ; exllama2
For exllama: kill -9 <process_id> ; exllama
If the issue still persists after trying these steps, you may need to look into specific configuration settings or seek further assistance from the support channels for the Mixtral model or the backend you are using (exllama2, exllama, etc.).
I also wanted to give exl2 a shot, the model is loading & grpc server seems fine, but i get this error on inference: Error rpc error: code = Unknown desc = Exception iterating responses: 'Result' object is not an iterator
LocalAI version:
v2.12.1
Environment, CPU architecture, OS, and Version:
Linux chrispc 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
This is ubuntu 22.04 on WSL2 with Nvidia drivers available in the VM.
Describe the bug
Using exllama2 directly just by cloning the repository and installing as per it's github I'm able to use an exl2 model example:
Using the same model with exllama2 through LocalAI I get
See the logs below in the log section during this time.
To Reproduce
I adjusted the file
./backend/python/exllama2/install.sh
to use the master branch of exllama2 just in case.I'm building with
sudo docker build --build-arg="BUILD_TYPE=cublas" --build-arg="CUDA_MAJOR_VERSION=12" --build-arg="CUDA_MINOR_VERSION=4" -t localai .
Then running this docker-compose:
The Mixtral is from:
https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2
I've tried3.5bpw
and3.0bpw
this particular run is 3.0 and both work fine when using the built in example from exllama2 and both fail in this same way when using LocalAI.The file mixtral.yaml in the /models folder is:
Logs
Model name: Buttercup
Model name: Llama
Model name: Mixtral
6:54PM DBG Model: Mixtral (config: {PredictionOptions:{Model:/Mixtral Language: N:0 TopP:0xc0001fc320 TopK:0xc0001fc328 Temperature:0xc0001fc330 Maxtokens:0xc0001fc338 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc360 TypicalP:0xc0001fc358 Seed:0xc0001fc378 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Mixtral F16:0xc0001fc318 Threads:0xc0001fc310 Debug:0xc0001fc370 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc350 MirostatTAU:0xc0001fc348 Mirostat:0xc0001fc340 NGPULayers:0xc0001fc368 MMap:0xc0001fc370 MMlock:0xc0001fc371 LowVRAM:0xc0001fc371 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc308 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:})
6:54PM DBG Model: Buttercup (config: {PredictionOptions:{Model:/Buttercup Language: N:0 TopP:0xc0001fc110 TopK:0xc0001fc118 Temperature:0xc0001fc120 Maxtokens:0xc0001fc128 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc150 TypicalP:0xc0001fc148 Seed:0xc0001fc168 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Buttercup F16:0xc0001fc108 Threads:0xc0001fc100 Debug:0xc0001fc160 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc140 MirostatTAU:0xc0001fc138 Mirostat:0xc0001fc130 NGPULayers:0xc0001fc158 MMap:0xc0001fc160 MMlock:0xc0001fc161 LowVRAM:0xc0001fc161 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc0f8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:})
6:54PM DBG Model: Llama (config: {PredictionOptions:{Model:/Llama Language: N:0 TopP:0xc0001fc218 TopK:0xc0001fc220 Temperature:0xc0001fc228 Maxtokens:0xc0001fc230 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc258 TypicalP:0xc0001fc250 Seed:0xc0001fc270 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Llama F16:0xc0001fc210 Threads:0xc0001fc208 Debug:0xc0001fc268 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc248 MirostatTAU:0xc0001fc240 Mirostat:0xc0001fc238 NGPULayers:0xc0001fc260 MMap:0xc0001fc268 MMlock:0xc0001fc269 LowVRAM:0xc0001fc269 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc200 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:})
6:54PM DBG Extracting backend assets files to /tmp/localai/backend_data
6:54PM INF core/startup process completed!
6:54PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
6:54PM DBG No configuration file found at /tmp/localai/config/assistants.json
6:54PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
┌───────────────────────────────────────────────────┐
│ Fiber v2.52.0 │
│ http://127.0.0.1:8080 │
│ (bound on host 0.0.0.0 and port 8080) │
│ │
│ Handlers ........... 181 Processes ........... 1 │
│ Prefork ....... Disabled PID ................. 1 │
└───────────────────────────────────────────────────┘
6:54PM DBG Request received: {"model":"Mixtral","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":"A long time ago in a galaxy far, far away","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
6:54PM DBG
input
: &{PredictionOptions:{Model:Mixtral Language: N:0 TopP: TopK: Temperature: Maxtokens: Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ: TypicalP: Seed: NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Context:context.Background.WithCancel Cancel:0x4ab9a0 File: ResponseFormat:{Type:} Size: Prompt:A long time ago in a galaxy far, far away Instruction: Input: Stop: Messages:[] Functions:[] FunctionCall: Tools:[] ToolsChoice: Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject: Backend: ModelBaseName:}6:54PM DBG Parameter Config: &{PredictionOptions:{Model:/Mixtral Language: N:0 TopP:0xc0001fc320 TopK:0xc0001fc328 Temperature:0xc0001fc330 Maxtokens:0xc0001fc338 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc0001fc360 TypicalP:0xc0001fc358 Seed:0xc0001fc378 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Mixtral F16:0xc0001fc318 Threads:0xc0001fc310 Debug:0xc000398bc8 Roles:map[] Embeddings:false Backend:exllama2 TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[A long time ago in a galaxy far, far away] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc0001fc350 MirostatTAU:0xc0001fc348 Mirostat:0xc0001fc340 NGPULayers:0xc0001fc368 MMap:0xc0001fc370 MMlock:0xc0001fc371 LowVRAM:0xc0001fc371 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc0001fc308 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
6:54PM INF Loading model '/Mixtral' with backend exllama2
6:54PM DBG Loading model in memory from file: /models/Mixtral
6:54PM DBG Loading Model /Mixtral with gRPC (file: /models/Mixtral) (backend: exllama2): {backendString:exllama2 model:/Mixtral threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0004d6000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
6:54PM DBG Loading external backend: /build/backend/python/exllama2/run.sh
6:54PM DBG Loading GRPC Process: /build/backend/python/exllama2/run.sh
6:54PM DBG GRPC Service for /Mixtral will be running at: '127.0.0.1:46605'
6:54PM DBG GRPC Service state dir: /tmp/go-processmanager4021755915
6:54PM DBG GRPC Service Started
6:55PM ERR failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:46605: connect: connection refused""
6:55PM DBG GRPC Service NOT ready
[192.168.1.202]:50909 500 - POST /v1/completions
Additional context
The text was updated successfully, but these errors were encountered: