-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ollama QOL settings #1800
Ollama QOL settings #1800
Conversation
…ility to run ollama embedding on another instance
If anyone feels like fixing failing mypy tests for private_gpt/components/llm/custom/ollama.py feel free I got some errors trying to import stuff only for annotations for mypy... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative implementation could wrap the methods instead of creating a new subclass. This might not be technically correct but you get the general idea.
from typing import Callable
def add_keep_alive(func: Callable) -> Callable:
def wrapper(*args, **kwargs):
# Adding the keep_alive='5m' keyword argument
kwargs['keep_alive'] = '5m'
# Calling the original function with the updated kwargs
return func(*args, **kwargs)
return wrapper
self.llm.chat = add_keep_alive(self.llm.chat)
self.llm.stream_chat = add_keep_alive(self.llm.stream_chat)
self.llm.complete = add_keep_alive(self.llm.complete)
self.llm.stream_complete = add_keep_alive(llm.stream_complete)
61970f0
to
a8fd51d
Compare
…f keep_alive differs from default
0a28a79
to
437f921
Compare
Thanks for the suggestion @dbzoo extra: |
Thanks for taking that suggestion to heart. The code looks better for it, too. Nice job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really powerful contribution!
ollama settings: ability to keep LLM in memory for a longer time + ability to run ollama embedding on another instance
We've got a butter smooth production setup as of right now by doing the following things:
Run the embedding on a separate Ollama instance (docker container)
By doing this we avoid waiting times needed where Ollama swaps the LLM in (V)RAM to the embedding model in VRAM and back
By explicitely stating with each request we want the used model to stay in (V)RAM for another 6 hours
By default Ollama makes a model leave (V)RAM after 5 minutes of not being used. This caused long wait times to reload the LLM after > 5 minutes (running a 20 GB quant at the moment)
(3. ingest_mode: pipeline)
I hope this PR can make others as happy as I am right now ;)