Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama QOL settings #1800

Merged
merged 7 commits into from
Apr 2, 2024
Merged

Ollama QOL settings #1800

merged 7 commits into from
Apr 2, 2024

Conversation

Robinsane
Copy link
Contributor

ollama settings: ability to keep LLM in memory for a longer time + ability to run ollama embedding on another instance

We've got a butter smooth production setup as of right now by doing the following things:

  1. Run the embedding on a separate Ollama instance (docker container)
    By doing this we avoid waiting times needed where Ollama swaps the LLM in (V)RAM to the embedding model in VRAM and back

  2. By explicitely stating with each request we want the used model to stay in (V)RAM for another 6 hours
    By default Ollama makes a model leave (V)RAM after 5 minutes of not being used. This caused long wait times to reload the LLM after > 5 minutes (running a 20 GB quant at the moment)

(3. ingest_mode: pipeline)

I hope this PR can make others as happy as I am right now ;)

@Robinsane
Copy link
Contributor Author

If anyone feels like fixing failing mypy tests for private_gpt/components/llm/custom/ollama.py feel free

I got some errors trying to import stuff only for annotations for mypy...
I feel like it's not really necessary since you can just look at the Ollama superclass to understand it all, right?

Copy link
Contributor

@dbzoo dbzoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative implementation could wrap the methods instead of creating a new subclass. This might not be technically correct but you get the general idea.

from typing import Callable

def add_keep_alive(func: Callable) -> Callable:
    def wrapper(*args, **kwargs):
        # Adding the keep_alive='5m' keyword argument
        kwargs['keep_alive'] = '5m'
        # Calling the original function with the updated kwargs
        return func(*args, **kwargs)
    return wrapper

self.llm.chat = add_keep_alive(self.llm.chat)
self.llm.stream_chat = add_keep_alive(self.llm.stream_chat)
self.llm.complete = add_keep_alive(self.llm.complete)
self.llm.stream_complete = add_keep_alive(llm.stream_complete)

@Robinsane Robinsane marked this pull request as draft March 28, 2024 06:02
@Robinsane Robinsane marked this pull request as ready for review March 28, 2024 07:28
@Robinsane
Copy link
Contributor Author

Thanks for the suggestion @dbzoo

extra:
If the default keep_alive is left unchanged, I don't wrap, leaving the requests just like they used to be :)

@dbzoo
Copy link
Contributor

dbzoo commented Mar 28, 2024

Thanks for the suggestion @dbzoo

extra: If the default keep_alive is left unchanged, I don't wrap, leaving the requests just like they used to be :)

Thanks for taking that suggestion to heart. The code looks better for it, too. Nice job.

Copy link
Collaborator

@imartinez imartinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really powerful contribution!

@imartinez imartinez merged commit b3b0140 into zylon-ai:main Apr 2, 2024
6 checks passed
mrepetto-certx pushed a commit to mrepetto-certx/privateGPT that referenced this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants