Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simonw/llm-cluster: LLM plugin for clustering embeddings #839

Open
1 task
ShellLM opened this issue Jul 3, 2024 · 1 comment
Open
1 task

simonw/llm-cluster: LLM plugin for clustering embeddings #839

ShellLM opened this issue Jul 3, 2024 · 1 comment
Labels
CLI-UX Command Line Interface user experience and best practices embeddings vector embeddings and related tools llm Large Language Models

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Jul 3, 2024

simonw/llm-cluster: LLM plugin for clustering embeddings

Snippet

"llm-cluster

LLM plugin for clustering embeddings

Background on this project: Clustering with llm-cluster.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster
Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

Content

LLM plugin for clustering embeddings

Background on this project: Clustering with llm-cluster.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster

Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

First, use paginate-json and jq to populate a collection. In this case we are embedding the title and body of every issue in the llm repository, and storing the result in a issues.db database:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db --store

The --store flag causes the content to be stored in the database along with the embedding vectors.

Now we can cluster those embeddings into 10 groups:

llm cluster llm-issues 10 \
  -d issues.db

If you omit the -d option the default embeddings database will be used.

The output should look something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

The content displayed is truncated to 100 characters. Pass --truncate 0 to disable truncation, or --truncate X to truncate to X characters.

Generating summaries for each cluster

The --summary flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the --truncate option) through a prompt to a Large Language Model.

This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.

Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.

This feature only works for embeddings that have had their associated content stored in the database using the --store flag.

You can use it like this:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary

This uses the default prompt and the default model.

Partial example output:

[
  {
    "id": "5",
    "items": [
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      },
      {
        "id": "1650757081",
        "content": "Command for browsing captured logs"
      }
    ],
    "summary": "Log Management and Interactive Prompt Tracking"
  },
  {
    "id": "6",
    "items": [
      {
        "id": "1650771320",
        "content": "Mechanism for continuing an existing conversation"
      },
      {
        "id": "1740090291",
        "content": "-c option for continuing a chat (using new chat_id column)"
      },
      {
        "id": "1784122278",
        "content": "Figure out truncation strategy for continue conversation mode"
      }
    ],
    "summary": "Continuing Conversation Mechanism and Management"
  }
]

To use a different model, e.g. GPT-4, pass the --model option:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4

The default prompt used is:

Short, concise title for this cluster of related documents.

To use a custom prompt, pass --prompt:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4 \
  --prompt 'Summarize this in a short line in the style of a bored, angry panda'

A "summary" key will be added to each cluster, containing the generated summary.

Suggested labels

None

@ShellLM ShellLM added CLI-UX Command Line Interface user experience and best practices embeddings vector embeddings and related tools llm Large Language Models labels Jul 3, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Jul 3, 2024

Related content

#325 similarity score: 0.91
#62 similarity score: 0.88
#678 similarity score: 0.87
#105 similarity score: 0.87
#6 similarity score: 0.87
#750 similarity score: 0.87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLI-UX Command Line Interface user experience and best practices embeddings vector embeddings and related tools llm Large Language Models
Projects
None yet
Development

No branches or pull requests

1 participant