Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Embedding Model for Databend #10689

Open
4 of 6 tasks
BohuTANG opened this issue Mar 21, 2023 · 7 comments
Open
4 of 6 tasks

feat: Embedding Model for Databend #10689

BohuTANG opened this issue Mar 21, 2023 · 7 comments
Assignees
Labels
C-feature Category: feature

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Mar 21, 2023

Summary

Tasks

Introduction

An embedding model is designed to map high-dimensional data into a lower-dimensional vector space, which facilitates various applications such as NLP, recommendation systems, and anomaly detection.

Obtaining Embedding Vectors with OpenAI API

To extract embedding vectors using the OpenAI API, utilize OpenAI's pre-trained language models. Below is a Python example:

import openai

openai.api_key = "your_openai_api_key"

def get_embedding(text):
    response = openai.Completion.create(
        engine="davinci-codex",
        prompt=f"Embed the following text: {text}",
        max_tokens=16,
        n=1,
        stop=None,
        temperature=0.5,
    )
    embedding = response.choices[0].text.strip()
    return embedding

text = "Databend warehouse"
embedding = get_embedding(text)
print(embedding)

Storing Embedding Vectors in Databend

To store the embedding vectors returned by the OpenAI API in Databend, create a table with a column of Vector(Alias Array(Float32) can be with IVF PQ index) type for holding the vectors. Assuming you have connected to a Databend instance:

CREATE TABLE embeddings (
    id INT,
    text VARCHAR NOT NULL,
    vector VECTOR NOT NULL
);

Computing the Distance Between Vectors in Databend

Databend can compute the distance between a query vector and stored vectors using a built-in function called cosine_distance. This function calculates the distance between two ARRAY(FLOAT32) inputs and can be used directly in SQL queries.

However, calculating vector distance for every pair of vectors becomes computationally expensive and slow with large-scale datasets and high-dimensional vectors. To tackle this issue, we propose the following techniques:

  • Inverted File (IVF) Index: An inverted file is an index data structure that maps words or terms to their locations in a set of documents. Within a vector database, it stores a mapping from a set of quantized vectors to their locations. An inverted file enables fast and memory-efficient search for approximate nearest neighbors.
  • Product Quantization (PQ) Index: Product Quantization is a vector compression technique that reduces memory footprint and computational cost while searching for nearest neighbors in high-dimensional spaces. PQ quantizes the original vector space into a Cartesian product of multiple lower-dimensional subspaces, compressing each high-dimensional vector into a compact code by quantizing its sub-vectors and concatenating the quantization indices. This enables efficient and approximate distance computation between compressed vectors.

The IVF PQ index is a combination of these techniques, where the database vectors are first quantized using product quantization, followed by the creation of an inverted file to index the quantized vectors. This approach allows for a fast and memory-efficient search of approximate nearest neighbors in high-dimensional vector spaces, particularly beneficial in large-scale multimedia retrieval systems.

Example SQL Queries

CREATE TABLE embeddings (
    id INT,
    text VARCHAR NOT NULL,
    vector VECTOR NOT NULL
);

Insert sample data

INSERT INTO embeddings (text, vector) VALUES
(1, 'Databend warehouse', ARRAY[0.12, 0.34, -0.56, 0.78]),
(2, 'Data warehouse', ARRAY[-0.15, 0.37, 0.29, -0.22]);

Query

WITH query_vector AS (
    SELECT ARRAY[0.11, 0.33, -0.55, 0.77] AS vector
)
SELECT id, text, cosine_distance(vector, query_vector.vector) AS distance
FROM embeddings, query_vector
ORDER BY distance ASC
LIMIT 1;
@BohuTANG BohuTANG added the C-feature Category: feature label Mar 21, 2023
@mokeyish
Copy link

vector_distance: Similarity Metrics

@BohuTANG
Copy link
Member Author

From openai doc:
https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use

We recommend [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). The choice of distance function typically doesn’t matter much.

OpenAI embeddings are normalized to length 1, which means that:

Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings

@BohuTANG BohuTANG changed the title feat: Embedding Model Proposal for Databend(By ChatGPT4) feat: Embedding Model for Databend Mar 25, 2023
@BohuTANG BohuTANG reopened this Mar 27, 2023
@BohuTANG BohuTANG self-assigned this Mar 27, 2023
@thatcort
Copy link

thatcort commented Aug 7, 2023

When is the vector index feature expected to be complete?

@thatcort thatcort mentioned this issue Aug 7, 2023
9 tasks
@BohuTANG
Copy link
Member Author

BohuTANG commented Aug 8, 2023

When is the vector index feature expected to be complete?

Indeed, there is a PR already #11318

But still a lot of work needs to do.
From Databend users case, their data is not large, so we make this ticket to low priority.

@thatcort
Copy link

thatcort commented Aug 8, 2023

I'm considering Databend for querying over large data sets of text and vectors. Vector indexing would allow replacing the current vector DB and save a lot of money by using object storage. Would be great if you raised the priority of that feature!

@BohuTANG
Copy link
Member Author

BohuTANG commented Aug 8, 2023

Thank you for your explanation. We will raise the priority of this feature, but there is still no definite expected time, as there are many higher-priority tasks that need to be completed.

@thatcort
Copy link

Another library worth looking at for vector ann support is USearch: https://unum-cloud.github.io/usearch/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

3 participants