Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #3971: Check how to integrate vector databases via rest APIs #4059

Merged
merged 8 commits into from
May 27, 2024

Conversation

vga91
Copy link
Collaborator

@vga91 vga91 commented May 2, 2024

Fixes #3971

Changes

Emulate the https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ commands.

Neo4j Vector Index Vector database correspondent
CREATE VECTOR INDEX apoc.vectordb.qdrant.createCollection
DROP VECTOR INDEX apoc.vectordb.qdrant.deleteCollection
add vector node/rel apoc.vectordb.qdrant.upsert
CALL db.index.vector.queryNodes / CALL db.index.vector.queryRelationships apoc.vectordb.qdrant.get and apoc.vectordb.qdrant.query
Delete vector node/rel apoc.vectordb.qdrant.delete
  • the same for the ChromaDb procedures.
  • the same for the Weaviate procedures

NOTE: Like the apoc.ml ones, the chroma, qdrand and weaviate procedures are implemented in such a way that they have the same signature, even though under the hood they have different bodies/methods/etc.

  • Added 2 custom procedures apoc.vectordb.qdrant.get and apoc.vectordb.custom to handle other vector databases (like Pinecone tested in PineconeTest).

Using the apoc.vectordb.*.get and apoc.vectordb.*.query procedures, we can auto-create neo4j vector indexes and entities, using the mapping config.

NOTE: by default, with the apoc.vectordb.*get and apoc.vectordb.*query only score, metatada and entity are retrieved, to get also other results, we have to set the config allResults: true.

To evaluate

  • apoc.vectordb.custom could be changed to a more generic naming, e.g. apoc.restapi.custom(<conf>), since it could be used with other rest APIs
  • move RestAPIConfig to util package

Additional notes (after PR merge)

  • Open a follow-up issue:

  • Test / custom procedures with other databases (like Pinecone)

  • Added trello Core card: problem with Pinecone, create a PR after neo4j-contrib PR creation...
    We cannot execute Pinecone fetch API with method: "", due to these 2 pieces of apoc core codes:

    • setDoOutput(true)
    • http.setChunkedStreamingMode(1024 * 1024);
      In both cases, we receive a 200OK, but with no results.

@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
var config = new HashMap<>(configuration);

String qdrantUrl = getChromaUrl(hostOrKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy & paste typo - not qdrant :)

@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
var config = new HashMap<>(configuration);

String qdrantUrl = getChromaUrl(hostOrKey);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

public URLAccessChecker urlAccessChecker;

@Procedure("apoc.vectordb.chroma.createCollection")
@Description("apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a bit better descriptions (for all the procedures), not just the signature again? otherwise the apoc.help output is not really informative if it shows the same content twice without a human description

}

private static Entity handleMappingNode(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR (n:%s) REQUIRE n.%s IS UNIQUE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test that you can run both the constraint as well as the data creation operation in the same tx?

shouldn't we leave that to the user to create the constraint, otherwise it would do it for every entity

transaction.commit();
}

String setVectorQuery = "CALL db.create.setNodeVectorProperty($entity, $key, $vector)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set the property to a float array ourselves, no need to call cypher here.


private static Entity handleMappingRel(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
try {
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR ()-[r:%s]-() REQUIRE (r.%s) IS UNIQUE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above I don't think we need to do that

// in this case we cannot auto-create the rel, since we should have to define start and end node as well
Relationship rel;
try (Transaction transaction = db.beginTx()) {
Object propValue = metaProps.remove(mapping.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we really remove the mapping-id ? if we later return the metadata that's missing?

try (Transaction transaction = db.beginTx()) {
Object propValue = metaProps.remove(mapping.getId());
rel = transaction.findRelationship(RelationshipType.withName(mapping.getType()), mapping.getProp(), propValue);
if (rel != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not only happen when "create: true" is set?

transaction.commit();
}

String setVectorQuery = "CALL db.create.setRelationshipVectorProperty($entity, $key, $vector)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set the float array property in the same tx above

node = transaction.createNode(Label.label(mapping.getLabel()));
node.setProperty(mapping.getProp(), propValue);
}
if (node != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to we write properties if create is not set to true? then we should just return the found node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only populate a node when create is true
alternatively we could have 3 modes (create / update / read) with read the default

try {
Node node;
try (Transaction transaction = db.beginTx()) {
Object propValue = metaProps.remove(mapping.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as below we should not remove the mapping id from the metadata

}

db.executeTransactionally(setVectorQuery,
Map.of("entity", Util.rebind(tx, entity), "key", mapping.getEmbeddingProp(), "vector", embedding));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to turn the double list into a float array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and just set the float array directly as property


APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:

- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pinecone to docs

* @param entity we cannot declare entity with class Entity,
* as an error `cannot be converted to a Neo4j type: Don't know how to map `org.neo4j.graphdb.Entity` to the Neo4j Type` would be thrown
*/
public record EmbeddingResult(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have two fields one for Node and one for Relationship
where one or the other is null?

otherwise Cypher cannot do anything with that Object result and you have to first call convert.toNode which would be really annoying.

enum Type {
CHROMA(new ChromaEmbeddingType()),
QDRANT(new QdrantEmbeddingType()),
WEAVIATE(new WeaviateEmbeddingType());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinecone?

import static org.junit.Assert.assertTrue;

/**
* It leverages `apoc.vectordb.custom*` procedures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we have a dedicated pinecone procedures set?

Copy link
Member

@jexp jexp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments

@vga91 vga91 marked this pull request as draft May 17, 2024 16:32
@vga91 vga91 force-pushed the issue-3971 branch 3 times, most recently from 9ff02b4 to 9a8f108 Compare May 20, 2024 07:13

@Procedure(value = "apoc.vectordb.chroma.query", mode = Mode.SCHEMA)
@Description("apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $configuration) - Retrieve closest vectors the the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter")
public Stream<EmbeddingResult> query(@Name("hostOrKey") String hostOrKey,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, that comment for the query procedure was meant for here:

I think if we should move the write behavior into a separate method, like queryAndUpdate or so? or updateGraphFromQuery ? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Mode.SCHEMA, I had accidentally left it in that initially the procedure also auto-created the vector indexes in neo4j, I removed it now.
And added procedures queryAndUpdate with WRITE mode

v -> listOfListsToMap((Map) v).stream());
}

private Map<String, Object> getVectorDbInfo(String hostOrKey, String collection, Map<String, Object> configuration, String templateUrl) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if we could expose this getVectorDbInfo in a procedure call for each of the databases to get an overview what's in there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added procedures with names apoc.vectordb.<type>.info

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I meant was not the configuration stored on the neo4j side, but rather the metadata (which collections and sizes are available in the vector db) but we can also do that post 5.20

* and mapping data to auto-create neo4j vector indexes and properties
*/
@Procedure(value = "apoc.vectordb.custom.get", mode = Mode.SCHEMA)
@Description("apoc.vectordb.custom.get(host, $configuration) - Customizable get / query procedure")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little bit more detail in the description?

throw new RuntimeException(embeddingErrMsg);
}

entity.setProperty(mapping.getEmbeddingProp(), embedding.stream()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just do a utility method that creates a float array of the list size and uses a for loop over the list to set the values. then the JVM can also optimize that to SIMD. I don't think that the streams are efficient here.

// -- implementations
//

class QdrantEmbeddingHandler implements VectorEmbeddingHandler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move these implementations closer to where the vector databases are? either into the procedures file or an associated file? Otherwise we have to update this file whenever we add a new db?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved

@vga91 vga91 force-pushed the issue-3971 branch 4 times, most recently from b4af8cb to ae0152a Compare May 24, 2024 23:30
[NOTE]
====
To optimize performances, we can choose what to `YIELD` with the apoc.vectordb.qdrant.query and the `apoc.vectordb.qdrant.get` procedures.
For example, by executing a `CALL apoc.vectordb.chroma.query(...) YIELD metadata, score, id`, the RestAPI request will have an {"include": ["metadatas", "documents", "distances"]},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but this is a Chroma db stuff, which use metadatas as a key: https://docs.trychroma.com/getting-started#6.-inspect-results


See the following pages for more details on specific vector db procedures

- xref:./qdrand.adoc[Qdrant]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in filename?

})
----

We can get the current configuration by executing the following procedure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not expose the stored secrets.

|===


which, in case of configuration key not found, just returns the baseUrl, for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not expose the stored secrets.

Retrieve closest vectors from the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter, and optionally creates/updates neo4j entities.
Note that, besides the common config parameters, this procedure requires a `field: [listOfProperty]` config, to define which properties are to be retrieved from GraphQL running under-the-hood.
The default endpoint is `<hostOrKey param>/graphql`.
| apoc.vectordb.weaviate.info(keyConfig) | Given the `keyConfig` returns the current configuration, created with the `apoc.vectordb.configure('WEAVIATE', keyConfig, ...)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not expose the stored secrets.


[source,cypher]
----
CALL apoc.vectordb.weaviate.query($host, 'test_collection',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's separate that into an extra procedure, e.g. updateGraph, so that the query method can remain read-only.

public static Stream<MapResult> getInfoProcCommon(String hostOrKey, VectorDbHandler handler) {
Map<String, Object> info = getCommonVectorDbInfo(hostOrKey, "", Map.of(), "%s", handler);
// endpoint is equivalent to baseUrl config
info.remove("endpoint");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also remove "headers" ? so that the bearer token goes away

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or other api-keys or credentials entires.
perhaps better to just keySet().retainAll(asList("keyConfig", ...)) ?

@vga91 vga91 force-pushed the issue-3971 branch 2 times, most recently from fb04bf8 to 19ef4ca Compare May 27, 2024 09:32
@vga91 vga91 merged commit 89d167b into dev May 27, 2024
5 checks passed
@vga91 vga91 deleted the issue-3971 branch May 27, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants