Fixes #3971: Check how to integrate vector databases via rest APIs #4059

vga91 · 2024-05-02T09:41:01Z

Changes

Created procedures ad-hoc for chroma, qdrant and weaviate.

Emulate the https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ commands.

Neo4j Vector Index	Vector database correspondent
`CREATE VECTOR INDEX`	`apoc.vectordb.qdrant.createCollection`
`DROP VECTOR INDEX`	`apoc.vectordb.qdrant.deleteCollection`
add vector node/rel	`apoc.vectordb.qdrant.upsert`
`CALL db.index.vector.queryNodes` / `CALL db.index.vector.queryRelationships`	`apoc.vectordb.qdrant.get` and `apoc.vectordb.qdrant.query`
Delete vector node/rel	`apoc.vectordb.qdrant.delete`

the same for the ChromaDb procedures.
the same for the Weaviate procedures

NOTE: Like the apoc.ml ones, the chroma, qdrand and weaviate procedures are implemented in such a way that they have the same signature, even though under the hood they have different bodies/methods/etc.

Added 2 custom procedures apoc.vectordb.qdrant.get and apoc.vectordb.custom to handle other vector databases (like Pinecone tested in PineconeTest).

Using the apoc.vectordb.*.get and apoc.vectordb.*.query procedures, we can auto-create neo4j vector indexes and entities, using the mapping config.

NOTE: by default, with the apoc.vectordb.*get and apoc.vectordb.*query only score, metatada and entity are retrieved, to get also other results, we have to set the config allResults: true.

To evaluate

apoc.vectordb.custom could be changed to a more generic naming, e.g. apoc.restapi.custom(<conf>), since it could be used with other rest APIs
move RestAPIConfig to util package

Additional notes (after PR merge)

Open a follow-up issue:
Test / custom procedures with other databases (like Pinecone)
Added trello Core card: problem with Pinecone, create a PR after neo4j-contrib PR creation...
We cannot execute Pinecone fetch API with method: "", due to these 2 pieces of apoc core codes:
- setDoOutput(true)
- http.setChunkedStreamingMode(1024 * 1024);
  In both cases, we receive a 200OK, but with no results.

jexp · 2024-05-15T22:09:57Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+                                    @Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
+        var config = new HashMap<>(configuration);
+
+        String qdrantUrl = getChromaUrl(hostOrKey);


copy & paste typo - not qdrant :)

jexp · 2024-05-15T22:11:03Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+            @Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception {
+        var config = new HashMap<>(configuration);
+
+        String qdrantUrl = getChromaUrl(hostOrKey);


jexp · 2024-05-15T22:12:31Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+    public URLAccessChecker urlAccessChecker;
+
+    @Procedure("apoc.vectordb.chroma.createCollection")
+    @Description("apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config)")


can we have a bit better descriptions (for all the procedures), not just the signature again? otherwise the apoc.help output is not really informative if it shows the same content twice without a human description

jexp · 2024-05-15T22:17:14Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+    }
+
+    private static Entity handleMappingNode(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
+        String query = "CREATE CONSTRAINT IF NOT EXISTS FOR (n:%s) REQUIRE n.%s IS UNIQUE"


did you test that you can run both the constraint as well as the data creation operation in the same tx?

shouldn't we leave that to the user to create the constraint, otherwise it would do it for every entity

jexp · 2024-05-15T22:18:22Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+                transaction.commit();
+            }
+
+            String setVectorQuery = "CALL db.create.setNodeVectorProperty($entity, $key, $vector)";


we can set the property to a float array ourselves, no need to call cypher here.

jexp · 2024-05-15T22:19:00Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+
+    private static Entity handleMappingRel(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) {
+        try {
+            String query = "CREATE CONSTRAINT IF NOT EXISTS FOR ()-[r:%s]-() REQUIRE (r.%s) IS UNIQUE"


same as above I don't think we need to do that

jexp · 2024-05-15T22:20:05Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+            // in this case we cannot auto-create the rel, since we should have to define start and end node as well
+            Relationship rel;
+            try (Transaction transaction = db.beginTx()) {
+                Object propValue = metaProps.remove(mapping.getId());


should we really remove the mapping-id ? if we later return the metadata that's missing?

jexp · 2024-05-15T22:20:24Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+            try (Transaction transaction = db.beginTx()) {
+                Object propValue = metaProps.remove(mapping.getId());
+                rel = transaction.findRelationship(RelationshipType.withName(mapping.getType()), mapping.getProp(), propValue);
+                if (rel != null) {


should this not only happen when "create: true" is set?

jexp · 2024-05-15T22:20:49Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+                transaction.commit();
+            }
+
+            String setVectorQuery = "CALL db.create.setRelationshipVectorProperty($entity, $key, $vector)";


we can set the float array property in the same tx above

jexp · 2024-05-15T22:21:24Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+                    node = transaction.createNode(Label.label(mapping.getLabel()));
+                    node.setProperty(mapping.getProp(), propValue);
+                }
+                if (node != null) {


why to we write properties if create is not set to true? then we should just return the found node

I think we should only populate a node when create is true
alternatively we could have 3 modes (create / update / read) with read the default

jexp · 2024-05-15T22:21:47Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+        try {
+            Node node;
+            try (Transaction transaction = db.beginTx()) {
+                Object propValue = metaProps.remove(mapping.getId());


as below we should not remove the mapping id from the metadata

jexp · 2024-05-15T22:22:57Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+        }
+
+        db.executeTransactionally(setVectorQuery,
+                Map.of("entity", Util.rebind(tx, entity), "key", mapping.getEmbeddingProp(), "vector", embedding));


make sure to turn the double list into a float array

and just set the float array directly as property

jexp · 2024-05-15T22:26:10Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb.adoc

+
+APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:
+
+- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])


add pinecone to docs

jexp · 2024-05-15T22:28:01Z

extended/src/main/java/apoc/vectordb/VectorDbUtil.java

+     * @param entity we cannot declare entity with class Entity, 
+     *               as an error `cannot be converted to a Neo4j type: Don't know how to map `org.neo4j.graphdb.Entity` to the Neo4j Type` would be thrown
+     */
+    public record EmbeddingResult(


could we have two fields one for Node and one for Relationship
where one or the other is null?

otherwise Cypher cannot do anything with that Object result and you have to first call convert.toNode which would be really annoying.

jexp · 2024-05-15T22:28:36Z

extended/src/main/java/apoc/vectordb/VectorEmbedding.java

+    enum Type {
+        CHROMA(new ChromaEmbeddingType()),
+        QDRANT(new QdrantEmbeddingType()),
+        WEAVIATE(new WeaviateEmbeddingType());


jexp · 2024-05-15T22:29:46Z

extended/src/test/java/apoc/vectordb/PineconeTest.java

+import static org.junit.Assert.assertTrue;
+
+/**
+ * It leverages `apoc.vectordb.custom*` procedures


shouldn't we have a dedicated pinecone procedures set?

jexp

Please see my comments

jexp · 2024-05-24T07:30:09Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+
+    @Procedure(value = "apoc.vectordb.chroma.query", mode = Mode.SCHEMA)
+    @Description("apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $configuration) - Retrieve closest vectors the the defined `vector`, `limit` of results,  in the collection with the name specified in the 2nd parameter")
+    public Stream<EmbeddingResult> query(@Name("hostOrKey") String hostOrKey,


sorry, that comment for the query procedure was meant for here:

I think if we should move the write behavior into a separate method, like queryAndUpdate or so? or updateGraphFromQuery ? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.

Removed Mode.SCHEMA, I had accidentally left it in that initially the procedure also auto-created the vector indexes in neo4j, I removed it now.
And added procedures queryAndUpdate with WRITE mode

jexp · 2024-05-24T07:31:22Z

extended/src/main/java/apoc/vectordb/ChromaDb.java

+                v -> listOfListsToMap((Map) v).stream());
+    }
+
+    private Map<String, Object> getVectorDbInfo(String hostOrKey, String collection, Map<String, Object> configuration, String templateUrl) {


it would be great if we could expose this getVectorDbInfo in a procedure call for each of the databases to get an overview what's in there.

added procedures with names apoc.vectordb.<type>.info

Sorry, what I meant was not the configuration stored on the neo4j side, but rather the metadata (which collections and sizes are available in the vector db) but we can also do that post 5.20

jexp · 2024-05-24T07:32:34Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+     * and mapping data to auto-create neo4j vector indexes and properties
+     */
+    @Procedure(value = "apoc.vectordb.custom.get", mode = Mode.SCHEMA)
+    @Description("apoc.vectordb.custom.get(host, $configuration) - Customizable get / query procedure")


little bit more detail in the description?

jexp · 2024-05-24T07:36:19Z

extended/src/main/java/apoc/vectordb/VectorDb.java

+            throw new RuntimeException(embeddingErrMsg);
+        }
+
+        entity.setProperty(mapping.getEmbeddingProp(), embedding.stream()


I think we should just do a utility method that creates a float array of the list size and uses a for loop over the list to set the values. then the JVM can also optimize that to SIMD. I don't think that the streams are efficient here.

jexp · 2024-05-24T07:39:51Z

extended/src/main/java/apoc/vectordb/VectorEmbeddingHandler.java

+    // -- implementations
+    //
+
+    class QdrantEmbeddingHandler implements VectorEmbeddingHandler {


I wonder if we should move these implementations closer to where the vector databases are? either into the procedures file or an associated file? Otherwise we have to update this file whenever we add a new db?

…d vector as a default result

vga91 · 2024-05-26T09:51:50Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/chroma.adoc

+[NOTE]
+====
+To optimize performances, we can choose what to `YIELD` with the apoc.vectordb.qdrant.query and the `apoc.vectordb.qdrant.get` procedures.
+For example, by executing a `CALL apoc.vectordb.chroma.query(...) YIELD metadata, score, id`, the RestAPI request will have an {"include": ["metadatas", "documents", "distances"]},


Yeah, but this is a Chroma db stuff, which use metadatas as a key: https://docs.trychroma.com/getting-started#6.-inspect-results

jexp · 2024-05-27T07:23:05Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+
+See the following pages for more details on specific vector db procedures
+
+- xref:./qdrand.adoc[Qdrant]


typo in filename?

jexp · 2024-05-27T07:23:57Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+})
+---- 
+
+We can get the current configuration by executing the following procedure:


let's not expose the stored secrets.

jexp · 2024-05-27T07:24:24Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/index.adoc

+|===
+
+
+which, in case of configuration key not found, just returns the baseUrl, for example:


let's not expose the stored secrets.

jexp · 2024-05-27T07:24:40Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/weaviate.adoc

+    Retrieve closest vectors from the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter, and optionally creates/updates neo4j entities.
+    Note that, besides the common config parameters, this procedure requires a `field: [listOfProperty]` config, to define which properties are to be retrieved from GraphQL running under-the-hood.
+    The default endpoint is `<hostOrKey param>/graphql`.
+| apoc.vectordb.weaviate.info(keyConfig) | Given the `keyConfig` returns the current configuration, created with the `apoc.vectordb.configure('WEAVIATE', keyConfig, ...)`


let's not expose the stored secrets.

jexp · 2024-05-27T07:26:27Z

docs/asciidoc/modules/ROOT/pages/database-integration/vectordb/weaviate.adoc

+
+[source,cypher]
+----
+CALL apoc.vectordb.weaviate.query($host, 'test_collection',


Let's separate that into an extra procedure, e.g. updateGraph, so that the query method can remain read-only.

jexp · 2024-05-27T07:41:23Z

extended/src/main/java/apoc/vectordb/VectorDbUtil.java

+    public static Stream<MapResult> getInfoProcCommon(String hostOrKey, VectorDbHandler handler) {
+        Map<String, Object> info = getCommonVectorDbInfo(hostOrKey, "", Map.of(), "%s", handler);
+        // endpoint is equivalent to baseUrl config
+        info.remove("endpoint");


also remove "headers" ? so that the bearer token goes away

or other api-keys or credentials entires.
perhaps better to just keySet().retainAll(asList("keyConfig", ...)) ?

vga91 force-pushed the issue-3971 branch from 0d89656 to 1d25ac5 Compare May 2, 2024 10:16

vga91 added extended-functionality dev labels May 2, 2024

vga91 force-pushed the issue-3971 branch 5 times, most recently from 6d57a89 to 4291f7a Compare May 8, 2024 12:26

vga91 force-pushed the issue-3971 branch from 4291f7a to 019ec21 Compare May 10, 2024 09:21

jexp reviewed May 15, 2024

View reviewed changes

jexp requested changes May 15, 2024

View reviewed changes

vga91 marked this pull request as draft May 17, 2024 16:32

vga91 force-pushed the issue-3971 branch 3 times, most recently from 9ff02b4 to 9a8f108 Compare May 20, 2024 07:13

jexp reviewed May 24, 2024

View reviewed changes

vga91 force-pushed the issue-3971 branch 4 times, most recently from b4af8cb to ae0152a Compare May 24, 2024 23:30

vga91 added 7 commits May 25, 2024 01:43

Fixes #3971: Check how to integrate vector databases via rest APIs

c94e0b3

fixed CI errors and removed unused imports

532b257

Changes review: added weaviate db, removed vector idx autocreation an…

b6c7461

…d vector as a default result

code clean

634cd24

Changes review: added systemdb store, removed constraint creation

47467dc

code clean

8f691b0

2nd changes review

d075a24

vga91 force-pushed the issue-3971 branch from ae0152a to d075a24 Compare May 24, 2024 23:43

vga91 commented May 26, 2024

View reviewed changes

jexp reviewed May 27, 2024

View reviewed changes

vga91 force-pushed the issue-3971 branch 2 times, most recently from fb04bf8 to 19ef4ca Compare May 27, 2024 09:32

fixed qdrant filename typo and removed info procs from docs

8f566ab

vga91 force-pushed the issue-3971 branch from 19ef4ca to 8f566ab Compare May 27, 2024 09:33

vga91 merged commit 89d167b into dev May 27, 2024
5 checks passed

vga91 deleted the issue-3971 branch May 27, 2024 13:32


		APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases:

		- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant])


		See the following pages for more details on specific vector db procedures

		- xref:./qdrand.adoc[Qdrant]

		\|===


		which, in case of configuration key not found, just returns the baseUrl, for example:

Fixes #3971: Check how to integrate vector databases via rest APIs #4059

Fixes #3971: Check how to integrate vector databases via rest APIs #4059

Conversation

vga91 commented May 2, 2024 • edited Loading

Changes

To evaluate

Additional notes (after PR merge)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jexp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vga91 commented May 2, 2024 •

edited

Loading