-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #3971: Check how to integrate vector databases via rest APIs #4059
Conversation
6d57a89
to
4291f7a
Compare
@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception { | ||
var config = new HashMap<>(configuration); | ||
|
||
String qdrantUrl = getChromaUrl(hostOrKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy & paste typo - not qdrant :)
@Name(value = "configuration", defaultValue = "{}") Map<String, Object> configuration) throws Exception { | ||
var config = new HashMap<>(configuration); | ||
|
||
String qdrantUrl = getChromaUrl(hostOrKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
public URLAccessChecker urlAccessChecker; | ||
|
||
@Procedure("apoc.vectordb.chroma.createCollection") | ||
@Description("apoc.vectordb.chroma.createCollection(hostOrKey, collection, similarity, size, $config)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a bit better descriptions (for all the procedures), not just the signature again? otherwise the apoc.help output is not really informative if it shows the same content twice without a human description
} | ||
|
||
private static Entity handleMappingNode(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) { | ||
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR (n:%s) REQUIRE n.%s IS UNIQUE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you test that you can run both the constraint as well as the data creation operation in the same tx?
shouldn't we leave that to the user to create the constraint, otherwise it would do it for every entity
transaction.commit(); | ||
} | ||
|
||
String setVectorQuery = "CALL db.create.setNodeVectorProperty($entity, $key, $vector)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can set the property to a float array ourselves, no need to call cypher here.
|
||
private static Entity handleMappingRel(Transaction tx, GraphDatabaseService db, VectorMappingConfig mapping, Map<String, Object> metaProps, List<Double> embedding) { | ||
try { | ||
String query = "CREATE CONSTRAINT IF NOT EXISTS FOR ()-[r:%s]-() REQUIRE (r.%s) IS UNIQUE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above I don't think we need to do that
// in this case we cannot auto-create the rel, since we should have to define start and end node as well | ||
Relationship rel; | ||
try (Transaction transaction = db.beginTx()) { | ||
Object propValue = metaProps.remove(mapping.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we really remove the mapping-id ? if we later return the metadata that's missing?
try (Transaction transaction = db.beginTx()) { | ||
Object propValue = metaProps.remove(mapping.getId()); | ||
rel = transaction.findRelationship(RelationshipType.withName(mapping.getType()), mapping.getProp(), propValue); | ||
if (rel != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this not only happen when "create: true" is set?
transaction.commit(); | ||
} | ||
|
||
String setVectorQuery = "CALL db.create.setRelationshipVectorProperty($entity, $key, $vector)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can set the float array property in the same tx above
node = transaction.createNode(Label.label(mapping.getLabel())); | ||
node.setProperty(mapping.getProp(), propValue); | ||
} | ||
if (node != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why to we write properties if create is not set to true? then we should just return the found node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only populate a node when create is true
alternatively we could have 3 modes (create / update / read) with read the default
try { | ||
Node node; | ||
try (Transaction transaction = db.beginTx()) { | ||
Object propValue = metaProps.remove(mapping.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as below we should not remove the mapping id from the metadata
} | ||
|
||
db.executeTransactionally(setVectorQuery, | ||
Map.of("entity", Util.rebind(tx, entity), "key", mapping.getEmbeddingProp(), "vector", embedding)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to turn the double list into a float array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and just set the float array directly as property
|
||
APOC provides these set of procedures, which leverages the Rest APIs, to interact with Vector Databases: | ||
|
||
- `apoc.vectordb.qdrant.*` (to interact with https://qdrant.tech/documentation/overview/[Qdrant]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add pinecone to docs
* @param entity we cannot declare entity with class Entity, | ||
* as an error `cannot be converted to a Neo4j type: Don't know how to map `org.neo4j.graphdb.Entity` to the Neo4j Type` would be thrown | ||
*/ | ||
public record EmbeddingResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have two fields one for Node and one for Relationship
where one or the other is null?
otherwise Cypher cannot do anything with that Object result and you have to first call convert.toNode which would be really annoying.
enum Type { | ||
CHROMA(new ChromaEmbeddingType()), | ||
QDRANT(new QdrantEmbeddingType()), | ||
WEAVIATE(new WeaviateEmbeddingType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinecone?
import static org.junit.Assert.assertTrue; | ||
|
||
/** | ||
* It leverages `apoc.vectordb.custom*` procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we have a dedicated pinecone procedures set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments
9ff02b4
to
9a8f108
Compare
|
||
@Procedure(value = "apoc.vectordb.chroma.query", mode = Mode.SCHEMA) | ||
@Description("apoc.vectordb.chroma.query(hostOrKey, collection, vector, filter, limit, $configuration) - Retrieve closest vectors the the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter") | ||
public Stream<EmbeddingResult> query(@Name("hostOrKey") String hostOrKey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, that comment for the query procedure was meant for here:
I think if we should move the write behavior into a separate method, like queryAndUpdate
or so? or updateGraphFromQuery
? and keep the query method read-only, otherwise read-only users can't use it and accidental write behavior will be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Mode.SCHEMA, I had accidentally left it in that initially the procedure also auto-created the vector indexes in neo4j, I removed it now.
And added procedures queryAndUpdate
with WRITE mode
v -> listOfListsToMap((Map) v).stream()); | ||
} | ||
|
||
private Map<String, Object> getVectorDbInfo(String hostOrKey, String collection, Map<String, Object> configuration, String templateUrl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if we could expose this getVectorDbInfo
in a procedure call for each of the databases to get an overview what's in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added procedures with names apoc.vectordb.<type>.info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what I meant was not the configuration stored on the neo4j side, but rather the metadata (which collections and sizes are available in the vector db) but we can also do that post 5.20
* and mapping data to auto-create neo4j vector indexes and properties | ||
*/ | ||
@Procedure(value = "apoc.vectordb.custom.get", mode = Mode.SCHEMA) | ||
@Description("apoc.vectordb.custom.get(host, $configuration) - Customizable get / query procedure") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
little bit more detail in the description?
throw new RuntimeException(embeddingErrMsg); | ||
} | ||
|
||
entity.setProperty(mapping.getEmbeddingProp(), embedding.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just do a utility method that creates a float array of the list size and uses a for loop over the list to set the values. then the JVM can also optimize that to SIMD. I don't think that the streams are efficient here.
// -- implementations | ||
// | ||
|
||
class QdrantEmbeddingHandler implements VectorEmbeddingHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should move these implementations closer to where the vector databases are? either into the procedures file or an associated file? Otherwise we have to update this file whenever we add a new db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
b4af8cb
to
ae0152a
Compare
…d vector as a default result
[NOTE] | ||
==== | ||
To optimize performances, we can choose what to `YIELD` with the apoc.vectordb.qdrant.query and the `apoc.vectordb.qdrant.get` procedures. | ||
For example, by executing a `CALL apoc.vectordb.chroma.query(...) YIELD metadata, score, id`, the RestAPI request will have an {"include": ["metadatas", "documents", "distances"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but this is a Chroma db stuff, which use metadatas as a key: https://docs.trychroma.com/getting-started#6.-inspect-results
|
||
See the following pages for more details on specific vector db procedures | ||
|
||
- xref:./qdrand.adoc[Qdrant] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in filename?
}) | ||
---- | ||
|
||
We can get the current configuration by executing the following procedure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not expose the stored secrets.
|=== | ||
|
||
|
||
which, in case of configuration key not found, just returns the baseUrl, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not expose the stored secrets.
Retrieve closest vectors from the defined `vector`, `limit` of results, in the collection with the name specified in the 2nd parameter, and optionally creates/updates neo4j entities. | ||
Note that, besides the common config parameters, this procedure requires a `field: [listOfProperty]` config, to define which properties are to be retrieved from GraphQL running under-the-hood. | ||
The default endpoint is `<hostOrKey param>/graphql`. | ||
| apoc.vectordb.weaviate.info(keyConfig) | Given the `keyConfig` returns the current configuration, created with the `apoc.vectordb.configure('WEAVIATE', keyConfig, ...)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not expose the stored secrets.
|
||
[source,cypher] | ||
---- | ||
CALL apoc.vectordb.weaviate.query($host, 'test_collection', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's separate that into an extra procedure, e.g. updateGraph, so that the query method can remain read-only.
public static Stream<MapResult> getInfoProcCommon(String hostOrKey, VectorDbHandler handler) { | ||
Map<String, Object> info = getCommonVectorDbInfo(hostOrKey, "", Map.of(), "%s", handler); | ||
// endpoint is equivalent to baseUrl config | ||
info.remove("endpoint"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also remove "headers" ? so that the bearer token goes away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or other api-keys or credentials entires.
perhaps better to just keySet().retainAll(asList("keyConfig", ...))
?
fb04bf8
to
19ef4ca
Compare
Fixes #3971
Changes
Emulate the https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ commands.
CREATE VECTOR INDEX
apoc.vectordb.qdrant.createCollection
DROP VECTOR INDEX
apoc.vectordb.qdrant.deleteCollection
apoc.vectordb.qdrant.upsert
CALL db.index.vector.queryNodes
/CALL db.index.vector.queryRelationships
apoc.vectordb.qdrant.get
andapoc.vectordb.qdrant.query
apoc.vectordb.qdrant.delete
NOTE: Like the
apoc.ml
ones, the chroma, qdrand and weaviate procedures are implemented in such a way that they have the same signature, even though under the hood they have different bodies/methods/etc.apoc.vectordb.qdrant.get
andapoc.vectordb.custom
to handle other vector databases (like Pinecone tested in PineconeTest).Using the
apoc.vectordb.*.get
andapoc.vectordb.*.query
procedures, we can auto-create neo4j vector indexes and entities, using the mapping config.NOTE: by default, with the
apoc.vectordb.*get
andapoc.vectordb.*query
only score, metatada and entity are retrieved, to get also other results, we have to set the configallResults: true
.To evaluate
apoc.vectordb.custom
could be changed to a more generic naming, e.g.apoc.restapi.custom(<conf>)
, since it could be used with other rest APIsRestAPIConfig
to util packageAdditional notes (after PR merge)
Open a follow-up issue:
Test / custom procedures with other databases (like Pinecone)
Added trello Core card: problem with Pinecone, create a PR after neo4j-contrib PR creation...
We cannot execute Pinecone fetch API with method: "", due to these 2 pieces of apoc core codes:
In both cases, we receive a 200OK, but with no results.