Grove
is a tree-based hierarchical implementation of vector databases. Given that much data is hierarchical, Grove is a database designed for such data where indices can search over subindices, allowing for recursive retrieval.
For example, suppose a use case of indexing research papers. Each node can be the vector database for individual papers and their keys would be the embedding of the abstract. At query time, the query is first compared to all the abstracts in the database before continuing the search on the most suitable paper.
Currently, Grove supports 3 different searching algorithms
- Hierarchical Navigable Small World (HNSW) search for the leaf indices which contain the actual data points
- Flat Search conducts a KNN search to find the top k nearest neighbours.
- SVM Search is an implementation of Karparthy's notebook.
from Grove.Indices import *
from Grove.Entry import BaseEntry
import numpy as np
DIM = 768
# Create a new index
root = FlatRootIndex("root", max_children=10) # Create database
# Create SVM leaf index at current level with 100 max elements
root.create_child("child1", SVMLeafIndex, max_children=100, loc="",
key = np.float32(np.random.rand(DIM)), dim = DIM)
# Create flat inner index at current level with 10 max children leaf indices
root.create_child("child2", FlatInnerIndex, max_children=10, loc = "",
key = np.float32(np.random.rand(DIM)))
# Create HNSW leaf index one level lower at child2 with 100 max elements
root.create_child("child2a", HNSWLeafIndex, max_children=100, loc = "child2", key = np.float32(np.random.rand(DIM)), dim = DIM)
root.insert_all([BaseEntry(np.float32(np.random.rand(DIM)),
metadata= {}) for i in range(100)], "child1") # insert into child 1
root.insert_all([BaseEntry(np.float32(np.random.rand(DIM)),
metadata= {}) for i in range(100)], "child2-child2a") # insert into child 2a
print(root.get_schema())
# FlatRootIndex(name = root, num_children = 2)
# FlatInnerIndex(name=child2, num_children=1, searchable = True)
# HNSWLeafIndex(name = child2a, dim = 768, current_count = 100, searchable = True)
# SVMLeafIndex(name = child1, dim = 768, current_count = 100, searchable = True)
query = np.float32(np.random.rand(DIM)) # Example query in the form of numpy array
path, results, distances = root.search(query, k = 5)
# print(path) # prints "child1"- -> searched inside child1
Grove
is structured around indices, it contains 3 types of indices: RootIndex
, InnerIndex
and LeafIndex
. Every database starts with a RootIndex
and contains at least one LeafIndex
which stores the actual data points.
BaseEntry
: Every data point in Grove
is wrapped by a BaseEntry
which stores each vector and contains metadata for each data point in the form of a python dictionary.
Locations in Grove
are specified in the following syntax: A-B-C
where B
is the child of A
and C
is the leaf of the structure,
To access a inner node directly, use the following syntax: parent.children["child_name"]
.
There are 2 types of Root Indices - SVMRootIndex
, FlatRootIndex
, CrossEncoderRootIndex
and they support the following methods. These indices form the base of every database in Grove
.
insert(self, item: BaseEntry, loc: str) -> None:
inserts aBaseEntry
into the specified location of the database.insert_all(self, items: List[BaseEntry], loc: str) -> None:
inserts a list ofBaseEntries
into the specified location. This is the same as callinginsert()
on each item of the list.create_child(self, new_child_name, t:Index, loc: str, **kwargs) -> None:
creates a child at the specified location with the givenchild_name
andIndex
type with the providedkwargs
. Provide empty string to create child at the currentIndex
.create_child_level(self, names: List[str], t: Index, loc: str, keys: List[np.array] = None, **kwargs) -> None:
createslen(names)
number of children at the given location from the givenkeys
. Other arguments are the same ascreate_child
. Note that this method creates children with the samekwargs
.delete_child(self, loc: str) -> None:
deletes the child at the given location.delete(self, metadata: dict, loc: str) -> None:
deletes the item given a metadata. IfBaseEntry
is provided for metadata, delete the associated entry in the database.search(self, query: np.array, k: int = 5) -> Tuple[str, List[BaseEntry], np.array]:
searches for the top k data points with the givenquery
. Returns a tuple of (location of the leaf node the results are from, the top K results, and an array of distances).get_schema(self) -> str:
returns a formatted string of the database structure.save_to_disk(self, safe_path: str) -> None:
pickles the database to diskload_from_disk(cls, name: str, safe_path: str) -> "RootIndex":
loads the database to disk. Currently does not check if the loadingIndex
is the same type as the callingIndex
.
There are 2 types of Inner Indices - FlatInnerIndex
and SVMInnerIndex
. They support the same functionality as the Root Indices. However, all inner indices must contain a key to be searchable. As such, they also contain the following methods:
set_key (self, key: np.array) -> bool:
sets a key at the givenIndex
. Returns true if successful else if the key is already set, it returns false.is_searchable(self) -> bool:
checks if the key has been set.
Leaf Indices are where the actual searching takes place. There are currently 3 different variants: FlatLeafIndex
, HNSWLeafIndex
and FlatLeafIndex
. They contain the following methods:
search(self, query: np.array, k: int = 5) -> Tuple[str, List[BaseEntry], np.array]: -> None
searches for the top k data points with the givenquery
. Returns a tuple of (location of the leaf node the results are from, aQ
xK
list of results, aQ
xK
np.array of distances).insert(self, item: BaseEntry, loc: str = None) -> None:
inserts aBaseEntry
into the specified location of the database.insert_all(self, items: List[BaseEntry], loc: str) -> None:
inserts a list ofBaseEntries
into the specified location. This is the same as callinginsert()
on each item of the list.delete(self, metadata: dict, loc: str) -> None:
deletes the item given a metadata. IfBaseEntry
is provided for metadata, delete the associated entry in the database.get_ids(self) -> List[dict]:
returns all the metadata present in this leaf
- Support multiple queries at once
- Performance metrics
- Other databases