-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Storing IVF indexes on disk
For large indexes, it is useful to store the index content on disk. This incurs a runtime penalty compared to standard indexes stored in RAM, but the size of the index can become larger.
The content of the inverted lists is stored in a file that is distinct from the index file. By default the file layout is just raw data with
[codes for inv list #0]
[ids for inv list #0]
[codes for inv list #1]
[ids for inv list #1]
...
[codes for inv list #nlist-1]
[ids for inv list #nlist-1]
The file is memory mapped at search time.
The OnDiskInvertedLists
object contains contains an indirection table that maps an inverted list number to an offset in the file on disk.
This table is kept in RAM so that accessing an inverted list requires a single disk seek.
When the index is written to a file, the offsets table is written in the index file.
A standard index can be memory-mapped as an on-disk index with using the I/O flag IO_FLAG_MMAP
.
This makes it possible to load many indexes that would not otherwise fit in RAM.
OnDiskInvertedLists does support adding vectors to the index, but it is very inefficient, and this support will likely be removed in some version of Faiss.
The supported way is to:
-
train the index
-
shard the dataset into parts that are small enough to fit in RAM
-
build one index for each shard, store them in one file per shard
-
merge the sharded indexes into one big index with
OnDisk.merge_from
See demo_ondisk_ivf.py for a demo on how to do this.
Note that there is a subtle intermix between RAM and disk caching. The data may be stored on disk but due to the kernel buffering mechanism, the data is actually accessed from RAM. This may happen in two cases:
-
when the dataset is small (like the demo above)
-
when the same search is repeated several times. In that case, the inverted lists en up in the kernel cache buffers, so the search is fast after the first search.
The on-disk functionality relies on "advanced" functionality from the host OS and compiler. It is known to work on native Linux and OSX. It is known to fail in at least the following cases:
-
when using address sanitizers like ASAN, the memory overhead is too high.
-
some virtual machines do not implement memory mapping properly, see issue #1262
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark