Skip to content

Storing IVF indexes on disk

Matthijs Douze edited this page Jun 25, 2020 · 5 revisions

For large indexes, it is useful to store the index content on disk. This incurs a runtime penalty compared to standard indexes stored in RAM, but the size of the index can become larger.

Data layout

The content of the inverted lists is stored in a file that is distinct from the index file. By default the file layout is just raw data with

[codes for inv list #0]
[ids for inv list #0]
[codes for inv list #1]
[ids for inv list #1]
...
[codes for inv list #nlist-1]
[ids for inv list #nlist-1]

The file is memory mapped at search time.

The OnDiskInvertedLists object

The OnDiskInvertedLists object contains contains an indirection table that maps an inverted list number to an offset in the file on disk. This table is kept in RAM so that accessing an inverted list requires a single disk seek. When the index is written to a file, the offsets table is written in the index file.

Loading a standard index on disk

A standard index can be memory-mapped as an on-disk index with using the I/O flag IO_FLAG_MMAP. This makes it possible to load many indexes that would not otherwise fit in RAM.

Building an on-disk index

OnDiskInvertedLists does support adding vectors to the index, but it is very inefficient, and this support will likely be removed in some version of Faiss.

The supported way is to:

  • train the index

  • shard the dataset into parts that are small enough to fit in RAM

  • build one index for each shard, store them in one file per shard

  • merge the sharded indexes into one big index with OnDisk.merge_from

See demo_ondisk_ivf.py for a demo on how to do this.

Search performance

Note that there is a subtle intermix between RAM and disk caching. The data may be stored on disk but due to the kernel buffering mechanism, the data is actually accessed from RAM. This may happen in two cases:

  • when the dataset is small (like the demo above)

  • when the same search is repeated several times. In that case, the inverted lists en up in the kernel cache buffers, so the search is fast after the first search.

OS support

The on-disk functionality relies on "advanced" functionality from the host OS and compiler. It is known to work on native Linux and OSX. It is known to fail in at least the following cases:

  • when using address sanitizers like ASAN, the memory overhead is too high.

  • some virtual machines do not implement memory mapping properly, see issue #1262

Clone this wiki locally