Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient way to increment on-disk merged index #2876

Closed
fferroni opened this issue May 25, 2023 · 2 comments
Closed

Efficient way to increment on-disk merged index #2876

fferroni opened this issue May 25, 2023 · 2 comments

Comments

@fferroni
Copy link

fferroni commented May 25, 2023

Hello,

Great library. Had two questions I couldn't find answers to in the past issues / READMEs

  1. Is it possible to add additional shards to an already pre-built index that was created from a set of shards (faiss.contrib.ondisk.merge_ondisk()), or does one need to run merge_ondisk again? I'm dealing with very large indices...
  2. In a distributed setting, indexing shards requires knowing the global ID offset. (https://github.com/facebookresearch/faiss/blob/main/demos/demo_ondisk_ivf.py#L53). Is there an efficient way to merge index shards that always start from zero? For example, I may have 10 machines that create an IVF index shard of 100 elements from 0-99. In the merge step, can I efficiently (or not) "on-the-fly" offset the indices of each such that they are globally consistent, perhaps based on the order of the list provided to merge_ondisk?

Thank you!

@mdouze
Copy link
Contributor

mdouze commented May 26, 2023

  1. this boils down to supporting addition in a merged index. Support is implemented but so inefficient that it is not practically usable unfortunately

  2. that would be a useful addition, to add an offset to all inverted lists, marking as enhancement. Should be implemented here: https://github.com/facebookresearch/faiss/blob/main/faiss/invlists/OnDiskInvertedLists.h#L104
    Another way would be to make a wrapper around an invlist object that shifts the ids.

@fferroni
Copy link
Author

fferroni commented Jun 5, 2023

Hi,
Thank you for the reply @mdouze. For 2. would you have some tips/pointers on how to implement this? I suppose one could add an extra default argument like auto_offset=False to the method. But in terms of implementation in the .cpp, how does one access the index IDs and offset them?

kuarora added a commit to kuarora/faiss that referenced this issue Mar 28, 2024
…Lists

Summary:
**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address facebookresearch#1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address facebookresearch#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Differential Revision: D55482518
kuarora added a commit to kuarora/faiss that referenced this issue Mar 28, 2024
…Lists (facebookresearch#3327)

Summary:

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address facebookresearch#1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address facebookresearch#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Differential Revision: D55482518
kuarora added a commit to kuarora/faiss that referenced this issue Apr 3, 2024
…Lists (facebookresearch#3327)

Summary:

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address facebookresearch#1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address facebookresearch#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518
kuarora added a commit to kuarora/faiss that referenced this issue Apr 3, 2024
…Lists (facebookresearch#3327)

Summary:

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address facebookresearch#1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address facebookresearch#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518
kuarora added a commit to kuarora/faiss that referenced this issue Apr 3, 2024
…Lists (facebookresearch#3327)

Summary:

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address facebookresearch#1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address facebookresearch#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518
facebook-github-bot pushed a commit that referenced this issue Apr 3, 2024
…Lists (#3327)

Summary:
Pull Request resolved: #3327

**Context**
1. [Issue 2621](#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address #1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address #2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518

fbshipit-source-id: 95470c7449160488d2b45b024d134cbc037a2083
@kuarora kuarora closed this as completed Apr 3, 2024
abhinavdangeti pushed a commit to blevesearch/faiss that referenced this issue Jul 12, 2024
…Lists (facebookresearch#3327)

Summary:
Pull Request resolved: facebookresearch#3327

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address #1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address #2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518

fbshipit-source-id: 95470c7449160488d2b45b024d134cbc037a2083
aalekhpatel07 pushed a commit to aalekhpatel07/faiss that referenced this issue Oct 17, 2024
…Lists (facebookresearch#3327)

Summary:
Pull Request resolved: facebookresearch#3327

**Context**
1. [Issue 2621](facebookresearch#2621) discuss inconsistency between OnDiskInvertedList and InvertedList. OnDiskInvertedList is supposed to handle disk based multiple Index Shards. Thus, we should name it differently when merging invls from index shard.
2. [Issue 2876](facebookresearch#2876) provides usecase of shifting ids when merging invls from different shards.

**In this diff**,
1. To address #1 above, I renamed the merge_from function to merge_from_multiple without touching merge_from base class.
why so? To continue to allow merge invl from one index to ondiskinvl from other index.

2. To address Enet4#2 above, I have added support of shift_ids in merge_from_multiple to shift ids from different shards. This can be used when each shard has same set of ids but different data. This is not recommended if id is already unique across shards.

Reviewed By: mdouze

Differential Revision: D55482518

fbshipit-source-id: 95470c7449160488d2b45b024d134cbc037a2083
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants