[FR] Add simple index deduplication #14

svilupp · 2024-04-18T09:16:23Z

When we run load_index!([:a,:b]), the knowledge packs a and b can have duplicate content. It would be good to dedupe once we merge them (see src/loading.jl::78).

The text was updated successfully, but these errors were encountered:

adarshpalaskar1 · 2024-04-21T09:04:01Z

Hello, I am interested in working on this issue.

Are we considering cosine-similarity-based deduplication? If so, here is a proposed approach:

Introduce a deduplicate_index() function after merging the knowledge packs.
Iteratively calculate the cosine similarity scores of new embeddings with the unique ones.
Discard embeddings based on a threshold.

Please let me know if this approach aligns with the plan.

svilupp · 2024-04-21T10:34:56Z

Hi Adarsh,

Awesome! I think this one could be done easier by replacing your second step with a simple string comparison, as we look for an exact match.

Some broader context:
It does have some negative impacts (eg, we can have duplicate text in multiple docstrings and by deleting it from all but one, it can never be linked to the other valid chunks...), so we will need some modularity (ie, to be able to enable/disable it in the future OR handle it when we do embeddings lookups).

But for now it would be great to have the brute-force deduplication!

Some helpful snippets that are in the pipeline that has not been published yet (so you might tweak/re-use/improve it):

"Finds duplicates in a list of chunks using SHA-256 hash. Returns a bit vector of the same length as the input list, where `true` indicates a duplicate (second instance of the same text)."
function find_duplicates(chunks::AbstractVector{<:AbstractString})
    # hash the chunks for easier search
    hashed_chunks = bytes2hex.(sha256.(chunks))
    sorted_indices = sortperm(hashed_chunks)  # Sort indices based on hashed values

    duplicates = falses(length(chunks))
    prev_hash = ""  # Initialize with an empty string to ensure the first comparison fails

    for idx in sorted_indices
        current_hash = hashed_chunks[idx]
        # Check if current hash matches the previous one, indicating a duplicate
        if current_hash == prev_hash
            duplicates[idx] = true  # Mark as duplicate
        else
            prev_hash = current_hash  # Update previous hash for the next iteration
        end
    end

    return duplicates
end

"Removes chunks that are duplicated in the input list of chunks and their corresponding sources."
function remove_duplicates(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString})
    idxs = find_duplicates(chunks)
    return chunks[.!idxs], sources[.!idxs]
end

The remove_duplicates shows you how it's used.

You can use your function name and difference would be that you need to check if the ChunkIndex has fields nothing or provided and then remove the duplicates from the right dimension.

Please make sure it's specific to ChunkIndex type only (as it's hard-coded to the field names)

Let me know what you think / if you have any questions!

adarshpalaskar1 · 2024-04-23T13:59:48Z

Great! Thanks for the detailed explanation. I think this is clear to me for now, and I will add the PR soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Add simple index deduplication #14

[FR] Add simple index deduplication #14

svilupp commented Apr 18, 2024

adarshpalaskar1 commented Apr 21, 2024

svilupp commented Apr 21, 2024

adarshpalaskar1 commented Apr 23, 2024 •

edited

Loading

[FR] Add simple index deduplication #14

[FR] Add simple index deduplication #14

Comments

svilupp commented Apr 18, 2024

adarshpalaskar1 commented Apr 21, 2024

svilupp commented Apr 21, 2024

adarshpalaskar1 commented Apr 23, 2024 • edited Loading

adarshpalaskar1 commented Apr 23, 2024 •

edited

Loading