Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Add simple index deduplication #14

Open
svilupp opened this issue Apr 18, 2024 · 3 comments
Open

[FR] Add simple index deduplication #14

svilupp opened this issue Apr 18, 2024 · 3 comments

Comments

@svilupp
Copy link
Owner

svilupp commented Apr 18, 2024

When we run load_index!([:a,:b]), the knowledge packs a and b can have duplicate content. It would be good to dedupe once we merge them (see src/loading.jl::78).

@adarshpalaskar1
Copy link

Hello, I am interested in working on this issue.

Are we considering cosine-similarity-based deduplication? If so, here is a proposed approach:

  • Introduce a deduplicate_index() function after merging the knowledge packs.
  • Iteratively calculate the cosine similarity scores of new embeddings with the unique ones.
  • Discard embeddings based on a threshold.

Please let me know if this approach aligns with the plan.

@svilupp
Copy link
Owner Author

svilupp commented Apr 21, 2024

Hi Adarsh,

Awesome! I think this one could be done easier by replacing your second step with a simple string comparison, as we look for an exact match.

Some broader context:
It does have some negative impacts (eg, we can have duplicate text in multiple docstrings and by deleting it from all but one, it can never be linked to the other valid chunks...), so we will need some modularity (ie, to be able to enable/disable it in the future OR handle it when we do embeddings lookups).

But for now it would be great to have the brute-force deduplication!

Some helpful snippets that are in the pipeline that has not been published yet (so you might tweak/re-use/improve it):

"Finds duplicates in a list of chunks using SHA-256 hash. Returns a bit vector of the same length as the input list, where `true` indicates a duplicate (second instance of the same text)."
function find_duplicates(chunks::AbstractVector{<:AbstractString})
    # hash the chunks for easier search
    hashed_chunks = bytes2hex.(sha256.(chunks))
    sorted_indices = sortperm(hashed_chunks)  # Sort indices based on hashed values

    duplicates = falses(length(chunks))
    prev_hash = ""  # Initialize with an empty string to ensure the first comparison fails

    for idx in sorted_indices
        current_hash = hashed_chunks[idx]
        # Check if current hash matches the previous one, indicating a duplicate
        if current_hash == prev_hash
            duplicates[idx] = true  # Mark as duplicate
        else
            prev_hash = current_hash  # Update previous hash for the next iteration
        end
    end

    return duplicates
end

"Removes chunks that are duplicated in the input list of chunks and their corresponding sources."
function remove_duplicates(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString})
    idxs = find_duplicates(chunks)
    return chunks[.!idxs], sources[.!idxs]
end

The remove_duplicates shows you how it's used.

You can use your function name and difference would be that you need to check if the ChunkIndex has fields nothing or provided and then remove the duplicates from the right dimension.

Please make sure it's specific to ChunkIndex type only (as it's hard-coded to the field names)

Let me know what you think / if you have any questions!

@adarshpalaskar1
Copy link

adarshpalaskar1 commented Apr 23, 2024

Great! Thanks for the detailed explanation. I think this is clear to me for now, and I will add the PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants