-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Add simple index deduplication #14
Comments
Hello, I am interested in working on this issue. Are we considering cosine-similarity-based deduplication? If so, here is a proposed approach:
Please let me know if this approach aligns with the plan. |
Hi Adarsh, Awesome! I think this one could be done easier by replacing your second step with a simple string comparison, as we look for an exact match. Some broader context: But for now it would be great to have the brute-force deduplication! Some helpful snippets that are in the pipeline that has not been published yet (so you might tweak/re-use/improve it): "Finds duplicates in a list of chunks using SHA-256 hash. Returns a bit vector of the same length as the input list, where `true` indicates a duplicate (second instance of the same text)."
function find_duplicates(chunks::AbstractVector{<:AbstractString})
# hash the chunks for easier search
hashed_chunks = bytes2hex.(sha256.(chunks))
sorted_indices = sortperm(hashed_chunks) # Sort indices based on hashed values
duplicates = falses(length(chunks))
prev_hash = "" # Initialize with an empty string to ensure the first comparison fails
for idx in sorted_indices
current_hash = hashed_chunks[idx]
# Check if current hash matches the previous one, indicating a duplicate
if current_hash == prev_hash
duplicates[idx] = true # Mark as duplicate
else
prev_hash = current_hash # Update previous hash for the next iteration
end
end
return duplicates
end
"Removes chunks that are duplicated in the input list of chunks and their corresponding sources."
function remove_duplicates(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString})
idxs = find_duplicates(chunks)
return chunks[.!idxs], sources[.!idxs]
end The You can use your function name and difference would be that you need to check if the ChunkIndex has fields nothing or provided and then remove the duplicates from the right dimension. Please make sure it's specific to ChunkIndex type only (as it's hard-coded to the field names) Let me know what you think / if you have any questions! |
Great! Thanks for the detailed explanation. I think this is clear to me for now, and I will add the PR soon. |
When we run
load_index!([:a,:b])
, the knowledge packs a and b can have duplicate content. It would be good to dedupe once we merge them (seesrc/loading.jl::78
).The text was updated successfully, but these errors were encountered: