Training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models. For more information on when this is harmful, please see Muennighoff et al., 2023 and Tirumala et al., 2023. The exact and fuzzy document-level deduplication module in the NeMo Curator aims at reducing the occurence of duplicate and near-duplicate documents in the dataset. Exact deduplication refers to removing identical (i.e., document strings are equal) documents from the dataset, while fuzzy deduplication refers to removing near-identical (e.g., an excerpt of a document is used in another document) documents from the dataset.
Both functionalities are supported in NeMo Curator and accelerated using RAPIDS. Exact dedpulication works by hashing each document and only keeping one document per hash. Fuzzy deduplication is more involved and follows the method outlined in Microsoft Turing NLG 530B.
As exact deduplication is a much less involved procedure and requires significantly less compute, we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.
When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that have high Jaccard similarity. Our approach closely resembles the approach described in Smith et al., 2020. This approach can essentially be split into two conceptual changes. The first stage involves computing MinHashes Signatures on documents and then performing Locality Sensitive Hashing (LSH) to find candidate duplucates. Due to the approximate nature of the bucketing via MinHash + LSH (Leskovec et al., 2020) we process each of the buckets to remove any potential false positives that may have been hashed into the buckets.
Before running either of these modules, users should assign a unique document ID to each document in the corpus.
This can be accomplished using the add_id
module within the NeMo Curator:
add_id \
--input-data-dir=<Path to directory containing jsonl files> \
--log-dir=./log/add_id
By default, this will create a new field named adlr_id
within each json document which will have the form "doc_prefix-000001".
If the dataset already has a unique ID this step can be skipped.
Note: Fuzzy deduplication only works with numeric ID's or the specific ID format generated by the add_id
script. If the
dataset does not contain ID's in this format it's recommended to convert to an integer based ID or ID created by the add_id
script.
Once a unique ID has been added to each document, users can proceed with exact and fuzzy deduplication which roughly require the following
steps (all scripts are included in the nemo_curator/scripts/
subdirectory):
- Exact dedup
- Input: Data directories
- Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.
Fuzzy Dedup
- Compute Minhashes
Input: Data Directories
Output: minhashes.parquet for each data dir.
Example call:
# same as `python compute_minhashes.py` gpu_compute_minhashes \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --output-minhash-dir /path/to/output_minhashes \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --char-ngram char_ngram_size \ --hash-bytes 4(or 8 byte hashes) \ --seed 42 \ --log-dir ./ # --scheduler-file /path/to/file.json
- Buckets (Minhash Buckets)
Input: Minhash directories
Output: Buckets.parquet
Example call:
# same as `python minhash_lsh.py` minhash_buckets \ --input-data-dirs /path/to/output_minhashes/dir1 /path/to/output_minhashes/dir2 \ --output-bucket-dir /path/to/dedup_output \ --input-minhash-field _minhash_signature \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --num-bands num_bands \ --buckets-per-shuffle 1 `#Value b/w [1-num_bands]. Higher is better but might lead to oom` \ --log-dir ./ # --scheduler-file /path/to/file.json
- Jaccard Map Buckets
Input: Buckets.parquet + Data Dir
Output: anchor_docs_with_bk.parquet
Example call:
# same as `python map_buckets.py` jaccard_map_buckets \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --input-bucket-dir /path/to/dedup_output/_buckets.parquet \ --output-dir /path/to/dedup_output \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ # --scheduler-file /path/to/file.json
- Jaccard Shuffle
Input: anchor_docs_with_bk.parquet + Data Dir
Output: shuffled_docs.parquet
Example call:
# same as `python jaccard_shuffle.py` jaccard_shuffle \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --input-bucket-mapping-dir /path/to/dedup_output/anchor_docs_with_bk.parquet \ --output-dir /path/to/dedup_output \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ # --scheduler-file /path/to/file.json
- Jaccard compute
Input: Shuffled docs.parquet
Output: jaccard_similarity_results.parquet
Example call:
# same as `python jaccard_compute.py` jaccard_compute \ --shuffled-docs-path /path/to/dedup_output/shuffled_docs.parquet \ --output-dir /path/to/dedup_output \ --ngram-size char_ngram_size_for_similarity \ # --scheduler-file /path/to/file.json
- Connected Components
Input: jaccard_similarity_results.parquet
Output: connected_components.parquet
Example call:
# same as `python connected_components.py` gpu_connected_component \ --jaccard-pairs-path /path/to/dedup_output/jaccard_similarity_results.parquet \ --output-dir /path/to/dedup_output \ --cache-dir /path/to/cc_cache \ --jaccard-threshold 0.8 # --scheduler-file /path/to/file.json
- Incremental Fuzzy Dedup
To incrementally perform fuzzy dedup, organize your incremental dataset snapshots into separate directories and pass a list of all your directories to
gpu_compute_minhashes
. All other subsequent steps can be done as described above without modification.Input (assuming incremental snapshots are all under
/input/
):/input/cc-2020-40 /input/cc-2021-42 /input/cc-2022-60
Output (assuming
--output-minhash-dir=/output
):/output/cc-2020-40/minhashes.parquet /output/cc-2021-42/minhashes.parquet /output/cc-2022-60/minhashes.parquet
Example call:
# same as `python compute_minhashes.py` gpu_compute_minhashes \ --input-data-dirs /input/cc-2020-40 /input/cc-2020-42 /input/cc-2020-60 \ --output-minhash-dir /output/ \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --char-ngram char_ngram_size \ --hash-bytes 4(or 8 byte hashes) \ --seed 42 \ --log-dir ./ # --scheduler-file /path/to/file.json
In addition to the scripts, there are examples in the examples directory that showcase using the python module directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy deduplication.