Fix issue #43 (empty files creation) and improve reading/writing speed #57

miguelusque · 2024-05-08T13:19:06Z

This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero.

In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method.

…g speed This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <[email protected]>

…g speed (NVIDIA#57) This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>

…g speed (NVIDIA#57) This commit fixes issue NVIDIA#43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Nicole Luo <[email protected]>

* Init commit for tutorial notebook Signed-off-by: Nicole Luo <[email protected]> * Fix metadata inference with pandas and dask (#35) * Fix metadata inference with pandas and dask Signed-off-by: Ryan Wolf <[email protected]> * Fix datatypes for task decontamination Signed-off-by: Ryan Wolf <[email protected]> * Use targetted import Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Disable PyTorch Compile Multiprocessing (#34) * Move tokenizer import Signed-off-by: Ryan Wolf <[email protected]> * Reduce inductor threads Signed-off-by: Ryan Wolf <[email protected]> * Change env int to string Signed-off-by: Ryan Wolf <[email protected]> * Change location of env var Signed-off-by: Ryan Wolf <[email protected]> * Add comment linking issue Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Improve speed of AddId module (#36) * Add fast id method Signed-off-by: Ryan Wolf <[email protected]> * Add type conversion Signed-off-by: Ryan Wolf <[email protected]> * Fix off by one errors in tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Make GPU dependencies optional (#27) * Move GPU imports and make them optional Signed-off-by: Ayush Dattagupta <[email protected]> * Move gpu dependencies to a seperate install Signed-off-by: Ayush Dattagupta <[email protected]> * Remove unused import Signed-off-by: Ayush Dattagupta <[email protected]> * Switch to placeholder import that raises on usage Signed-off-by: Ayush Dattagupta <[email protected]> * Remove deprecated utils usage Signed-off-by: Ayush Dattagupta <[email protected]> * Add cuML attribution Signed-off-by: Ayush Dattagupta <[email protected]> * Safe import tests, improve install instruction, update gha workflow Signed-off-by: Ayush Dattagupta <[email protected]> * Fix pytests due to loc bug Signed-off-by: Ayush Dattagupta <[email protected]> * update install instructions Signed-off-by: Ayush Dattagupta <[email protected]> * Raise on non module-not-found errors, update logging Signed-off-by: Ayush Dattagupta <[email protected]> * Update logging to not change root logger Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix failing GPU tests with latest pandas bump (#41) Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Adds Nemo Curator K8s example (#40) * [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <[email protected]> * black formatting Signed-off-by: Terry Kong <[email protected]> * big_english -> my_dataset Signed-off-by: Terry Kong <[email protected]> * 24.01 -> 24.03 default container Signed-off-by: Terry Kong <[email protected]> * Add help kwarg to all flags Signed-off-by: Terry Kong <[email protected]> * Clarify why venv is needed Signed-off-by: Terry Kong <[email protected]> * fix precommit failures Signed-off-by: Terry Kong <[email protected]> --------- Signed-off-by: Terry Kong <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Move common dedup utils and remove unused code (#42) * Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <[email protected]> * More cleanup Signed-off-by: Ayush Dattagupta <[email protected]> * More updates/shuffling Signed-off-by: Ayush Dattagupta <[email protected]> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <[email protected]> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <[email protected]> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <[email protected]> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <[email protected]> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <[email protected]> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <[email protected]> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix lang id example (#37) * Fix lang id example Signed-off-by: Ryan Wolf <[email protected]> * Add classifier unit tests Signed-off-by: Ryan Wolf <[email protected]> * Add test for failure Signed-off-by: Ryan Wolf <[email protected]> * Remove failure test Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Add dataset blending tool (#32) * Add initial dataset blending function Signed-off-by: Ryan Wolf <[email protected]> * Add blend unit tests Signed-off-by: Ryan Wolf <[email protected]> * Add self parameter Signed-off-by: Ryan Wolf <[email protected]> * Fix return type of blend dataset Signed-off-by: Ryan Wolf <[email protected]> * Fix blending tests Signed-off-by: Ryan Wolf <[email protected]> * Change assert statement for very uneven blend Signed-off-by: Ryan Wolf <[email protected]> * Fix key error Signed-off-by: Ryan Wolf <[email protected]> * Add proper proportion blending test Signed-off-by: Ryan Wolf <[email protected]> * Add four dataset blend and clarify docs Signed-off-by: Ryan Wolf <[email protected]> * Add shuffle module Signed-off-by: Ryan Wolf <[email protected]> * Add blend example and tests Signed-off-by: Ryan Wolf <[email protected]> * Fix random method name Signed-off-by: Ryan Wolf <[email protected]> * Wrap return type in DocumentDataset Signed-off-by: Ryan Wolf <[email protected]> * Save result of column drop Signed-off-by: Ryan Wolf <[email protected]> * Change equality check for shuffle tests Signed-off-by: Ryan Wolf <[email protected]> * Fix expected order after shuffle Signed-off-by: Ryan Wolf <[email protected]> * Add more documents to shuffle test Signed-off-by: Ryan Wolf <[email protected]> * Add assert statement Signed-off-by: Ryan Wolf <[email protected]> * Add within partition shuffle Signed-off-by: Ryan Wolf <[email protected]> * Refactor add rand column for shuffle Signed-off-by: Ryan Wolf <[email protected]> * Fix filename tests Signed-off-by: Ryan Wolf <[email protected]> * Add determinism handling for shuffle Signed-off-by: Ryan Wolf <[email protected]> * Change numpy random function Signed-off-by: Ryan Wolf <[email protected]> * Fix tests with new random method Signed-off-by: Ryan Wolf <[email protected]> * Remove length call from blending Signed-off-by: Ryan Wolf <[email protected]> * Improve scaling of blending function Signed-off-by: Ryan Wolf <[email protected]> * Fix blend tests Signed-off-by: Ryan Wolf <[email protected]> * Add blending script Signed-off-by: Ryan Wolf <[email protected]> * Add additional file paths call Signed-off-by: Ryan Wolf <[email protected]> * Add documentation Signed-off-by: Ryan Wolf <[email protected]> * Reformat docs Signed-off-by: Ryan Wolf <[email protected]> * Remove backticks Signed-off-by: Ryan Wolf <[email protected]> * Add context manager for shuffle tests Signed-off-by: Ryan Wolf <[email protected]> * Add better deterministic shuffle path Signed-off-by: Ryan Wolf <[email protected]> * Update documentation and reset index Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * High level fuzzy duplicates module (#46) * Initial pass at fuzzy dedup api Signed-off-by: Ayush Dattagupta <[email protected]> * Update deprecated shuffle arg Signed-off-by: Ayush Dattagupta <[email protected]> * dask_cuda gpu only import Signed-off-by: Ayush Dattagupta <[email protected]> * Move fuzzy_dedup imports to optional Signed-off-by: Ayush Dattagupta <[email protected]> * more tests Signed-off-by: Ayush Dattagupta <[email protected]> * Move FuzzyDeDupConfig to it's own class Signed-off-by: Ayush Dattagupta <[email protected]> * Add example script and config file, fix typo Signed-off-by: Ayush Dattagupta <[email protected]> * Remove slurm examples for gpu dedup Signed-off-by: Ayush Dattagupta <[email protected]> * Add config module Signed-off-by: Ayush Dattagupta <[email protected]> * Rename FuzzyDeDupConfig and minhash_length to FuzzyDuplicatesConfig, num_hashes Signed-off-by: Ayush Dattagupta <[email protected]> * Add comments and update example Signed-off-by: Ayush Dattagupta <[email protected]> * Write to same format as input in fuzzy dedup example Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix indexing in PII Modifier (#55) * Fix pii index issue Signed-off-by: Ryan Wolf <[email protected]> * Add sequential wrapper Signed-off-by: Ryan Wolf <[email protected]> * Fix pii tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Disable string conversion globally (#56) Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix issue #43 (empty files creation) and improve reading/writing speed (#57) This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * [Tutorials] Add a tutorial for PEFT data curation (#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Only import PII constants during Curator import (#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <[email protected]> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Deleting links Signed-off-by: Nicoel Luo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example Signed-off-by: Nicole Luo <[email protected]> * Fixing Style Signed-off-by: Nicole Luo <[email protected]> * Updating container version Signed-off-by: Nicole Luo <[email protected]> * Fixing style Signed-off-by: Nicole Luo <[email protected]> * Update get_client() according to latest version; Update log path for map_bucket section Signed-off-by: Nicole Luo <[email protected]> --------- Signed-off-by: Nicole Luo <[email protected]> Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Terry Kong <[email protected]> Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: Nicoel Luo <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Co-authored-by: Ryan Wolf <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: Terry Kong <[email protected]> Co-authored-by: Miguel Martínez <[email protected]> Co-authored-by: Mehran Maghoumi <[email protected]> Co-authored-by: Ryan Wolf <[email protected]>

ryantwolf approved these changes May 8, 2024

View reviewed changes

ryantwolf merged commit 72b9775 into NVIDIA:main May 8, 2024
3 checks passed

ayushdg mentioned this pull request May 8, 2024

[BUG] Empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #43 (empty files creation) and improve reading/writing speed #57

Fix issue #43 (empty files creation) and improve reading/writing speed #57

miguelusque commented May 8, 2024

Fix issue #43 (empty files creation) and improve reading/writing speed #57

Fix issue #43 (empty files creation) and improve reading/writing speed #57

Conversation

miguelusque commented May 8, 2024