-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve speed of AddId module #36
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
ayushdg
approved these changes
Apr 23, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Can confirm this doesn't impact the GPU dedup results from small scale testing.
nicoleeeluo
pushed a commit
to nicoleeeluo/NeMo-Curator
that referenced
this pull request
May 20, 2024
* Add fast id method Signed-off-by: Ryan Wolf <[email protected]> * Add type conversion Signed-off-by: Ryan Wolf <[email protected]> * Fix off by one errors in tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]>
ryantwolf
added a commit
that referenced
this pull request
May 24, 2024
* Init commit for tutorial notebook Signed-off-by: Nicole Luo <[email protected]> * Fix metadata inference with pandas and dask (#35) * Fix metadata inference with pandas and dask Signed-off-by: Ryan Wolf <[email protected]> * Fix datatypes for task decontamination Signed-off-by: Ryan Wolf <[email protected]> * Use targetted import Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Disable PyTorch Compile Multiprocessing (#34) * Move tokenizer import Signed-off-by: Ryan Wolf <[email protected]> * Reduce inductor threads Signed-off-by: Ryan Wolf <[email protected]> * Change env int to string Signed-off-by: Ryan Wolf <[email protected]> * Change location of env var Signed-off-by: Ryan Wolf <[email protected]> * Add comment linking issue Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Improve speed of AddId module (#36) * Add fast id method Signed-off-by: Ryan Wolf <[email protected]> * Add type conversion Signed-off-by: Ryan Wolf <[email protected]> * Fix off by one errors in tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Make GPU dependencies optional (#27) * Move GPU imports and make them optional Signed-off-by: Ayush Dattagupta <[email protected]> * Move gpu dependencies to a seperate install Signed-off-by: Ayush Dattagupta <[email protected]> * Remove unused import Signed-off-by: Ayush Dattagupta <[email protected]> * Switch to placeholder import that raises on usage Signed-off-by: Ayush Dattagupta <[email protected]> * Remove deprecated utils usage Signed-off-by: Ayush Dattagupta <[email protected]> * Add cuML attribution Signed-off-by: Ayush Dattagupta <[email protected]> * Safe import tests, improve install instruction, update gha workflow Signed-off-by: Ayush Dattagupta <[email protected]> * Fix pytests due to loc bug Signed-off-by: Ayush Dattagupta <[email protected]> * update install instructions Signed-off-by: Ayush Dattagupta <[email protected]> * Raise on non module-not-found errors, update logging Signed-off-by: Ayush Dattagupta <[email protected]> * Update logging to not change root logger Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix failing GPU tests with latest pandas bump (#41) Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Adds Nemo Curator K8s example (#40) * [K8s]: Adds a helper script to create a dask cluster on k8s and includes instructions for how to a Curator workload on k8s Signed-off-by: Terry Kong <[email protected]> * black formatting Signed-off-by: Terry Kong <[email protected]> * big_english -> my_dataset Signed-off-by: Terry Kong <[email protected]> * 24.01 -> 24.03 default container Signed-off-by: Terry Kong <[email protected]> * Add help kwarg to all flags Signed-off-by: Terry Kong <[email protected]> * Clarify why venv is needed Signed-off-by: Terry Kong <[email protected]> * fix precommit failures Signed-off-by: Terry Kong <[email protected]> --------- Signed-off-by: Terry Kong <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Move common dedup utils and remove unused code (#42) * Refactor common utils and remove unused code Signed-off-by: Ayush Dattagupta <[email protected]> * More cleanup Signed-off-by: Ayush Dattagupta <[email protected]> * More updates/shuffling Signed-off-by: Ayush Dattagupta <[email protected]> * Move gpu_dedup scripts into subfolder Signed-off-by: Ayush Dattagupta <[email protected]> * Remove gpu_deduplication subfolder Signed-off-by: Ayush Dattagupta <[email protected]> * Add readme to fuzzy dedup scripts section Signed-off-by: Ayush Dattagupta <[email protected]> * Fix typo and relative links Signed-off-by: Ayush Dattagupta <[email protected]> * Remove legacy script entrypoints Signed-off-by: Ayush Dattagupta <[email protected]> * Remove legacy scripts and add init file Signed-off-by: Ayush Dattagupta <[email protected]> * Update GpuDeduplication.rst Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix lang id example (#37) * Fix lang id example Signed-off-by: Ryan Wolf <[email protected]> * Add classifier unit tests Signed-off-by: Ryan Wolf <[email protected]> * Add test for failure Signed-off-by: Ryan Wolf <[email protected]> * Remove failure test Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Add dataset blending tool (#32) * Add initial dataset blending function Signed-off-by: Ryan Wolf <[email protected]> * Add blend unit tests Signed-off-by: Ryan Wolf <[email protected]> * Add self parameter Signed-off-by: Ryan Wolf <[email protected]> * Fix return type of blend dataset Signed-off-by: Ryan Wolf <[email protected]> * Fix blending tests Signed-off-by: Ryan Wolf <[email protected]> * Change assert statement for very uneven blend Signed-off-by: Ryan Wolf <[email protected]> * Fix key error Signed-off-by: Ryan Wolf <[email protected]> * Add proper proportion blending test Signed-off-by: Ryan Wolf <[email protected]> * Add four dataset blend and clarify docs Signed-off-by: Ryan Wolf <[email protected]> * Add shuffle module Signed-off-by: Ryan Wolf <[email protected]> * Add blend example and tests Signed-off-by: Ryan Wolf <[email protected]> * Fix random method name Signed-off-by: Ryan Wolf <[email protected]> * Wrap return type in DocumentDataset Signed-off-by: Ryan Wolf <[email protected]> * Save result of column drop Signed-off-by: Ryan Wolf <[email protected]> * Change equality check for shuffle tests Signed-off-by: Ryan Wolf <[email protected]> * Fix expected order after shuffle Signed-off-by: Ryan Wolf <[email protected]> * Add more documents to shuffle test Signed-off-by: Ryan Wolf <[email protected]> * Add assert statement Signed-off-by: Ryan Wolf <[email protected]> * Add within partition shuffle Signed-off-by: Ryan Wolf <[email protected]> * Refactor add rand column for shuffle Signed-off-by: Ryan Wolf <[email protected]> * Fix filename tests Signed-off-by: Ryan Wolf <[email protected]> * Add determinism handling for shuffle Signed-off-by: Ryan Wolf <[email protected]> * Change numpy random function Signed-off-by: Ryan Wolf <[email protected]> * Fix tests with new random method Signed-off-by: Ryan Wolf <[email protected]> * Remove length call from blending Signed-off-by: Ryan Wolf <[email protected]> * Improve scaling of blending function Signed-off-by: Ryan Wolf <[email protected]> * Fix blend tests Signed-off-by: Ryan Wolf <[email protected]> * Add blending script Signed-off-by: Ryan Wolf <[email protected]> * Add additional file paths call Signed-off-by: Ryan Wolf <[email protected]> * Add documentation Signed-off-by: Ryan Wolf <[email protected]> * Reformat docs Signed-off-by: Ryan Wolf <[email protected]> * Remove backticks Signed-off-by: Ryan Wolf <[email protected]> * Add context manager for shuffle tests Signed-off-by: Ryan Wolf <[email protected]> * Add better deterministic shuffle path Signed-off-by: Ryan Wolf <[email protected]> * Update documentation and reset index Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * High level fuzzy duplicates module (#46) * Initial pass at fuzzy dedup api Signed-off-by: Ayush Dattagupta <[email protected]> * Update deprecated shuffle arg Signed-off-by: Ayush Dattagupta <[email protected]> * dask_cuda gpu only import Signed-off-by: Ayush Dattagupta <[email protected]> * Move fuzzy_dedup imports to optional Signed-off-by: Ayush Dattagupta <[email protected]> * more tests Signed-off-by: Ayush Dattagupta <[email protected]> * Move FuzzyDeDupConfig to it's own class Signed-off-by: Ayush Dattagupta <[email protected]> * Add example script and config file, fix typo Signed-off-by: Ayush Dattagupta <[email protected]> * Remove slurm examples for gpu dedup Signed-off-by: Ayush Dattagupta <[email protected]> * Add config module Signed-off-by: Ayush Dattagupta <[email protected]> * Rename FuzzyDeDupConfig and minhash_length to FuzzyDuplicatesConfig, num_hashes Signed-off-by: Ayush Dattagupta <[email protected]> * Add comments and update example Signed-off-by: Ayush Dattagupta <[email protected]> * Write to same format as input in fuzzy dedup example Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix indexing in PII Modifier (#55) * Fix pii index issue Signed-off-by: Ryan Wolf <[email protected]> * Add sequential wrapper Signed-off-by: Ryan Wolf <[email protected]> * Fix pii tests Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Disable string conversion globally (#56) Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fix issue #43 (empty files creation) and improve reading/writing speed (#57) This commit fixes issue #43 (empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py) by double-checking the files size after being generated, and deleting them with size zero. In addition to that, I have noticed there is no need to parse to JSON object the content of the different lines, which should be already in json format. By removing that extra-parsing, there is a significant speed up in the execution of this method. Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * [Tutorials] Add a tutorial for PEFT data curation (#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Only import PII constants during Curator import (#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <[email protected]> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Deleting links Signed-off-by: Nicoel Luo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Update tutorials/single_node_tutorial/single_gpu_tutorial.ipynb Co-authored-by: Ryan Wolf <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Signed-off-by: Nicole Luo <[email protected]> * Fixed typo. Update content to lastest NeMo Curator version. Added fuzzy deduplication wrapper example Signed-off-by: Nicole Luo <[email protected]> * Fixing Style Signed-off-by: Nicole Luo <[email protected]> * Updating container version Signed-off-by: Nicole Luo <[email protected]> * Fixing style Signed-off-by: Nicole Luo <[email protected]> * Update get_client() according to latest version; Update log path for map_bucket section Signed-off-by: Nicole Luo <[email protected]> --------- Signed-off-by: Nicole Luo <[email protected]> Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Terry Kong <[email protected]> Signed-off-by: Miguel Martínez <[email protected]> Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: Nicoel Luo <[email protected]> Signed-off-by: nicoleeeluo <[email protected]> Co-authored-by: Ryan Wolf <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: Terry Kong <[email protected]> Co-authored-by: Miguel Martínez <[email protected]> Co-authored-by: Mehran Maghoumi <[email protected]> Co-authored-by: Ryan Wolf <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improves the speed of the add id module by reducing the size of the task graph. However, this new method uses a different id structure. If users need the old format that is guaranteed to be ordered, they can still access by passing in a definitive
start_id
.