Changelog

NeMo Curator 0.5.0

Highlights

Image Curation
- Image Embedding Creation
- Aesthetic Classifier
- NSFW Classifier
- Semantic Deduplication
Text Curation
- Quality Classifier
- Aegis Classifier
- FineWeb-Edu Classifier

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.5.0

NeMo Curator 0.4.1

What's Changed

Add spacy<3.8 pin to r0.4.1 by @ayushdg in #279

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/compare/v0.4.0...v0.4.1

NeMo Curator 0.4.0

Highlights

Semantic Deduplication
Resiliparse for Text Extraction
Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching
Synthetic data generation for fine-tuning

What's Changed

Update README by @ryantwolf in #6
[Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in #5
Add workflow for running cpu pytests by @ayushdg in #13
Add pre-commit style checks by @ayushdg in #14
Add citation by @ryantwolf in #15
Fix Noisy CUDA Shutdown by @ryantwolf in #20
Bump Python and RAPIDS versions by @ryantwolf in #16
Add batched decorator by @ryantwolf in #18
Add issue templates by @ayushdg in #22
Add dependency to fix justext by @ryantwolf in #24
Fix metadata inference with pandas and dask by @ryantwolf in #35
Disable PyTorch Compile Multiprocessing by @ryantwolf in #34
Improve speed of AddId module by @ryantwolf in #36
Make GPU dependencies optional by @ayushdg in #27
Fix failing GPU tests with latest pandas bump by @ayushdg in #41
Adds Nemo Curator K8s example by @terrykong in #40
Move common dedup utils and remove unused code by @ayushdg in #42
Fix lang id example by @ryantwolf in #37
Add dataset blending tool by @ryantwolf in #32
High level fuzzy duplicates module by @ayushdg in #46
Fix indexing in PII Modifier by @ryantwolf in #55
Disable string conversion globally by @ryantwolf in #56
Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in #57
[Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in #45
Only import PII constants during Curator import by @ayushdg in #61
Align extract_partitioning_index logic with upstream shuffling by @rjzamora in #60
[REVIEW] Switch Models to use Crossfit by @VibhuJawa in #58
Remove argparse from get_client function signature by @ryantwolf in #12
Fuzzy Dedup: Use text_field instead of hardcoded text column by @ayushdg in #74
Add pull request template by @ayushdg in #78
Add jupyter notebook tutorial for single node mulilingual dataset by @nicoleeeluo in #30
Update issue templates by @ryantwolf in #81
Fix #91 - Incorrect reference to domain_classifier_example.py by @miguelusque in #92
Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by @miguelusque in #75
Update readme by @ayushdg in #93
Update documentation for new version by @ryantwolf in #83
Update requirements documentation. by @ayushdg in #98
Make sure query-planning is disabled for now by @rjzamora in #97
Applying SEO Best Pratices by @aschilling-nv in #104
Shuffle CC result on group before writing out by @ayushdg in #110
Added tutorials to index.rst by @jgerh in #113
Pin to numpy<2 to avoid spacy compat issues by @ayushdg in #119
Fix #116. Fix broken links by @miguelusque in #117
Update index.rst by @aschilling-nv in #129
Fix nemo_curator import in CPU only environment when GPU packages are installed. by @ayushdg in #123
Improve Common Crawl download by @ryantwolf in #82
Update README.md by @Maghoumi in #126
Allow multiple filenames per partition when separating by metadata by @ayushdg in #99
[REVIEW] Add Resiliparse option for text extraction by @sarahyurick in #128
Fix 69 - Refactor how arguments are added to scripts by @miguelusque in #102
Stricter check for query planning. by @ayushdg in #107
Add DataFrame example to Distributed Data Classification tutorial by @sarahyurick in #137
Enable Sem-dedup by @VibhuJawa in #130
Remove lxml installation by @ryantwolf in #140
Nemotron 340 SDG Pipeline Tutorial by @chrisalexiuk-nvidia in #144
Add Synthetic Data Generation Module by @ryantwolf in #136
Skip explicit comms shuffle for dask-cuda 24.06 by @ayushdg in #147
Add support for NeMo SDK by @ryantwolf in #131
[REVIEW] Fix SemDedup bugs by @VibhuJawa in #151
[pre-commit.ci] pre-commit suggestions by @pre-commit-ci in #135
Fix bug with torch rmm and nemo by @ryantwolf in #155

New Contributors

@ayushdg made their first contribution in #13
@terrykong made their first contribution in #40
@rjzamora made their first contribution in #60
@nicoleeeluo made their first contribution in #30
@aschilling-nv made their first contribution in #104
@pre-commit-ci made their first contribution in #135

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.4.0s

NeMo Curator 0.3.0

What's Changed

Update README by @ryantwolf in #6
[Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in #5
Add workflow for running cpu pytests by @ayushdg in #13
Add pre-commit style checks by @ayushdg in #14
Add citation by @ryantwolf in #15
Fix Noisy CUDA Shutdown by @ryantwolf in #20
Bump Python and RAPIDS versions by @ryantwolf in #16
Add batched decorator by @ryantwolf in #18
Add issue templates by @ayushdg in #22
Add dependency to fix justext by @ryantwolf in #24
Fix metadata inference with pandas and dask by @ryantwolf in #35
Disable PyTorch Compile Multiprocessing by @ryantwolf in #34
Improve speed of AddId module by @ryantwolf in #36
Make GPU dependencies optional by @ayushdg in #27
Fix failing GPU tests with latest pandas bump by @ayushdg in #41
Adds Nemo Curator K8s example by @terrykong in #40
Move common dedup utils and remove unused code by @ayushdg in #42
Fix lang id example by @ryantwolf in #37
Add dataset blending tool by @ryantwolf in #32
High level fuzzy duplicates module by @ayushdg in #46
Fix indexing in PII Modifier by @ryantwolf in #55
Disable string conversion globally by @ryantwolf in #56
Fix issue #43 (empty files creation) and improve reading/writing speed by @miguelusque in #57
[Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in #45
Only import PII constants during Curator import by @ayushdg in #61
Align extract_partitioning_index logic with upstream shuffling by @rjzamora in #60

New Contributors

@Maghoumi made their first contribution in #5
@terrykong made their first contribution in #40
@miguelusque made their first contribution in #57
@rjzamora made their first contribution in #60

Full Changelog: https://github.com/NVIDIA/NeMo-Curator/commits/v0.3.0

PyPi

https://pypi.org/project/nemo-curator/0.3.0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

NeMo Curator 0.5.0

Highlights

NeMo Curator 0.4.1

What's Changed

NeMo Curator 0.4.0

Highlights

What's Changed

New Contributors

NeMo Curator 0.3.0

What's Changed

New Contributors

PyPi

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

NeMo Curator 0.5.0

Highlights

NeMo Curator 0.4.1

What's Changed

NeMo Curator 0.4.0

Highlights

What's Changed

New Contributors

NeMo Curator 0.3.0

What's Changed

New Contributors

PyPi