Skip to content

Commit

Permalink
Fix #116. Fix task-decontamination broken links (#117)
Browse files Browse the repository at this point in the history
* Fix task-decontamination broken link

Fix #116. Broken link in readme.md for task-decontamination.

Signed-off-by: Miguel Martínez <[email protected]>

* Fixed remaining links

There is multiple links broken. I have fixed them.

Signed-off-by: Miguel Martínez <[email protected]>

---------

Signed-off-by: Miguel Martínez <[email protected]>
  • Loading branch information
miguelusque authored Jun 18, 2024
1 parent 9da54b8 commit 462b964
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,40 +8,40 @@ At the core of the NeMo Curator is the `DocumentDataset` which serves as the the

NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include:

[Data download and text extraction](docs/user-guide/Download.rst)
[Data download and text extraction](docs/user-guide/download.rst)

- Default implementations for downloading and extracting Common Crawl, Wikipedia, and ArXiv data
- Easily customize the download and extraction and extend to other datasets

[Language identification and separation](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
[Language identification and separation](docs/user-guide/languageidentificationunicodeformatting.rst)

- Language identification with [fastText](https://fasttext.cc/docs/en/language-identification.html) and [pycld2](https://pypi.org/project/pycld2/)

[Text reformatting and cleaning](docs/user-guide/LanguageIdentificationUnicodeFormatting.rst)
[Text reformatting and cleaning](docs/user-guide/languageidentificationunicodeformatting.rst)

- Fix unicode decoding errors via [ftfy](https://ftfy.readthedocs.io/en/latest/)

[Quality filtering](docs/user-guide/QualityFiltering.rst)
[Quality filtering](docs/user-guide/qualityfiltering.rst)

- Multilingual heuristic-based filtering
- Classifier-based filtering via [fastText](https://fasttext.cc/)

[Document-level deduplication](docs/user-guide/GpuDeduplication.rst)
[Document-level deduplication](docs/user-guide/gpudeduplication.rst)

- Both exact and fuzzy deduplication are accelerated using cuDF and Dask
- For fuzzy deduplication, our implementation follows the method described in [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)

[Multilingual downstream-task decontamination](docs/user-guide/TaskDecontamination.rst)
[Multilingual downstream-task decontamination](docs/user-guide/taskdecontamination.rst)

- Our implementation follows the approach of [OpenAI GPT3](https://arxiv.org/pdf/2005.14165.pdf) and [Microsoft Turing NLG 530B](https://arxiv.org/abs/2201.11990)

[Distributed data classification](docs/user-guide/DistributedDataClassification.rst)
[Distributed data classification](docs/user-guide/distributeddataclassification.rst)

- Multi-node, multi-GPU classifier inference
- Provides sophisticated domain and quality classification
- Flexible interface for extending to your own classifier network

[Personal identifiable information (PII) redaction](docs/user-guide/PersonalIdentifiableInformationIdentificationAndRemoval.rst)
[Personal identifiable information (PII) redaction](docs/user-guide/personalidentifiableinformationidentificationandremoval.rst)

- Identification tools for removing addresses, credit card numbers, social security numbers, and more

Expand Down

0 comments on commit 462b964

Please sign in to comment.