-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update README.md #126
Update README.md #126
Conversation
d913c7e
to
e1d5b18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of changes I'd like to see, but overall nice job!
README.md
Outdated
|
||
## Key Features | ||
# NeMo Curator | ||
🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we capitalize this like a title?
Suggested change
🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀 🚀 **The GPU-Accelerated Open Source Framework for Efficient Large Language Models Data Curation** 🚀 - Not sure how much of this is for SEO, but I personally would prefer to omit the "Open Source" part. It feels self-evident given that we are on GitHub.
- I feel like it should be "Large Language Model Data Curation", not "Large Language Models Data Curation". Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")
|
||
NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include: | ||
<p align="center"> | ||
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally not a fan of having a diagram, but after talking with others they have found it useful. I still have a couple of critiques of this diagram in particular.
- Too much detail for the first thing the user sees. I think a smaller, simpler diagram would serve us better.
- Slightly confusing information. There are two "Language Detections" that might confuse the user. We would need to go more in detail to explain why each of them are needed than what is good for a diagram like this.
- The text is too small.
- The diagram aspect ratio is too wide.
I'm not sure what the best way to proceed is. We should perhaps have a separate meeting to discuss how to improve it. Do we know of any graphic designers we can reach out to to help refine this? I can make suggestions all day, but actually producing a better version is beyond my skillset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments. Adding @arhamm1 who made the diagram for awareness.
I think it'd be great to have a sync on the diagram with all the stakeholders.
In the meantime, let me know if you want me to remove this diagram from the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the diagram in the PR for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, kept.
5f8d210
to
e4c5a47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryantwolf Thanks for reviewing. I made all the changes and pushed. Please take a look when you get a chance.
|
||
NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include: | ||
<p align="center"> | ||
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments. Adding @arhamm1 who made the diagram for awareness.
I think it'd be great to have a sync on the diagram with all the stakeholders.
In the meantime, let me know if you want me to remove this diagram from the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple of nits this time. Thanks again!
README.md
Outdated
- `peft-curation` which focuses on data curation for parameter-efficient fine-tuning use-cases. | ||
- [`tinystories`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/tinystories) which focuses on data curation for training LLMs from scratch. | ||
- [`peft-curation`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation) which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases. | ||
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: There is no data blending that occurs in this tutorial.
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation and blending. | |
- [`distributed_data_classification`](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/distributed_data_classification) which focuses on using the quality and domain classifiers to help with data annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the tutorial's overview says it's mean to "help" with data annotation and blending, which is why I thought to include that here.
I changed the wording so it only mentions annotation.
README.md
Outdated
ScoreFilter(FastTextQualityFilter(model_path="model.bin")), | ||
# Discard records from irrelevant tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit inaccurate. What this is actually doing is ensuring that evaluation metrics don't leak into the training data. Think of it like this: The model shouldn't see/memorize the answers to the final exam. I think one of these would be better
- "Discard records from the evaluation metrics"
- "Prevent test set leakage"
- "Prevent test set contamination"
I'll let you decide on what you think is the best description is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, thanks for correcting my understanding. Revised.
|
||
NeMo Curator provides a collection of scalable data-mining modules. Some of the key features include: | ||
<p align="center"> | ||
<img src="./docs/user-guide/images/diagram.png" alt="diagram"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the diagram in the PR for now
README.md
Outdated
|
||
## Key Features | ||
# NeMo Curator | ||
🚀 **The GPU-accelerated open source framework for efficient large language models data curation** 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Maghoumi given you didn't change according to 2 & 3, I just want to confirm that the open source is needed for SEO and that "Large Language Models Data Curation" is correct (instead of "Large Language Model Data Curation")
e4c5a47
to
f5c7f9e
Compare
@ryantwolf Oops, I missed your original comment. I changed it to "Large Language Model Data Curation". Yes the redundant open source emphasis is for SEO as there is no other mention of it in the readme file. |
* Improve and shorten. * Include links to blogs and tutorials. * Remove incorrect info about non-existent branches. * Add a header and a diagram. * Add a note about incremental deduplication. Signed-off-by: Mehran Maghoumi <[email protected]>
f5c7f9e
to
5b0214b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good with me. Thanks!
Update README.md:
Description
Make minor revisions to the landing page. Include a link to the TinyStories blog post.
Checklist