Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

利用finetune_cosmopedia.sh脚本进行继续预训练中的数据集如何构建 #26

Open
RuipingWang1986 opened this issue May 20, 2024 · 2 comments

Comments

@RuipingWang1986
Copy link

您好,目前我正在用finetune_cosmopedia.sh进行继续预训练,用HuggingFaceTB上的数据集可以实现继续预训练,但是我目前想要使用自己的数据集,我的数据集格式是txt,我想知道有没有办法将我们自己的数据转变成可以用于继续预训练的方法,或者有没有类似的工具呢,谢谢。

@hills-code
Copy link
Collaborator

您可以参考huggingface dataset的官方文档读入txt文件:https://huggingface.co/docs/datasets/nlp_load

@RuipingWang1986
Copy link
Author

好的,我先试试看,感谢回复

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants