-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return the name of the currently loaded file in the load_dataset function. #5806
Comments
Implementing this makes sense (e.g., |
Hey @mariosasko, Can I work on this issue, this one seems interesting to implement. I have contributed to jupyterlab recently, and would love to contribute here as well. |
@tsabbir96 if you are planning to start working on this, you can take on this issue by writing a comment with only the keyword: #self-assign |
#self-assign |
@albertvillanova thank you for letting me contribute here. |
Hello there, is this issue resolved? @tsabbir96 are you still working on it? Otherwise I would love to give it a try |
@EduardoPach This issue is still relevant, so feel free to work on it. |
Hey @mariosasko, I've taken the time to take a look at how we load the datasets usually. My main question now is about the final solution. So the idea is that whenever we load the datasets we also add a new column in the _generate_tables() method from the builders called filename (or file_name) that should be related files contained in each split, right? Do you have any suggestions of where I could add that? |
Is this issue still open? If yes, I'd like to work upon on it. Thanks |
Definitely still open. I gave it a try, but then didn't get any feedback on my last question so I stopped . Feel free to work on it. |
It's still open, so feel free to work on it. This can be implemented by adding a param to the packaged builders' configs that inserts a column with file names (in |
Hi is the issues still open, is see no activity since September but it shows that it is still assigned to tsabbir96. If |
Looking forward to your implementation. I also really need this feature. |
Hi. I am new and would like to contribute to this issue @tsabbir96 |
Feature request
Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.
Motivation
When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.
Your contribution
I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.
The text was updated successfully, but these errors were encountered: