Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix datasets export to JSON #7181

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

varadhbhatnagar
Copy link
Contributor

No description provided.

@varadhbhatnagar
Copy link
Contributor Author

Linked Issue: #7037
Ideas: #7039

@varadhbhatnagar
Copy link
Contributor Author

varadhbhatnagar commented Sep 29, 2024

@albertvillanova / @lhoestq any early feedback?

AFAIK there is no param orient in load_dataset(). So for orientations other than "records", the loading isn't very accurate. Any thoughts?

@varadhbhatnagar varadhbhatnagar marked this pull request as ready for review October 9, 2024 18:17
@varadhbhatnagar
Copy link
Contributor Author

varadhbhatnagar commented Oct 9, 2024

orient = "split" can also be handled. I will add the changes soon

@lhoestq
Copy link
Member

lhoestq commented Oct 11, 2024

Thanks for diving into this ! I don't think we want the JSON export to be that complex though, especially if people can do ds.to_pandas().to_json(orient=...). Maybe we can just raise an error and suggest users to use pandas ? And also note that it loads the full dataset in memory so it's mainly for small scale datasets. The only acceptable option for large scale datasets is probably just JSON Lines anyway since it enables streaming.

@varadhbhatnagar
Copy link
Contributor Author

@lhoestq Simply doing ds.to_pandas().to_json(orient=...) is not going to give any batching or multiprocessing benefits right? Also, which function are you referring to - when you say that its meant for small scale datasets only?

@lhoestq
Copy link
Member

lhoestq commented Oct 14, 2024

Yes indeed. Though I think it's fine since using something else than orient="lines" is only suitable/useful for small datasets. Or you know a case where a big dataset need to be in a format that is not orient="lines" ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants