Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FileFormat for BSON #2308

Closed
vrongmeal opened this issue Dec 27, 2023 · 6 comments
Closed

Implement FileFormat for BSON #2308

vrongmeal opened this issue Dec 27, 2023 · 6 comments
Labels
refactor ♻️ Code refactoring

Comments

@vrongmeal
Copy link
Contributor

Description

bson_scan has a unique implementation compared to other object scans. We can implement FileFormat trait for BSON and create the BSON scan table function similar to others (csv_scan etc.)

@vrongmeal vrongmeal added the refactor ♻️ Code refactoring label Dec 27, 2023
@tychoish
Copy link
Collaborator

When I started writing this, I tried to do this and it didn't work. I think this probably doesn't make sense for a couple of reasons:

Having said that, I think even if it were possible, I'd be less inclined to want to do this: it'd mean writing rather a lot of boilerplate code, minimal benefits, and as it stands, the bson tables are easy to partition (and therefore potentially distribute/parallelize), and are quite lazily generated, which isn't impossible to do otherwise, but is pretty nice.

@vrongmeal
Copy link
Contributor Author

I'm not sure it gets us anything, except adding more boilerplate code: the streaming table implementation is rather clean.

It allows us to have a common implementation for both external tables and table functions (and any other thing we have in future).

it's not actually possible at this time given the state of the FileType enum and its dependency on DataFusion's FileFormat trait. see:

I think we can not use FileType and have an enum of our own. We create dyn FileFormat on our own using the enum. We can parse into our own enum and use it instead.

Not implementing FileFormat will lead us to maintain everything for (csv, json, parquet) and bson separately. Moreover, when we support other types, it will lead to more maintenance headache. I think it's worth investing in some boilerplate code in this case.

@tychoish
Copy link
Collaborator

It allows us to have a common implementation for both external tables and table functions (and any other thing we have in future).

The external table and table funcs share an implementation. We don't implement anything special for any of the other formats, we just us stuff from DF, which is great: changing in this way, would move one of our own implementations to a different form of our own implementation. There doesn't seem to be a clear win in terms of functionality, or maintenance unless we can upstream it, which seems unlikely.

I think we can not use FileType and have an enum of our own. We create dyn FileFormat on our own using the enum. We can parse into our own enum and use it instead.

I tried this. Because the FileFormat trait in DF depends on the FileType enum in DF, you can't inject your own FileFormat. The DF ticket I linked should address this problem, but we'd be trading what we currently have for a much more complicated implementation without any new features.

@tychoish
Copy link
Collaborator

(Just to close the loop on this; I think the current implementation is probably temporary and that this solution will eventually make sense, but we do need to wait for changes to datafusion and I don't think it's worth dedicating time to doing this work unless there's some kind of functional gain.)

@tychoish
Copy link
Collaborator

tychoish commented Mar 6, 2024

I'm increasingly disinclined to do this, and now that we have object store support for bson without needing FileFormat (and a parallel implementation for json stream data) there's even less reason to rewrite the code this way. Also by more tightly coupling to datafusion interfaces, we expose ourselves to more upgrade related friction.

It might make sense to do this as part of some other feature development, but I think we can go ahead and close this.

@tychoish tychoish closed this as completed Mar 6, 2024
@vrongmeal
Copy link
Contributor Author

Agreed. I did think about this some more and this makes sense not to move forward with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor ♻️ Code refactoring
Projects
None yet
Development

No branches or pull requests

2 participants