Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5: Fix NotSerializableException when migrating Spark tables #11157

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manuzhang
Copy link
Contributor

@manuzhang manuzhang commented Sep 18, 2024

@manuzhang manuzhang force-pushed the fix-migrate-partitioned-tables branch 2 times, most recently from 999c1ba to e4aeb31 Compare September 18, 2024 13:49
@manuzhang manuzhang force-pushed the fix-migrate-partitioned-tables branch from e4aeb31 to 005114b Compare October 6, 2024 16:45
@manuzhang
Copy link
Contributor Author

@nastra could you help take a look?

@RussellSpitzer RussellSpitzer added this to the Iceberg 1.7.0 milestone Oct 23, 2024
@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 24, 2024

@manuzhang Can you summerize the usage of ExecutorService on the Spark Executors? It looks like the current fix involves making a new Executor service per task and i'm not sure that's what we want to do. I wonder if it makes more sense to pass in a Supplier so we don't have to implement a wrapper class.

But before we do that I want to make sure we are using the ExecutorService for the right reasons in the code path.

@manuzhang
Copy link
Contributor Author

ExecutorService is used to parallelize reading files to build manifests on the Spark executors for Spark table migration procedures (add_files, migrate, snapshot).

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 25, 2024

ExecutorService is used to parallelize reading files to build manifests on the Spark executors for Spark table migration procedures (add_files, migrate, snapshot).

I meant specifically, are we using this in listPartitions or in the FileIndex implementation?

The title of this issue says this is an issue for "partitioned tables" so why does it work in that case but not in this case? Is it because the listPartitions code is using the executor service or what?

@manuzhang
Copy link
Contributor Author

manuzhang commented Oct 25, 2024

Yes, it's used in listPartitions while the title was not accurate. Migrating unpartitioned Spark tables has the same issue. I've updated the title.

@manuzhang manuzhang changed the title Spark 3.5: Fix NotSerializableException when migrating partitioned Spark tables Spark 3.5: Fix NotSerializableException when migrating Spark tables Oct 25, 2024
@github-actions github-actions bot added the API label Oct 26, 2024
@RussellSpitzer
Copy link
Member

After thinking about this for a while, I think you are probably right that we need to build a specific LazyExecutorService like you did originally. I'm sorry I lend you on a goose chase here, Let's make sure it is Spark specific and doesn't touch any of the other implementations.

@manuzhang
Copy link
Contributor Author

@RussellSpitzer I can revert to previous commit and this is Spark specific, but can you elaborate on why LazyExecutorService is better thanSerializableSupplier? I agree with you that people can pass around a non-lazy ExecutorService in future implementations.

I submitted #11417 to add warning in the doc since this PR can't get into 1.7.0

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 29, 2024

Main reason was that the API that is specified in the API Module allows withExecutorService(executor service) so unless we want to break that api (which we could) we need to stick with just passing through an executor service. We could alternatively just change the API if you think that's warrrented. If we did that, I'd probably remove "executor service" all together

@manuzhang manuzhang force-pushed the fix-migrate-partitioned-tables branch from ac414d8 to 005114b Compare October 30, 2024 02:22
@github-actions github-actions bot removed the API label Oct 30, 2024
@manuzhang
Copy link
Contributor Author

@RussellSpitzer I've reverted to lazy executor service. Please check again. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

procedure add_files parallelism > 1 -> NotSerializableException
3 participants