Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add __eq__, __hash__ to SparkSource for correct comparison #4028

Merged
merged 5 commits into from
Mar 26, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,21 @@ def get_table_query_string(self) -> str:

return f"`{tmp_table_name}`"

def __eq__(self, other):
base_eq = super().__eq__(other)
if not base_eq:
return False
if self.table != other.table:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nitpicking :) Won't this look better as a single expression? something like return table == table and query == query ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they have priority, I see that there is a kind of hierarchy in the way we get the data out of sparksource in get_table_query_string function, if not table provided then read query if not read path. So I assume that there is a case that like this
sparksourceA(table = 'tableA', query=None) and after that it change to sparksourceA(table = 'tableA', query='select some thing from sometable'), how should we compare in this case. I still have some concern about the logic of providing table, query and path when init sparksource, isn't it we just need 1 of those to be provided, and 2 remained will be null, should we have some contraint on this? If yes so we can do some things like return table == table and query == query ... @tokoko @HaoXuAI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good points, but I think we should still simply check for absolute equality, so priority doesn't really matter, if there is a change in query or if user changes query to None and switches to using table, this method should return False. In some cases, the change might not actually have an impact i guess, but it's still a change. So a single expression should be fine.

As for the question about 3 competing parameters (table, query and path), some sort of a constraint is probably a good idea, but imho the whole idea of using SparkSource for everything was a really bad decision in the first place. If the user wants to register a parquet folder as a data source for example, i believe the way to do that should be by providing FileSource object instead of a SparkSource one... To achieve that, we need to teach SparkOfflineStore how to read FileSource and deprecate path parameter from SparkSource... but that's a discussion for another day and a little out of scope here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I think we can by pass the order, thanks @tokoko

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HaoXuAI , I think we can merge the pr .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HaoXuAI , can you help me with the pr ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks for the work 👍

return False
if self.query != other.query:
return False
if self.path != other.path:
return False
return True

def __hash__(self):
return super().__hash__()


class SparkOptions:
allowed_formats = [format.value for format in SparkSourceFormat]
Expand Down
Loading