Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(ingest): Call source_helpers via new WorkUnitProcessors in base Source #8101

Merged
merged 8 commits into from
May 24, 2023

Conversation

asikowitz
Copy link
Collaborator

Somewhat serious refactor that provides a default implementation of get_workunits and changes get_workunits_internal to be the main overriden method. Adds get_workunit_processors() which returns a list of callables that take in a WorkUnit stream and alter that stream (i.e. source helpers). Provides a default list of workunit processors (source helpers) that can be changed by overriding the method.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -119,14 +118,9 @@ def auto_workunit_reporter(report: SourceReport, stream: Iterable[T]) -> Iterabl

def auto_materialize_referenced_tags(
stream: Iterable[MetadataWorkUnit],
active: bool = True,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now handled by passing in None for the workunit processor, although I don't think this was ever passed as False

state_type_class=GenericCheckpointState,
pipeline_name=self.ctx.pipeline_name,
run_id=self.ctx.run_id,
self.stale_entity_removal_handler = StaleEntityRemovalHandler.create(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets used to call self.stale_entity_removal_handler.add_urn_to_skip(node_datahub_urn) at some point

@@ -1197,6 +1185,17 @@ def get_workspace_workunit(
for workunit in dataset_workunits:
yield workunit

def get_workunit_processors(self) -> Sequence[Optional[MetadataWorkUnitProcessor]]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pay attention to this file, some pretty non-standard behavior. Not sure if there's a cleaner way to do this... could directly overwrite get_workunits as one option

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 23, 2023
def get_workunits(self) -> Iterable[MetadataWorkUnit]:
return auto_workunit_reporter(self.report, self.get_workunits_internal())
def get_workunit_processors(self) -> Sequence[Optional[MetadataWorkUnitProcessor]]:
return [partial(auto_workunit_reporter, self.report)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment that not calling super() is intentional here

@@ -155,9 +176,35 @@ def create(cls, config_dict: dict, ctx: PipelineContext) -> "Source":
# can't make this method abstract.
raise NotImplementedError('sources must implement "create"')

@abstractmethod
def get_workunit_processors(self) -> Sequence[Optional[MetadataWorkUnitProcessor]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this return type to List[Optional[MetadataWorkUnitProcessor]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can append to it? I guess it's more specific.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for append - we shouldn't be using append here

Sequence is a bit weird to work with in some places with mypy. We can always make it more general in the future, but let's not restrain ourselves unnecessarily

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh really, what issues have you seen with Sequence? In general, I think we want to use Sequence (if possible) for parameters but List / other specific types for return types

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i think sequences are fine. iirc the issue i've seen is that Iterable is not covariant (so passing Iterable[Square] to a function that takes Iterable[Shape] would throw an error), but sequence works fine there

In general, I think we want to use Sequence (if possible) for parameters but List / other specific types for return types

yup this seems like a good rule of thumb - might be worth adding to the list here https://datahubproject.io/docs/metadata-ingestion/developing/#code-style

)


def get_current_checkpoint_from_pipeline(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're gonna get merge conflicts on this with my PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #8104

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -155,9 +177,35 @@ def create(cls, config_dict: dict, ctx: PipelineContext) -> "Source":
# can't make this method abstract.
raise NotImplementedError('sources must implement "create"')

@abstractmethod
def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]:
"""A list of functions that transforms the workunits produced by this source.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the returned list can have None ?
Not clear to me why we need Optional[MetadataWorkUnitProcessor] as opposed to simply MetadataWorkUnitProcessor

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the idea is we can do something like return [*super().get_workunit_processors(), other_workunit_processor if self.config.flag else None]

StaleEntityRemovalHandler.create(
self, self.config, self.ctx
).workunit_processor,
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale Entity Removal would emit some workunits, which won't be reported I believe, as workunit_processor for stale entity removal comes after auto_workunit_reporter. The same was the case earlier, so guessing thats okay.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was trying to match existing behavior. If we want it the other way we can always make the change

@asikowitz asikowitz merged commit fdbc4de into datahub-project:master May 24, 2023
@asikowitz asikowitz deleted the workunit-processors branch May 24, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants