Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Produce browse paths v2 on demand and with platform instance #8173

Merged
merged 10 commits into from
Jun 9, 2023

Conversation

asikowitz
Copy link
Collaborator

Also updates MetadataWorkUnit.get_aspects_of_type to only return a single aspect (the last one) for code simplicity.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 5, 2023
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall pretty clean

none of these comments are blocking, so we can merge and then fix in a follow-up

browse_path_processor,
partial(auto_workunit_reporter, self.get_report()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the original order made more sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to report these browse path workunits?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh are these run top to bottom? For some reason I thought the bottom was the innermost one. In that case this is fine

@@ -58,7 +66,7 @@ class PipelineConfig(ConfigModel):
source: SourceConfig
sink: DynamicTypedConfig
transformers: Optional[List[DynamicTypedConfig]]
flags: FlagsConfig = Field(default=FlagsConfig())
flags: FlagsConfig = Field(default=FlagsConfig(), hidden_from_docs=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pipeline config docs aren't autogenerated anywhere so this is a no-op, but still nice to have

parent_urn = container_aspect.container
containers_used_as_parent.add(parent_urn)
paths[urn] = [
*paths.setdefault(parent_urn, []), # Guess parent has no parents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it'd be more clear from a code readability perspective to split this into two lines

also, do we need paths.setdefault here or can we just use paths.get(parent_urn, []). the mutation is throwing me off a bit

parent_path = paths.setdefault(parent_urn, [])
paths[urn] = [...]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this used to have an impact but looking at the code, doesn't seem like it anymore. I'll just replace with the .get(...)

if browse_path_aspect and browse_path_aspect.paths:
legacy_path = [
BrowsePathEntryClass(id=p.strip())
for p in browse_path_aspect.paths[0].strip("/").split("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight nit - we call p.strip() three times, so maybe we should have an inner generator that breaks those apart

or maybe a helper method for _split_legacy_browse_path_entry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeahh I didn't like this much, but it's solved by upgrading to python 3.8 and using the walrus operator, which we will have to do quite soon. I can add a TODO to update once we upgrade though

"""

# For telemetry, to see if our sources violate assumptions
num_out_of_order = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_out_of_order = 0
num_containers_out_of_order = 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this name can be confusing because it's num_container_aspects_out_of_order rather than num_container_entities_out_of_order. I'll think about it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine too - I just wanted the word container in there


# For telemetry, to see if our sources violate assumptions
num_out_of_order = 0
num_out_of_batch = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
num_out_of_batch = 0
num_aspects_out_of_batch = 0

# Set for all containers and urns with a Container aspect
# Used to construct container paths while iterating through stream
# Assumes topological order of entities in stream
paths: Dict[str, List[BrowsePathEntryClass]] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe note that this one does not contain platform instance details

@asikowitz asikowitz merged commit f2c66fd into datahub-project:master Jun 9, 2023
@asikowitz asikowitz deleted the browse-path-updates branch June 9, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants