Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/powerbi): support modified_since, extract_dataset_schema and many more #7519

Merged
merged 20 commits into from
Apr 21, 2023

Conversation

aezomz
Copy link
Contributor

@aezomz aezomz commented Mar 8, 2023

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub
  1. modified_since config ingestion which ingest only modified workspaces (which also improve stateful ingestion to be per checkpoint per workspace)
  2. ownership config to have more fine grain control of how the corp user is built and also what criteria in term of authority over the powerbi asset
  3. extract_datasets_to_containers config wrap powerbi tables (datahub dataset) under 1 powerbi dataset (datahub container)
  4. extract_only_matched_endorsed_dataset config to extract only dataset that have matching endorsement
  5. extract_dashboards config whether to ingest powerbi dashboards
  6. extract_dataset_schema to extract powerbi dataset table schema to datahub dataset schema

I think the main improvement will be modified_since, which leave a interesting problem for stateful.
I tried my best to solve this in the most elegant way as compared to my previous PR which got rejected.

As much as I am catering for backward compatibility, those extra config are entirely optional and will not impact current users. :)

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 8, 2023
@aezomz aezomz changed the title feat(ingest): powerbi lots of improvement feat(ingest): powerbi config improvements modified_since, extract_dataset_schema and many more Mar 8, 2023
@aezomz
Copy link
Contributor Author

aezomz commented Mar 13, 2023

@hsheth2 , @mohdsiddique and @jjoyce0510 . Can you all please help to take a look?

@siddiquebagwan
Copy link
Contributor

@hsheth2 , @mohdsiddique and @jjoyce0510 . Can you all please help to take a look?

I am going through it

)

assert len(data_platform_tables) == 1
assert data_platform_tables[0].name == "public_consumer_price_index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should match to table.name

assert data_platform_tables[0].name == "public_consumer_price_index"
assert (
data_platform_tables[0].full_name
== "hive_metastore.sandbox_revenue.public_consumer_price_index"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it should be table.full_name

) -> SchemaFieldClass:
if isinstance(field, powerbi_data_classes.Column):
data_type = field.dataType
type_class = FIELD_TYPE_MAPPING.get(data_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move FIELD_TYPE_MAPPING to data_classes.py and while creating Column and Measure then set the data_type to DataHub datatype in data_resolver.py. I believe we don't need to check whether it is Measure or Column as mostly both has same attributes the difference is only in forming description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good recommendation, i will create a datahubDataType attribute

@@ -223,12 +292,22 @@ def to_datahub_dataset(
dataset_mcps: List[MetadataChangeProposalWrapper] = []
if dataset is None:
return dataset_mcps
if (
self.__config.extract_only_matched_endorsed_dataset is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please code it as per allow deny pattern. Check workspace_id_pattern for reference. default would be allow all

workspace.scan_result
)
# Fetch endorsements tag if it is enabled from configuration
if self.__config.extract_endorsements_to_tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this got changed? powerbi source already have code to extract endorsements to tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't change anything actually. just indentation. cause of the for loop workspace

def get_allowed_workspaces(self) -> Iterable[powerbi_data_classes.Workspace]:
all_workspaces = self.powerbi_client.get_workspaces()
def get_allowed_workspaces(self) -> List[powerbi_data_classes.Workspace]:
if self.source_config.admin_apis_only and self.source_config.modified_since:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove admin_api_only condition. User can set admin permission to client-credential and can go for the ingestion of specific workspace. Please refer code of method _get_entity_users and write as per that code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove extra condition handling from here and add it in self.powerbi_client.get_workspaces(). the self.powerbi_client.get_workspaces() should know how to return workspace whether modified only or all

@@ -191,11 +193,16 @@ def compute_job_id(cls, platform: Optional[str]) -> JobId:

# Default name for everything else
job_name_suffix = "stale_entity_removal"
return JobId(f"{platform}_{job_name_suffix}" if platform else job_name_suffix)
unique_suffix = f"_{unique_id}" if unique_id else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add comment for this code change

@@ -6,7 +6,7 @@
"aspectName": "corpUserKey",
"aspect": {
"json": {
"username": "[email protected]"
"username": "urn:li:corpuser:users.[email protected]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ownership should not break the old golden files. The code should be backward compatible.

@aezomz
Copy link
Contributor Author

aezomz commented Mar 17, 2023

@mohdsiddique hihi, please help me to review again. I have changed accordingly to your comments. thanks!

@@ -47,6 +48,9 @@ class SupportedDataPlatform(Enum):
MS_SQL = DataPlatformPair(
powerbi_data_platform_name="Sql", datahub_data_platform_name="mssql"
)
DATABRICK_SQL = DataPlatformPair(
powerbi_data_platform_name="Databrick", datahub_data_platform_name="databrick"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found databrick as a platform in DataHub. This mapping is used for upstream lineage. Could you please map it to proper DataHub platform, check this documentation: https://datahubproject.io/docs/generated/ingestion/sources/databricks.

As per your data-access function Databricks.Catalogs you might need to map it to unity-catalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field: Union[powerbi_data_classes.Column, powerbi_data_classes.Measure],
) -> SchemaFieldClass:
data_type = field.dataType
if getattr(field, "expression", None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do isinstance check here instead of attribute check. for example isinstance of Column then determine the description from expression and description fields

return builder.make_user_urn(user.split("@")[0])
return builder.make_user_urn(f"users.{user}")

def get_dataset_table_schema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to to_datahub_schema_field

for tag in (dataset.tags or [""])
]
):
return dataset_mcps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please debug level log message, like `Returning empty dataset_mcps as no dataset tag matched with extract_only_matched_endorsed_dataset"

@@ -241,9 +287,17 @@ def to_datahub_dataset(

logger.debug(f"{Constant.Dataset_URN}={ds_urn}")
# Create datasetProperties mcp
custom_properties = {}
if table.expression:
custom_properties["expression"] = table.expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "expression" in constant

@@ -260,7 +314,63 @@ def to_datahub_dataset(
aspect_name=Constant.STATUS,
aspect=StatusClass(removed=False),
)
dataset_mcps.extend([info_mcp, status_mcp])
if self.__config.extract_dataset_schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a separate function for this logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a function extract_dataset_schema similar to self.extract_lineage and move schema generation logic into extract_dataset_schema

@@ -183,6 +186,11 @@ def fill_tags() -> None:
return reports

def get_workspaces(self) -> List[Workspace]:
if self.__config.modified_since:
workspaces = self.get_modified_workspaces()
if workspaces:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this if ?if get_modified_workspaces returns empty list then we don't need this

@@ -1,12 +1,46 @@
[
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check why user related MCP got added, ideally these should not get ingested.

custom-properties related changes in golden file is expected.

@@ -65,14 +65,48 @@
"runId": "powerbi-test"
}
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, user should not get added in existing golden files, looks like your owner related changes are breaking previous configuration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aezomz looks like this one is still unresolved - I think this is the last pending item on this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@aezomz aezomz Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same aspect is still emitted before and after this PR for this test.

# As modified_workspaces is not idempotent, hence we checkpoint for each powerbi workspace
# Because job_id is used as dictionary key, we have to set a new job_id
# Refer to https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/stateful_ingestion_base.py#L390
self.stale_entity_removal_handler.set_job_id(workspace.id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backwards compatibility for all users of powerbi stateful ingestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i will make some change so there is backward compatibility. Will need help to review the logic.

@vercel

This comment was marked as outdated.

@vercel

This comment was marked as outdated.

@aezomz
Copy link
Contributor Author

aezomz commented Mar 21, 2023

@hsheth2 and @mohdsiddique please help to review again. thanks!

@aezomz
Copy link
Contributor Author

aezomz commented Mar 27, 2023

Hi @hsheth2 , can we get this reviewed again? thanks

Copy link
Contributor

@siddiquebagwan siddiquebagwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make sure existing golden files should not get updated because of new code changes. description related changes or new PowerBI dataset as container are acceptable

}

def create_scan_job(self, workspace_id: str) -> str:
def create_scan_job(self, workspace_ids: List[str]) -> str:
Copy link
Contributor

@siddiquebagwan siddiquebagwan Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert it to single workspace_id: str so that we can have concurrent asset extraction per workspace if needed at higher level.

For example check https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/looker/looker_source.py#L1255

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function wrap the ids in a list eventually and send to PBI eitherway.
So it doesn't matter if u pass in 1 workspace_id or a List[workspace_id], in the end it will still wrap in a List and send.
Actually i recommend sending in a list of workspace to the scan api. Since PBI accept a list, they do it quite quickly too.

But if u want to think async.
Lets make batch_size configurable then.

https://github.com/aezomz/datahub/blob/powerbi-improvements/metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py#L1185

@@ -260,7 +314,63 @@ def to_datahub_dataset(
aspect_name=Constant.STATUS,
aspect=StatusClass(removed=False),
)
dataset_mcps.extend([info_mcp, status_mcp])
if self.__config.extract_dataset_schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a function extract_dataset_schema similar to self.extract_lineage and move schema generation logic into extract_dataset_schema

self.reporter, auto_status_aspect(self.get_workunits_internal())
),
)
if self.source_config.modified_since:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to skip auto_stale_entity_removal for workspaces return in modified since flow ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,34 +1,4 @@
[
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make sure existing golden file should not get updated because of your code changes ? description related changes or new PowerBI dataset as container are acceptable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap! checking through. actually corpuserkey is just having inconsistent json ordering.

I have ensured that the new feature will have no impact on existing user as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description and custom properties update are new. So are subtypes. The rest remain the same. U can open in full view and check.

self.stale_entity_removal_handler
)

yield from auto_stale_entity_removal(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to skip auto_stale_entity_removal for workspaces return in modified since flow ?

modified_since will only work in this way. where we create checkpoint state for each workspace id

@aezomz
Copy link
Contributor Author

aezomz commented Mar 28, 2023

@hsheth2 and @mohdsiddique please help to review again. I hope we can close this soon as the default recipe setting is already backward compatible.

@@ -472,13 +638,20 @@
}
},
{
"entityType": "corpuser",
"entityUrn": "urn:li:corpuser:[email protected]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohdsiddique , same as before . The corpuser is created here. Just that the order of the JSON changed.

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2023

Codecov Report

Patch coverage: 87.05% and project coverage change: -7.63 ⚠️

Comparison is base (c7d35ff) 74.90% compared to head (b4366d7) 67.27%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7519      +/-   ##
==========================================
- Coverage   74.90%   67.27%   -7.63%     
==========================================
  Files         353      353              
  Lines       35395    35586     +191     
==========================================
- Hits        26511    23940    -2571     
- Misses       8884    11646    +2762     
Flag Coverage Δ
pytest-testIntegration ?
pytest-testIntegrationBatch1 36.46% <33.72%> (+<0.01%) ⬆️
pytest-testQuick 63.78% <87.05%> (+0.20%) ⬆️
pytest-testSlowIntegration 32.97% <32.54%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...tahub/ingestion/source/powerbi/m_query/resolver.py 78.76% <22.72%> (-4.07%) ⬇️
...on/src/datahub/ingestion/source/powerbi/powerbi.py 94.14% <89.28%> (-2.08%) ⬇️
...n/source/powerbi/rest_api_wrapper/data_resolver.py 85.86% <92.30%> (+0.15%) ⬆️
...ion/source/powerbi/rest_api_wrapper/powerbi_api.py 88.71% <94.59%> (+0.85%) ⬆️
...on/source/powerbi/rest_api_wrapper/data_classes.py 92.72% <97.50%> (+2.72%) ⬆️
...on/src/datahub/ingestion/source/common/subtypes.py 100.00% <100.00%> (ø)
...ion/src/datahub/ingestion/source/powerbi/config.py 97.47% <100.00%> (-0.26%) ⬇️
...stion/source/state/stale_entity_removal_handler.py 93.49% <100.00%> (-0.68%) ⬇️

... and 73 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@aezomz
Copy link
Contributor Author

aezomz commented Mar 30, 2023

@hsheth2 can u help to review again? we should close this soon. Can u help to resolve the conflict if all is good?

@aezomz
Copy link
Contributor Author

aezomz commented Apr 18, 2023

@mohdsiddique

We can remove expression from custom properties and add it in viewProperties, please refer https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py#L935.
The DataHub display the code in view definition section on Portal. (https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:looker,long_tail_companions.view.customer_focused,PROD)/View%20Definition?is_lineage_mode=false)
Could you please confirm your user related changes are backward compatible ?

I have added view properties, and view subtype to enable the view definition.

I have fixed the user mcps ordering issues. The golden json ordering will be similar to the OSS now for corpuser.

@hsheth2 please review and lets close this. thanks

@siddiquebagwan
Copy link
Contributor

@mohdsiddique

We can remove expression from custom properties and add it in viewProperties, please refer https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/looker/lookml_source.py#L935.
The DataHub display the code in view definition section on Portal. (https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:looker,long_tail_companions.view.customer_focused,PROD)/View%20Definition?is_lineage_mode=false)
Could you please confirm your user related changes are backward compatible ?

I have added view properties, and view subtype to enable the view definition.

I have fixed the user mcps ordering issues. The golden json ordering will be similar to the OSS now for corpuser.

@hsheth2 please review and lets close this. thanks

Hi @aezomz
Could you please update the concept mapping https://github.com/aezomz/datahub/blob/powerbi-improvements/metadata-ingestion/docs/sources/powerbi/powerbi_pre.md#concept-mapping,

@aezomz
Copy link
Contributor Author

aezomz commented Apr 19, 2023

Hi @mohdsiddique , i think its already done in this PR which is approved by Joyce.
#7835

@aezomz
Copy link
Contributor Author

aezomz commented Apr 19, 2023

Once again, please close this soon. thanks

@hsheth2
Copy link
Collaborator

hsheth2 commented Apr 21, 2023

@aezomz I'm going to merge this in tonight, assuming CI turns green

One thing I'd like to call out - the changes to stateful ingestion in this PR are definitely experimental, and I can't guarantee that we won't be making changes to stateful ingestion in the future that may mess with the implementation here.

@hsheth2 hsheth2 changed the title feat(ingest): powerbi config improvements modified_since, extract_dataset_schema and many more feat(ingest/powerbi): support modified_since, extract_dataset_schema and many more Apr 21, 2023
@hsheth2 hsheth2 merged commit 1a5c716 into datahub-project:master Apr 21, 2023
iprentic pushed a commit that referenced this pull request Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants