Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(matadata-io): neo4j generateLineageStatement use shortestPath #7219

Merged
merged 6 commits into from
Feb 14, 2023
Merged

perf(matadata-io): neo4j generateLineageStatement use shortestPath #7219

merged 6 commits into from
Feb 14, 2023

Conversation

shidianshifen
Copy link
Contributor

@shidianshifen shidianshifen commented Feb 2, 2023

nested/circle relation or lineage between tables as show below would lead to expensive match cypher results get from neo4j database

graph LR
B(table_B) --> |DownstreamOf| A(table_A)
D(table_D) -->|DownstreamOf| A(table_A)
C(table_C) -->|DownstreamOf| B(table_B)
D(table_D) -->|DownstreamOf| B(table_B)
C(table_C) -->|DownstreamOf| D(table_D)
E(table_E) -->|DownstreamOf| D(table_D)
F(table_F) -->|DownstreamOf| E(table_E)
D(table_D) -->|DownstreamOf| F(table_F)
Loading

find all table_A downstream tables using current cypher:

MATCH path=(a {urn: 'urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)'})<-[r:DownstreamOf*1..1000]-(b) 
WHERE b:dataJob OR b:dataProcess OR b:mlFeature OR b:dataset OR b:chart OR b:dashboard OR b:mlPrimaryKey 
RETURN a,nodes(path) as related_nodes, size(r), b

results:

index a related_nodes path_length b
1 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)}] 1 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)}
2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}] 2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}
3 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}] 3 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}
4 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)}] 3 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)}
5 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)}] 4 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)}
6 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}] 5 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}
7 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}] 6 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}
8 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}] 2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}
9 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}] 1 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}
10 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}] 2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}
11 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)}] 2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)}
12 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)}] 3 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)}
13 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}] 4 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}
14 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}] 5 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}

using neo4j shortestPath cypher become:

MATCH path=shortestPath((a {urn: 'urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)'})<-[r:DownstreamOf*1..1000]-(b)) 
WHERE (b:dataJob OR b:dataProcess OR b:mlFeature OR b:dataset OR b:chart OR b:dashboard OR b:mlPrimaryKey) 
AND b.urn <> 'urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)'
RETURN a, nodes(path) as related_nodes, size(r) as path_length, b

results:

index a related_nodes path_length b
1 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)}] 1 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_B,PROD)}
2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}] 1 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)}
3 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)}] 3 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_F,PROD)}
4 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}] 2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_C,PROD)}
5 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)} [{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_A,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_D,PROD)},{"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)}] 2 {"urn":urn:li:dataset:(urn:li:dataPlatform:bigquery,table_E,PROD)}

The results pass to gms decreased significantly from 14 to 5. this could help to avoid gms oom and gc problem

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@shidianshifen shidianshifen changed the title perf-(matadata-io): neo4j generateLineageStatement use shortestPath perf(matadata-io): neo4j generateLineageStatement use shortestPath Feb 2, 2023
@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Feb 2, 2023
@shidianshifen
Copy link
Contributor Author

@aditya-radhakrishnan hi, Could you help me review this patch?

@aditya-radhakrishnan
Copy link
Contributor

Hey @shidianshifen sorry for the delay here! Have you been able to test these changes? Looks good to me though :)

@shidianshifen
Copy link
Contributor Author

Hey @shidianshifen sorry for the delay here! Have you been able to test these changes? Looks good to me though :)

yes, i tested with no problem. and i have are around 12000+ entities and 12 hops query results from neo4j decrease from 7 million to actual 1500+ using shortestPath

add missing '' for urn in neo4j cypher template
Copy link
Contributor

@aditya-radhakrishnan aditya-radhakrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for making this change :)

@aditya-radhakrishnan
Copy link
Contributor

Will merge once CI is passing.

@shidianshifen
Copy link
Contributor Author

@aditya-radhakrishnan Thanks very much for your review.

@shidianshifen
Copy link
Contributor Author

@aditya-radhakrishnan this patch not merge in to master, should i do some else to make it happen? and neo4j graph backend not support for time filter in new release, i am prepare to commit a patch to make it work. what could be done next?

@aditya-radhakrishnan
Copy link
Contributor

aditya-radhakrishnan commented Feb 14, 2023

Hey @shidianshifen this is true, Neo4j time filtering is not supported with the new release. We can merge this one separately and then collaborate on updating for time filtering! I will message you on Slack!

@aditya-radhakrishnan
Copy link
Contributor

CI is running now (not sure why it didn't before). Will merge once green, apologies it didn't go through before!

@jjoyce0510 jjoyce0510 merged commit 6901f31 into datahub-project:master Feb 14, 2023
looppi pushed a commit to looppi/datahub that referenced this pull request Feb 15, 2023
oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request Feb 28, 2023
yoonhyejin pushed a commit that referenced this pull request Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants