-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl Table ACLs in all databases #122
Conversation
bin/install.py
Outdated
logger.info("Uploading...") | ||
ws.workspace.mkdirs(folder_base) | ||
with open(local_wheel_file, "rb") as fh: | ||
ws.workspace.upload(path=remote_wheel_file, content=fh.read(), format=ImportFormat.AUTO) | ||
buf = BytesIO(INSTALL_NOTEBOOK.format(remote_wheel_file=remote_wheel_file).encode()) | ||
ws.workspace.upload(path=remote_notebook_file, content=buf) | ||
ws.workspace.upload(path=remote_configuration_file, content=buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content should be different :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this on Friday
# trigger one-time-run job and get waiter object | ||
waiter = w.jobs.submit( | ||
run_name=f"sdk-{time.time_ns()}", | ||
tasks=[ | ||
j.SubmitTask( | ||
existing_cluster_id=cluster_id, | ||
python_wheel_task=j.PythonWheelTask( | ||
"crawler", package_name="databricks-labs-ucx", parameters=[inventory_catalog, inventory_schema] | ||
), | ||
task_key=f"sdk-{time.time_ns()}", | ||
) | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# trigger one-time-run job and get waiter object | |
waiter = w.jobs.submit( | |
run_name=f"sdk-{time.time_ns()}", | |
tasks=[ | |
j.SubmitTask( | |
existing_cluster_id=cluster_id, | |
python_wheel_task=j.PythonWheelTask( | |
"crawler", package_name="databricks-labs-ucx", parameters=[inventory_catalog, inventory_schema] | |
), | |
task_key=f"sdk-{time.time_ns()}", | |
) | |
], | |
) | |
from databricks.sdk.service import jobs, compute | |
created_job = ws.jobs.create(tasks=[ | |
jobs.Task(task_key='crawl', | |
python_wheel_task=jobs.PythonWheelTask( | |
package_name='databricks.labs.ucx.toolkits.table_acls', | |
entry_point='the_function_you_are_adding' | |
), | |
new_cluster=jobs.ClusterSpec( | |
libraries=[compute.Library( | |
whl='... the location of the wheel ..' | |
)], | |
new_cluster=compute.ClusterSpec( | |
node_type_id=ws.clusters.select_node_type(min_cores=10), | |
spark_version=ws.clusters.select_spark_version(latest=True), | |
num_workers=1, | |
# .. other properties if needed | |
) | |
))]) |
i didn't tell to do one time run of the job, i've requested to create a persistent job. have some function somewhere to setup this job, like in the init script, and the use ws.jobs.run_now(created_job.job_id).result()
in tests
ws.jobs.run_now(created_job.job_id).result()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR databricks/databricks-sdk-py#264 will prevent this class of errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested code throws the following error:
Cluster validation error: Missing required field: settings.cluster_spec.new_cluster.size
I believe this is caused by that the member variable new_cluster of dataclass Task in SDK is type compute.Clusterspec
Shouldn't it be Optional['ClusterSpec'] instead?
commands = CommandExecutor(ws) | ||
commands.install_notebook_library(f"/Workspace{wsfs_wheel}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this, as the wheel has to be installed on the job cluster, not the interactive cluster. refactor this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will complete this on Friday
inventory_schema = make_schema(catalog=make_catalog()) | ||
inventory_catalog, inventory_schema = inventory_schema.split(".") | ||
|
||
crawl_tacl(cluster_id, ws, inventory_catalog, inventory_schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't really think we need this function. ws.jobs.run_now(created_job.job_id).result()
does the job just fine.
Missing:
|
@nfx how is this PR related to v0.0.3 milestone? we don't need any scheduled crawler jobs to deliver that functionality, we just need table ACLs collection in parallel once during the migration. Could you please explain why scheduled crawling is required? |
@renardeinside it's better to kick off crawling from a job, as it's the only way to deterministically specify correct cluster settings, which is not guaranteed in the notebook. |
We just need to specify the requirements to the cluster that is used by the notebook, nothing more than than. This adds extra complexity, requires the user to configure an additional cluster and so-on. This is a nice feature, but definitely not something on the critical path. |
@renardeinside this has to become self-service with minimum setup steps. this is the first step towards it :) |
Current PR state from the perspective of milestone v0.0.3: Functional:
Tests:
Non-functional:
Crawler job is not a priority for the v0.0.3 milestone. it's not mentioned in any requirements, and it's just a nice to have from this perspective. |
@saraivdbx please factor out of this PR any work that is related to scheduling / setting up the crawler job. In milestone v0.0.3 this code will only run from the notebook anyways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert the database: str
-> databases: list
signature changes from TableCrawler and GrantsGrawler, as it doesn't fit and introduces more risk. Instead, iterate over databases and call methods that were there:
for db in databases:
for grant in grants_cralwer.snapshot('hive_metastore', db):
yield grant
@@ -118,24 +118,20 @@ def __init__(self, tc: TablesCrawler): | |||
super().__init__(tc._backend, tc._catalog, tc._schema, "grants") | |||
self._tc = tc | |||
|
|||
def snapshot(self, catalog: str, database: str) -> list[Grant]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changeset looks larger than necessary. why justifies changing the signatures of 5 downstream methods versus iterating over databases in a loop?
doesn't this code look more maintainable in comparison?
for db in databases:
for grant in grants_cralwer.snapshot('hive_metastore', db):
yield grant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up implementing this approach because I attempted the same versions - yield and return - just like what you suggested - and both resulted in only taking a snapshot of the first database in the iteration.
def grants_snapshot(self, schema): | ||
return self._gc.snapshot("hive_metastore", schema) | ||
def grants_snapshot(self): | ||
return self._gc.snapshot("hive_metastore", self.databases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change the signatures of TablesCrawler
and GrantsCrawler
by changing a lot of already tested code?
Why didn't we use this approach?
grants = []
for db in self.databases:
for grant in grants_cralwer.snapshot('hive_metastore', db):
grants.append(grant)
return grants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
+resolve conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm still not comfortable with a signature change, that also introduces hidden bugs.
I think that adding a small state machine into CrawlerBase would avoid changes in the TablesCrawler
, GrantsCrawler
, and other future subclasses. Please try.
Index: src/databricks/labs/ucx/tacl/_internal.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/databricks/labs/ucx/tacl/_internal.py b/src/databricks/labs/ucx/tacl/_internal.py
--- a/src/databricks/labs/ucx/tacl/_internal.py (revision a11ae80d3c7a2384dca79c70c00c3eada4a9d91f)
+++ b/src/databricks/labs/ucx/tacl/_internal.py (date 1693919765155)
@@ -143,15 +143,25 @@
Returns:
list[any]: A list of data records, either fetched or loaded.
"""
+ loaded = False
+ trigger_load = ValueError('trigger records load')
while True:
try:
logger.debug(f"[{self._full_name}] fetching {self._table} inventory")
- return list(fetcher())
+ cached_results = list(fetcher())
+ if len(cached_results) == 0 and loaded:
+ return cached_results
+ if len(cached_results) == 0 and not loaded:
+ raise trigger_load
+ return cached_results
except Exception as e:
- if "TABLE_OR_VIEW_NOT_FOUND" not in str(e):
+ if not (e == trigger_load or "TABLE_OR_VIEW_NOT_FOUND" in str(e)):
raise e
- logger.debug(f"[{self._full_name}] {self._table} inventory not found, crawling")
- self._append_records(klass, loader())
+ logger.debug(f"[{self._full_name}] crawling new batch for {self._table}")
+ loaded_records = list(loader())
+ if len(loaded_records) > 0:
+ self._append_records(klass, loaded_records)
+ loaded = True
@staticmethod
def _row_to_sql(row, fields):
Index: tests/integration/test_tacls.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/tests/integration/test_tacls.py b/tests/integration/test_tacls.py
--- a/tests/integration/test_tacls.py (revision a11ae80d3c7a2384dca79c70c00c3eada4a9d91f)
+++ b/tests/integration/test_tacls.py (date 1693918922272)
@@ -15,9 +15,13 @@
schema = make_schema(catalog="hive_metastore")
managed_table = make_table(schema=schema)
external_table = make_table(schema=schema, external=True)
- tmp_table = make_table(schema=schema, ctas="SELECT 2+2 AS four")
view = make_table(schema=schema, ctas="SELECT 2+2 AS four", view=True)
- non_delta = make_table(schema=schema, non_detla=True)
+
+ schema2 = make_schema(catalog="hive_metastore")
+ tmp_table = make_table(schema=schema2, ctas="SELECT 2+2 AS four")
+ non_delta = make_table(schema=schema2, non_detla=True)
+
+ schema3 = make_schema(catalog="hive_metastore")
logger.info(
f"managed_table={managed_table}, "
@@ -33,6 +37,10 @@
all_tables = {}
for t in tak.database_snapshot(schema.split(".")[1]):
all_tables[t.key] = t
+ for t in tak.database_snapshot(schema2.split(".")[1]):
+ all_tables[t.key] = t
+ for t in tak.database_snapshot(schema3.split(".")[1]):
+ all_tables[t.key] = t # no tables in schema3
assert len(all_tables) == 5
assert all_tables[non_delta].table_format == "JSON"
Applied suggested changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still needs few changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last comment
notebooks/toolkit.py
Outdated
log_level="DEBUG", | ||
) | ||
toolkit = GroupMigrationToolkit(config) | ||
tacltoolkit = TaclToolkit(toolkit._ws, config.inventory.table.catalog, config.tacl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tacltoolkit = TaclToolkit(toolkit._ws, config.inventory.table.catalog, config.tacl) | |
tacltoolkit = TaclToolkit(toolkit._ws, | |
inventory_catalog=config.inventory.table.catalog, | |
inventory_schema=config.inventory.table.schema, | |
databases=["default"], | |
auto=False, | |
warehouse_id=None) |
this is missing inventory schema.
|
||
class TaclToolkit: | ||
def __init__(self, ws: WorkspaceClient, inventory_catalog, inventory_schema, warehouse_id=None): | ||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last comment: don't depend on TaclConfig
, but rather add explicit constructor arguments: databases=["default"], auto=False
. this will make it easier to extend and test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added only databases, as during configuration, we enforce setting either databases or auto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase
# Version changelog ## 0.1.0 Features * Added interactive installation wizard ([#184](#184), [#117](#117)). * Added schedule of jobs as part of `install.sh` flow and created some documentation ([#187](#187)). * Added debug notebook companion to troubleshoot the installation ([#191](#191)). * Added support for Hive Metastore Table ACLs inventory from all databases ([#78](#78), [#122](#122), [#151](#151)). * Created `$inventory.tables` from Scala notebook ([#207](#207)). * Added local group migration support for ML-related objects ([#56](#56)). * Added local group migration support for SQL warehouses ([#57](#57)). * Added local group migration support for all compute-related resources ([#53](#53)). * Added local group migration support for security-related objects ([#58](#58)). * Added local group migration support for workflows ([#54](#54)). * Added local group migration support for workspace-level objects ([#59](#59)). * Added local group migration support for dashboards, queries, and alerts ([#144](#144)). Stability * Added `codecov.io` publishing ([#204](#204)). * Added more tests to group.py ([#148](#148)). * Added tests for group state ([#133](#133)). * Added tests for inventorizer and typed ([#125](#125)). * Added tests WorkspaceListing ([#110](#110)). * Added `make_*_permissions` fixtures ([#159](#159)). * Added reusable fixtures module ([#119](#119)). * Added testing for permissions ([#126](#126)). * Added inventory table manager tests ([#153](#153)). * Added `product_info` to track as SDK integration ([#76](#76)). * Added failsafe permission get operations ([#65](#65)). * Always install the latest `pip` version in `./install.sh` ([#201](#201)). * Always store inventory in `hive_metastore` and make only `inventory_database` configurable ([#178](#178)). * Changed default logging level from `TRACE` to `DEBUG` log level ([#124](#124)). * Consistently use `WorkspaceClient` from `databricks.sdk` ([#120](#120)). * Convert pipeline code to use fixtures. ([#166](#166)). * Exclude mixins from coverage ([#130](#130)). * Fixed codecov.io reporting ([#212](#212)). * Fixed configuration path in job task install code ([#210](#210)). * Fixed a bug with dependency definitions ([#70](#70)). * Fixed failing `test_jobs` ([#140](#140)). * Fixed the issues with experiment listing ([#64](#64)). * Fixed integration testing configuration ([#77](#77)). * Make project runnable on nightly testing infrastructure ([#75](#75)). * Migrated cluster policies to new fixtures ([#174](#174)). * Migrated clusters to the new fixture framework ([#162](#162)). * Migrated instance pool to the new fixture framework ([#161](#161)). * Migrated to `databricks.labs.ucx` package ([#90](#90)). * Migrated token authorization to new fixtures ([#175](#175)). * Migrated experiment fixture to standard one ([#168](#168)). * Migrated jobs test to fixture based one. ([#167](#167)). * Migrated model fixture to the standard fixtures ([#169](#169)). * Migrated warehouse fixture to standard one ([#170](#170)). * Organise modules by domain ([#197](#197)). * Prefetch all account-level and workspace-level groups ([#192](#192)). * Programmatically create a dashboard ([#121](#121)). * Properly integrate Python `logging` facility ([#118](#118)). * Refactored code to use Databricks SDK for Python ([#27](#27)). * Refactored configuration and remove global provider state ([#71](#71)). * Removed `pydantic` dependency ([#138](#138)). * Removed redundant `pyspark`, `databricks-connect`, `delta-spark`, and `pandas` dependencies ([#193](#193)). * Removed redundant `typer[all]` dependency and its usages ([#194](#194)). * Renamed `MigrationGroupsProvider` to `GroupMigrationState` ([#81](#81)). * Replaced `ratelimit` and `tenacity` dependencies with simpler implementations ([#195](#195)). * Reorganised integration tests to align more with unit tests ([#206](#206)). * Run `build` workflow also on `main` branch ([#211](#211)). * Run integration test with a single group ([#152](#152)). * Simplify `SqlBackend` and table creation logic ([#203](#203)). * Updated `migration_config.yml` ([#179](#179)). * Updated legal information ([#196](#196)). * Use `make_secret_scope` fixture ([#163](#163)). * Use fixture factory for `make_table`, `make_schema`, and `make_catalog` ([#189](#189)). * Use new fixtures for notebooks and folders ([#176](#176)). * Validate toolkit notebook test ([#183](#183)). Contributing * Added a note on external dependencies ([#139](#139)). * Added ability to run SQL queries on Spark when in Databricks Runtime ([#108](#108)). * Added some ground rules for contributing ([#82](#82)). * Added contributing instructions link from main readme ([#109](#109)). * Added info about environment refreshes ([#155](#155)). * Clarified documentation ([#137](#137)). * Enabled merge queue ([#146](#146)). * Improved `CONTRIBUTING.md` guide ([#135](#135), [#145](#145)).
# Version changelog ## 0.1.0 Features * Added interactive installation wizard ([#184](#184), [#117](#117)). * Added schedule of jobs as part of `install.sh` flow and created some documentation ([#187](#187)). * Added debug notebook companion to troubleshoot the installation ([#191](#191)). * Added support for Hive Metastore Table ACLs inventory from all databases ([#78](#78), [#122](#122), [#151](#151)). * Created `$inventory.tables` from Scala notebook ([#207](#207)). * Added local group migration support for ML-related objects ([#56](#56)). * Added local group migration support for SQL warehouses ([#57](#57)). * Added local group migration support for all compute-related resources ([#53](#53)). * Added local group migration support for security-related objects ([#58](#58)). * Added local group migration support for workflows ([#54](#54)). * Added local group migration support for workspace-level objects ([#59](#59)). * Added local group migration support for dashboards, queries, and alerts ([#144](#144)). Stability * Added `codecov.io` publishing ([#204](#204)). * Added more tests to group.py ([#148](#148)). * Added tests for group state ([#133](#133)). * Added tests for inventorizer and typed ([#125](#125)). * Added tests WorkspaceListing ([#110](#110)). * Added `make_*_permissions` fixtures ([#159](#159)). * Added reusable fixtures module ([#119](#119)). * Added testing for permissions ([#126](#126)). * Added inventory table manager tests ([#153](#153)). * Added `product_info` to track as SDK integration ([#76](#76)). * Added failsafe permission get operations ([#65](#65)). * Always install the latest `pip` version in `./install.sh` ([#201](#201)). * Always store inventory in `hive_metastore` and make only `inventory_database` configurable ([#178](#178)). * Changed default logging level from `TRACE` to `DEBUG` log level ([#124](#124)). * Consistently use `WorkspaceClient` from `databricks.sdk` ([#120](#120)). * Convert pipeline code to use fixtures. ([#166](#166)). * Exclude mixins from coverage ([#130](#130)). * Fixed codecov.io reporting ([#212](#212)). * Fixed configuration path in job task install code ([#210](#210)). * Fixed a bug with dependency definitions ([#70](#70)). * Fixed failing `test_jobs` ([#140](#140)). * Fixed the issues with experiment listing ([#64](#64)). * Fixed integration testing configuration ([#77](#77)). * Make project runnable on nightly testing infrastructure ([#75](#75)). * Migrated cluster policies to new fixtures ([#174](#174)). * Migrated clusters to the new fixture framework ([#162](#162)). * Migrated instance pool to the new fixture framework ([#161](#161)). * Migrated to `databricks.labs.ucx` package ([#90](#90)). * Migrated token authorization to new fixtures ([#175](#175)). * Migrated experiment fixture to standard one ([#168](#168)). * Migrated jobs test to fixture based one. ([#167](#167)). * Migrated model fixture to the standard fixtures ([#169](#169)). * Migrated warehouse fixture to standard one ([#170](#170)). * Organise modules by domain ([#197](#197)). * Prefetch all account-level and workspace-level groups ([#192](#192)). * Programmatically create a dashboard ([#121](#121)). * Properly integrate Python `logging` facility ([#118](#118)). * Refactored code to use Databricks SDK for Python ([#27](#27)). * Refactored configuration and remove global provider state ([#71](#71)). * Removed `pydantic` dependency ([#138](#138)). * Removed redundant `pyspark`, `databricks-connect`, `delta-spark`, and `pandas` dependencies ([#193](#193)). * Removed redundant `typer[all]` dependency and its usages ([#194](#194)). * Renamed `MigrationGroupsProvider` to `GroupMigrationState` ([#81](#81)). * Replaced `ratelimit` and `tenacity` dependencies with simpler implementations ([#195](#195)). * Reorganised integration tests to align more with unit tests ([#206](#206)). * Run `build` workflow also on `main` branch ([#211](#211)). * Run integration test with a single group ([#152](#152)). * Simplify `SqlBackend` and table creation logic ([#203](#203)). * Updated `migration_config.yml` ([#179](#179)). * Updated legal information ([#196](#196)). * Use `make_secret_scope` fixture ([#163](#163)). * Use fixture factory for `make_table`, `make_schema`, and `make_catalog` ([#189](#189)). * Use new fixtures for notebooks and folders ([#176](#176)). * Validate toolkit notebook test ([#183](#183)). Contributing * Added a note on external dependencies ([#139](#139)). * Added ability to run SQL queries on Spark when in Databricks Runtime ([#108](#108)). * Added some ground rules for contributing ([#82](#82)). * Added contributing instructions link from main readme ([#109](#109)). * Added info about environment refreshes ([#155](#155)). * Clarified documentation ([#137](#137)). * Enabled merge queue ([#146](#146)). * Improved `CONTRIBUTING.md` guide ([#135](#135), [#145](#145)).
This PR enumerates all databases for
GrantsCrawler
.