Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: edge case in Snowflake information_schema.last_altered value #360

Merged

Conversation

instazackwu
Copy link
Contributor

@instazackwu instazackwu commented Sep 4, 2020

Summary of Changes

Add Snowflake documentation link, account for edge case for null last_altered values for tables in information_schema

SELECT lower(t.table_catalog) AS CLUSTER,
       lower(t.table_catalog) AS db,
       lower(t.table_schema) AS SCHEMA,
       lower(t.table_name) AS TABLE_NAME,
       NVL(DATE_PART(epoch_second, t.last_altered), 0) AS last_updated_time
FROM EXAMPLE_DATABASE.INFORMATION_SCHEMA.TABLES t
WHERE lower(SCHEMA) = 'information_schema'
ORDER BY last_updated_time ASC;

CLUSTER	DB	SCHEMA	TABLE_NAME	LAST_UPDATED_TIME
sample_db	sample_schema	information_schema	views	0
sample_db	sample_schema	information_schema	table_privileges	0
sample_db	sample_schema	information_schema	databases	0
sample_db	sample_schema	information_schema	functions	0
sample_db	sample_schema	information_schema	procedures	0
sample_db	sample_schema	information_schema	enabled_roles	0
sample_db	sample_schema	information_schema	referential_constraints	0
sample_db	sample_schema	information_schema	table_constraints	0
sample_db	sample_schema	information_schema	load_history	0
sample_db	sample_schema	information_schema	tables	0
sample_db	sample_schema	information_schema	columns	0
sample_db	sample_schema	information_schema	schemata	0
sample_db	sample_schema	information_schema	sequences	0
sample_db	sample_schema	information_schema	usage_privileges	0
sample_db	sample_schema	information_schema	replication_databases	0
sample_db	sample_schema	information_schema	object_privileges	0
sample_db	sample_schema	information_schema	file_formats	0
sample_db	sample_schema	information_schema	applicable_roles	0
sample_db	sample_schema	information_schema	stages	0
sample_db	sample_schema	information_schema	information_schema_catalog_name	0
sample_db	sample_schema	information_schema	table_storage_metrics	0
sample_db	sample_schema	information_schema	pipes	0
sample_db	sample_schema	information_schema	external_tables	0

For context, this was the error that was sent without the NVL (The null was being translated to None):

Traceback (most recent call last):
  File "/app/snowflake_table_last_updated_databuilder.py", line 176, in <module>
    cli()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/app/snowflake_table_last_updated_databuilder.py", line 170, in run
    sc.create_snowflake_last_updated_job()
  File "/app/snowflake_table_last_updated_databuilder.py", line 105, in create_snowflake_last_updated_job
    job.launch()
  File "/usr/local/lib/python3.7/site-packages/databuilder/job/job.py", line 77, in launch
    raise e
  File "/usr/local/lib/python3.7/site-packages/databuilder/job/job.py", line 65, in launch
    self._init()
  File "/usr/local/lib/python3.7/site-packages/databuilder/job/job.py", line 52, in _init
    self.task.init(self.conf)
  File "/usr/local/lib/python3.7/site-packages/databuilder/task/task.py", line 47, in init
    self.extractor.init(Scoped.get_scoped_conf(conf, self.extractor.get_scope()))
  File "/usr/local/lib/python3.7/site-packages/databuilder/extractor/sql_alchemy_extractor.py", line 37, in init
    self._execute_query()
  File "/usr/local/lib/python3.7/site-packages/databuilder/extractor/sql_alchemy_extractor.py", line 56, in _execute_query
    for result in self.results]
  File "/usr/local/lib/python3.7/site-packages/databuilder/extractor/sql_alchemy_extractor.py", line 56, in <listcomp>
    for result in self.results]
  File "/usr/local/lib/python3.7/site-packages/databuilder/models/table_last_updated.py", line 32, in __init__
    self.last_updated_time = int(last_updated_time_epoch)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Tests

Data examples provided above.

Documentation

https://docs.snowflake.com/en/sql-reference/info-schema/views.html#columns

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

@instazackwu
Copy link
Contributor Author

cc @Alagappan FYI this was an edge case with sending Snowflake data

@feng-tao
Copy link
Member

feng-tao commented Sep 4, 2020

will let @Alagappan comment, but the title is about snowflake, but the file seems to be related to myssql?

@instazackwu
Copy link
Contributor Author

Ah 😅 I committed the wrong file, will modify shortly!

@feng-tao
Copy link
Member

feng-tao commented Sep 4, 2020

the DOC check fails, could you do git rebase HEAD~2 --signoff in your local branch and repush it again?

@Alagappan
Copy link
Contributor

Alagappan commented Sep 4, 2020

Hello @instazackwu I thought about this some more and I think the best way to handle null is to prune those rows out in the SQL itself with a where condition as opposed to converting nulls to 0. We should not display last_updated_ts metadata for a table if it is not available. What do you think?

@feng-tao would like to get your thoughts on this as well.

@instazackwu
Copy link
Contributor Author

Hello @instazackwu I thought about this some more and I think the best way to handle null is to prune those rows out in the SQL itself with a where condition as opposed to converting nulls to 0. We should not display last_updated_ts metadata for a table if it is not available. What do you think?

@feng-tao would like to get your thoughts on this as well.

👍🏻 yeah I agree that's actually a better way to make the data more clean, seems like the frontend doesn't display last_updated_time if the value is 0 anyways.

I'll wait for Feng to weigh in, but the approach I was thinking was to set the default where clause to:
WHERE_CLAUSE_SUFFIX_KEY: ' WHERE t.last_altered IS NOT NULL '

@feng-tao
Copy link
Member

feng-tao commented Sep 4, 2020

yeah, having a where clause seems to be better.

Copy link
Contributor

@Alagappan Alagappan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@feng-tao feng-tao merged commit c3e713e into amundsen-io:master Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants