Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(ingest/bigquery): Add performance testing framework for bigquery usage #7690

Merged

Conversation

asikowitz
Copy link
Collaborator

Performance testing framework that contains:

  • A simplified model of a data warehouse's metadata (tables, views, queries, etc.)
  • A function for generating examples of such a model of different sizes
  • A function for converting generic Querys into BigQuery AuditEvents
  • An unfinished test for testing bigquery usage performance

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 24, 2023
@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2023

Codecov Report

Patch coverage: 93.33% and project coverage change: -7.81 ⚠️

Comparison is base (301c861) 74.39% compared to head (7d7bb67) 66.59%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7690      +/-   ##
==========================================
- Coverage   74.39%   66.59%   -7.81%     
==========================================
  Files         353      353              
  Lines       35386    35433      +47     
==========================================
- Hits        26327    23596    -2731     
- Misses       9059    11837    +2778     
Flag Coverage Δ
pytest-testIntegration ?
pytest-testIntegrationBatch1 36.50% <40.00%> (+0.03%) ⬆️
pytest-testQuick 63.60% <93.33%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...hub/ingestion/source/bigquery_v2/bigquery_audit.py 54.88% <87.50%> (+0.83%) ⬆️
...ub/ingestion/source/bigquery_v2/bigquery_report.py 100.00% <100.00%> (+4.76%) ⬆️

... and 87 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.


@pytest.fixture
def report_log_level_info():
yield from set_log_level(report_logger, logging.INFO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this stuff necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any error logs which made debugging these tests pretty annoying. Is this not usually the case and I'm doing something wrong? I also want to see INFO logs from the report class, so that when you run these tests, you can see how long each stage took

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this stuff is built into pytest: https://docs.pytest.org/en/7.1.x/how-to/logging.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice. I don't think it's unreasonable to set the log level for a specific logger for a specific test, so e.g. you can set overall log level to DEBUG but INFO for a certain logger. What do you think about removing the root logger changes (although, how is it possible that by default we don't show error logs?) but keeping the call to set the report log level. I also don't want someone running these tests and not seeing half the reporting it's supposed to show because they forgot to call pytest with --log-level INFO or something


report.set_project_state("All", "Event Ingestion")
with PerfTimer() as timer:
assert usage_extractor, table_refs # TODO: Replace with call to usage extractor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here? don't we need to call the usage_extractor to actually test it's performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I wrote this to work with the cross-project usage implementation that uses FileBackedDict, or at least the in-memory cross-project usage implementation that I have living in a branch. I don't think the current bigquery code is set up to take in a list of events and table_refs like those 2 are, so just leaving the call out here. Then I can add it in in those other branches, once I merge these changes in

@asikowitz
Copy link
Collaborator Author

Added changes to metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_report.py that I forgot to add before

from datahub.utilities.perf_timer import PerfTimer


def set_log_level(logger: logging.Logger, level: int) -> Generator[None, None, None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

Copy link
Collaborator

@hsheth2 hsheth2 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll probably need to add a rule to exclude this file from our docs site - can add that in the generateDocsDir script in docs-website

@asikowitz asikowitz merged commit 54a3727 into datahub-project:master Mar 29, 2023
@asikowitz asikowitz deleted the ingestion-performance-testing-bq branch March 29, 2023 21:13
yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023
… usage (#7690)

- Creates metadata-ingestion/tests/performance directory
- Excludes metadata-ingestion/tests from docs generation
- Updates bigquery reporting around project state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants