Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix resource level max_table_nesting and normalizer performance tuning #2026

Merged
merged 6 commits into from
Nov 7, 2024

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Nov 5, 2024

Description

The resource level max_table_nesting settings where not passed down to child tables, this PR fixes this. I also completely rewrote the tests to be much more readable (imho), also they were not testing various cases although they were claiming to.

b6c7d035-1ea6-46f9-a17b-24ea78c7e918

Copy link

netlify bot commented Nov 5, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 8246012
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/672b5f8ce20de6000810b67f

@sh-rp sh-rp linked an issue Nov 5, 2024 that may be closed by this pull request
@sh-rp sh-rp force-pushed the fix/2009-fix-max-table-nesting branch from 9aa43be to 179fcb1 Compare November 5, 2024 13:52

# if table is not found, try to get it from root path
if not table and parent_path:
table = schema.tables.get(parent_path[0])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check intermediary tables here too? so if parent path is ["one", "two"] should we also check table "one__two" or is it always correct to use the top table? I'm not sure there is a case where there would be settings in a intermediary table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need only parent table. the overall fix is good but maybe we could optimize this block a little? it is always top level table that has this setting so maybe we should pass root table as table_name here? then no changes are needed

@sh-rp sh-rp requested a review from rudolfix November 5, 2024 17:50
@sh-rp sh-rp self-assigned this Nov 5, 2024
@sh-rp sh-rp added the bug Something isn't working label Nov 5, 2024
@sh-rp sh-rp marked this pull request as ready for review November 5, 2024 22:50
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good just look at performance again because function here are often called


# if table is not found, try to get it from root path
if not table and parent_path:
table = schema.tables.get(parent_path[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need only parent table. the overall fix is good but maybe we could optimize this block a little? it is always top level table that has this setting so maybe we should pass root table as table_name here? then no changes are needed

@@ -96,7 +96,7 @@ def _reset(self) -> None:
# self.primary_keys = Dict[str, ]

def _flatten(
self, table: str, dict_row: DictStrAny, _r_lvl: int
self, table: str, dict_row: DictStrAny, parent_path: Tuple[str, ...], _r_lvl: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably the most often called function in our code. maybe we can skip passing additional argument here or do something else to make it faster?

  • count the recursion level down (r_lvl) with the initial value initialized to max nesting per table. so we do not unnest when below _r_lvl
  • pass just parent table name, should be faster than tuple

@sh-rp sh-rp force-pushed the fix/2009-fix-max-table-nesting branch from 88f7bec to bfa7b0a Compare November 6, 2024 11:31
@sh-rp sh-rp force-pushed the fix/2009-fix-max-table-nesting branch from bfa7b0a to 2ad9591 Compare November 6, 2024 11:33
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 6, 2024

@rudolfix I think I have taken all your ideas into account and it should be fast now, I need to run some benchmarks though.

@sh-rp sh-rp requested a review from rudolfix November 6, 2024 11:37
@sh-rp sh-rp changed the title fix resource level max_table_nesting fix resource level max_table_nesting and normalizer performance tuning Nov 6, 2024
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 6, 2024

For the following code the time goes from 4.5s on devel to about 3.1 seconds on this branch with the normalization caching. The boost mainly comes from cached normalization, I think only to a very small degree from the nesting stuff. String operations just are expensive.

import json
import os
import time
from tests.common.utils import json_case_path
from dlt.common.schema import Schema

def rasa_event_bot_metadata():
    with open(json_case_path("rasa_event_bot_metadata"), "rb") as f:
        return json.load(f)
    
def norm():
    return Schema("default").data_item_normalizer  # type: ignore[return-value]

if __name__ == "__main__":
    
    payload = rasa_event_bot_metadata()

    for i in range(1):
        n = norm() 
        start = time.time()
        for i in range(1000):
            payload["_id"] = i
            list(n.normalize_data_item(payload, "load_id", "table"))
        print(f"Time taken: {time.time() - start}")

# Cached helper methods for all operations that are called often
#
@staticmethod
@lru_cache(maxsize=None)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this adds a considerable speed boost. we could also consider adding caching support on the naming and not here so that all normalizers and other places can benefit, I'm not quite sure if there are other places where this gets called as often as here though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting because this functions is already cached. are you using snake_case? convention? please look at the underlying code again

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my comments. let's check where the improvements are coming from, becasue I already cache ident normalizers...

# Cached helper methods for all operations that are called often
#
@staticmethod
@lru_cache(maxsize=None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting because this functions is already cached. are you using snake_case? convention? please look at the underlying code again

@staticmethod
@lru_cache(maxsize=None)
def _normalize_table_identifier(schema: Schema, table_name: str) -> str:
return schema.naming.normalize_table_identifier(table_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also cached already

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 7, 2024

@rudolfix it's using snake_case naming. There are a few of operations that get called over and over again because they are not cached. For example if you call normalize_identifier, then this is not cached, only _normalize_identifier is. The snake_case class calls the super class where strip is called and some checks are made. The strip alone accounts for 0.2s of those 1.7s or so gains. Then for example in shorten_fragments make_paths is calculated over and over again which also accounts for a bit of the savings etc. I can try to move all the caching to the naming if you like.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! there are other naming conventions that are not cached so that will speed things up. All non-deterministic naming conventions will stop working but no one should write those :)

@rudolfix rudolfix merged commit 62f46db into devel Nov 7, 2024
58 of 61 checks passed
@rudolfix rudolfix deleted the fix/2009-fix-max-table-nesting branch November 7, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

max_table_nesting not working as expected
2 participants