-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix resource level max_table_nesting and normalizer performance tuning #2026
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
9aa43be
to
179fcb1
Compare
|
||
# if table is not found, try to get it from root path | ||
if not table and parent_path: | ||
table = schema.tables.get(parent_path[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check intermediary tables here too? so if parent path is ["one", "two"] should we also check table "one__two" or is it always correct to use the top table? I'm not sure there is a case where there would be settings in a intermediary table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need only parent table. the overall fix is good but maybe we could optimize this block a little? it is always top level table that has this setting so maybe we should pass root table as table_name here? then no changes are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good just look at performance again because function here are often called
|
||
# if table is not found, try to get it from root path | ||
if not table and parent_path: | ||
table = schema.tables.get(parent_path[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need only parent table. the overall fix is good but maybe we could optimize this block a little? it is always top level table that has this setting so maybe we should pass root table as table_name here? then no changes are needed
@@ -96,7 +96,7 @@ def _reset(self) -> None: | |||
# self.primary_keys = Dict[str, ] | |||
|
|||
def _flatten( | |||
self, table: str, dict_row: DictStrAny, _r_lvl: int | |||
self, table: str, dict_row: DictStrAny, parent_path: Tuple[str, ...], _r_lvl: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably the most often called function in our code. maybe we can skip passing additional argument here or do something else to make it faster?
- count the recursion level down (r_lvl) with the initial value initialized to max nesting per table. so we do not unnest when below _r_lvl
- pass just parent table name, should be faster than tuple
88f7bec
to
bfa7b0a
Compare
bfa7b0a
to
2ad9591
Compare
@rudolfix I think I have taken all your ideas into account and it should be fast now, I need to run some benchmarks though. |
For the following code the time goes from 4.5s on devel to about 3.1 seconds on this branch with the normalization caching. The boost mainly comes from cached normalization, I think only to a very small degree from the nesting stuff. String operations just are expensive.
|
# Cached helper methods for all operations that are called often | ||
# | ||
@staticmethod | ||
@lru_cache(maxsize=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this adds a considerable speed boost. we could also consider adding caching support on the naming and not here so that all normalizers and other places can benefit, I'm not quite sure if there are other places where this gets called as often as here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting because this functions is already cached. are you using snake_case
? convention? please look at the underlying code again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my comments. let's check where the improvements are coming from, becasue I already cache ident normalizers...
# Cached helper methods for all operations that are called often | ||
# | ||
@staticmethod | ||
@lru_cache(maxsize=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting because this functions is already cached. are you using snake_case
? convention? please look at the underlying code again
@staticmethod | ||
@lru_cache(maxsize=None) | ||
def _normalize_table_identifier(schema: Schema, table_name: str) -> str: | ||
return schema.naming.normalize_table_identifier(table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also cached already
@rudolfix it's using snake_case naming. There are a few of operations that get called over and over again because they are not cached. For example if you call normalize_identifier, then this is not cached, only _normalize_identifier is. The snake_case class calls the super class where strip is called and some checks are made. The strip alone accounts for 0.2s of those 1.7s or so gains. Then for example in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! there are other naming conventions that are not cached so that will speed things up. All non-deterministic naming conventions will stop working but no one should write those :)
Description
The resource level max_table_nesting settings where not passed down to child tables, this PR fixes this. I also completely rewrote the tests to be much more readable (imho), also they were not testing various cases although they were claiming to.