fix resource level max_table_nesting and normalizer performance tuning #2026

sh-rp · 2024-11-05T13:45:28Z

Description

The resource level max_table_nesting settings where not passed down to child tables, this PR fixes this. I also completely rewrote the tests to be much more readable (imho), also they were not testing various cases although they were claiming to.

netlify · 2024-11-05T13:45:49Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`8246012`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/672b5f8ce20de6000810b67f

sh-rp · 2024-11-05T14:14:15Z

dlt/common/normalizers/json/relational.py

+
+        # if table is not found, try to get it from root path
+        if not table and parent_path:
+            table = schema.tables.get(parent_path[0])


should we check intermediary tables here too? so if parent path is ["one", "two"] should we also check table "one__two" or is it always correct to use the top table? I'm not sure there is a case where there would be settings in a intermediary table.

we need only parent table. the overall fix is good but maybe we could optimize this block a little? it is always top level table that has this setting so maybe we should pass root table as table_name here? then no changes are needed

rudolfix

this is good just look at performance again because function here are often called

rudolfix · 2024-11-06T04:19:46Z

dlt/common/normalizers/json/relational.py

+
+        # if table is not found, try to get it from root path
+        if not table and parent_path:
+            table = schema.tables.get(parent_path[0])


we need only parent table. the overall fix is good but maybe we could optimize this block a little? it is always top level table that has this setting so maybe we should pass root table as table_name here? then no changes are needed

rudolfix · 2024-11-06T04:27:17Z

dlt/common/normalizers/json/relational.py

@@ -96,7 +96,7 @@ def _reset(self) -> None:
        # self.primary_keys = Dict[str, ]

    def _flatten(
-        self, table: str, dict_row: DictStrAny, _r_lvl: int
+        self, table: str, dict_row: DictStrAny, parent_path: Tuple[str, ...], _r_lvl: int


this is probably the most often called function in our code. maybe we can skip passing additional argument here or do something else to make it faster?

count the recursion level down (r_lvl) with the initial value initialized to max nesting per table. so we do not unnest when below _r_lvl

pass just parent table name, should be faster than tuple

sh-rp · 2024-11-06T11:37:24Z

@rudolfix I think I have taken all your ideas into account and it should be fast now, I need to run some benchmarks though.

sh-rp · 2024-11-06T12:24:03Z

For the following code the time goes from 4.5s on devel to about 3.1 seconds on this branch with the normalization caching. The boost mainly comes from cached normalization, I think only to a very small degree from the nesting stuff. String operations just are expensive.

import json
import os
import time
from tests.common.utils import json_case_path
from dlt.common.schema import Schema

def rasa_event_bot_metadata():
    with open(json_case_path("rasa_event_bot_metadata"), "rb") as f:
        return json.load(f)
    
def norm():
    return Schema("default").data_item_normalizer  # type: ignore[return-value]

if __name__ == "__main__":
    
    payload = rasa_event_bot_metadata()

    for i in range(1):
        n = norm() 
        start = time.time()
        for i in range(1000):
            payload["_id"] = i
            list(n.normalize_data_item(payload, "load_id", "table"))
        print(f"Time taken: {time.time() - start}")

sh-rp · 2024-11-06T12:25:39Z

dlt/common/normalizers/json/relational.py

+    # Cached helper methods for all operations that are called often
+    #
+    @staticmethod
+    @lru_cache(maxsize=None)


this adds a considerable speed boost. we could also consider adding caching support on the naming and not here so that all normalizers and other places can benefit, I'm not quite sure if there are other places where this gets called as often as here though.

this is interesting because this functions is already cached. are you using snake_case? convention? please look at the underlying code again

rudolfix

please see my comments. let's check where the improvements are coming from, becasue I already cache ident normalizers...

rudolfix · 2024-11-06T18:11:13Z

dlt/common/normalizers/json/relational.py

+    # Cached helper methods for all operations that are called often
+    #
+    @staticmethod
+    @lru_cache(maxsize=None)


this is interesting because this functions is already cached. are you using snake_case? convention? please look at the underlying code again

rudolfix · 2024-11-06T18:11:26Z

dlt/common/normalizers/json/relational.py

+    @staticmethod
+    @lru_cache(maxsize=None)
+    def _normalize_table_identifier(schema: Schema, table_name: str) -> str:
+        return schema.naming.normalize_table_identifier(table_name)


this is also cached already

sh-rp · 2024-11-07T09:34:35Z

@rudolfix it's using snake_case naming. There are a few of operations that get called over and over again because they are not cached. For example if you call normalize_identifier, then this is not cached, only _normalize_identifier is. The snake_case class calls the super class where strip is called and some checks are made. The strip alone accounts for 0.2s of those 1.7s or so gains. Then for example in shorten_fragments make_paths is calculated over and over again which also accounts for a bit of the savings etc. I can try to move all the caching to the naming if you like.

rudolfix

OK! there are other naming conventions that are not cached so that will speed things up. All non-deterministic naming conventions will stop working but no one should write those :)

sh-rp linked an issue Nov 5, 2024 that may be closed by this pull request

max_table_nesting not working as expected #2009

Closed

fix max table nesting, updated tests to come

179fcb1

sh-rp force-pushed the fix/2009-fix-max-table-nesting branch from 9aa43be to 179fcb1 Compare November 5, 2024 13:52

sh-rp commented Nov 5, 2024

View reviewed changes

completely rework tests

23582ce

sh-rp requested a review from rudolfix November 5, 2024 17:50

sh-rp self-assigned this Nov 5, 2024

sh-rp added the bug Something isn't working label Nov 5, 2024

sh-rp marked this pull request as ready for review November 5, 2024 22:50

rudolfix requested changes Nov 6, 2024

View reviewed changes

sh-rp force-pushed the fix/2009-fix-max-table-nesting branch from 88f7bec to bfa7b0a Compare November 6, 2024 11:31

calculate max nesting only once, and count nesting level backwards

2ad9591

sh-rp force-pushed the fix/2009-fix-max-table-nesting branch from bfa7b0a to 2ad9591 Compare November 6, 2024 11:33

sh-rp requested a review from rudolfix November 6, 2024 11:37

sh-rp added 3 commits November 6, 2024 12:47

fix normalizer tests in common

4606b47

cache shorten fragments (saves about 20-25% of time)

a54a171

cache normalizing identifiers

8246012

sh-rp changed the title ~~fix resource level max_table_nesting~~ fix resource level max_table_nesting and normalizer performance tuning Nov 6, 2024

sh-rp commented Nov 6, 2024

View reviewed changes

rudolfix reviewed Nov 6, 2024

View reviewed changes

rudolfix approved these changes Nov 7, 2024

View reviewed changes

rudolfix merged commit 62f46db into devel Nov 7, 2024
58 of 61 checks passed

rudolfix deleted the fix/2009-fix-max-table-nesting branch November 7, 2024 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix resource level max_table_nesting and normalizer performance tuning #2026

fix resource level max_table_nesting and normalizer performance tuning #2026

sh-rp commented Nov 5, 2024 •

edited

Loading

netlify bot commented Nov 5, 2024 •

edited

Loading

sh-rp Nov 5, 2024

rudolfix Nov 6, 2024

rudolfix left a comment

rudolfix Nov 6, 2024

rudolfix Nov 6, 2024

sh-rp commented Nov 6, 2024

sh-rp commented Nov 6, 2024 •

edited

Loading

sh-rp Nov 6, 2024

rudolfix Nov 6, 2024

rudolfix left a comment

rudolfix Nov 6, 2024

rudolfix Nov 6, 2024

sh-rp commented Nov 7, 2024 •

edited

Loading

rudolfix left a comment

fix resource level max_table_nesting and normalizer performance tuning #2026

fix resource level max_table_nesting and normalizer performance tuning #2026

Conversation

sh-rp commented Nov 5, 2024 • edited Loading

Description

netlify bot commented Nov 5, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Nov 6, 2024

sh-rp commented Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Nov 7, 2024 • edited Loading

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Nov 5, 2024 •

edited

Loading

netlify bot commented Nov 5, 2024 •

edited

Loading

sh-rp commented Nov 6, 2024 •

edited

Loading

sh-rp commented Nov 7, 2024 •

edited

Loading