-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer: cache result of flattening schema #1086
Comments
pondzix
pushed a commit
that referenced
this issue
Sep 30, 2022
pondzix
pushed a commit
that referenced
this issue
Sep 30, 2022
pondzix
pushed a commit
that referenced
this issue
Sep 30, 2022
pondzix
pushed a commit
that referenced
this issue
Sep 30, 2022
pondzix
pushed a commit
that referenced
this issue
Sep 30, 2022
pondzix
pushed a commit
that referenced
this issue
Oct 4, 2022
pondzix
pushed a commit
that referenced
this issue
Oct 4, 2022
pondzix
pushed a commit
that referenced
this issue
Oct 5, 2022
pondzix
pushed a commit
that referenced
this issue
Oct 25, 2022
spenes
pushed a commit
that referenced
this issue
Oct 29, 2022
spenes
pushed a commit
that referenced
this issue
Nov 3, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Flattening
During the transformation (either of data coming from a batch or directly from a stream) to shredded TSV output format, flatten method is used. It is called for every entity of every transformed event.
In order to build proper 'flattened' structure, transformer executes following steps:
Caching
Step 3) is expensive operation from CPU point of view, as it dives deeper into JSON schema structure.
It turns out, step 3) can be cached, as it depends only on schema's content. Event's data is not needed to build properties, therefore these can be cached and shared between different entities using the same schema key (actually same vendor, name and model is sufficient).
Iglu client
There is already cache implemented for schema lookup in the Iglu client. It it possible to define TTL to let items expire. It is necessary as items stored in the cache can become stale e.g. when:
When item expires, new one is fetched from registry.
Transformer
The basic structure of a new cache, built on top of the one from Iglu, could look like:
(vendor, name, model)
tuple consisting of schema key details. Note: just model, not a full version because step 1) of flattening is listing schemas just by model.Properties
coming from schema-ddl library.This approach effectively removes steps 1), 2), 3) for most of the flattened entities, what could result in the performance improvement.
The only problem is properties would never expire, as opposed to items stored in the Iglu resolver's cache. This could lead to potential issues - even though we have refreshed data in Iglu, we would rely on stale data in the transformer.
Timestamps
Recent improvements in the Iglu client allows keeping both caches in sync. New variants of lookup methods in the resolver provide more details about returned value, e.g. timestamp indicating a moment in time when given value has been originally cached.
Improved cache's structure could be:
((vendor, name, model), timestamp)
- compound key consisting of schema key details (as before) and timestamp returned by Iglu lookupProperties
same as beforeWith this model it is possible to 'notify' transformer cache about changes in the correlated Iglu cache.
E.g. if new schema list is stored in the Iglu cache (e.g. after the original one expires), then new timestamp would be stored alongside. As this new timestamp would not yet be present in the properties cache (even for the same schema key), it would enforce recalculation of properties in transformer, therefore preventing usage of stale value.
Without timestamps we would end up with single properties calculation for every (vendor, name, model) key during application's lifetime, regardless of any changes of values stored in the resolver's lookup cache.
Related Iglu client issue: snowplow/iglu-scala-client#207
The text was updated successfully, but these errors were encountered: