Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer: cache result of flattening schema #1086

Closed
pondzix opened this issue Sep 30, 2022 · 0 comments
Closed

Transformer: cache result of flattening schema #1086

pondzix opened this issue Sep 30, 2022 · 0 comments

Comments

@pondzix
Copy link
Contributor

pondzix commented Sep 30, 2022

Flattening

During the transformation (either of data coming from a batch or directly from a stream) to shredded TSV output format, flatten method is used. It is called for every entity of every transformed event.

In order to build proper 'flattened' structure, transformer executes following steps:

  1. Listing schemas matching vendor, name and model of a schema referenced by entities of transformed event.
  2. Fetching content of every matching schema.
  3. Extracting properties based on schemas content (in other words: getting TSV columns in the correct order).
  4. Extracting entities data based on the properties (in other words: fill out TSV columns with data).

Caching

Step 3) is expensive operation from CPU point of view, as it dives deeper into JSON schema structure.
It turns out, step 3) can be cached, as it depends only on schema's content. Event's data is not needed to build properties, therefore these can be cached and shared between different entities using the same schema key (actually same vendor, name and model is sufficient).

Iglu client

There is already cache implemented for schema lookup in the Iglu client. It it possible to define TTL to let items expire. It is necessary as items stored in the cache can become stale e.g. when:

  • new schema is added to a registry
  • schema is patched in a registry

When item expires, new one is fetched from registry.

Transformer

The basic structure of a new cache, built on top of the one from Iglu, could look like:

  • Key - (vendor, name, model) tuple consisting of schema key details. Note: just model, not a full version because step 1) of flattening is listing schemas just by model.
  • Value - Properties coming from schema-ddl library.

This approach effectively removes steps 1), 2), 3) for most of the flattened entities, what could result in the performance improvement.

The only problem is properties would never expire, as opposed to items stored in the Iglu resolver's cache. This could lead to potential issues - even though we have refreshed data in Iglu, we would rely on stale data in the transformer.

Timestamps

Recent improvements in the Iglu client allows keeping both caches in sync. New variants of lookup methods in the resolver provide more details about returned value, e.g. timestamp indicating a moment in time when given value has been originally cached.

Improved cache's structure could be:

  • Key ((vendor, name, model), timestamp) - compound key consisting of schema key details (as before) and timestamp returned by Iglu lookup
  • Value - Properties same as before

With this model it is possible to 'notify' transformer cache about changes in the correlated Iglu cache.

E.g. if new schema list is stored in the Iglu cache (e.g. after the original one expires), then new timestamp would be stored alongside. As this new timestamp would not yet be present in the properties cache (even for the same schema key), it would enforce recalculation of properties in transformer, therefore preventing usage of stale value.

Without timestamps we would end up with single properties calculation for every (vendor, name, model) key during application's lifetime, regardless of any changes of values stored in the resolver's lookup cache.

Related Iglu client issue: snowplow/iglu-scala-client#207

@spenes spenes closed this as completed in 756b4a7 Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant