Skip to content

Releases: snowplow/snowplow-rdb-loader

5.2.0

21 Nov 13:47
Compare
Choose a tag to compare

This release brings Parquet support to Transformer Pubsub. Also, it brings various new features and improvements to RDB Loader applications.

Scheduler for running 'optimize' command on Databricks Loader

Loader applications are using a manifest table to keep track of information about the folders loaded so far. However, we've found that frequent manifest table updates result in a growing number of files backing this table in Databricks. This severely degrades the performance of loading. A similar problem impacts the event table too.

The remedy for the issue is to run an OPTIMIZE command on the table, compacting all updates into a small number of files.

In order to make this process smoother for users, we've added a scheduler that runs the OPTIMIZE command regularly, according to the given CRON statement.

OPTIMIZE scheduler can be configured like so:

  "schedules": {
    ...
    # CRON statement means that run the optimize command on event table every day at 00:00 (JVM timezone)
    "optimizeEvents": "0 0 0 ? * *",
    # CRON statement means that run the optimize command on manifest table every day at 05:00 (JVM timezone)
    "optimizeManifest": "0 0 5 ? * *"
  }

Databricks Loader has the above values as default. If you want to disable these schedulers completely, you need to set them to 'null':

  "schedules": {
    ...
    "optimizeEvents": null,
    "optimizeManifest": null
  }

Note: This feature requires collector_tstamp_date generated column in the event table. We recommend disabling this feature if you don't have this column in your events table. If the feature is enabled and collector_tstamp_date colum doesn't exist, you might see some errors in the application logs however those errors shouldn't interfere with the normal function of the application.

Parquet support in Transformer Pubsub

In this release, Transformer Pubsub comes with the ability to output in Parquet format. You need to add the following section to the Transformer Pubsub config to enable Parquet output:

"formats": {
  "fileFormat": "parquet"
}

You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.

New authorization method in Redshift Loader

On version 4.1.0, we've introduced new authorization method in Snowflake Loader and Databricks Loader. We are adding same method to Redshift Loader in this release.

This method allows to generate temporary credentials using STS and pass these credentials to Redshift. This removes the need to pre-configure the warehouse with access permission.

To start using the new authorization method, you must add a loadAuthMethod to the storage block in your config file:

"storage": {
  // other required fields go here

  "loadAuthMethod": {
    "type": "TempCreds"
    "roleArn": "arn:aws:iam::123456789:role/example_role_name"
  }
}

...where roleArn is a role with permission to read files from the S3 bucket. The loader must have permission to assume this role. More information about authorization options can be found in the docs.

Full Changelog

  • Redshift: Use STS tokens for copying from S3 (#998)
  • Snowflake Loader: auto-configure staging location paths (#1059)
  • Transformer Pubsub: support Parquet output option (#1124)
  • Common Stream Transformer: add Sentry (#881)
  • Common: add code format check as CI/CD step (#1089)
  • Loader: Improve minimum_age_of_loaded_data metric (#1111)
  • Transformer: cache result of flattening schema (#1086)
  • Databricks loader: Performance degradation for long running application (#1029)
  • Kinesis tranformer: flaky wide row test (#1030)

5.1.2

15 Nov 14:11
Compare
Choose a tag to compare

This patch release brings a bug fix to Snowflake Loader. This bug can affect you if you have json schema with a name that is styled like camelCase but uses capital letters in a row similar to the followings: testTESTSchema. The bug doesn't affect schemas with usual snake_case, camelCase, or PascalCase style names.

Changelog

  • Snowflake Loader: create column name with same method in everywhere (#1128)

Version 5.1.1

02 Nov 18:27
Compare
Choose a tag to compare

Changelog

  • Transformer Kinesis: include hadoop-aws runtime dependency (#1121)
  • Bump aws sdk to 2.18.8 (#1123)
  • Bump slf4j to 2.0.3 (#1119)
  • Bump hadoop-client to 3.3.4 (#1117)
  • Bump amazon-kinesis-client to 2.4.3 (#1116)
  • Bump google-cloud-storage to 2.14.0 (#1115)
  • Common: fix snyk monitoring (#1114)

5.1.0

26 Oct 12:36
Compare
Choose a tag to compare

This release brings SSH tunnel connection recovery to Redshift Loader. Also, it makes disabling in-batch natural deduplication in Batch Transformer possible.

Option to disable in-batch natural deduplication in Batch Transformer

Previously, it wasn't possible to disable in-batch natural deduplication in Batch Transformer. We have found that in-batch natural deduplication affects performance therefore we have made disabling it possible. If duplicate events aren't a problem for you, we suggest disabling deduplication.

It can be disabled by adding following section to the config:

  "deduplication": {
    # When natural deduplication is disabled, 'synthetic' deduplication needs to be disabled too. 
    "synthetic": {
      "type": "NONE"
    }
    "natural": false
  }

More information about deduplication in Batch Transformer can be found here.

SSH tunnel connection recovery in Redshift Loader

Redshift loader can connect to a private Redshift cluster through an SSH tunnel. Previously, if SSH tunnel session was disconnected, the loader didn't have a way to discover it. We added retry around SSH tunnel connection to make it possible to recover from this problem and to make it more robust.

Upgrading to 5.1.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.1.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:5.1.0
docker pull snowplow/rdb-loader-redshift:5.1.0
docker pull snowplow/rdb-loader-snowflake:5.1.0
docker pull snowplow/rdb-loader-databricks:5.1.0

The Snowplow docs site has a full guide to running the RDB Loader.

Changelog

  • Transformer Batch: make in-batch natural deduplication optional (#1108)
  • Recover from disconnected SSH tunnel (#1084)

5.0.0

18 Oct 09:38
Compare
Choose a tag to compare

This release brings the first GCP supported applications in the RDB Loader application family: Snowflake Loader and Transformer Pubsub.

Additionally, this release brings a few bug fixes on Databricks Loader and Transformer Kinesis.

GCP support on Snowflake Loader and Transformer Pubsub

From its inception, RDB Loader applications are developed to run on AWS. Making possible to run them in GCP have been in our roadmap for a long time. In this release, we pave its way with integrating GCP services to Snowflake Loader to make it possible to run it with GCP services completely. GCP counterpart of transformer, Transformer Pubsub, is created as well. With these additions, it is possible to load Snowplow data from GCP pipeline to Snowflake.

At the moment, Transformer Pubsub can't output in Parquet format. Adding support for it is in our roadmap as well. This change will make the Databricks Loader on GCP possible as well.

How to start loading into Snowflake on GCP

Initially, you need to deploy the Transformer Pubsub. Minimal configuration file for Transformer Pubsub looks like following:

{
  # Name of the Pubsub subscription with the enriched events
  "input": {
    "subscription": "projects/project-id/subscriptions/subscription-id"
  }
  # Path to transformed archive
  "output": {
    "path": "gs://bucket/transformed/"
  }
  # Name of the Pubsub topic used to communicate with Loader
  "queue": {
    "topic": "projects/project-id/topics/topic-id"
  }
}

You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.

Then, for the Snowflake Loader part you'll need to:

  • setup the necessary Snowflake resources
  • prepare configuration files for the loader
  • deploy the Snowflake Loader app

Important bit in the Snowflake Loader config is that Pubsub should be used as message queue:

  ...
  "messageQueue": {
    "type": "pubsub"
    "subscription": "projects/project-id/subscriptions/subscription-id"
  }
  ...

Full documentation for Snowflake Loader can be found here.

Bug fixes on Databricks Loader and Transformer Kinesis

  • It is reported that there was an issue in Databricks Loader when trying to load a batch where multiple parquet files with different schemas and optional column only exist in some of the files. This issue is fixed in version 5.0.0. Thanks drphrozen for reporting the issue and submitting a PR!

  • It is reported that Transformer Kinesis throws exception when Kinesis stream shard count is increased. This issue is fixed in version 5.0.0. Thanks sdbeans for reporting the issue!

Adding telemetry to loader apps and Transformer Pubsub

In Snowplow, we are trying to improve our products every day and understanding what is popular is important part of it to focus our development effort in the right place. Therefore, we are adding telemetry to loader apps and Transformer Pubsub. What it is doing basically sending heartbeats with some minimal meta-information about the application.

You can help us by providing userProvidedId in the config file:

"telemetry" {
  "userProvidedId": "myCompany"
}

Telemetry can be deactivated by putting the following section in the configuration file:

"telemetry": {
  "disable": true
}

More information about telemetry in RDB Loader project can be found here.

Upgrading to 5.0.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.0.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:5.0.0
docker pull snowplow/rdb-loader-redshift:5.0.0
docker pull snowplow/rdb-loader-snowflake:5.0.0
docker pull snowplow/rdb-loader-databricks:5.0.0

The Snowplow docs site has a full guide to running the RDB Loader.

4.3.0

26 Sep 10:22
Compare
Choose a tag to compare

This release brings some important bug fixes, especially around table migrations. Table migrations are an important feature of RDB Loader: if you update an Iglu schema (e.g. from version 1-0-0 to 1-0-1) then the loader automatically alters the target table to accommodate the newer events. However, we discovered a number of edge cases where migrations did not work as expected.

Redshift loader: Fix to alter max length of text fields

This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 3.0.0 and does not affect older versions.

If you update an Iglu schema by raising the maxLength setting for a string field, then RDB Loader should respond by altering the table e.g. from VARCHAR(10) to VARCHAR(20). Because of this bug, RDB Loader did not attempt to alter the column length; it would instead attempt to load the newer events into the table without running the migrations. You might be affected by this bug if you have recently updated an Iglu schema by raising the max length of a field. If you think you have been affected by this bug, we suggest you check your entity tables and manually alter the table if needed:

ALTER TABLE <SHREDDED_TABLE_NAME> ALTER COLUMN <EXTENDED_COLUMN> TYPE VARCHAR(<NEW_SIZE>);

The bug is fixed in this new 4.3.0 release. Once you upgrade to 4.3.0, RDB Loader will be prepared to correctly migrate your table in response to future field length changes.

Redshift loader: Fix to recover from failed migrations

This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 1.2.0.

If a table migration is immediately followed by a batch which cannot be loaded for any reason, then a table could be left in an inconsistent state where a migration was partially applied. If this ever happened, then RDB Loader could get stuck on successive loads with error messages like:

Invalid operation: cannot alter column “CCCCCCC” of relation “XXXXXXX", target column size should be different; - SqlState: 0A000

With this new 4.3.0 release, the inconsistent state is still reachable (due to Redshift limitations), but the loader is robust to recover from it.

Redshift loader: Fix migrations for batches with multiple versions of the same schema

This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 1.3.0.

It is possible and completely allowed for a batch of events to contain multiple versions of the same schema, e.g. both 1-0-0 and 1-0-1. However, because of this bug, the loader was in danger of trying to perform table migrations twice. This could result in an error message like (same error as in previous case):

Invalid operation: cannot alter column “CCCCCCC” of relation “XXXXXXX", target column size should be different; - SqlState: 0A000

or following one depending on schema migration.

Invalid operation: cannot add column “CCCCCCC” of relation “XXXXXXX", already exists; - SqlState: 0A000

This is fixed in the 4.3.0 release, and now the loader will not enter this failure state if a batch contains multiple versions of the same schema.

Snowflake loader: Configure folder monitoring without a stage, while doing loading with a stage

This is a new feature you can benefit from if you load into Snowflake with RDB Loader. The Snowflake loader allows two alternative methods for authentication between the warehouse and the S3 bucket: either using Snowflake storage integration, or using temporary credentials generated with AWS STS. Previously, you were forced to pick the same method for loading events and for folder monitoring. With this change, it is possible to use the storage integration for loading events, but temporary credentials for folder monitoring. This is beneficial if you want the faster load times from using a storage integration, but do not want to go through the overhead of setting up a storage integration just for folder monitoring.

Take a look at the github issue for more details on the different ways to configure the loader to use the different authentication methods.

Snowflake and Databricks loaders: Fix inserting timestamps with wrong timezone to manifest table

This is a low-impact bug that is not expected to have any detrimental effect on loading. It could affect your pipeline if you load into Snowflake or Databricks, and if your warehouse is set to have a non-UTC timezone by default.

This bug affects the manifest table, which is the table the loader uses to track which batches have been loaded already. Because of this bug, timestamps in the manifest table were stored using the default timezone of the warehouse, not UTC. This bug could only affect you in the unlikely case you use the manifest table for some other purpose.

Starting from this version 4.3.0 release, we now take care to insert timestamps with the UTC timezone.

Upgrading to 4.3.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.3.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:4.3.0
docker pull snowplow/rdb-loader-redshift:4.3.0
docker pull snowplow/rdb-loader-snowflake:4.3.0
docker pull snowplow/rdb-loader-databricks:4.3.0

The Snowplow docs site has a full guide to running the RDB Loader.

4.2.2

06 Sep 10:34
Compare
Choose a tag to compare

A minor release brings stability improvements and better error messaging on loaders.

Changelog

  • Loader: feature flag to disable adding load_tstamp column (#1041)
  • Loader: catch and ignore errors when adding load_tstamp column (#1039)
  • Loader: improvements to initialization logs and monitoring (#1040)
  • Databricks loader: make catalog prepended to queries optional (#992)

4.2.1

25 Jul 10:31
Compare
Choose a tag to compare

A bug fix release that fixes a bug in the Snowflake Loader’s new temp credentials auth method.

Changelog

  • Snowflake Loader: drop temp table in the first step of the load operation (#1001)
  • Snowflake Loader: fix loading path used with temp creds auth method (#1002)

4.2.0

19 Jul 12:54
Compare
Choose a tag to compare

New authorization options for Snowflake and Databricks

Up until version 4.1.0, RDB Loader required that the warehouse had been pre-configured to have read-access to the data files in S3.
For Snowflake, this meant setting up an external stage with a storage integration.
For Databricks, it meant setting up a cluster to assume an AWS instance profile.

Starting with version 4.2.0, RDB Loader is able to generate temporary credentials using STS and pass these credentials to Snowflake/Databricks. This removes the need to pre-configure the warehouse with access permission.

To start using the new authorization method, you must add a loadAuthMethod to the storage block in your config file:

"storage": {
  // other required fields go here
  "loadAuthMethod": {
    "type": "TempCreds"
    "roleArn": "arn:aws:iam::123456789:role/example_role_name"
  }
}

...where roleArn is a role with permission to read files from the S3 bucket. The loader must have permission to assume this role.
Our Github repo has some examples of this configuration for Snowflake and for Databricks.

Note, for Snowflake loading, depending on your event volume and warehouse configuration, there may still be an advantage to setting up the storage integration, because the underlying COPY INTO statement is more efficient.
For Databricks loading, though, there should be no impact of changing to the new authorization method.

Retry on target initialization

Initialization block is surrounded by retry block so that if an exception is thrown from initialization block instead of crashing the application, it will be retried according to the specified backoff strategy.

To enable this feature, initRetries must be added to config file:

  "initRetries": {
    "backoff": "30 seconds"
    "strategy": "EXPONENTIAL"
    "attempts": 3,
    "cumulativeBound": "1 hour"
  },

Adjusting the path appended to Snowflake stage in Snowflake Loader

As it is mentioned in the first section, we've added a new authorization option to Snowflake Loader. However, old Snowflake stage method still can be used.

Previously, Snowflake stage path needs to be exactly the path where transformed run folders reside. If the path of an upper folder is given as a stage path, loading wouldn't work.

We've fixed this issue in this release. Even if the stage path is set to the path of transformed folder's upper directory, loading would still work correctly.

To use this feature, you need to update transformedStage and folderMonitoringStage blocks:

  "transformedStage":  {
    # The name of the stage
    "name": "snowplow_stage"
    # The S3 path used as stage location
    "location": "s3://bucket/transformed/"
  }
  "folderMonitoringStage": {
    # The name of the stage
    "name": "snowplow_folders_stage"
    # The S3 path used as stage location
    "location": "s3://bucket/monitoring/"
  }

Bug fix for streaming transformer on multiple instances

In the previous released of RDB Loader we announced that the streaming transformer can now scale to multiple instances, which was a really important requirement for high volume pipelines.
We got one little thing wrong though, and it lead to some app crashes with error messages about lost Kinesis leases. This bug is now fixed in version 4.2.0, and we hope this unlocks your pipeline from scaling to higher event volumes with the streaming transformer.

Upgrading to 4.2.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.2.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:4.2.0
docker pull snowplow/rdb-loader-redshift:4.2.0
docker pull snowplow/rdb-loader-snowflake:4.2.0
docker pull snowplow/rdb-loader-databricks:4.2.0

The Snowplow docs site has a full guide to running the RDB Loader.

Changelog

  • Transformer kinesis: Recover from losing lease to a new worker (#962)
  • Loader: make the part appended to folder monitoring staging path configurable (#969)
  • Loader: retry on target initialization (#964)
  • Loader: Trim alert message payloads to 4096 characters (#956)
  • Snowflake loader: make on_error continue when type of the incoming data is parquet (#970)
  • Snowflake loader: make the path used with stage adjustable (#968)
  • Snowflake loader: use STS tokens for copying from S3 (#955)
  • Snowflake loader: Specify file format in the load statement (#957)
  • Databricks loader: Generate STS tokens for copying from S3 (#954)

4.1.0

04 Jul 12:07
Compare
Choose a tag to compare

Concurrent streaming transformers for horizontal scaling

Before version 4.1.0, it was only possible to run a single instance of the streaming transformer at any one time. If you tried to run multiple instances at the same time, then there was a race condition which got described in detail in a previous Discourse thread. The old setup worked great for low volume pipelines, but it meant the streaming solution was not ideal for scaling up to higher volumes.

In version 4.1.0 we have worked around the problem simply by changing the directory names in S3 to contain a UUID unique to each running transformer. Before version 4.1.0, an output directory might be called run=2022-05-01-00-00-14, but in 4.1.0 the output directory might be called like run=2022-05-01-00-00-14-b4cac3e5-9948-40e3-bd68-38abcf01cdf9. Directory names for the batch transformer are not affected.

With this simple change, you can now safely scale out your streaming transformer to have multiple instances running in parallel.

Databricks loader supports generated columns

If you load into Databricks, a great way to set up your table is to partition based on the date of the event using a generated column:

CREATE TABLE IF NOT EXISTS snowplow.events (
  app_id                      VARCHAR(255),
  collector_tstamp            TIMESTAMP       NOT NULL,
  event_name                  VARCHAR(1000),
  -- Lots of other fields go here

  -- Collector timestamp date for partitioning
  collector_tstamp_date       DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name);

This partitioning strategy is very efficient for analytic queries that filter by collector_tstamp. The Snowplow/Databricks dbt web model works particularly well with this partitioning scheme.

In RDB Loader version 4.1.0 we made a small change to the Databricks loading to account for these generated columns.

Upgrading to 4.1.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.1.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:4.1.0
docker pull snowplow/rdb-loader-redshift:4.1.0
docker pull snowplow/rdb-loader-snowflake:4.1.0
docker pull snowplow/rdb-loader-databricks:4.1.0

The Snowplow docs site has a full guide to running the RDB Loader.

Changelog

  • Databricks loader: Support for generated columns (#951)
  • Loader: Use explicit schema name everywhere (#952)
  • Loader: Jars cannot load jsch (#942)
  • Snowflake loader: region and account configuration fields should be optional (#947)
  • Loader: Include the SQLState when logging a SQLException (#941)
  • Loader: Handle run directories with UUID suffix in folder monitoring (#949)
  • Add UUID to streaming transformer directory structure (#945)