Skip to content

Commit

Permalink
Address Nick's feedback - v2
Browse files Browse the repository at this point in the history
  • Loading branch information
spenes committed Nov 21, 2022
1 parent 4e67523 commit 436866e
Showing 1 changed file with 11 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ This is a complete list of the options that can be configured:
| `maxError` | Optional. Configures the [Redshift MAXERROR load option](https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html#copy-maxerror). The default is 10. |
| `loadAuthMethod.*` (since 5.2.0) | Optional, default method is `NoCreds`. Specifies the auth method to use with the `COPY` statement. |
| `loadAuthMethod.type` | Required if `loadAuthMethod` section is included. Specifies the type of the authentication method. The possible values are `NoCreds` and `TempCreds`. <br/><br/>With `NoCreds`, no credentials will be passed to the `COPY` statement. Instead, Redshift cluster needs to be configured with an AWS Role ARN that allows it to load data from S3. This Role ARN needs to be passed in the `roleArn` setting above. You can find more information [here](https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html). <br/><br/>With 'TempCreds', temporary credentials will be created for every load operation and these temporary credentials will be passed to the `COPY` statement. |
| `loadAuthMethod.roleArn` | Required if `loadAuthMethod.type` is `TempCreds`.IAM role that is used while creating temporary credentials. Created credentials will allow to access resources specified in the given role. `s3:GetObject*`, `s3:ListBucket`, and `s3:GetBucketLocation` permissions for transformer output S3 bucket should be specified in the role.
| `loadAuthMethod.roleArn` | Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to, with the following permissions: `s3:GetObject*`, `s3:ListBucket`, and `s3:GetBucketLocation`.
| `jdbc.*` | Optional. Custom JDBC configuration. The default value is `{"ssl": true}`. |
| `jdbc.BlockingRowsMode` | Optional. Refer to the [Redshift JDBC driver reference](https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.54.1082/Amazon+Redshift+JDBC+Connector+Install+Guide.pdf). |
| `jdbc.DisableIsValidQuery` | Optional. Refer to the [Redshift JDBC driver reference](https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.54.1082/Amazon+Redshift+JDBC+Connector+Install+Guide.pdf). |
Expand Down Expand Up @@ -63,14 +63,18 @@ This is a complete list of the options that can be configured:
| `warehouse` | Required. Snowflake warehouse which the SQL statements submitted by Snowflake Loader will run on. |
| `database` | Required. Snowflake database which the data will be loaded to. |
| `schema` | Required. Target schema |
| `transformedStage` | Required if `NoCreds` is chosen as load auth method. Snowflake stage for transformed events. |
| `folderMonitoringStage` | Required if `monitoring.folders` section is configured and `NoCreds` is chosen as load auth method. Snowflake stage to load folder monitoring entries into temporary Snowflake table. |
| `transformedStage.*` | Required if `NoCreds` is chosen as load auth method. Snowflake stage for transformed events. |
| `transformedStage.name` | Required if `transformedStage` is included. The name of the stage. |
| `transformedStage.location` | Required if `transformedStage` is included. The S3 path used as stage location. (not needed since 5.2.0 because it is auto-configured) |
| `folderMonitoringStage.*` | Required if `monitoring.folders` section is configured and `NoCreds` is chosen as load auth method. Snowflake stage to load folder monitoring entries into temporary Snowflake table. |
| `folderMonitoringStage.name` | Required if `folderMonitoringStage` is included. The name of the stage. |
| `folderMonitoringStage.location` | Required if `folderMonitoringStage` is included. The S3 path used as stage location. (not needed since 5.2.0 because it is auto-configured)|
| `appName` | Optional. Name passed as 'application' property while creating Snowflake connection. The default is `Snowplow_OSS`. |
| `maxError` | Optional. A table copy statement will skip an input file when the number of errors in it exceeds the specified number. This setting is used during initial loading and thus can filter out only invalid JSONs (which is impossible situation if used with Transformer). |
| `jdbcHost` | Optional. Host for the JDBC driver that has priority over automatically derived hosts. If it is not given, host will be created automatically according to given `snowflakeRegion`. |
| `loadAuthMethod.*` | Optional, default method is `NoCreds`. Specifies the auth method to use with `COPY INTO` statement. Note that `TempCreds` auth method doesn't work when data is loaded from GCS. |
| `loadAuthMethod.type` | Required if `loadAuthMethod` section is included. Specifies the type of the auth method. The possible values are `NoCreds` and `TempCreds`. <br/><br/>With `NoCreds`, no credentials will be passed to `COPY INTO` statement. Instead, `transformedStage` and `folderMonitoringStage` specified above will be used. More information can be found [here](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html). <br/><br/>With `TempCreds`, temporary credentials will be created for every load operation and these temporary credentials will be passed to `COPY INTO` statement. |
| `loadAuthMethod.roleArn` | Required if `loadAuthMethod.type` is `TempCreds`.IAM role that is used while creating temporary credentials. Created credentials will allow to access resources specified in the given role. List of necessary permissions needs to be given to role specified in [here](https://docs.snowflake.com/en/user-guide/data-load-s3-config-aws-iam-user.html). |
| `loadAuthMethod.roleArn` | Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to. You can find the list of necessary permissions needs to be given to role in [here](https://docs.snowflake.com/en/user-guide/data-load-s3-config-aws-iam-user.html). |

## Databricks Loader `storage` section

Expand All @@ -88,7 +92,7 @@ This is a complete list of the options that can be configured:
| `userAgent` | Optional. The default value is `snowplow-rdbloader-oss`. User agent name for Databricks connection. |
| `loadAuthMethod.*` | Optional, default method is `NoCreds`. Specifies the auth method to use with `COPY INTO` statement |
| `loadAuthMethod.type` | Required if `loadAuthMethod` section is included. Specifies the type of the auth method. The possible values are `NoCreds` and `TempCreds`. <br/><br/>With `NoCreds`, no credentials will be passed to `COPY INTO` statement. Databricks cluster needs to have permission to access transformer output S3 bucket. More information can be found [here](https://docs.databricks.com/administration-guide/cloud-configurations/aws/instance-profiles.html). <br/><br/>With `TempCreds`, temporary credentials will be created for every load operation and these temporary credentials will be passed to `COPY INTO` statement. With this way, Databricks cluster doesn't need permission to access to transformer output S3 bucket. This access will be provided by temporary credentials. |
| `loadAuthMethod.roleArn` | Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. Created credentials will allow to access resources specified in the given role. In our case, `s3:GetObject\*`, `s3:ListBucket`, and `s3:GetBucketLocation` permissions for transformer output S3 bucket should be specified in the role. |
| `loadAuthMethod.roleArn` | Required if `loadAuthMethod.type` is `TempCreds`. IAM role that is used while creating temporary credentials. This role should allow access to the S3 bucket the transformer will write data to, with the following permissions: `s3:GetObject*`, `s3:ListBucket`, and `s3:GetBucketLocation`. |

## AWS specific settings

Expand Down Expand Up @@ -118,8 +122,8 @@ Only Snowflake Loader can be run on GCP at the moment.
| `schedules.noOperation.[*].name` | Human-readable name of the no-op window. |
| `schedules.noOperation.[*].when` | Cron expression with second granularity. |
| `schedules.noOperation.[*].duration` | For how long the loader should be paused. |
| `schedules.optimizeEvents` | Optional. The default value is `"0 0 0 ? * *"`. Cron expression with second granularity that specifies the schedule to run periodically an `OPTIMIZE` statement on event table. (Only for Databricks Loader) |
| `schedules.optimizeManifest` | Optional. The default value is `"0 0 5 ? * *"`. Cron expression with second granularity that specifies the schedule to run periodically an `OPTIMIZE` statement on manifest table. (Only for Databricks Loader) |
| `schedules.optimizeEvents` | Optional. The default value is `"0 0 0 ? * *"` (i.e. every day at 00:00, JVM timezone). Cron expression with second granularity that specifies the schedule to run periodically an `OPTIMIZE` statement on event table. (Only for Databricks Loader) |
| `schedules.optimizeManifest` | Optional. The default value is `"0 0 5 ? * *"` (i.e. every day at 05:00 AM, JVM timezone). Cron expression with second granularity that specifies the schedule to run periodically an `OPTIMIZE` statement on manifest table. (Only for Databricks Loader) |
| `retryQueue.*` | Optional. Additional backlog of recently failed folders that could be automatically retried. Retry queue saves a failed folder and then re-reads the info from `shredding_complete` S3 file. (Despite the legacy name of the message, which is required for backward compatibility, this also works with wide row format data.) |
| `retryQueue.period` | Required if `retryQueue` section is configured. How often batch of failed folders should be pulled into a discovery queue. |
| `retryQueue.size` | Required if `retryQueue` section is configured. How many failures should be kept in memory. After the limit is reached new failures are dropped. |
Expand Down

0 comments on commit 436866e

Please sign in to comment.