diff --git a/airbyte-integrations/connectors/destination-s3/Dockerfile b/airbyte-integrations/connectors/destination-s3/Dockerfile index 6f97def52b5e..852934d1f068 100644 --- a/airbyte-integrations/connectors/destination-s3/Dockerfile +++ b/airbyte-integrations/connectors/destination-s3/Dockerfile @@ -16,5 +16,5 @@ ENV APPLICATION destination-s3 COPY --from=build /airbyte /airbyte -LABEL io.airbyte.version=0.2.12 +LABEL io.airbyte.version=0.2.13 LABEL io.airbyte.name=airbyte/destination-s3 diff --git a/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/S3ConsumerFactory.java b/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/S3ConsumerFactory.java index 0bd210852d83..c709e2ed547c 100644 --- a/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/S3ConsumerFactory.java +++ b/airbyte-integrations/connectors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/S3ConsumerFactory.java @@ -82,9 +82,9 @@ private static Function toWriteConfig( final String streamName = abStream.getName(); final String bucketPath = config.get(BUCKET_PATH_FIELD).asText(); final String customOutputFormat = String.join("/", - bucketPath, - config.has(PATH_FORMAT_FIELD) && !config.get(PATH_FORMAT_FIELD).asText().isBlank() ? - config.get(PATH_FORMAT_FIELD).asText() : S3DestinationConstants.DEFAULT_PATH_FORMAT); + bucketPath, + config.has(PATH_FORMAT_FIELD) && !config.get(PATH_FORMAT_FIELD).asText().isBlank() ? config.get(PATH_FORMAT_FIELD).asText() + : S3DestinationConstants.DEFAULT_PATH_FORMAT); final String outputBucketPath = storageOperations.getBucketObjectPath(namespace, streamName, SYNC_DATETIME, customOutputFormat); final DestinationSyncMode syncMode = stream.getDestinationSyncMode(); final WriteConfig writeConfig = new WriteConfig(namespace, streamName, outputBucketPath, syncMode); diff --git a/docs/integrations/destinations/s3.md b/docs/integrations/destinations/s3.md index 5dab2017035f..b96b2ab208cf 100644 --- a/docs/integrations/destinations/s3.md +++ b/docs/integrations/destinations/s3.md @@ -22,6 +22,7 @@ Check out common troubleshooting issues for the S3 destination connector on our | S3 Endpoint | string | URL to S3, If using AWS S3 just leave blank. | | S3 Bucket Name | string | Name of the bucket to sync data into. | | S3 Bucket Path | string | Subdirectory under the above bucket to sync the data into. | +| S3 Bucket Format | string | Additional subdirectories format under S3 Bucket Path. Default value is `${NAMESPACE}/${STREAM_NAME}/` and this can be further customized with variables such as `${YEAR}, ${MONTH}, ${DAY}, ${HOUR} etc` referring to the writing datetime. | | S3 Region | string | See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions) for all region codes. | | Access Key ID | string | AWS/Minio credential. | | Secret Access Key | string | AWS/Minio credential. | @@ -29,21 +30,21 @@ Check out common troubleshooting issues for the S3 destination connector on our ⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️ -The full path of the output data is: +The full path of the output data with S3 path format `${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}` is: ```text -///--. +////. ``` For example: ```text -testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv -↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ -| | | | | | | format extension -| | | | | | partition id -| | | | | upload time in millis -| | | | upload date in YYYY-MM-DD +testing_bucket/data_output_path/public/users/2021_01_01/123e4567-e89b-12d3-a456-426614174000.csv.gz +↑ ↑ ↑ ↑ ↑ ↑ ↑ +| | | | | | format extension +| | | | | | +| | | | | uuid +| | | | upload date in YYYY_MM_DD | | | stream name | | source namespace (if it exists) | bucket path @@ -51,10 +52,7 @@ bucket name ``` Please note that the stream name may contain a prefix, if it is configured on the connection. - -The rationales behind this naming pattern are: 1. Each stream has its own directory. 2. The data output files can be sorted by upload time. 3. The upload time composes of a date part and millis part so that it is both readable and unique. - -Currently, each data sync will only create one file per stream. In the future, the output file can be partitioned by size. Each partition is identifiable by the partition ID, which is always 0 for now. +A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) . ## Output Schema @@ -133,6 +131,8 @@ With root level normalization, the output CSV is: | :--- | :--- | :--- | :--- | | `26d73cde-7eb1-4e1e-b7db-a4c03b4cf206` | 1622135805000 | 123 | `{ "first": "John", "last": "Doe" }` | +Output CSV files will always be compressed using GZIP compression. + ### JSON Lines \(JSONL\) [Json Lines](https://jsonlines.org/) is a text format with one JSON per line. Each line has a structure as follows: @@ -173,6 +173,8 @@ They will be like this in the output file: { "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } } ``` +Output JSONL files will always be compressed using GZIP compression. + ### Parquet #### Configuration @@ -226,6 +228,7 @@ Under the hood, an Airbyte data stream in Json schema is first converted to an A | Version | Date | Pull Request | Subject | |:--------| :--- | :--- |:---------------------------------------------------------------------------------------------------------------------------| +| 0.2.13 | 2022-03-29 | [\#11496](https://github.com/airbytehq/airbyte/pull/11496) | Fix S3 bucket path to be included with S3 bucket format | | 0.2.12 | 2022-03-28 | [\#11294](https://github.com/airbytehq/airbyte/pull/11294) | Change to serialized buffering strategy to reduce memory consumption | | 0.2.11 | 2022-03-23 | [\#11173](https://github.com/airbytehq/airbyte/pull/11173) | Added support for AWS Glue crawler | | 0.2.10 | 2022-03-07 | [\#10856](https://github.com/airbytehq/airbyte/pull/10856) | `check` method now tests for listObjects permissions on the target bucket |