Skip to content

Commit

Permalink
Source Amazon S3: Refactored docs (#12534)
Browse files Browse the repository at this point in the history
* Refactored spec and docs

* Updated spec.json

* Rollback spec fromating

* Rollback spec fromating

* Rollback spec fromating
  • Loading branch information
lazebnyi authored May 9, 2022
1 parent 15aef24 commit 27e6ce2
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 54 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -221,4 +221,4 @@
},
"supportsIncremental": true,
"supported_destination_sync_modes": ["overwrite", "append", "append_dedup"]
}
}
94 changes: 41 additions & 53 deletions docs/integrations/sources/s3.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,45 @@
# S3
# Amazon S3

## Overview
This page contains the setup guide and reference information for the Amazon S3 source connector.

The S3 source enables syncing of **file-based tables** with support for multiple files using glob-like pattern matching, and both Full Refresh and Incremental syncs, using the last\_modified property of files to determine incremental batches.
**This connector does not support syncing unstructured data files such as raw text, audio, or videos.**
You can choose if this connector will read only the new/updated files, or all the matching files, every time a sync is run.
## Prerequisites

Connector allows using either Amazon S3 storage or 3rd party S3 compatible service like Wasabi or custom S3 services set up with minio, leofs, ceph etc.
<Connector-specific prerequisites which are required in both Airbyte Cloud & OSS>

### Output Schema
<If OSS has different requirements (e.g: user needs to setup a developer application)>

At this time, this source produces only a single stream \(table\) for the target files.
## Setup guide

By default, the schema will be automatically inferred from all the relevant files present when setting up the connection, however you can also specify a schema in the source settings to enforce desired columns and datatypes. Any additional columns found \(on any sync\) are packed into an extra mapping field called `_ab_additional_properties`. Any missing columns will be added and null-filled.
### Step 1: Set up Amazon S3

We'll be considering extending these behaviours in the future and welcome your feedback!
* If syncing from a private bucket, the credentials you use for the connection must have have both `read` and `list` access on the S3 bucket. `list` is required to discover files based on the provided pattern\(s\).

## Step 2: Set up the Amazon S3 connector in Airbyte

### For Airbyte Cloud:

1. [Log into your Airbyte Cloud](https://cloud.airbyte.io/workspaces) account.
2. In the left navigation bar, click **<Sources/Destinations>**. In the top-right corner, click **+new source/destination**.
3. On the Set up the <source/destination> page, enter the name for the <connector name> connector and select **<connector name>** from the <Source/Destination> type dropdown.
4. Set `dataset` appropriately. This will be the name of the table in the destination.
3. If your bucket contains _only_ files containing data for this table, use `**` as path\_pattern. See the [Path Patterns section](s3.md#path-patterns) for more specific pattern matching.
4. Leave schema as `{}` to automatically infer it from the file\(s\). For details on providing a schema, see the [User Schema section](s3.md#user-schema).
5. Fill in the fields within the provider box appropriately. If your bucket is not public, add [credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) with sufficient permissions under `aws_access_key_id` and `aws_secret_access_key`.
6. Choose the format corresponding to the format of your files and fill in fields as required. If unsure about values, try out the defaults and come back if needed. Find details on these settings [here](s3.md#file-format-settings).

### For Airbyte OSS:

Note that you should provide the `dataset` which dictates how the table will be identified in the destination.
1. Create a new S3 source with a suitable name. Since each S3 source maps to just a single table, it may be worth including that in the name.
2. Set `dataset` appropriately. This will be the name of the table in the destination.
3. If your bucket contains _only_ files containing data for this table, use `**` as path\_pattern. See the [Path Patterns section](s3.md#path-patterns) for more specific pattern matching.
4. Leave schema as `{}` to automatically infer it from the file\(s\). For details on providing a schema, see the [User Schema section](s3.md#user-schema).
5. Fill in the fields within the provider box appropriately. If your bucket is not public, add [credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) with sufficient permissions under `aws_access_key_id` and `aws_secret_access_key`.
6. Choose the format corresponding to the format of your files and fill in fields as required. If unsure about values, try out the defaults and come back if needed. Find details on these settings [here](s3.md#file-format-settings).

### Data Types

Currently, complex types \(array and object\) are coerced to string, but we'll be looking to improve support for this in the future!
## Supported sync modes

### Features
The Amazon S3 source connector supports the following [sync modes](https://docs.airbyte.com/cloud/core-concepts#connection-sync-modes):

| Feature | Supported? |
| :--- | :--- |
Expand All @@ -33,7 +50,8 @@ Currently, complex types \(array and object\) are coerced to string, but we'll b
| Replicate Multiple Streams \(distinct tables\) | No |
| Namespaces | No |

### File Compressions

## File Compressions

| Compression | Supported? |
| :--- | :--- |
Expand All @@ -46,41 +64,8 @@ Currently, complex types \(array and object\) are coerced to string, but we'll b

Please let us know any specific compressions you'd like to see support for next!

### File Formats

File Formats are mostly enabled \(and further tested\) thanks to other open-source libraries that we are using under the hood such as:

* [PyArrow](https://arrow.apache.org/docs/python/csv.html)

| Format | Supported? |
| :--- | :--- |
| CSV | Yes |
| Parquet | Yes |
| JSON | No |
| HTML | No |
| XML | No |
| Excel | No |
| Feather | No |
| Pickle | No |

We're looking to enable these other formats very soon, so watch this space!

## Getting started

### Requirements

* If syncing from a private bucket, the credentials you use for the connection must have have both `read` and `list` access on the S3 bucket. `list` is required to discover files based on the provided pattern\(s\).

### Quickstart

1. Create a new S3 source with a suitable name. Since each S3 source maps to just a single table, it may be worth including that in the name.
2. Set `dataset` appropriately. This will be the name of the table in the destination.
3. If your bucket contains _only_ files containing data for this table, use `**` as path\_pattern. See the [Path Patterns section](s3.md#path-patterns) for more specific pattern matching.
4. Leave schema as `{}` to automatically infer it from the file\(s\). For details on providing a schema, see the [User Schema section](s3.md#user-schema).
5. Fill in the fields within the provider box appropriately. If your bucket is not public, add [credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) with sufficient permissions under `aws_access_key_id` and `aws_secret_access_key`.
6. Choose the format corresponding to the format of your files and fill in fields as required. If unsure about values, try out the defaults and come back if needed. Find details on these settings [here](s3.md#file-format-settings).

### Path Pattern
## Path Pattern

\(tl;dr -&gt; path pattern syntax using [wcmatch.glob](https://facelessuser.github.io/wcmatch/glob/). GLOBSTAR and SPLIT flags are enabled.\)

Expand Down Expand Up @@ -130,7 +115,8 @@ We want to pick up part1.csv, part2.csv and part3.csv \(excluding another\_part1

As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.

### User Schema

## User Schema

Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from each file and a superset schema created. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:

Expand All @@ -154,7 +140,8 @@ For example:
* {"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
* {"username": "string", "friends": "array", "information": "object"}

### S3 Provider Settings

## S3 Provider Settings

* `bucket` : name of the bucket your files are in
* `aws_access_key_id` : one half of the [required credentials](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) for accessing a private bucket.
Expand All @@ -170,7 +157,7 @@ For example:

Note that all files within one stream must adhere to the same read options for every provided format.

#### CSV
### CSV

Since CSV files are effectively plain text, providing specific reader options is often required for correct parsing of the files. These settings are applied when a CSV is created or exported so please ensure that this process happens consistently over time.

Expand All @@ -192,7 +179,7 @@ Since CSV files are effectively plain text, providing specific reader options is
{"column_names": ["column1", "column2", "column3"]}
```

#### Parquet
### Parquet

Apache Parquet file is a column-oriented data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. For now this solutiion are iterating through individual files at the abstract-level thus partitioned parquet datasets are unsupported. The following settings are available:

Expand All @@ -202,6 +189,7 @@ Apache Parquet file is a column-oriented data storage format of the Apache Hadoo

You can find details on [here](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches).


## Changelog

| Version | Date | Pull Request | Subject |
Expand Down

0 comments on commit 27e6ce2

Please sign in to comment.