Skip to content

Commit

Permalink
Source S3 - fix schema inference (#17991)
Browse files Browse the repository at this point in the history
* #678 oncall. Source S3 - fix schema inference

* source s3: upd changelog

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <[email protected]>
  • Loading branch information
davydov-d and octavia-squidington-iii authored Oct 14, 2022
1 parent 888347a commit 5aa25a1
Show file tree
Hide file tree
Showing 9 changed files with 24 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -915,7 +915,7 @@
- name: S3
sourceDefinitionId: 69589781-7828-43c5-9f63-8925b1c1ccc2
dockerRepository: airbyte/source-s3
dockerImageTag: 0.1.23
dockerImageTag: 0.1.24
documentationUrl: https://docs.airbyte.com/integrations/sources/s3
icon: s3.svg
sourceType: file
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9436,7 +9436,7 @@
supportsNormalization: false
supportsDBT: false
supported_destination_sync_modes: []
- dockerImage: "airbyte/source-s3:0.1.23"
- dockerImage: "airbyte/source-s3:0.1.24"
spec:
documentationUrl: "https://docs.airbyte.com/integrations/sources/s3"
changelogUrl: "https://docs.airbyte.com/integrations/sources/s3"
Expand Down
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-s3/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@ COPY source_s3 ./source_s3
ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]

LABEL io.airbyte.version=0.1.23
LABEL io.airbyte.version=0.1.24
LABEL io.airbyte.name=airbyte/source-s3
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"aws_access_key_id": "123456",
"aws_secret_access_key": "123456key",
"path_prefix": "",
"endpoint": "http://10.0.167.14:9000"
"endpoint": "http://10.0.92.4:9000"
},
"format": {
"filetype": "csv"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
class AbstractFileParser(ABC):
logger = AirbyteLogger()

NON_SCALAR_TYPES = {"struct": "struct"}
NON_SCALAR_TYPES = {"struct": "struct", "list": "list"}
TYPE_MAP = {
"boolean": ("bool_", "bool"),
"integer": ("int64", "int8", "int16", "int32", "uint8", "uint16", "uint32", "uint64"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ class JsonlParser(AbstractFileParser):
"large_string",
),
# TODO: support array type rather than coercing to string
"array": ("large_string",),
"array": (
"list",
"large_string",
),
"null": ("large_string",),
}

Expand Down Expand Up @@ -80,6 +83,8 @@ def get_inferred_schema(self, file: Union[TextIO, BinaryIO]) -> Mapping[str, Any
def field_type_to_str(type_: Any) -> str:
if isinstance(type_, pa.lib.StructType):
return "struct"
if isinstance(type_, pa.lib.ListType):
return "list"
if isinstance(type_, pa.lib.DataType):
return str(type_)
raise Exception(f"Unknown PyArrow Type: {type_}")
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{"id": 1, "name": "Erich", "books": ["Shadows in Paradise", "The Dream Room", "The Night in Lisbon"]}
{"id": 2, "name": "Maria", "books": ["All Quiet on the Western Front"]}
{"id": 3, "name": "Remarque", "books": ["The Road Back", "Three Comrades"]}
Original file line number Diff line number Diff line change
Expand Up @@ -168,4 +168,12 @@ def cases(cls) -> Mapping[str, Any]:
"line_checks": {},
"fails": [],
},
"array_in_schema_test": {
"AbstractFileParser": JsonlParser(format={"filetype": "jsonl"}),
"filepath": os.path.join(SAMPLE_DIRECTORY, "jsonl/test_file_11_array_in_schema.jsonl"),
"num_records": 3,
"inferred_schema": {"id": "integer", "name": "string", "books": "array"},
"line_checks": {},
"fails": [],
},
}
3 changes: 2 additions & 1 deletion docs/integrations/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,8 @@ The Jsonl parser uses pyarrow hence,only the line-delimited JSON format is suppo
## Changelog

| Version | Date | Pull Request | Subject |
| :------ | :--------- | :-------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------- |
|:--------|:-----------|:----------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------|
| 0.1.23 | 2022-10-10 | [17991](https://github.com/airbytehq/airbyte/pull/17991) | Fix pyarrow to JSON schema type conversion for arrays |
| 0.1.23 | 2022-10-10 | [17800](https://github.com/airbytehq/airbyte/pull/17800) | Deleted `use_ssl` and `verify_ssl_cert` flags and hardcoded to `True` |
| 0.1.22 | 2022-09-28 | [17304](https://github.com/airbytehq/airbyte/pull/17304) | Migrate to per-stream state |
| 0.1.21 | 2022-09-20 | [16921](https://github.com/airbytehq/airbyte/pull/16921) | Upgrade pyarrow |
Expand Down

0 comments on commit 5aa25a1

Please sign in to comment.