Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BigQuery source: Fix nested arrays #4981

Merged
merged 80 commits into from
Jul 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
286e327
unfinished jdbcsource separation
DoNotPanicUA Jun 11, 2021
6a62eaa
creation AbstactRelation
DoNotPanicUA Jun 11, 2021
0aaf904
Migrate StateManager to new abstract level (JdbcSource -> RelationalS…
DoNotPanicUA Jun 14, 2021
6dcedf5
fix imports
DoNotPanicUA Jun 14, 2021
6d8e976
move configs to Database level + fix MySql source
DoNotPanicUA Jun 14, 2021
042c527
make in line jdbc source with a new impl
DoNotPanicUA Jun 15, 2021
147e166
Fix ScaffoldJavaJdbcSource template
DoNotPanicUA Jun 15, 2021
246528a
rename `AbstractField` to `CommonField`. Now it
DoNotPanicUA Jun 17, 2021
e8d29c9
format
DoNotPanicUA Jun 17, 2021
111abae
rename generated files in line with their location
DoNotPanicUA Jun 17, 2021
3bc83b2
bonus renaming
DoNotPanicUA Jun 17, 2021
18e86d8
move utility methods specific for jdbc source to a proper module
DoNotPanicUA Jun 18, 2021
67f1e6a
internal review update
DoNotPanicUA Jun 22, 2021
42eb2c5
BigQueryDatabase impl without row transformation
DoNotPanicUA Jun 23, 2021
9326d35
add Static method for BigQueryDatabase instancing
DoNotPanicUA Jun 23, 2021
c7d2eff
remove data type parameter limitation + rename class parameters
DoNotPanicUA Jun 23, 2021
c78a441
Merge remote-tracking branch 'origin/aleonets/4024-abstract-source' i…
DoNotPanicUA Jun 23, 2021
06f5d13
Move DataTypeUtils from jdbs to common + impl basic types BigQueryUtils
DoNotPanicUA Jun 23, 2021
d16f883
Merge remote-tracking branch 'origin/master' into aleonets/4024-abstr…
DoNotPanicUA Jun 23, 2021
75e7c99
make DB2 in line with new relational abstract classes
DoNotPanicUA Jun 23, 2021
3dc5383
add missing import
DoNotPanicUA Jun 23, 2021
fd60d14
cover all biqquery classes + add type transformation method from Stan…
DoNotPanicUA Jun 29, 2021
b3e0801
Merge remote-tracking branch 'origin/master' into 4024-abstract-source
DoNotPanicUA Jun 30, 2021
5421313
close unused connections
DoNotPanicUA Jun 30, 2021
22356d0
Merge branch 'master' into aleonets/1876-source-bigquery
heade Jul 1, 2021
32b53ee
Merge remote-tracking branch 'origin/aleonets/4024-abstract-source' i…
DoNotPanicUA Jul 1, 2021
8c706b4
Merge remote-tracking branch 'origin/aleonets/1876-source-bigquery' i…
heade Jul 1, 2021
aa38921
add table list extract method
DoNotPanicUA Jul 1, 2021
5d31771
Merge remote-tracking branch 'origin/aleonets/1876-source-bigquery' i…
heade Jul 1, 2021
4fb2f24
bigquery source connector
heade Jul 1, 2021
f4d6aa0
return all tables for a whole project instead of a dataset
DoNotPanicUA Jul 1, 2021
7f76db9
impl incremental fetch
DoNotPanicUA Jul 1, 2021
0495b35
bigquery source connector
heade Jul 2, 2021
d764c16
bigquery source connector
heade Jul 2, 2021
33a447e
remove unnecessary databaseid
DoNotPanicUA Jul 5, 2021
e114a18
add primitive type filtering
DoNotPanicUA Jul 5, 2021
c27a744
Merge remote-tracking branch 'origin/master' into aleonets/1876-sourc…
DoNotPanicUA Jul 5, 2021
74f8350
add temporary workaround for test database.
DoNotPanicUA Jul 6, 2021
2a5703e
add dataset location
DoNotPanicUA Jul 7, 2021
ae5f059
fix table info retrieving
DoNotPanicUA Jul 7, 2021
904f054
handle dataset config
DoNotPanicUA Jul 8, 2021
094fa82
Add working comprehensive test without data cases
DoNotPanicUA Jul 8, 2021
32bd999
minor changes in the source processing
DoNotPanicUA Jul 9, 2021
5541017
acceptance tests; discover method fix
heade Jul 9, 2021
667018b
Merge remote-tracking branch 'origin/aleonets/1876-source-bigquery' i…
heade Jul 9, 2021
4e8910f
discover method fix
heade Jul 9, 2021
36693ed
first comprehensinve test
DoNotPanicUA Jul 9, 2021
5468d54
Merge branch 'aleonets/1876-source-bigquery' of https://github.com/ai…
DoNotPanicUA Jul 9, 2021
8dc3f44
Comprehensive tests for the BigQuery source + database timeout config
DoNotPanicUA Jul 11, 2021
194af3f
bigquery acceptance tests fix; formatting
heade Jul 12, 2021
62b3f89
fix incremental sync using date, datetime, time and timestamp types
DoNotPanicUA Jul 13, 2021
954995a
Implement source checks: basic and dataset
DoNotPanicUA Jul 13, 2021
d90f96c
Merge remote-tracking branch 'origin/master' into aleonets/1876-sourc…
DoNotPanicUA Jul 13, 2021
e258c82
format
DoNotPanicUA Jul 13, 2021
6107f30
revert: airbyte_protocol.by
DoNotPanicUA Jul 13, 2021
f95c1af
Merge remote-tracking branch 'origin/master' into aleonets/1876-sourc…
DoNotPanicUA Jul 14, 2021
bfa5cf3
internal review update
DoNotPanicUA Jul 14, 2021
d64ce0b
Add possibility to get list of comprehensive tests in a Markdown tabl…
DoNotPanicUA Jul 14, 2021
fd33eed
Merge branch 'master' into aleonets/1876-source-bigquery
heade Jul 15, 2021
6a58540
Update airbyte-integrations/connectors/source-bigquery/src/main/resou…
DoNotPanicUA Jul 16, 2021
d6053f9
review update
DoNotPanicUA Jul 16, 2021
e545247
Implement processing for arrays and structures
DoNotPanicUA Jul 16, 2021
2fd9199
format
DoNotPanicUA Jul 16, 2021
45c7f0c
Merge remote-tracking branch 'origin/master' into aleonets/1876-sourc…
DoNotPanicUA Jul 16, 2021
ebce19c
Merge remote-tracking branch 'origin/aleonets/1876-source-bigquery' i…
heade Jul 20, 2021
46f5b3e
added bigquery secrets
heade Jul 20, 2021
e493468
added bigquery secrets
heade Jul 20, 2021
05067c3
spec fix
heade Jul 22, 2021
449a0b5
test configs fix
heade Jul 22, 2021
a1c02d8
extend mapping for Arrays and Structs
DoNotPanicUA Jul 20, 2021
365b761
Process nested arrays
DoNotPanicUA Jul 26, 2021
749ecba
Merge remote-tracking branch 'origin/master' into aleonets/1876-sourc…
DoNotPanicUA Jul 26, 2021
4a08717
handle arrays of records properly.
DoNotPanicUA Jul 26, 2021
80d541d
format
DoNotPanicUA Jul 26, 2021
5c8c65f
BigQuery source docs
DoNotPanicUA Jul 27, 2021
fd957ab
docs readme update
DoNotPanicUA Jul 27, 2021
8d76778
hide evidences
DoNotPanicUA Jul 27, 2021
8f59837
fix changlog order
DoNotPanicUA Jul 27, 2021
32eeb7f
Merge remote-tracking branch 'origin/master' into aleonets/1876-sourc…
DoNotPanicUA Jul 27, 2021
864580a
Add bigquery to source_defintions yaml
DoNotPanicUA Jul 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"sourceDefinitionId": "bfd1ddf8-ae8a-4620-b1d7-55597d2ba08c",
"name": "BigQuery",
"dockerRepository": "airbyte/source-bigquery",
"dockerImageTag": "0.1.1",
"documentationUrl": "https://docs.airbyte.io/integrations/sources/bigquery"
}
Original file line number Diff line number Diff line change
Expand Up @@ -404,3 +404,8 @@
dockerRepository: airbyte/source-prestashop
dockerImageTag: 0.1.0
documentationUrl: https://docs.airbyte.io/integrations/sources/prestashop
- sourceDefinitionId: bfd1ddf8-ae8a-4620-b1d7-55597d2ba08c
name: BigQuery
dockerRepository: airbyte/source-bigquery
dockerImageTag: 0.1.1
documentationUrl: https://docs.airbyte.io/integrations/sources/bigquery
22 changes: 20 additions & 2 deletions airbyte-db/src/main/java/io/airbyte/db/bigquery/BigQueryUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,26 @@ private static void setJsonField(Field field, FieldValue fieldValue, ObjectNode
} else if (fieldValue.getAttribute().equals(Attribute.REPEATED)) {
ArrayNode arrayNode = node.putArray(fieldName);
StandardSQLTypeName fieldType = field.getType().getStandardType();
fieldValue.getRepeatedValue().forEach(arrayFieldValue -> fillObjectNode(fieldName, fieldType, arrayFieldValue, arrayNode.addObject()));
FieldList subFields = field.getSubFields();
// Array of primitive
if (subFields == null || subFields.isEmpty()) {
DoNotPanicUA marked this conversation as resolved.
Show resolved Hide resolved
fieldValue.getRepeatedValue().forEach(arrayFieldValue -> fillObjectNode(fieldName, fieldType, arrayFieldValue, arrayNode.addObject()));
// Array of records
} else {
for (FieldValue arrayFieldValue : fieldValue.getRepeatedValue()) {
int count = 0; // named get doesn't work here for some reasons.
ObjectNode newNode = arrayNode.addObject();
for (Field repeatedField : subFields) {
setJsonField(repeatedField, arrayFieldValue.getRecordValue().get(count++),
newNode);
}
}
}
} else if (fieldValue.getAttribute().equals(Attribute.RECORD)) {
ObjectNode newNode = node.putObject(fieldName);
field.getSubFields().forEach(recordField -> setJsonField(recordField, fieldValue.getRecordValue().get(recordField.getName()), newNode));
field.getSubFields().forEach(recordField -> {
setJsonField(recordField, fieldValue.getRecordValue().get(recordField.getName()), newNode);
});
}
}

Expand All @@ -113,6 +129,8 @@ public static JsonSchemaPrimitive getType(StandardSQLTypeName bigQueryType) {
case BOOL -> JsonSchemaPrimitive.BOOLEAN;
case INT64, FLOAT64, NUMERIC, BIGNUMERIC -> JsonSchemaPrimitive.NUMBER;
case STRING, BYTES, TIMESTAMP, DATE, TIME, DATETIME -> JsonSchemaPrimitive.STRING;
case ARRAY -> JsonSchemaPrimitive.ARRAY;
case STRUCT -> JsonSchemaPrimitive.OBJECT;
default -> JsonSchemaPrimitive.STRING;
};
}
Expand Down
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-bigquery/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ COPY build/distributions/${APPLICATION}*.tar ${APPLICATION}.tar
RUN tar xf ${APPLICATION}.tar --strip-components=1

# Airbyte's build system uses these labels to know what to name and tag the docker images produced by this Dockerfile.
LABEL io.airbyte.version=0.1.0
LABEL io.airbyte.version=0.1.1
LABEL io.airbyte.name=airbyte/source-bigquery
21 changes: 21 additions & 0 deletions airbyte-integrations/connectors/source-bigquery/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# BigQuery Test Configuration

In order to test the BigQuery source, you need a service account key file.

## Community Contributor

As a community contributor, you will need access to a GCP project and BigQuery to run tests.

1. Go to the `Service Accounts` page on the GCP console
1. Click on `+ Create Service Account" button
1. Fill out a descriptive name/id/description
1. Click the edit icon next to the service account you created on the `IAM` page
1. Add the `BigQuery Data Editor` and `BigQuery User` role
1. Go back to the `Service Accounts` page and use the actions modal to `Create Key`
1. Download this key as a JSON file
1. Move and rename this file to `secrets/credentials.json`

## Airbyte Employee

1. Access the `BigQuery Integration Test User` secret on Rippling under the `Engineering` folder
1. Create a file with the contents at `secrets/credentials.json`
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ dependencies {
implementation project(':airbyte-integrations:connectors:source-jdbc')
implementation project(':airbyte-integrations:connectors:source-relational-db')

//TODO Add jdbc driver import here. Ex: implementation 'com.microsoft.sqlserver:mssql-jdbc:8.4.1.jre14'

testImplementation testFixtures(project(':airbyte-integrations:connectors:source-jdbc'))

testImplementation 'org.apache.commons:commons-lang3:3.11'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,26 @@ protected void initTests() {
.addInsertValues("STRUCT('s' as frst, 1 as sec, STRUCT(555 as id_col, STRUCT(TIME(15, 30, 00) as time) as mega_obbj) as obbj)")
.addExpectedValues("{\"frst\":\"s\",\"sec\":1,\"obbj\":{\"id_col\":555,\"mega_obbj\":{\"last_col\":\"15:30:00\"}}}")
.build());

addDataTypeTestData(
TestDataHolder.builder()
.sourceType("array")
.fullSourceDataType("array<STRUCT<fff String, ggg int64>>")
.airbyteType(JsonSchemaPrimitive.STRING)
.createTablePatternSql(CREATE_SQL_PATTERN)
.addInsertValues("[STRUCT('qqq' as fff, 1 as ggg), STRUCT('kkk' as fff, 2 as ggg)]")
.addExpectedValues("[{\"fff\":\"qqq\",\"ggg\":1},{\"fff\":\"kkk\",\"ggg\":2}]")
.build());

addDataTypeTestData(
TestDataHolder.builder()
.sourceType("array")
.fullSourceDataType("array<STRUCT<fff String, ggg array<STRUCT<ooo String, kkk int64>>>>")
.airbyteType(JsonSchemaPrimitive.STRING)
.createTablePatternSql(CREATE_SQL_PATTERN)
.addInsertValues("[STRUCT('qqq' as fff, [STRUCT('fff' as ooo, 1 as kkk), STRUCT('hhh' as ooo, 2 as kkk)] as ggg)]")
.addExpectedValues("[{\"fff\":\"qqq\",\"ggg\":[{\"ooo\":\"fff\",\"kkk\":1},{\"ooo\":\"hhh\",\"kkk\":2}]}]")
.build());
}

@Override
Expand Down
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
* [Asana](integrations/sources/asana.md)
* [AWS CloudTrail](integrations/sources/aws-cloudtrail.md)
* [Braintree](integrations/sources/braintree.md)
* [BigQuery](integrations/sources/bigquery.md)
* [Cart](integrations/sources/cart.md)
* [ClickHouse](integrations/sources/clickhouse.md)
* [CockroachDB](integrations/sources/cockroachdb.md)
Expand Down
1 change: 1 addition & 0 deletions docs/integrations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Airbyte uses a grading system for connectors to help users understand what to ex
|[Asana](./sources/asana.md) | Beta |
|[AWS CloudTrail](./sources/aws-cloudtrail.md)| Beta |
|[Braintree](./sources/braintree.md)| Alpha |
|[BigQuery](./sources/bigquery.md)| Beta |
|[Cart](./sources/cart.md)| Beta |
|[ClickHouse](./sources/clickhouse.md)| Beta |
|[CockroachDB](./sources/cockroachdb.md)| Beta |
Expand Down
92 changes: 92 additions & 0 deletions docs/integrations/sources/bigquery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
description: >-
BigQuery is a serverless, highly scalable, and cost-effective data warehouse
offered by Google Cloud Provider.
---

# BigQuery

## Overview

The BigQuery source supports both Full Refresh and Incremental syncs. You can choose if this connector will copy only the new or updated data, or all rows in the tables and columns you set up for replication, every time a sync is running.

### Resulting schema

The BigQuery source does not alter the schema present in your database. Depending on the destination connected to this source, however, the schema may be altered. See the destination's documentation for more details.

### Data type mapping

The BigQuery data types mapping:

| CockroachDb Type | Resulting Type | Notes |
| :--- | :--- | :--- |
| `BOOL` | Boolean | |
| `INT64` | Number | |
| `FLOAT64` | Number | |
| `NUMERIC` | Number | |
| `BIGNUMERIC` | Number | |
| `STRING` | String | |
| `BYTES` | String | |
| `DATE` | String | In ISO8601 format |
| `DATETIME` | String | In ISO8601 format |
| `TIMESTAMP` | String | In ISO8601 format |
| `TIME` | String | |
| `ARRAY` | Array | |
| `STRUCT` | Object | |
| `GEOGRAPHY` | String | |

### Features

| Feature | Supported | Notes |
| :--- | :--- | :--- |
| Full Refresh Sync | Yes | |
| Incremental Sync| Yes | |
| Change Data Capture | No | |
| SSL Support | Yes | |

## Getting started

### Requirements

To use the BigQuery source, you'll need:

* A Google Cloud Project with BigQuery enabled
* A Google Cloud Service Account with the "BigQuery User" and "BigQuery Data Editor" roles in your GCP project
* A Service Account Key to authenticate into your Service Account

See the setup guide for more information about how to create the required resources.

#### Service account

In order for Airbyte to sync data from BigQuery, it needs credentials for a [Service Account](https://cloud.google.com/iam/docs/service-accounts) with the "BigQuery User" and "BigQuery Data Editor" roles, which grants permissions to run BigQuery jobs, write to BigQuery Datasets, and read table metadata. We highly recommend that this Service Account is exclusive to Airbyte for ease of permissioning and auditing. However, you can use a pre-existing Service Account if you already have one with the correct permissions.

The easiest way to create a Service Account is to follow GCP's guide for [Creating a Service Account](https://cloud.google.com/iam/docs/creating-managing-service-accounts). Once you've created the Service Account, make sure to keep its ID handy as you will need to reference it when granting roles. Service Account IDs typically take the form `<account-name>@<project-name>.iam.gserviceaccount.com`

Then, add the service account as a Member in your Google Cloud Project with the "BigQuery User" role. To do this, follow the instructions for [Granting Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#granting-console) in the Google documentation. The email address of the member you are adding is the same as the Service Account ID you just created.

At this point you should have a service account with the "BigQuery User" project-level permission.

#### Service account key

Service Account Keys are used to authenticate as Google Service Accounts. For Airbyte to leverage the permissions you granted to the Service Account in the previous step, you'll need to provide its Service Account Keys. See the [Google documentation](https://cloud.google.com/iam/docs/service-accounts#service_account_keys) for more information about Keys.

Follow the [Creating and Managing Service Account Keys](https://cloud.google.com/iam/docs/creating-managing-service-account-keys) guide to create a key. Airbyte currently supports JSON Keys only, so make sure you create your key in that format. As soon as you created the key, make sure to download it, as that is the only time Google will allow you to see its contents. Once you've successfully configured BigQuery as a source in Airbyte, delete this key from your computer.

### Setup the BigQuery source in Airbyte

You should now have all the requirements needed to configure BigQuery as a source in the UI. You'll need the following information to configure the BigQuery source:

* **Project ID**
* **Default Dataset ID [Optional]**: the schema name if only one schema is interested. Dramatically boost source discover operation.
* **Credentials JSON**: the contents of your Service Account Key JSON file

Once you've configured BigQuery as a source, delete the Service Account Key from your computer.

## CHANGELOG

### source-bigquery

| Version | Date | Pull Request | Subject |
| :--- | :--- | :--- | :--- |
| 0.1.1 | 2021-07-28 | [#4981](https://github.com/airbytehq/airbyte/pull/4981) | 🐛 BigQuery source: Fix nested arrays |
| 0.1.0 | 2021-07-22 | [#4457](https://github.com/airbytehq/airbyte/pull/4457) | 🎉 New Source: Big Query. |