Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for existing BigQuery Schema (replace #40) #57

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
41f24cc
Add support for starting from a known schema_map
bozzzzo Mar 22, 2020
99c5168
A SCHEMA tag can not be followed by DATA, ERROR or SCHEMA tag
bozzzzo Mar 24, 2020
783e1f1
Convert tests to pytest.
bozzzzo Mar 22, 2020
bbbb35c
Add support for starting from a known BQ schema
bozzzzo Mar 24, 2020
040d801
Add support for tox to test across python versions
bozzzzo Mar 24, 2020
ea5107e
BQ schema is case insensitive, track keys by their lowercase value
bozzzzo Mar 24, 2020
35d5cec
lowercase field names when generating schema map from schema
bozzzzo Apr 7, 2020
444a5b9
Keep schema map index case insensitive, but preserve original field case
bozzzzo Apr 23, 2020
265b3c3
case sensitivity fix - preserve key name from old/original schema, if…
matevz-digiverse Jun 16, 2020
ee47847
add an optional callback to SchemaGenerator - allow the caller to pro…
matevz-digiverse Aug 19, 2020
8fbe52d
Modified data_reader class to read in an existing schema
abroglesc Nov 6, 2020
e15bfe9
Fixed bug related to sanitization within CSV datasets
abroglesc Nov 6, 2020
1e286f1
Migrated off of pytest and back to unittest. Coverted fixtures into f…
abroglesc Nov 6, 2020
7afdf41
Adding command-line flag for starting from existing bigquery schema
abroglesc Nov 6, 2020
96ca4ae
Added default NULLABLE mode when bigquery does not provide one
abroglesc Nov 9, 2020
fff6f5b
Removing errors informed from tests. Adding additional test cases inc…
abroglesc Nov 9, 2020
c5a57de
Removing type_mismatch_callback as this was untested
abroglesc Nov 9, 2020
5ef1427
Merge branch 'develop' into abroglesc-existing-schema
abroglesc Nov 9, 2020
098dbba
Fixing tests post merging from develop
abroglesc Nov 9, 2020
608b3fd
Removing tox from gitignore
abroglesc Nov 9, 2020
dd0db5f
Updating README to include details on existing_schema_path
abroglesc Nov 9, 2020
de27694
Fixing Flake8 errors
abroglesc Nov 9, 2020
7d1b4ce
Actually fully fixing flake8 tests
abroglesc Nov 9, 2020
ed60a7f
Removing old generator.run function call
abroglesc Nov 16, 2020
bb35faf
Removing unused test function
abroglesc Nov 16, 2020
23f8c40
Keeping case sensitivity rather than converting everything to lowerca…
abroglesc Nov 17, 2020
39f8222
Fixing error logging bug related to base_path not being passed to get…
abroglesc Nov 18, 2020
efcb2fa
Adding additional json_full_path error locations
abroglesc Nov 18, 2020
865e270
Fixing flake8 error
abroglesc Nov 18, 2020
bb5745c
Allow infer_schema to control relaxing mode when using existing_schem…
abroglesc Dec 1, 2020
411f402
Renaming line --> line_number
abroglesc Dec 1, 2020
9829bb0
Updating make flake8 task to also scan tests/ folder since CI/CD does…
abroglesc Dec 1, 2020
b556e0b
Fix flake8 on tests/
abroglesc Dec 1, 2020
be3a37a
Revert read_errors_section logic to original
abroglesc Dec 1, 2020
fae9581
Convert .format into f strings
abroglesc Dec 1, 2020
d41556e
Revert generator for testcases to original loop method
abroglesc Dec 1, 2020
38523e2
Added a test for standard sql types to legacy type conversion FLOAT64…
abroglesc Dec 1, 2020
989c2f8
Added additional 2 standard to legacy type conversions to test
abroglesc Dec 1, 2020
82593a4
Fixed bug where we used infer_mode to set a field as REQUIRED for a j…
abroglesc Dec 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,16 @@ tests:
python3 -m unittest

flake8:
flake8 bigquery_schema_generator \
flake8 bigquery_schema_generator/ \
--count \
--ignore W503 \
--show-source \
--statistics \
--max-line-length=80
flake8 tests/ \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to duplicate the entire command, just add tests to the previous flake8 command:

flake8:
    flake8 bigquery_schema_generator tests \
        --count \
        --ignore W503 \
        --show-source \
        --statistics \
        --max-line-length=80

Small nit: I prefer my directories to not have a trailing /, since it is not part of their name. And trailing slashes are actually meaningful in some programs (e.g. rsync(1)), so I'd rather not get in the habit of using them without explicit reasons.

--count \
--ignore W503 \
--show-source \
--statistics \
--max-line-length=80

59 changes: 42 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,13 +235,14 @@ as shown by the `--help` flag below.

Print the built-in help strings:

```
```bash
$ generate-schema --help
usage: generate_schema.py [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
[--quoted_values_are_strings] [--infer_mode]
[--debugging_interval DEBUGGING_INTERVAL]
[--debugging_map] [--sanitize_names]
[--ignore_invalid_lines]
usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
[--quoted_values_are_strings] [--infer_mode]
[--debugging_interval DEBUGGING_INTERVAL]
[--debugging_map] [--sanitize_names]
[--ignore_invalid_lines]
[--existing_schema_path EXISTING_SCHEMA_PATH]

Generate BigQuery schema from JSON or CSV file.

Expand All @@ -261,6 +262,10 @@ optional arguments:
standard
--ignore_invalid_lines
Ignore lines that cannot be parsed instead of stopping
--existing_schema_path EXISTING_SCHEMA_PATH
File that contains the existing BigQuery schema for a
table. This can be fetched with: `bq show --schema
<project_id>:<dataset>:<table_name>
```

#### Input Format (`--input_format`)
Expand All @@ -282,7 +287,7 @@ array or empty record as its value, the field is suppressed in the schema file.
This flag enables this field to be included in the schema file.

In other words, using a data file containing just nulls and empty values:
```
```bash
$ generate_schema
{ "s": null, "a": [], "m": {} }
^D
Expand All @@ -291,7 +296,7 @@ INFO:root:Processed 1 lines
```

With the `keep_nulls` flag, we get:
```
```bash
$ generate-schema --keep_nulls
{ "s": null, "a": [], "m": {} }
^D
Expand Down Expand Up @@ -331,7 +336,7 @@ consistent with the algorithm used by `bq load`. However, for the `BOOLEAN`,
normal strings instead. This flag disables type inference for `BOOLEAN`,
`INTEGER` and `FLOAT` types inside quoted strings.

```
```bash
$ generate-schema
{ "name": "1" }
^D
Expand Down Expand Up @@ -365,6 +370,12 @@ feature for JSON files, but too difficult to implement in practice because
fields are often completely missing from a given JSON record (instead of
explicitly being defined to be `null`).

In addition to the above, this option, when used in conjunction with
--existing_schema_map, will allow fields to be relaxed from REQUIRED to NULLABLE
if they were REQUIRED in the existing schema and NULL rows are found in the new
data we are inferring a schema from. In this case it can be used with either
input_format, CSV or JSON.

See [Issue #28](https://github.com/bxparks/bigquery-schema-generator/issues/28)
for implementation details.

Expand All @@ -374,7 +385,7 @@ By default, the `generate_schema.py` script prints a short progress message
every 1000 lines of input data. This interval can be changed using the
`--debugging_interval` flag.

```
```bash
$ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
```

Expand All @@ -385,7 +396,7 @@ the bookkeeping metadata map which is used internally to keep track of the
various fields and their types that were inferred using the data file. This
flag is intended to be used for debugging.

```
```bash
$ generate-schema --debugging_map < file.data.json > file.schema.json
```

Expand Down Expand Up @@ -435,6 +446,20 @@ deduction logic will handle any missing or extra columns gracefully.
Fixes [Issue
#49](https://github.com/bxparks/bigquery-schema-generator/issues/49).

#### Existing Schema Path (`--existing_schema_path`)
There are cases where we would like to start from an existing BigQuery table schema
rather than starting from scratch with a new batch of data we would like to load.
In this case we can specify the path to a local file on disk that is our existing
bigquery table schema. This can be generated via the following bq cli command:
```bash
bq show --schema <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> > existing_table_schema.json
```

We can then run generate-schema with the additional option
```bash
--existing_schema_path existing_table_schema.json
```

## Schema Types

### Supported Types
Expand Down Expand Up @@ -534,7 +559,7 @@ compatibility rules implemented by **bq load**:
Here is an example of a single JSON data record on the STDIN (the `^D` below
means typing Control-D, which indicates "end of file" under Linux and MacOS):

```
```bash
$ generate-schema
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
^D
Expand Down Expand Up @@ -569,7 +594,7 @@ INFO:root:Processed 1 lines
```

In most cases, the data file will be stored in a file:
```
```bash
$ cat > file.data.json
{ "a": [1, 2] }
{ "i": 3 }
Expand All @@ -596,7 +621,7 @@ $ cat file.schema.json
Here is the schema generated from a CSV input file. The first line is the header
containing the names of the columns, and the schema lists the columns in the
same order as the header:
```
```bash
$ generate-schema --input_format csv
e,b,c,d,a
1,x,true,,2.0
Expand Down Expand Up @@ -634,7 +659,7 @@ INFO:root:Processed 3 lines
```

Here is an example of the schema generated with the `--infer_mode` flag:
```
```bash
$ generate-schema --input_format csv --infer_mode
name,surname,age
John
Expand Down Expand Up @@ -701,15 +726,15 @@ json.dump(schema, output_file, indent=2)

I wrote the `bigquery_schema_generator/anonymize.py` script to create an
anonymized data file `tests/testdata/anon1.data.json.gz`:
```
```bash
$ ./bigquery_schema_generator/anonymize.py < original.data.json \
> anon1.data.json
$ gzip anon1.data.json
```
This data file is 290MB (5.6MB compressed) with 103080 data records.

Generating the schema using
```
```bash
$ bigquery_schema_generator/generate_schema.py < anon1.data.json \
> anon1.schema.json
```
Expand Down
Loading