Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for existing BigQuery Schema (replace #40) #57

Closed
wants to merge 39 commits into from

Conversation

abroglesc
Copy link
Contributor

@abroglesc abroglesc commented Nov 9, 2020

This takes the code proposed in PR #40 and attempts to address all outstanding items of concern as they were reviewed by @bxparks. #40 is very stale at this time and it does not seem that will get implemented at any point. I would like to request we close at #40 in favor of this PR but we should use that as a base of what changes I made since you last reviewed that PR.

The changes I have made as requested are:

  1. Extend testdata structure to enable interpreting the existing schema as a string
    commit id: 8fbe52d
  2. Convert added pytest tests to normal unittest (replaced fixtures with nested for loops which is worse but satisfies not using a different test suite)
    commit id: 1e286f1
  3. Delete pytest dep, and delete tox
    commit id: 1e286f1
  4. Add --existing_schema command line flag for reading in existing schema from a file. (super easy)
    commit id: 7afdf419a2c4a61c3b356d1b59ad5baaed8cc171
  5. Write some tests that include an existing schema in testdata.txt
    commit id: fff6f5b
  6. Remove ERRORS INFORMED from the data_reader. This didn't make sense to me and I did not see the value. All errors should be part of the error_logs that are returned.
    commit id: fff6f5b
  7. Remove type_mismatch_callback. This was not properly tested and I didn't feel like maintaining/testing it.
    commit id: c5a57de
  8. Fix bug as it related to bigquery not providing a mode for some nested JSON fields by defaulting to NULLABLE
    commit id: 96ca4ae
  9. Not requested but while updating the README I added the bash syntax flag to the generate-schema command examples

Summarizing original changes from @bozzzzo.

  1. Break sanitize_names into it's own function and call this earlier in the schema generation process, not just during flattening.
  2. Add ability to begin from an existing bigquery table schema by being able to reconstitute a schema_map from an existing bigquery schema.

bozzzzo and others added 23 commits March 24, 2020 09:19
Identify data chunks by count and line number
…luding ones starting from an existing schema.
@abroglesc
Copy link
Contributor Author

@bxparks This is ready for review. Happy to have your feedback and incorporate anything else that is needed to get this merged. I know this is a big change but it's a particularly useful one when you are loading data into existing tables.

@bozzzzo
Copy link

bozzzzo commented Nov 10, 2020

@abroglesc Thanks for picking this up!

Copy link
Owner

@bxparks bxparks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thanks for doing this. I tell you, no matter how many recursive functions I write and review, they still make my brain hurt trying to figure out if they are doing the right thing. Most of this looks great, I just have some small nits here and there. The major discussion point is inside merge_schema_entry() where you added some code to handle the situation of a field going from REQUIRED -> NULLABLE due to an existing schema. Please take a look at my notes and proposal, and let me know what you think.

f'old=({old_status},{full_old_name},{old_mode},{old_type}); '
f'new=({new_status},{full_new_name},{new_mode},{new_type})')
return None
# primitive-types are conservatively deduced NULLABLE. In case we
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 358-364: This code here is very subtle, it took me a while to figure out what's going on. Let me see if I understand this correctly.

Without an existing schema, the only legal transitions are NULLABLE -> NULLABLE and REPEATED -> REPEATED. When we add an existing schema through --existing_schema_path, we can now have a field going from REQUIRED -> NULLABLE. (The reverse, NULLABLE -> REQUIRED, cannot happen because a REQUIRED field is produced only during flatten_schema(), never during deduce_schema_for_line()).

So now we have 2 cases:

Case 1) REQUIRED -> NULLABLE(filled=True)
Case 2) REQUIRED -> NULLABLE(filled=False).

In Case 1 (filled=True), the obvious thing is to keep the field as REQUIRED (regardless of whether --infer_mode was selected).

In Case 2 (filled=False), we have 2 choices:
Choice 2.1) We can revert the existing REQUIRED into a NULLABLE.
Choice 2.2) We can print out an error message, ignore this field, and continue processing.

I feel like the User should be able to choose 2.1 or 2.2 with a flag, because it's not obvious which behavior is the correct one. And here, it seems like we can overload the meaning of the --infer_mode to choose 2.1. It was originally meant to upgrade NULLABLE -> REQUIRED, but I think it makes sense to overload its meaning to allow the reverse.

Putting all this together, I think it's worth pulling this logic into its own section, above Line 357, and I think it would look something like this:

# If the old field is a REQUIRED primitive (which could only have come from an existing
# schema), the new field can be either a NULLABLE(filled) or a NULLABLE(unfilled).
if old_mode == 'REQUIRED' and new_mode == 'NULLABLE':
    # If the new field is filled, then retain the REQUIRED.
    if new_schema_entry['filled']:
         new_info['mode'] = old_mode  # REQUIRED
         new_mode = old_mode
    else:
        # The new field is not filled (i.e. an empty or null field).
        # If --infer_mode is active, then we allow the REQUIRED to revert back to NULLABLE.
        if self.infer_mode:
            old_info['mode'] = new_mode  # NULLABLE
            old_mode = new_mode
        else:
             # Leave the old_mode and new_mode unchanged and different.
             # The mismatch will be detected in the code below.
             pass

[...The following code remains unchanged...]
# For all other types...
if old_mode != new_mode:
    self.log_error(...)
    return None     

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bxparks this came from @bozzzzo's initial implementation.

I implemented your suggestion and then when writing a unit testcase to go from REQUIRED --> NULLABLE so I could ensure errors existed properly I could never trigger this. After a lot of tracing this is because when we deduce an existing schema we assume that status is hard since we can't change a type on line 286

        # new 'soft' does not clobber old 'hard'
        if old_status == 'hard' and new_status == 'soft':
            return old_schema_entry

Given we assume all are hard and if we find a NULL field in a schema_entry we instantiate it with a status of soft we will never reach the condition to relax from REQUIRED --> NULLABLE.

Code for instantiating a NULL schema_entry

elif value_type == '__null__':
schema_entry = OrderedDict([
('status', 'soft'),
('filled', False),
('info', OrderedDict([
('mode', 'NULLABLE'),
('name', sanitized_key),
('type', 'STRING'),
])),
])

Should we move the logic about relaxing a field from REQUIRED --> NULLABLE or NULLABLE --> REQUIRED further up in this function or should we delete this status check and use it only when doing type conversions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I broke out the mode determination into it's own function and I call that within this soft/hard check and lower in the code where this comment initially referenced and it seems to be working. Expect a commit with the change and updated testcase soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been implemented with tests and pushed in bb5745c

bigquery_schema_generator/generate_schema.py Show resolved Hide resolved
bigquery_schema_generator/generate_schema.py Show resolved Hide resolved
tests/test_generate_schema.py Outdated Show resolved Hide resolved
bigquery_schema_generator/generate_schema.py Outdated Show resolved Hide resolved
tests/test_generate_schema.py Outdated Show resolved Hide resolved
tests/data_reader.py Outdated Show resolved Hide resolved
tests/data_reader.py Outdated Show resolved Hide resolved
@@ -216,8 +264,8 @@ def read_schema_section(self):
break
(tag, _) = self.parse_tag_line(line)
if tag in self.TAG_TOKENS:
if tag == 'SCHEMA':
raise Exception("Unexpected SCHEMA tag")
if tag in ('DATA', 'ERRORS', 'EXISTING_SCHEMA', 'SCHEMA'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it's getting harder to keep track of which tags are allowed in which sections. Originally, the order of the tags were just: DATA, [ERRORS], SCHEMA, END. But now it's DATA, [EXISTING_SCHEMA], [ERRORS], SCHEMA, END. A better way would be to allow these sections to appears in any order. But that's a bit out of scope for this PR. If I get motivated, maybe I'll take a crack at it after merging in this PR... but realistically, it will probably not rise high enough on my priority list with so many other things going on. Too bad. At least this tidbit is recorded here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #59 to keep that nice to have visible after we merge this PR.

tests/test_generate_schema.py Outdated Show resolved Hide resolved
@ZebraCat
Copy link

ZebraCat commented Nov 17, 2020

When is this expected to be merged and be in the stable release?
we are working with big query in an environment where the schema can change every day, and this solution is perfect for us.

@abroglesc
Copy link
Contributor Author

Fixed the casing issue. Added a couple tests for that as well. Just need to address the rest of the comments. Overall the logic fixes are mostly implemented just small changes left.

@bxparks
Copy link
Owner

bxparks commented Nov 28, 2020

Hi, Just want to check in to get a status on this PR. The sooner we get this in the better, because it takes a fair bit of effort to pull in all the context info into my head to allow me to review this code properly. I don't work on this code frequently, so if we wait too long, I start forgetting things.

BTW, there's a Makefile checked in at the root. You can run the unit tests and check flake8 errors by typing make at the top. You don't have to wait until you push tot GitHub.

@abroglesc
Copy link
Contributor Author

Hi @bxparks,

I will try to have this all addressed today. I had a bit of time off recently where I disconnected and just getting back into the swing of things.

@abroglesc
Copy link
Contributor Author

abroglesc commented Dec 1, 2020

@bxparks re: using Makefile for checking Flake8 I started doing that but it wasn't scanning tests/ folder whereas the CI/CD in github was. This was added to the flake8 Make target in commit 9829bb0

@abroglesc
Copy link
Contributor Author

@bxparks At this time I believe I have addressed all of your comments. Let me know if there is anything else we need to get this PR merged :)

@abroglesc
Copy link
Contributor Author

I now truly believe this is ready for a final review after addressing the one final comment I missed.

Copy link
Owner

@bxparks bxparks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied in one of my embedded comments that this PR has so many commits, and has Bozo's commit which are not rebased from develop, that I can no longer follow what it's doing. I added request in there, asking you to squash-merge all these changes into a single commit, rebased from develop, then send me another PR. Which will make it possible to get a sensible git diff.

new_schema_entry = self.get_schema_entry(key, value)
new_schema_entry = self.get_schema_entry(key,
value,
base_path=base_path)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, this is fixed on the develop branch... In fact, you fixed it with 0c2b3a0. It looks like the problem is that you took Bozo's #40 but didn't rebase it off develop, so now, this PR has diverged.

--count \
--ignore W503 \
--show-source \
--statistics \
--max-line-length=80
flake8 tests/ \
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to duplicate the entire command, just add tests to the previous flake8 command:

flake8:
    flake8 bigquery_schema_generator tests \
        --count \
        --ignore W503 \
        --show-source \
        --statistics \
        --max-line-length=80

Small nit: I prefer my directories to not have a trailing /, since it is not part of their name. And trailing slashes are actually meaningful in some programs (e.g. rsync(1)), so I'd rather not get in the habit of using them without explicit reasons.

sanitized_key = self.sanitize_name(key)
# We want to use a lowercase version of the key in the schema map so
# that we can aggregate keys with slightly different casing together
sanitized_key = self.sanitize_name(key).lower()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename sanitized_key in this method to something like canonical_key, to avoid confusion with sanitized_key in get_schema_entry()? Also, update the comment to say something like: "The canonical key is the lower-cased version of the sanitized key so that the case of the field name is preserved when generating the schema."

new_info['type'] = candidate_type
return new_schema_entry

def merge_mode(self, old_schema_entry, new_schema_entry, base_path):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this section is the crux of this PR, but unfortunately I am having an incredibly difficult time reviewing it. The GitHub code review tool is so clunky and difficult compared to what I'm used to. And the git diff output is completely messed up because the previous PR was not rebased off the current develop branch, so it's giving me diffs with lines that don't exist anymore, and I can't make any sense of it.

Request: Can you squash merge all your commits and Bozo's commits into a single commit, and send me another PR? I don't how if GitHub allows you to push those changes into this PR, or whether it will create another PR. I suspect the latter. This way, all your changes will be rebased off the current state of the code, so I can figure out what actually changed. Right now, I just cannot make any sense of it...

I think what you need to do is create a new branch, then do something like:

$ git co develop
$ git pull # make sure that you are synced with the repo
$ git co -b {new_branch}
$ git merge --squash abroglesc-existing-schema
$ git push

Then create a PR request on GitHub from {new_branch}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bxparks created #61 as requested.

# impossible, but more effort than I want to expend right now) to figure
# out which fields are missing so that we can mark the appropriate
# schema entries with 'filled=False'.
if infer_mode and input_format != 'csv':
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a bad idea. I know that you tried to recover some of this functionality in flatten_schema_map(), I am not sure that's sufficient. The reason is that if --infer_mode is enabled in JSON mode, it will do the wrong thing when JSON lines are missing those fields completely. All of the records will have a filled=True state, so I think the program will incorrectly infer the mode to be REQUIRED. This will break bq load when loading that dataset with this generated schema.

I realize that I was the one who suggested overloading the --infer_mode for the REQUIRED->NULLABLE transition, but I'm starting to think that we need a separate flag for this. To avoid enabling --infer_mode for JSON format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will continue on the next PR I submit with squashed commits but why would this be bad given that added one more check to only apply this in flatten_schema_map() if the input_format is a CSV. If it's a JSON input format it will not elevate these to REQUIRED.

bxparks added a commit that referenced this pull request Dec 5, 2020
Add support for existing BigQuery Schema with a single squashed commit (replace #57)
@bxparks
Copy link
Owner

bxparks commented Dec 5, 2020

Superceded by #61, closing.

@bxparks bxparks closed this Dec 5, 2020
@bxparks bxparks mentioned this pull request Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants