Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer 'REQUIRED' mode with a flag for consistently filled in values #28

Merged
merged 6 commits into from
Mar 4, 2019

Conversation

korotkevics
Copy link
Contributor

Hello Brian!

Sorry for making you wait.

Here is what I wanted to have.

I've been rereading code and I've understood your confusion.
You use bq load and it expects one to loose mode to either NULLABLE or REPEATED. The thing is that I do not use bq load. :) I first analyze a few given files, decide on a schema, create a table, and then start loading a lot of data continuously (I wouldn't like to discuss it too deeply). If we allow users to decide on mode only when a given flag is set, it should remain backward compatible, however please do check if I haven't broken things :)

To me it seems like a general purpose functionality therefore I would prefer it to be open source so I really hope you won't mind having it.

If you agree to this, feel free to commit directly to my branch or branch out unless you strongly prefer to rewrite :) Also, if you just do a code review could be helpful as I am less experienced with Python.

Have a nice weekend.

@bxparks
Copy link
Owner

bxparks commented Mar 4, 2019

Thanks. I made a small change to implement this during the schema output (i.e. in flatten_schema_map()) instead of converting the NULLABLE into a REQUIRED during input data processing. I tried to upload the change directly into your forked repo, but didn't have permission (I'm not sure I fully understand how GitHub implements this workflow), so I uploaded it to my repo here:
https://github.com/bxparks/bigquery-schema-generator/tree/korotkevics-feature/infer_mode
(i.e. the branch name is the same as yours).

I also had to restrict --infer_mode to CSV files only. It doesn't really make sense for JSON files since JSON fields are often completely missing in some records compared to other records.

@korotkevics
Copy link
Contributor Author

korotkevics commented Mar 4, 2019

@bxparks nice, I've run it locally, it seems to be what I wanted and written in a shorter way :) You may consider adding tests :)

Could you please tell me when it would normally become available on pip (to be installed from repo).

@bxparks
Copy link
Owner

bxparks commented Mar 4, 2019

Cool. I'll merge into develop, then let it sit there for a day or two, to give me a chance to clean up small loose ends (There always something). Then push to PyPI in a few days.

@bxparks bxparks merged commit 6c52455 into bxparks:develop Mar 4, 2019
@korotkevics
Copy link
Contributor Author

@bxparks hm, yes it seems there's :P, I've just run it over a different file and it does not seem to support TIMESTAMP (example: "2018-07-23 12:42:21.6830000") as it recognizes it as STRING. Does the library support this type?

@bxparks
Copy link
Owner

bxparks commented Mar 4, 2019

BigQuery (more precisely 'bq load') supports only 6 decimal places for the 'second' component. You should write a filter to normalize the data. You might say that this belongs in bigquery-schema-generator but I would push back a little on that. To me, this is a data-cleansing operation that belongs in the 'T' part of the ETL pipeline.

@korotkevics
Copy link
Contributor Author

@bxparks ah I see it is not a TIMESTAMP which'd be recognized by BQ as such. Thank you!

@korotkevics korotkevics deleted the feature/infer_mode branch March 4, 2019 08:20
@bxparks bxparks mentioned this pull request Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants