-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98
Comments
Can you attach a minimal sample data that illustrates the issue, say something with 3-4 records? I don't think I understand the expected behavior, and this part of the codebase is tricky, so I don't remember all the edge cases since I don't use this project personally anymore. |
For example 2-5 are problematic but it gives only 2nd and 6th (not problematic)
|
Thanks for that sample data. It helped to track down a latent bug with the handling of multiple type mismatches. The bug probably existed since the very beginning of the script. The new code fixes the problem with multiple warning messages. It now prints only the first mismatch. The script will now ignore that particular column for all subsequent records, and the resulting schema will not contain the problematic column. Your proposed solution is unfortunately out of scope for bigquery-schema-generator. I understand that it is what you eventually want to do, I have made the conscious decision to restrict bigquery-schema-generator to be strictly a schema generator, not a data cleanser. There are too many ways people want to sanitize, filter, and massage their data set, and I don't want to be in the business of supporting those endless variations. |
I finally got around releasing v1.6.1 yesterday. PyPI apparently changed its authentication system, so I had to update the release process. I had released v1.6.0 to GitHub back April 1, 2023, but apparently I never got around to pushing it to PyPI. So PyPI releases jump from v1.5.1 to v1.6.1. This release fixes the problem of multiple warning messages. The tool has no opinions on what to do about those misbehaving records. Every project will want to do different things with them. Some will want to drop those records. Some will want to ignore just those columns, instead of ignoring the entire record. Some will want to convert the problematic field values into something else, using some rules which are appropriate for the specific project instead of contained in the dataset itself. Each downstream project needs to figure out how to do the data cleansing. The bigquery-schema-generator tool is not a data cleanser, so that feature is out of scope of this tool. This project can be used as a python library, which may be helpful in some of those data cleansing scripts. |
Current Behavior
Whether starting with the existing schema or not, if the script encounters a change, it logs the changed line. And giving errors like:
Error Log
INFO:root:Problem on line 47730: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,RECORD); new=(hard,dimensionValue,REPEATED,STRING) INFO:root:Problem on line 47732: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,STRING); new=(hard,dimensionValue,REPEATED,RECORD)
Expected Behavior
For example, our file includes like 100000 rows, but there are only 100 rows that do not match the existing schema. But if those nonmatching lines come consecutively, the script detects the first one as problematic, and one matching line that comes after consecutive nonmatching lines is marked as problematic, although it actually matches the existing schema.
Suggested solution
Add a new feature that checks existing files regarding a schema file and excludes rows that do not match the schema and writes them to another JSON/CSV file.
The text was updated successfully, but these errors were encountered: