Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98

Closed
yusuffgur opened this issue Oct 31, 2023 · 4 comments

Comments

@yusuffgur
Copy link

Current Behavior

Whether starting with the existing schema or not, if the script encounters a change, it logs the changed line. And giving errors like:

Error Log

INFO:root:Problem on line 47730: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,RECORD); new=(hard,dimensionValue,REPEATED,STRING) INFO:root:Problem on line 47732: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,STRING); new=(hard,dimensionValue,REPEATED,RECORD)

Expected Behavior

For example, our file includes like 100000 rows, but there are only 100 rows that do not match the existing schema. But if those nonmatching lines come consecutively, the script detects the first one as problematic, and one matching line that comes after consecutive nonmatching lines is marked as problematic, although it actually matches the existing schema.

Suggested solution

Add a new feature that checks existing files regarding a schema file and excludes rows that do not match the schema and writes them to another JSON/CSV file.

@bxparks
Copy link
Owner

bxparks commented Oct 31, 2023

Can you attach a minimal sample data that illustrates the issue, say something with 3-4 records? I don't think I understand the expected behavior, and this part of the codebase is tricky, so I don't remember all the edge cases since I don't use this project personally anymore.

@yusuffgur
Copy link
Author

For example 2-5 are problematic but it gives only 2nd and 6th (not problematic)

{"landingPageClicks":2.0,"costInLocalCurrency":3.211797061516033,"impressions":331.0,"clicks":2.0,"totalEngagements":3.0,"account_id":0000000,"account_name":"dummy","date":"2021-07-10","dimensionValue":[{"servingHoldReasons":["STOPPED","CAMPAIGN_STOPPED"],"lastModifiedAt":1656605739000.0,"content":{"reference":"urn:li:share:6803578573446815744"},"createdAt":1622099537000.0,"review":{"status":"APPROVED"},"id":"urn:li:sponsoredCreative:133406534","lastModifiedBy":"urn:li:system:0","createdBy":"urn:li:person:ghhgfhfgh","isTest":false,"isServing":false,"campaign":"urn:li:sponsoredCampaign:3423525","intendedStatus":"PAUSED","account":"urn:li:sponsoredAccount:234578"}],"dimension":"creative"}
{"landingPageClicks":17.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":25.41813751523781,"impressions":2300.0,"clicks":17.0,"totalEngagements":45.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-03","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":11.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":8.094764692519716,"impressions":602.0,"clicks":11.0,"totalEngagements":15.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-27","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":10.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":15.095423445027421,"impressions":999.0,"clicks":10.0,"totalEngagements":19.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-06","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":19.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":26.39521675040559,"impressions":2982.0,"clicks":19.0,"totalEngagements":45.0,"account_id":0000000,"account_name":"dummy","date":"2021-07-19","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":5.0,"costInLocalCurrency":7.54,"impressions":430.0,"clicks":5.0,"totalEngagements":12.0,"account_id":0000000,"account_name":"dummy","date":"2021-12-25","dimensionValue":[{"servingHoldReasons":["CAMPAIGN_STOPPED","CAMPAIGN_TOTAL_BUDGET_HOLD"],"lastModifiedAt":1656583173000.0,"content":{"reference":"urn:li:share:6879344597777084416"},"createdAt":1640163564000.0,"review":{"status":"APPROVED"},"id":"urn:li:sponsoredCreative:157081644","lastModifiedBy":"urn:li:system:0","createdBy":"urn:li:person:ghhgfhfgh","isTest":false,"isServing":false,"campaign":"urn:li:sponsoredCampaign:235325","intendedStatus":"ACTIVE","account":"urn:li:sponsoredAccount:37865567"}],"dimension":"creative"}

bxparks added a commit that referenced this issue Nov 2, 2023
@bxparks
Copy link
Owner

bxparks commented Nov 2, 2023

Thanks for that sample data. It helped to track down a latent bug with the handling of multiple type mismatches. The bug probably existed since the very beginning of the script. The new code fixes the problem with multiple warning messages. It now prints only the first mismatch. The script will now ignore that particular column for all subsequent records, and the resulting schema will not contain the problematic column.

Your proposed solution is unfortunately out of scope for bigquery-schema-generator. I understand that it is what you eventually want to do, I have made the conscious decision to restrict bigquery-schema-generator to be strictly a schema generator, not a data cleanser. There are too many ways people want to sanitize, filter, and massage their data set, and I don't want to be in the business of supporting those endless variations.

@bxparks
Copy link
Owner

bxparks commented Jan 13, 2024

I finally got around releasing v1.6.1 yesterday. PyPI apparently changed its authentication system, so I had to update the release process. I had released v1.6.0 to GitHub back April 1, 2023, but apparently I never got around to pushing it to PyPI. So PyPI releases jump from v1.5.1 to v1.6.1.

This release fixes the problem of multiple warning messages. The tool has no opinions on what to do about those misbehaving records. Every project will want to do different things with them. Some will want to drop those records. Some will want to ignore just those columns, instead of ignoring the entire record. Some will want to convert the problematic field values into something else, using some rules which are appropriate for the specific project instead of contained in the dataset itself.

Each downstream project needs to figure out how to do the data cleansing. The bigquery-schema-generator tool is not a data cleanser, so that feature is out of scope of this tool. This project can be used as a python library, which may be helpful in some of those data cleansing scripts.

@bxparks bxparks closed this as completed Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants