[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98

yusuffgur · 2023-10-31T10:49:20Z

Current Behavior

Whether starting with the existing schema or not, if the script encounters a change, it logs the changed line. And giving errors like:

Error Log

INFO:root:Problem on line 47730: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,RECORD); new=(hard,dimensionValue,REPEATED,STRING) INFO:root:Problem on line 47732: Ignoring field with mismatched type: old=(hard,dimensionValue,REPEATED,STRING); new=(hard,dimensionValue,REPEATED,RECORD)

Expected Behavior

For example, our file includes like 100000 rows, but there are only 100 rows that do not match the existing schema. But if those nonmatching lines come consecutively, the script detects the first one as problematic, and one matching line that comes after consecutive nonmatching lines is marked as problematic, although it actually matches the existing schema.

Suggested solution

Add a new feature that checks existing files regarding a schema file and excludes rows that do not match the schema and writes them to another JSON/CSV file.

bxparks · 2023-10-31T17:05:04Z

Can you attach a minimal sample data that illustrates the issue, say something with 3-4 records? I don't think I understand the expected behavior, and this part of the codebase is tricky, so I don't remember all the edge cases since I don't use this project personally anymore.

yusuffgur · 2023-10-31T18:50:14Z

For example 2-5 are problematic but it gives only 2nd and 6th (not problematic)

{"landingPageClicks":2.0,"costInLocalCurrency":3.211797061516033,"impressions":331.0,"clicks":2.0,"totalEngagements":3.0,"account_id":0000000,"account_name":"dummy","date":"2021-07-10","dimensionValue":[{"servingHoldReasons":["STOPPED","CAMPAIGN_STOPPED"],"lastModifiedAt":1656605739000.0,"content":{"reference":"urn:li:share:6803578573446815744"},"createdAt":1622099537000.0,"review":{"status":"APPROVED"},"id":"urn:li:sponsoredCreative:133406534","lastModifiedBy":"urn:li:system:0","createdBy":"urn:li:person:ghhgfhfgh","isTest":false,"isServing":false,"campaign":"urn:li:sponsoredCampaign:3423525","intendedStatus":"PAUSED","account":"urn:li:sponsoredAccount:234578"}],"dimension":"creative"}
{"landingPageClicks":17.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":25.41813751523781,"impressions":2300.0,"clicks":17.0,"totalEngagements":45.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-03","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":11.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":8.094764692519716,"impressions":602.0,"clicks":11.0,"totalEngagements":15.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-27","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":10.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":15.095423445027421,"impressions":999.0,"clicks":10.0,"totalEngagements":19.0,"account_id":0000000,"account_name":"dummy","date":"2021-06-06","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":19.0,"pivotValues_":[{"message":"Call to downstream service failed. Downstream Service Exception: Cannot fetch this creative because the referenced post does not exist.","status":400.0}],"costInLocalCurrency":26.39521675040559,"impressions":2982.0,"clicks":19.0,"totalEngagements":45.0,"account_id":0000000,"account_name":"dummy","date":"2021-07-19","dimensionValue":["urn:li:sponsoredCreative:133406804"],"dimension":"creative"}
{"landingPageClicks":5.0,"costInLocalCurrency":7.54,"impressions":430.0,"clicks":5.0,"totalEngagements":12.0,"account_id":0000000,"account_name":"dummy","date":"2021-12-25","dimensionValue":[{"servingHoldReasons":["CAMPAIGN_STOPPED","CAMPAIGN_TOTAL_BUDGET_HOLD"],"lastModifiedAt":1656583173000.0,"content":{"reference":"urn:li:share:6879344597777084416"},"createdAt":1640163564000.0,"review":{"status":"APPROVED"},"id":"urn:li:sponsoredCreative:157081644","lastModifiedBy":"urn:li:system:0","createdBy":"urn:li:person:ghhgfhfgh","isTest":false,"isServing":false,"campaign":"urn:li:sponsoredCampaign:235325","intendedStatus":"ACTIVE","account":"urn:li:sponsoredAccount:37865567"}],"dimension":"creative"}

…match warnings (see #98)

bxparks · 2023-11-02T15:57:49Z

Thanks for that sample data. It helped to track down a latent bug with the handling of multiple type mismatches. The bug probably existed since the very beginning of the script. The new code fixes the problem with multiple warning messages. It now prints only the first mismatch. The script will now ignore that particular column for all subsequent records, and the resulting schema will not contain the problematic column.

Your proposed solution is unfortunately out of scope for bigquery-schema-generator. I understand that it is what you eventually want to do, I have made the conscious decision to restrict bigquery-schema-generator to be strictly a schema generator, not a data cleanser. There are too many ways people want to sanitize, filter, and massage their data set, and I don't want to be in the business of supporting those endless variations.

bxparks · 2024-01-13T15:50:12Z

I finally got around releasing v1.6.1 yesterday. PyPI apparently changed its authentication system, so I had to update the release process. I had released v1.6.0 to GitHub back April 1, 2023, but apparently I never got around to pushing it to PyPI. So PyPI releases jump from v1.5.1 to v1.6.1.

This release fixes the problem of multiple warning messages. The tool has no opinions on what to do about those misbehaving records. Every project will want to do different things with them. Some will want to drop those records. Some will want to ignore just those columns, instead of ignoring the entire record. Some will want to convert the problematic field values into something else, using some rules which are appropriate for the specific project instead of contained in the dataset itself.

Each downstream project needs to figure out how to do the data cleansing. The bigquery-schema-generator tool is not a data cleanser, so that feature is out of scope of this tool. This project can be used as a python library, which may be helpful in some of those data cleansing scripts.

bxparks added a commit that referenced this issue Nov 2, 2023

generate_schema.py: fix schema amnesia which prints multiple type mis…

79c7ef2

…match warnings (see #98)

bxparks mentioned this issue Jan 12, 2024

merge 1.6.1 into master #99

Merged

renovate bot mentioned this issue Jan 13, 2024

chore(deps): update dependency bigquery-schema-generator to v1.6.1 bharatgoelharness/harness-core#10

Open

1 task

bxparks closed this as completed Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98

[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98

yusuffgur commented Oct 31, 2023

bxparks commented Oct 31, 2023

yusuffgur commented Oct 31, 2023

bxparks commented Nov 2, 2023

bxparks commented Jan 13, 2024

[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98

[FEATURE] Starting with an existing schema, exclude rows that do not match the existing schema #98

Comments

yusuffgur commented Oct 31, 2023

Current Behavior

Expected Behavior

Suggested solution

bxparks commented Oct 31, 2023

yusuffgur commented Oct 31, 2023

bxparks commented Nov 2, 2023

bxparks commented Jan 13, 2024