-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏗️ Python CDK: add schema transformer class #6139
Conversation
a99b8c1
to
a21e032
Compare
a21e032
to
abde8f9
Compare
abde8f9
to
d912dd4
Compare
c445fde
to
14316be
Compare
14316be
to
d63db55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may add some additional tests to check the the poorly formatted schemas.
For instance:
type: "object"
, butproperties
is not declaredtype: "array"
with and withoutitems
prop declared
I don't also see the transform case foranyOf
prop.
Overall it would nice to see all of the possible schema props covered (see http://json-schema.org/draft-07/schema#). Or at least we should document what is supported and what's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like this as a way to make syncs more robust against API changes. Nice approach.
Added various comments with some requests for changes/additions.
Two more high-level points I'd like to understand:
- How impactful is this on performance? We're adding a non-trivial set of steps and field mutation to every single airbyte record so I think it would be good diligence to check how severely this impacts performance. If it's major, there may need to be some optimisation-led refactoring to achieve the same result.
- How does this handle additional fields in the record that aren't present in the schema? Fail sync? Would be good to make this very clear so as a developer I would know how including Transformer usage affects additional properties.
Ive done some performance measures with ads_insights facebook schema (most complex schema I found) and example object (see under Test data spoiler).
On my PC (AMD Ryzen 7 5800X) it took 0.8 milliseconds per one object. As you can see most time (~75%) is taken by jsonschema traverse/validation routine and very little (less than 10 %) by actual converting. Test dataobject:{
"account_currency": "USD",
"account_id": "212551616838260",
"account_name": "Airbyte",
"ad_id": "23846784938030398",
"ad_name": "Stock photo ad 2",
"adset_id": "23846765228280398",
"adset_name": "Vanilla awareness ad set",
"buying_type": "AUCTION",
"campaign_id": "23846765228240398",
"campaign_name": "Airbyte Awareness Campaign 1 (sherif)",
"clicks": 0,
"conversion_rate_ranking": "UNKNOWN",
"cost_per_estimated_ad_recallers": 0.04,
"cpm": 6.779661,
"cpp": 6.837607,
"created_time": "2021-02-11",
"ctr": 0.0,
"date_start": "2021-08-27",
"date_stop": "2021-08-27",
"engagement_rate_ranking": "UNKNOWN",
"estimated_ad_recall_rate": 17.094017,
"estimated_ad_recallers": 20.0,
"frequency": 1.008547,
"impressions": 118,
"inline_link_clicks": 0,
"inline_post_engagement": 0,
"objective": "BRAND_AWARENESS",
"optimization_goal": "AD_RECALL_LIFT",
"quality_ranking": "UNKNOWN",
"reach": 117,
"social_spend": 0.0,
"spend": 0.8,
"unique_clicks": 0,
"unique_ctr": 0.0,
"unique_inline_link_clicks": 0,
"unique_link_clicks_ctr": 0.0,
"updated_time": "2021-08-27",
"wish_bid": 0.0,
"action_values": [
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"}
],
"ad_click_actions": [
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
{"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"}
]
} schema {
"properties": {
"account_currency": {
"type": [
"null",
"string"
]
},
"account_id": {
"type": [
"null",
"string"
]
},
"account_name": {
"type": [
"null",
"string"
]
},
"action_values": {
"$ref": "#/definitions/ads_action_stats_"
},
"actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"ad_click_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"ad_id": {
"type": [
"null",
"string"
]
},
"ad_impression_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"ad_name": {
"type": [
"null",
"string"
]
},
"adset_id": {
"type": [
"null",
"string"
]
},
"adset_name": {
"type": [
"null",
"string"
]
},
"age_targeting": {
"type": [
"null",
"string"
]
},
"attribution_setting": {
"type": [
"null",
"string"
]
},
"auction_bid": {
"type": [
"null",
"number"
]
},
"auction_competitiveness": {
"type": [
"null",
"number"
]
},
"auction_max_competitor_bid": {
"type": [
"null",
"number"
]
},
"buying_type": {
"type": [
"null",
"string"
]
},
"campaign_id": {
"type": [
"null",
"string"
]
},
"campaign_name": {
"type": [
"null",
"string"
]
},
"canvas_avg_view_percent": {
"type": [
"null",
"number"
]
},
"canvas_avg_view_time": {
"type": [
"null",
"number"
]
},
"catalog_segment_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"catalog_segment_value": {
"$ref": "#/definitions/ads_action_stats_"
},
"catalog_segment_value_mobile_purchase_roas": {
"$ref": "#/definitions/ads_action_stats_"
},
"catalog_segment_value_omni_purchase_roas": {
"$ref": "#/definitions/ads_action_stats_"
},
"catalog_segment_value_website_purchase_roas": {
"$ref": "#/definitions/ads_action_stats_"
},
"clicks": {
"type": [
"null",
"integer"
]
},
"conversion_rate_ranking": {
"type": [
"null",
"string"
]
},
"conversion_values": {
"$ref": "#/definitions/ads_action_stats_"
},
"conversions": {
"$ref": "#/definitions/ads_action_stats_"
},
"converted_product_quantity": {
"$ref": "#/definitions/ads_action_stats_"
},
"converted_product_value": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_15_sec_video_view": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_2_sec_continuous_video_view": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_action_type": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_ad_click": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_conversion": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_estimated_ad_recallers": {
"type": [
"null",
"number"
]
},
"cost_per_inline_link_click": {
"type": [
"null",
"number"
]
},
"cost_per_inline_post_engagement": {
"type": [
"null",
"number"
]
},
"cost_per_outbound_click": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_thruplay": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_unique_action_type": {
"$ref": "#/definitions/ads_action_stats_"
},
"cost_per_unique_click": {
"type": [
"null",
"number"
]
},
"cost_per_unique_inline_link_click": {
"type": [
"null",
"number"
]
},
"cost_per_unique_outbound_click": {
"$ref": "#/definitions/ads_action_stats_"
},
"cpc": {
"type": [
"null",
"number"
]
},
"cpm": {
"type": [
"null",
"number"
]
},
"cpp": {
"type": [
"null",
"number"
]
},
"created_time": {
"format": "date",
"type": [
"null",
"string"
]
},
"ctr": {
"type": [
"null",
"number"
]
},
"date_start": {
"format": "date",
"type": [
"null",
"string"
]
},
"date_stop": {
"format": "date",
"type": [
"null",
"string"
]
},
"engagement_rate_ranking": {
"type": [
"null",
"string"
]
},
"estimated_ad_recall_rate": {
"type": [
"null",
"number"
]
},
"estimated_ad_recall_rate_lower_bound": {
"type": [
"null",
"number"
]
},
"estimated_ad_recall_rate_upper_bound": {
"type": [
"null",
"number"
]
},
"estimated_ad_recallers": {
"type": [
"null",
"number"
]
},
"estimated_ad_recallers_lower_bound": {
"type": [
"null",
"number"
]
},
"estimated_ad_recallers_upper_bound": {
"type": [
"null",
"number"
]
},
"frequency": {
"type": [
"null",
"number"
]
},
"full_view_impressions": {
"type": [
"null",
"number"
]
},
"full_view_reach": {
"type": [
"null",
"number"
]
},
"gender_targeting": {
"type": [
"null",
"string"
]
},
"impressions": {
"type": [
"null",
"integer"
]
},
"inline_link_click_ctr": {
"type": [
"null",
"number"
]
},
"inline_link_clicks": {
"type": [
"null",
"integer"
]
},
"inline_post_engagement": {
"type": [
"null",
"integer"
]
},
"instant_experience_clicks_to_open": {
"type": [
"null",
"number"
]
},
"instant_experience_clicks_to_start": {
"type": [
"null",
"number"
]
},
"instant_experience_outbound_clicks": {
"$ref": "#/definitions/ads_action_stats_"
},
"labels": {
"type": [
"null",
"string"
]
},
"location": {
"type": [
"null",
"string"
]
},
"mobile_app_purchase_roas": {
"$ref": "#/definitions/ads_action_stats_"
},
"objective": {
"type": [
"null",
"string"
]
},
"optimization_goal": {
"type": [
"null",
"string"
]
},
"outbound_clicks": {
"$ref": "#/definitions/ads_action_stats_"
},
"outbound_clicks_ctr": {
"$ref": "#/definitions/ads_action_stats_"
},
"purchase_roas": {
"$ref": "#/definitions/ads_action_stats_"
},
"qualifying_question_qualify_answer_rate": {
"type": [
"null",
"number"
]
},
"quality_ranking": {
"type": [
"null",
"string"
]
},
"reach": {
"type": [
"null",
"integer"
]
},
"social_spend": {
"type": [
"null",
"number"
]
},
"spend": {
"type": [
"null",
"number"
]
},
"unique_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"unique_clicks": {
"type": [
"null",
"integer"
]
},
"unique_ctr": {
"type": [
"null",
"number"
]
},
"unique_inline_link_click_ctr": {
"type": [
"null",
"number"
]
},
"unique_inline_link_clicks": {
"type": [
"null",
"integer"
]
},
"unique_link_clicks_ctr": {
"type": [
"null",
"number"
]
},
"unique_outbound_clicks": {
"$ref": "#/definitions/ads_action_stats_"
},
"unique_outbound_clicks_ctr": {
"$ref": "#/definitions/ads_action_stats_"
},
"updated_time": {
"format": "date",
"type": [
"null",
"string"
]
},
"video_15_sec_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_30_sec_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_avg_time_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_continuous_2_sec_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_p100_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_p25_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_p50_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_p75_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_p95_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_play_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"video_play_curve_actions": {
"$ref": "#/definitions/ads_histogram_stats_"
},
"video_play_retention_0_to_15s_actions": {
"$ref": "#/definitions/ads_histogram_stats_"
},
"video_play_retention_20_to_60s_actions": {
"$ref": "#/definitions/ads_histogram_stats_"
},
"video_play_retention_graph_actions": {
"$ref": "#/definitions/ads_histogram_stats_"
},
"video_time_watched_actions": {
"$ref": "#/definitions/ads_action_stats_"
},
"website_ctr": {
"$ref": "#/definitions/ads_action_stats_"
},
"website_purchase_roas": {
"$ref": "#/definitions/ads_action_stats_"
},
"wish_bid": {
"type": [
"null",
"number"
]
}
},
"type": [
"null",
"object"
],
"definitions": {
"ads_action_stats_": {
"type": [
"null",
"array"
],
"items": {
"type": [
"null",
"object"
],
"properties": {
"1d_click": {
"type": [
"null",
"number"
]
},
"7d_click": {
"type": [
"null",
"number"
]
},
"28d_click": {
"type": [
"null",
"number"
]
},
"1d_view": {
"type": [
"null",
"number"
]
},
"7d_view": {
"type": [
"null",
"number"
]
},
"28d_view": {
"type": [
"null",
"number"
]
},
"action_destination": {
"type": [
"null",
"string"
]
},
"action_target_id": {
"type": [
"null",
"string"
]
},
"action_type": {
"type": [
"null",
"string"
]
},
"value": {
"type": [
"null",
"number"
]
}
}
}
},
"ads_histogram_stats_": {
"type": [
"null",
"array"
],
"items": {
"type": [
"null",
"object"
],
"properties": {
"action_type": {
"type": [
"null",
"string"
]
},
"value": {
"type": [
"null",
"array"
],
"items": {
"type": [
"null",
"integer"
]
}
}
}
}
}
}
} update: we can reduce single object processing by 2 times faster by skipping jsonschema type checking, but in this case we wouldn't receive warning on object does not fit schema.
It would be ignored. Any errors while validating object against schema will be ignored (only print warning in the logs). |
65b2620
to
7592571
Compare
7592571
to
1fbf534
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for changes and running those performance benchmarks, that's really useful.
Based on that, I think it is at least worth mentioning in the .../schemas.md
that utilising this behaviour will introduce a slight slowdown, and you could link to your comment in this PR with those performance tests.
if TransformConfig.NoTransform in config and config != TransformConfig.NoTransform: | ||
raise Exception("NoTransform option cannot be combined with other flags.") | ||
self._config = config | ||
all_validators = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
piggybacking on jsonschema native validators is a clever idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Im proud of it :)
} | ||
self._normalizer = validators.create(meta_schema=Draft7Validator.META_SCHEMA, validators=all_validators) | ||
|
||
def register(self, normalization_callback: Callable) -> Callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coming from the point of view of a developer leveraging this class, it's not clear what the contract of the registered normalizer is. What is the input to the callable I provide? is it just a single value? Does it also take in a schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add type annotation and link to docs.
``` | ||
Also it works on complex types: | ||
```json | ||
{"value": {"unexpected_object": "value"}} -> {"value": "{'unexpected_object': 'value'}"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this example the types are unchanged right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object {"unexpected_object": "value"} has been casted to a string representing object "{'unexpected_object': 'value'}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH I see it now. thanks!
# need it to normalize values against json schema. By default no action | ||
# taken unless configured. See | ||
# docs/connector-development/cdk-python/schemas.md for details. | ||
transformer.transform(data, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am conflicted between providing this by default in the source itself and therefore increasing the scope of the source vs. just telling the user they are responsible for using this helper class.
How do you think about this decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it would be a lot of duplicate code like
class MyStream:
transformer = TypeTransformer(DefaultTransform)
def read_records(...):
for r in super().read_records(...):
# conversion done in-place, we cant just do yield from ...
transformer.transform(r)
yield r
instead of just
class MyStream:
transformer = TypeTransformer(DefaultTransform)
I would like to do this in source code to keep connectors code cleaner. Also connector's developer may forget about get_json_schema caching and this would reduce performance A LOT.
@avida also can we retain your benchmarks somewhere? Maybe even just in the docs? I am sure this question will come up by users |
@@ -25,3 +25,85 @@ def get_json_schema(self): | |||
return schema | |||
``` | |||
|
|||
## Schema normalization | |||
|
|||
It is important to ensure output data conforms to the declared json schema. This is because the destination receiving this data to load into tables may strictly enforce schema (e.g. when data is stored in a SQL database, you can't put CHAT type into INTEGER column). In the case of changes to API output (which is almost guaranteed to happen over time) or a minor mistake in jsonschema definition, data syncs could thus break because of mismatched datatype schemas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are fantastic! very thorough and user friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
f1add24
to
28318a5
Compare
28318a5
to
daeb1bb
Compare
What
Resolves #5224
How
Describe the solution
Recommended reading order
x.java
y.python
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/SUMMARY.md
docs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampledocs/integrations/README.md
airbyte-integrations/builds.md
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereUpdating a connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampleAirbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing./publish
command described hereConnector Generator
-scaffold
in their name) have been updated with the latest scaffold by running./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates
then checking in your changes