Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏗️ Python CDK: add schema transformer class #6139

Merged
merged 7 commits into from
Sep 27, 2021

Conversation

avida
Copy link
Contributor

@avida avida commented Sep 16, 2021

What

Resolves #5224

How

Describe the solution

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions
  • Connector added to connector index like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Connector Generator

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed.

@github-actions github-actions bot added the CDK Connector Development Kit label Sep 16, 2021
@avida avida changed the title WIP: Python CDK: add schema transformer class WIP: 🏗️ Python CDK: add schema transformer class Sep 16, 2021
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Sep 17, 2021
@avida avida changed the title WIP: 🏗️ Python CDK: add schema transformer class 🏗️ Python CDK: add schema transformer class Sep 17, 2021
@avida avida force-pushed the drezchykov/cdk-transform branch 3 times, most recently from c445fde to 14316be Compare September 17, 2021 10:44
Copy link
Contributor

@htrueman htrueman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may add some additional tests to check the the poorly formatted schemas.
For instance:

  • type: "object", but properties is not declared
  • type: "array" with and without items prop declared
    I don't also see the transform case for anyOf prop.
    Overall it would nice to see all of the possible schema props covered (see http://json-schema.org/draft-07/schema#). Or at least we should document what is supported and what's not.

Copy link
Contributor

@Phlair Phlair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this as a way to make syncs more robust against API changes. Nice approach.

Added various comments with some requests for changes/additions.

Two more high-level points I'd like to understand:

  • How impactful is this on performance? We're adding a non-trivial set of steps and field mutation to every single airbyte record so I think it would be good diligence to check how severely this impacts performance. If it's major, there may need to be some optimisation-led refactoring to achieve the same result.
  • How does this handle additional fields in the record that aren't present in the schema? Fail sync? Would be good to make this very clear so as a developer I would know how including Transformer usage affects additional properties.

airbyte-cdk/python/CHANGELOG.md Outdated Show resolved Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py Outdated Show resolved Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/abstract_source.py Outdated Show resolved Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
@avida
Copy link
Contributor Author

avida commented Sep 22, 2021

How impactful is this on performance? We're adding a non-trivial set of steps and field mutation to every single airbyte record so I think it would be good diligence to check how severely this impacts performance. If it's major, there may need to be some optimisation-led refactoring to achieve the same result.

Ive done some performance measures with ads_insights facebook schema (most complex schema I found) and example object (see under Test data spoiler).
I run 3 times loop with 1000 object transformation. Here is average transform time per single object, seconds:

regular transform:
0.0008423403530008121

transform without type casting (but value still being write to dict/array):
0.000776215762666349

transform without actual value setting  (but iterating through object properties):
0.0006788729513330812

just traverse/validate through json schema and object fields:
0.0006139181846665452

On my PC (AMD Ryzen 7 5800X) it took 0.8 milliseconds per one object. As you can see most time (~75%) is taken by jsonschema traverse/validation routine and very little (less than 10 %) by actual converting.

Test data object:
{
    "account_currency": "USD",
    "account_id": "212551616838260",
    "account_name": "Airbyte",
    "ad_id": "23846784938030398",
    "ad_name": "Stock photo ad 2",
    "adset_id": "23846765228280398",
    "adset_name": "Vanilla awareness ad set",
    "buying_type": "AUCTION",
    "campaign_id": "23846765228240398",
    "campaign_name": "Airbyte Awareness Campaign 1 (sherif)",
    "clicks": 0,
    "conversion_rate_ranking": "UNKNOWN",
    "cost_per_estimated_ad_recallers": 0.04,
    "cpm": 6.779661,
    "cpp": 6.837607,
    "created_time": "2021-02-11",
    "ctr": 0.0,
    "date_start": "2021-08-27",
    "date_stop": "2021-08-27",
    "engagement_rate_ranking": "UNKNOWN",
    "estimated_ad_recall_rate": 17.094017,
    "estimated_ad_recallers": 20.0,
    "frequency": 1.008547,
    "impressions": 118,
    "inline_link_clicks": 0,
    "inline_post_engagement": 0,
    "objective": "BRAND_AWARENESS",
    "optimization_goal": "AD_RECALL_LIFT",
    "quality_ranking": "UNKNOWN",
    "reach": 117,
    "social_spend": 0.0,
    "spend": 0.8,
    "unique_clicks": 0,
    "unique_ctr": 0.0,
    "unique_inline_link_clicks": 0,
    "unique_link_clicks_ctr": 0.0,
    "updated_time": "2021-08-27",
    "wish_bid": 0.0,
    "action_values": [
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"}
    ],
    "ad_click_actions": [
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"},
        {"1d_click": 1123, "7d_click": 2232, "28d_click": 1213, "1d_view": 122, "7d_view": 21312, "28d_view": 1213, "action_destination": "1231", "action_target_id": "123", "action_type": "1231"}

    ]
}

schema

{
  "properties": {
    "account_currency": {
      "type": [
        "null",
        "string"
      ]
    },
    "account_id": {
      "type": [
        "null",
        "string"
      ]
    },
    "account_name": {
      "type": [
        "null",
        "string"
      ]
    },
    "action_values": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "ad_click_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "ad_id": {
      "type": [
        "null",
        "string"
      ]
    },
    "ad_impression_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "ad_name": {
      "type": [
        "null",
        "string"
      ]
    },
    "adset_id": {
      "type": [
        "null",
        "string"
      ]
    },
    "adset_name": {
      "type": [
        "null",
        "string"
      ]
    },
    "age_targeting": {
      "type": [
        "null",
        "string"
      ]
    },
    "attribution_setting": {
      "type": [
        "null",
        "string"
      ]
    },
    "auction_bid": {
      "type": [
        "null",
        "number"
      ]
    },
    "auction_competitiveness": {
      "type": [
        "null",
        "number"
      ]
    },
    "auction_max_competitor_bid": {
      "type": [
        "null",
        "number"
      ]
    },
    "buying_type": {
      "type": [
        "null",
        "string"
      ]
    },
    "campaign_id": {
      "type": [
        "null",
        "string"
      ]
    },
    "campaign_name": {
      "type": [
        "null",
        "string"
      ]
    },
    "canvas_avg_view_percent": {
      "type": [
        "null",
        "number"
      ]
    },
    "canvas_avg_view_time": {
      "type": [
        "null",
        "number"
      ]
    },
    "catalog_segment_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "catalog_segment_value": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "catalog_segment_value_mobile_purchase_roas": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "catalog_segment_value_omni_purchase_roas": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "catalog_segment_value_website_purchase_roas": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "clicks": {
      "type": [
        "null",
        "integer"
      ]
    },
    "conversion_rate_ranking": {
      "type": [
        "null",
        "string"
      ]
    },
    "conversion_values": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "conversions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "converted_product_quantity": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "converted_product_value": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_15_sec_video_view": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_2_sec_continuous_video_view": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_action_type": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_ad_click": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_conversion": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_estimated_ad_recallers": {
      "type": [
        "null",
        "number"
      ]
    },
    "cost_per_inline_link_click": {
      "type": [
        "null",
        "number"
      ]
    },
    "cost_per_inline_post_engagement": {
      "type": [
        "null",
        "number"
      ]
    },
    "cost_per_outbound_click": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_thruplay": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_unique_action_type": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cost_per_unique_click": {
      "type": [
        "null",
        "number"
      ]
    },
    "cost_per_unique_inline_link_click": {
      "type": [
        "null",
        "number"
      ]
    },
    "cost_per_unique_outbound_click": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "cpc": {
      "type": [
        "null",
        "number"
      ]
    },
    "cpm": {
      "type": [
        "null",
        "number"
      ]
    },
    "cpp": {
      "type": [
        "null",
        "number"
      ]
    },
    "created_time": {
      "format": "date",
      "type": [
        "null",
        "string"
      ]
    },
    "ctr": {
      "type": [
        "null",
        "number"
      ]
    },
    "date_start": {
      "format": "date",
      "type": [
        "null",
        "string"
      ]
    },
    "date_stop": {
      "format": "date",
      "type": [
        "null",
        "string"
      ]
    },
    "engagement_rate_ranking": {
      "type": [
        "null",
        "string"
      ]
    },
    "estimated_ad_recall_rate": {
      "type": [
        "null",
        "number"
      ]
    },
    "estimated_ad_recall_rate_lower_bound": {
      "type": [
        "null",
        "number"
      ]
    },
    "estimated_ad_recall_rate_upper_bound": {
      "type": [
        "null",
        "number"
      ]
    },
    "estimated_ad_recallers": {
      "type": [
        "null",
        "number"
      ]
    },
    "estimated_ad_recallers_lower_bound": {
      "type": [
        "null",
        "number"
      ]
    },
    "estimated_ad_recallers_upper_bound": {
      "type": [
        "null",
        "number"
      ]
    },
    "frequency": {
      "type": [
        "null",
        "number"
      ]
    },
    "full_view_impressions": {
      "type": [
        "null",
        "number"
      ]
    },
    "full_view_reach": {
      "type": [
        "null",
        "number"
      ]
    },
    "gender_targeting": {
      "type": [
        "null",
        "string"
      ]
    },
    "impressions": {
      "type": [
        "null",
        "integer"
      ]
    },
    "inline_link_click_ctr": {
      "type": [
        "null",
        "number"
      ]
    },
    "inline_link_clicks": {
      "type": [
        "null",
        "integer"
      ]
    },
    "inline_post_engagement": {
      "type": [
        "null",
        "integer"
      ]
    },
    "instant_experience_clicks_to_open": {
      "type": [
        "null",
        "number"
      ]
    },
    "instant_experience_clicks_to_start": {
      "type": [
        "null",
        "number"
      ]
    },
    "instant_experience_outbound_clicks": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "labels": {
      "type": [
        "null",
        "string"
      ]
    },
    "location": {
      "type": [
        "null",
        "string"
      ]
    },
    "mobile_app_purchase_roas": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "objective": {
      "type": [
        "null",
        "string"
      ]
    },
    "optimization_goal": {
      "type": [
        "null",
        "string"
      ]
    },
    "outbound_clicks": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "outbound_clicks_ctr": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "purchase_roas": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "qualifying_question_qualify_answer_rate": {
      "type": [
        "null",
        "number"
      ]
    },
    "quality_ranking": {
      "type": [
        "null",
        "string"
      ]
    },
    "reach": {
      "type": [
        "null",
        "integer"
      ]
    },
    "social_spend": {
      "type": [
        "null",
        "number"
      ]
    },
    "spend": {
      "type": [
        "null",
        "number"
      ]
    },
    "unique_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "unique_clicks": {
      "type": [
        "null",
        "integer"
      ]
    },
    "unique_ctr": {
      "type": [
        "null",
        "number"
      ]
    },
    "unique_inline_link_click_ctr": {
      "type": [
        "null",
        "number"
      ]
    },
    "unique_inline_link_clicks": {
      "type": [
        "null",
        "integer"
      ]
    },
    "unique_link_clicks_ctr": {
      "type": [
        "null",
        "number"
      ]
    },
    "unique_outbound_clicks": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "unique_outbound_clicks_ctr": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "updated_time": {
      "format": "date",
      "type": [
        "null",
        "string"
      ]
    },
    "video_15_sec_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_30_sec_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_avg_time_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_continuous_2_sec_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_p100_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_p25_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_p50_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_p75_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_p95_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_play_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "video_play_curve_actions": {
      "$ref": "#/definitions/ads_histogram_stats_"
    },
    "video_play_retention_0_to_15s_actions": {
      "$ref": "#/definitions/ads_histogram_stats_"
    },
    "video_play_retention_20_to_60s_actions": {
      "$ref": "#/definitions/ads_histogram_stats_"
    },
    "video_play_retention_graph_actions": {
      "$ref": "#/definitions/ads_histogram_stats_"
    },
    "video_time_watched_actions": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "website_ctr": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "website_purchase_roas": {
      "$ref": "#/definitions/ads_action_stats_"
    },
    "wish_bid": {
      "type": [
        "null",
        "number"
      ]
    }
  },
  "type": [
    "null",
    "object"
  ],
  "definitions": {
    "ads_action_stats_": {
      "type": [
        "null",
        "array"
      ],
      "items": {
        "type": [
          "null",
          "object"
        ],
        "properties": {
          "1d_click": {
            "type": [
              "null",
              "number"
            ]
          },
          "7d_click": {
            "type": [
              "null",
              "number"
            ]
          },
          "28d_click": {
            "type": [
              "null",
              "number"
            ]
          },
          "1d_view": {
            "type": [
              "null",
              "number"
            ]
          },
          "7d_view": {
            "type": [
              "null",
              "number"
            ]
          },
          "28d_view": {
            "type": [
              "null",
              "number"
            ]
          },
          "action_destination": {
            "type": [
              "null",
              "string"
            ]
          },
          "action_target_id": {
            "type": [
              "null",
              "string"
            ]
          },
          "action_type": {
            "type": [
              "null",
              "string"
            ]
          },
          "value": {
            "type": [
              "null",
              "number"
            ]
          }
        }
      }
    },
    "ads_histogram_stats_": {
      "type": [
        "null",
        "array"
      ],
      "items": {
        "type": [
          "null",
          "object"
        ],
        "properties": {
          "action_type": {
            "type": [
              "null",
              "string"
            ]
          },
          "value": {
            "type": [
              "null",
              "array"
            ],
            "items": {
              "type": [
                "null",
                "integer"
              ]
            }
          }
        }
      }
    }
  }
}

update: we can reduce single object processing by 2 times faster by skipping jsonschema type checking, but in this case we wouldn't receive warning on object does not fit schema.

How does this handle additional fields in the record that aren't present in the schema? Fail sync? Would be good to make this very clear so as a developer I would know how including Transformer usage affects additional properties.

It would be ignored. Any errors while validating object against schema will be ignored (only print warning in the logs).

Copy link
Contributor

@Phlair Phlair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changes and running those performance benchmarks, that's really useful.

Based on that, I think it is at least worth mentioning in the .../schemas.md that utilising this behaviour will introduce a slight slowdown, and you could link to your comment in this PR with those performance tests.

airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
if TransformConfig.NoTransform in config and config != TransformConfig.NoTransform:
raise Exception("NoTransform option cannot be combined with other flags.")
self._config = config
all_validators = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piggybacking on jsonschema native validators is a clever idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Im proud of it :)

airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
}
self._normalizer = validators.create(meta_schema=Draft7Validator.META_SCHEMA, validators=all_validators)

def register(self, normalization_callback: Callable) -> Callable:
Copy link
Contributor

@sherifnada sherifnada Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming from the point of view of a developer leveraging this class, it's not clear what the contract of the registered normalizer is. What is the input to the callable I provide? is it just a single value? Does it also take in a schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add type annotation and link to docs.

```
Also it works on complex types:
```json
{"value": {"unexpected_object": "value"}} -> {"value": "{'unexpected_object': 'value'}"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this example the types are unchanged right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object {"unexpected_object": "value"} has been casted to a string representing object "{'unexpected_object': 'value'}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH I see it now. thanks!

# need it to normalize values against json schema. By default no action
# taken unless configured. See
# docs/connector-development/cdk-python/schemas.md for details.
transformer.transform(data, schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am conflicted between providing this by default in the source itself and therefore increasing the scope of the source vs. just telling the user they are responsible for using this helper class.

How do you think about this decision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it would be a lot of duplicate code like

class MyStream:
    transformer = TypeTransformer(DefaultTransform)
    def read_records(...):
        for r in super().read_records(...):
            # conversion done in-place, we cant just do yield from ...
             transformer.transform(r)
             yield r

instead of just

class MyStream:
    transformer = TypeTransformer(DefaultTransform)

I would like to do this in source code to keep connectors code cleaner. Also connector's developer may forget about get_json_schema caching and this would reduce performance A LOT.

@sherifnada
Copy link
Contributor

@avida also can we retain your benchmarks somewhere? Maybe even just in the docs? I am sure this question will come up by users

@avida avida temporarily deployed to more-secrets September 24, 2021 07:58 Inactive
airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
airbyte-cdk/python/airbyte_cdk/sources/utils/transform.py Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
docs/connector-development/cdk-python/schemas.md Outdated Show resolved Hide resolved
@@ -25,3 +25,85 @@ def get_json_schema(self):
return schema
```

## Schema normalization

It is important to ensure output data conforms to the declared json schema. This is because the destination receiving this data to load into tables may strictly enforce schema (e.g. when data is stored in a SQL database, you can't put CHAT type into INTEGER column). In the case of changes to API output (which is almost guaranteed to happen over time) or a minor mistake in jsonschema definition, data syncs could thus break because of mismatched datatype schemas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are fantastic! very thorough and user friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@avida avida temporarily deployed to more-secrets September 27, 2021 09:54 Inactive
@avida
Copy link
Contributor Author

avida commented Sep 27, 2021

/publish-cdk dry-run=false

🕑 https://github.com/airbytehq/airbyte/actions/runs/1277728358
https://github.com/airbytehq/airbyte/actions/runs/1277728358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CDK: Create a generic method for transforming records fields in accordance with streams schemas
5 participants