Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error rate improvement on airbyte server - correctly classify 4xx vs 5xx errors #14427

Closed
xiaohansong opened this issue Jul 5, 2022 · 2 comments
Assignees
Labels
area/platform issues related to the platform team/compose

Comments

@xiaohansong
Copy link
Contributor

xiaohansong commented Jul 5, 2022

The error rate on the following APIs are a slightly higher than desired, from datadog monitoring page:

/v1/web_backend/connections/get
/v1/connections/reset
/v1/source_oauths/complete_oauth (and destination_oauths)
/v1/scheduler/sources/check_connection

A closer look on the traces suggested some of them might not come from a server error. For example:

  • /v1/source_oauths/complete_oauth failed because java.io.IOException: Undefined 'code' from consent redirected url. (failed oauth?)
  • /v1/web_backend/connections/get failed because Duplicate key [deals] (attempted merging values {"type":["null","array"],"items":{"type":["null","string"]}} and {"type":["null","string"]}) (invalid schema maybe?)
  • /v1/connections/reset failed because Could not find job with id: -1 (has the job existed ever?)
  • /v1/scheduler/sources/check_connection failed due to com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Secret Payload cannot be empty. (maybe invalid auth info from customer?)
  • /v1/connections/sync failed because a sync has been running at the moment. (can be reproduced by openning the same workspace in two tabs, and clicking sync on both pages)

For some errors we could validate the status before running the workflow, such as checking if a sync has been running, while for others such as validating oauth we need to actually start the workflow and rely on the result.

It seems that we forward errors from worker as illegalStateException which later gets translated into 500 error code. https://github.com/airbytehq/airbyte/blob/master/airbyte-server/src/main/java/io/airbyte/server/handlers/SchedulerHandler.java#L384

Or simply throwing a RuntimeException - https://github.com/airbytehq/airbyte/blob/master/airbyte-scheduler/client/src/main/java/io/airbyte/scheduler/client/DefaultSynchronousSchedulerClient.java#L129-L131

We should classify them into correct errors so we maintain a healthy monitoring environment, and we could spot errors from server side easier.

@davinchia FYI since we have discussed this :)

Implementation goals:

  1. move sanity state checks ahead of the actual logic; so we could fail a request early
  2. Add a ErrorCode enum class, change worker to throw exceptions with error code, and propagate this all the way to schedule_handler.
@xiaohansong
Copy link
Contributor Author

@lmossman or @benmoriceau From the git blame it seems you two know the context about the code I'm going to change - can you take a look on this ticket and see if you think this proposal makes sense and doable? I appreciate any comments before actually working on it. Thanks a lot!

@xiaohansong xiaohansong added the area/platform issues related to the platform label Jul 13, 2022
@xiaohansong
Copy link
Contributor Author

A closer look found out the workflow issue was correctly handled. The underlying issue needs to be resolved individually. The previous PR was sufficient to close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform team/compose
Projects
None yet
Development

No branches or pull requests

2 participants