Actor Catalog Fetch Event Duplicate Key exception #21208

michelgalle · 2023-01-10T17:04:18Z

Environment

Airbyte version: 0.40.26
OS Version / Instance: AMI 2 / AWS EC2
Deployment: Docker
Step where error happened: After Setting up a connector

Current Behavior

After setting up a connector (unfortunately I don't have the logs nor remember which connector it was), I started getting the following error in the ui:

Expected Behavior

The ui behave as expected.

Logs

The only logs I can get are the ones posted on a slack thread.

2022-12-22 21:49:01 ERROR i.a.s.a.ApiHelper(execute):28 - Unexpected Exception
java.lang.IllegalStateException: Duplicate key b601e028-e003-4290-9acc-5fc72ca44776 (attempted merging values io.airbyte.config.ActorCatalogFetchEvent@50671700[id=<null>,actorId=b601e028-e003-4290-9acc-5fc72ca44776,actorCatalogId=29181719-0c50-4489-b5aa-03c0e10cb227,configHash=<null>,connectorVersion=<null>,createdAt=1671739270] and io.airbyte.config.ActorCatalogFetchEvent@69081582[id=<null>,actorId=b601e028-e003-4290-9acc-5fc72ca44776,actorCatalogId=d664d0dd-2ded-4a9e-9737-fba55e1e79eb,configHash=<null>,connectorVersion=<null>,createdAt=1671739270])

Steps to Reproduce

Create a new ingest.
During schema discovery somehow airbyte executed/added to the metadata db 2 rows to actor catalog fetch event with the same timestamp. Not sure why it happened.

I already did some troubleshooting here and know the reason for the error. Since there are 2 rows with same timestamp, this should be using row_number() instead of rank(). Since I am new to airbyte, I am not able to inform whether the code that inserts into the table should be fixed so we do not have 2 rows with same timestamp for same actor or simply fixing the query I mentioned above is enough.

Are you willing to submit a PR?

No

The text was updated successfully, but these errors were encountered:

hugozap · 2023-01-19T21:29:20Z

I confirm that by using row_number as @michelgalle stated, the problem goes away at least from a UI standpoint.

mfsiega-airbyte · 2023-01-20T14:55:10Z

Thanks for the report, and the investigation!

It seems like the 2 rows with the same timestamp are happening because the frontend is actually requesting schema discovery twice. This isn't intended, but it shouldn't case the backend to fail like this. Indeed as you note, the combination of actor_catalog_id, actor_id, created_at isn't guaranteed to be unique.

Using row_number seems like a good option. We can also include the id column which is intended to be unique. I'll send a PR with those changes.

hugozap · 2023-01-21T05:17:02Z

@mfsiega-airbyte I found that both getMostRecentActorCatalogFetchEventForSource and getMostRecentActorCatalogFetchEventForSources need to be updated as they both fail when there are duplicate records.

For getMostRecentActorCatalogFetchEventForSource I had to include the ID as part of the groupBy (as stated in the error I got).

After both methods were patched I could setup a connection succesfully.

mfsiega-airbyte · 2023-01-23T18:29:55Z

@hugozap thanks for the pointer! I put up a PR for getMostRecentActorCatalogFetchEventForSources since that one was a bit clearer.

For getMostRecentActorCatalogFetchEventForSource I'm not totally following - I don't see in the code that it's doing a groupBy. Am I misunderstanding? (If you have your patch in a fork/branch somewhere, feel free to point me there instead?)

hugozap · 2023-01-23T18:42:48Z

@mfsiega-airbyte Sorry I was not clear.

With getMostRecentActorCatalogFetchEventForSource the problem is that it will also fail if duplicate records were written. I know this because I first patched locally getMostRecentActorCatalogFetchEventForSources and then the UI worked but the moment I tried to setup a new connection I got an error again. The cause of the error is that the limit(1) will fail because it won't know which record to choose.

I fixed it locally by adding ACTOR_CATALOG_FETCH_EVENT.id.desc() to the ORDER BY to break the tie and limit(1) will work again.

To test that it fails, you would have to simulate the current write bug by inserting a duplicate record (same actor id and create date).

michelgalle added needs-triage type/bug Something isn't working labels Jan 10, 2023

octavia-squidington-iv added autoteam community team/tse Technical Support Engineers labels Jan 10, 2023

sh4sh added area/platform issues related to the platform team/platform-move and removed needs-triage team/tse Technical Support Engineers labels Jan 18, 2023

github-actions bot added the platform-move/requires-grooming label Jan 18, 2023

hugozap added a commit to hugozap/airbyte that referenced this issue Jan 19, 2023

Fix for issue airbytehq#21208

94d5326

mfsiega-airbyte mentioned this issue Jan 23, 2023

fix query to get most recent actor catalog fetch event for some sources #21726

Merged

bleonard added the frozen Not being actively worked on label Mar 22, 2024

davinchia added move-migrate and removed move-migrate labels Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor Catalog Fetch Event Duplicate Key exception #21208

Actor Catalog Fetch Event Duplicate Key exception #21208

michelgalle commented Jan 10, 2023

hugozap commented Jan 19, 2023

mfsiega-airbyte commented Jan 20, 2023

hugozap commented Jan 21, 2023

mfsiega-airbyte commented Jan 23, 2023

hugozap commented Jan 23, 2023 •

edited

Loading

Actor Catalog Fetch Event Duplicate Key exception #21208

Actor Catalog Fetch Event Duplicate Key exception #21208

Comments

michelgalle commented Jan 10, 2023

Environment

Current Behavior

Expected Behavior

Logs

Steps to Reproduce

Are you willing to submit a PR?

hugozap commented Jan 19, 2023

mfsiega-airbyte commented Jan 20, 2023

hugozap commented Jan 21, 2023

mfsiega-airbyte commented Jan 23, 2023

hugozap commented Jan 23, 2023 • edited Loading

hugozap commented Jan 23, 2023 •

edited

Loading