Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitlab Source is only loading 50 rows per project #21076

Closed
Tracked by #14152
eflorico opened this issue Jan 5, 2023 · 9 comments · Fixed by #21713
Closed
Tracked by #14152

Gitlab Source is only loading 50 rows per project #21076

eflorico opened this issue Jan 5, 2023 · 9 comments · Fixed by #21713
Assignees

Comments

@eflorico
Copy link

eflorico commented Jan 5, 2023

Environment

  • Airbyte version: 0.40.26
  • OS Version / Instance: Debian 11
  • Deployment: Docker
  • Source Connector and version: Gitlab 0.1.12
  • Destination Connector and version: BigQuery 1.2.9
  • Step where error happened: Sync job

Current Behavior

For each Gitlab project, only a maximum of 50 pipelines, merge requests, and commits are retrieved. This means that most rows are missing from these streams. In our case, this means that tens of thousands of rows are missing. The retrieved data from Gitlab is therefore unusable.

I do see >50 branches for some projects, so I think branches may be fetched correctly.

Additionally, I see lots of duplicates in the users stream (all fields have the same values except for the airbyte* fields). In total, there are 617 rows for 12 distinct users. Mentioning this because it may be related; however, this is not the main issue for me.

Expected Behavior

All pipelines, merge requests, and commits since the specified start date should be retrieved.

Logs

log.txt

Notable findings in the logs:

2023-01-05 18:05:46 �[44msource�[0m > Syncing stream: merge_requests 
[...]
2023-01-05 18:05:51 �[44msource�[0m > Got 403 error when accessing URL https://gitlab.com/api/v4/projects/14806542/merge_requests?per_page=50&scope=all&updated_after=2019-01-01T00%3A00%3A00Z. Very likely the feature is disabled for this project and/or group. Please double check it, or report a bug otherwise.
[...]
2023-01-05 18:06:08 �[44msource�[0m > Syncing stream: pipelines 
[...]
2023-01-05 18:06:12 �[44msource�[0m > Got 403 error when accessing URL https://gitlab.com/api/v4/projects/14806542/pipelines?per_page=50&updated_after=2019-01-01T00%3A00%3A00Z. Very likely the feature is disabled for this project and/or group. Please double check it, or report a bug otherwise.

Steps to Reproduce

  1. Add Gitlab source. Config:
  • API URL: gitlab.com
  • Start Date: 2019-01-01T00:00:00Z
  • Groups: (name of our group)
  • Projects: (empty)
  1. Add Bigquery destination. Configured with GCS staging.
  2. Add Gitlab-Bigquery connection. Sync the following streams:
  • branches - full refresh/overwrite
  • commits - incremental/deduped+history
  • merge_requests - incremental/deduped+history
  • pipelines - incremental/deduped+history
  • projects - full refresh/overwrite
  • users - full refresh/overwrite

Other observations

I am currently evaluating both Airbyte and Meltano. Meltano seems to fetch data from Gitlab mostly fine after fixing some configuration issues. Interestingly, Meltano has the same issue with duplicate users.

May be same issue as #12476

Are you willing to submit a PR?

Probably not.

@davydov-d
Copy link
Collaborator

hey @eflorico thanks for the feedback.
I'm looking at the log file you have provided and see that there is much more data than 50 records per stream that have been emitted and committed. 415 merge requests, 456 pipelines, 1071 commits. Is it possible you're filtering the result in your destination or something? Given you have 56 projects and 415 merge requests, an average count of merge requests per project is about 8. I understand it's just an average, but could you please double check you're looking at the whole result, not a part of it?

@emilsedgh
Copy link

Hi @davydov-d

I could reliably reproduce this issue. I have a group with ~16.8K issues within a few different projects. When I pull the group, I get a total of 1923 issues.

Right now what Im doing is that I have one source per project which does actually get all issues.

Not sure if you have access but here is a job I executed on a brand new source (Using the new OAuth authentication, well done and congrats) a few minutes ago.

(The source is deleted now but you should probably be able to see logs)

I'd love to help you reproduce and resolve this.

@davydov-d
Copy link
Collaborator

@emilsedgh I got it, thanks. Will make a deep dive into it then

@eflorico
Copy link
Author

@davydov-d I checked and can confirm I'm not filtering the results by accident. You're right that there are 56 projects total, but most pipelines, merge requests, and commits belong to only a handful of projects which should have thousands of them.

@davydov-d davydov-d linked a pull request Jan 23, 2023 that will close this issue
davydov-d added a commit that referenced this issue Jan 23, 2023
davydov-d added a commit that referenced this issue Jan 24, 2023
* #21076 source gitlab: fix missing data issue

* #21076 source gitlab: upd changelog

* auto-bump connector version

Co-authored-by: Octavia Squidington III <[email protected]>
@davydov-d
Copy link
Collaborator

@eflorico @emilsedgh hey, the fix has been released. Could you please check it and provide your feedback?

@emilsedgh
Copy link

Checking right now.

@emilsedgh
Copy link

Fantastic job and well done! It synchronized all my 16K issues.

@davydov-d
Copy link
Collaborator

glad to know! thanks for your cooperation 🤝

@eflorico
Copy link
Author

eflorico commented Jan 24, 2023

Just tested it here as well, on first glance everything looks fine! It does appear to synchronize all commits, MRs, and branches for my projects 😊

Thank you for the speedy fix @davydov-d, I appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants