Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public repos in a github org dont return all streams #3953

Closed
Moofasax opened this issue Jun 8, 2021 · 32 comments
Closed

Public repos in a github org dont return all streams #3953

Moofasax opened this issue Jun 8, 2021 · 32 comments
Assignees
Labels
type/bug Something isn't working

Comments

@Moofasax
Copy link

Moofasax commented Jun 8, 2021

Expected Behavior

Disable the collaborators, projects, events, and teams streams for a public repo i dont have access to should still pull data from public repos that i dont have access to. I expect the streams to output data into postgres tables.

Current Behavior

Only the issue_events, and assignee tables are populated, every other table is blank. Not sure if this is because the repo i am pulling from is large and airbyte cant handle it, or if its because of permissions issues.

Logs

If applicable, please upload the logs from the failing operation. For sync jobs, you can download the full logs from the UI by going to the sync attempt page and clicking the download logs button at the top right of the logs display window.

Steps to Reproduce

  1. create a github source with this public repo: https://github.com/department-of-veterans-affairs/vets-website
  2. disable the collaborators, projects, events, and teams streams
  3. create a postgres destination
  4. run the sync job, it will succeed, but no data is returned.

Severity of the bug for you

Critical

Airbyte Version

0.24.7-alpha

Connector Version (if applicable)

Github singer: 0.2.7

Additional context

Environment, version, integration...

@Moofasax Moofasax added the type/bug Something isn't working label Jun 8, 2021
@Moofasax
Copy link
Author

Moofasax commented Jun 8, 2021

logs-2-0 (1).txt
Attached logs

@Moofasax
Copy link
Author

Moofasax commented Jun 8, 2021

just attempted this with a smaller repo: https://github.com/ablevets/test-public and i was able to pull in data in all expected tables, is this a performance issue with airbyte? The repo i mentioned in the initial issue: https://github.com/department-of-veterans-affairs/vets-website is really large.

@marcosmarxm
Copy link
Member

@Moofasax from your logs

io.airbyte.config.ReplicationAttemptSummary@29f1d914[status=completed,recordsSynced=40444,bytesSynced=227838114,startTime=1623163689523,endTime=1623164818856]

First, could you check the *_stargazers table?

@Moofasax
Copy link
Author

Moofasax commented Jun 9, 2021

all stargazers tables are empty, the only two tables that have any data in them are issue_events, and the assignees tables, all other tables are empty.

@Moofasax
Copy link
Author

Moofasax commented Jun 9, 2021

are you seeing this if you try what was stated in the original issue? exporting data out of this public repo: https://github.com/department-of-veterans-affairs/vets-website with these streams disabled the collaborators, projects, events?

@garden-of-delete
Copy link

@Moofasax I may be experiencing the same thing. Try syncing the stargazers stream on its own

(connection settings -> update latest source schema buttton -> deselect all except stargazers -> save changes and reset data button -> update latest source schema button in the pop up -> save changes button -> reset button -> let the reset finish and start a new sync)

Also how many records does Airbyte say it synced when the task returns successful?

@Moofasax
Copy link
Author

Moofasax commented Jun 9, 2021

It returns 40,444 and i will try to reset and sync only the stargazers stream now:
image

I noticed theres a new github version 0.28 i will try that one after?

@Moofasax
Copy link
Author

Moofasax commented Jun 9, 2021

With only stargazers checked, it does return 100 records for the _stargazers tables!!

@garden-of-delete
Copy link

With only stargazers checked, it does return 100 records for the _stargazers tables!!

@Moofasax @marcosmarxm this lines up with my experience of this issue

@Moofasax
Copy link
Author

Moofasax commented Jun 9, 2021

thanks for reporting, i am trying this with the newer github singer 0.2.8 and will report. Also nice username @garden-of-delete

@Moofasax
Copy link
Author

Moofasax commented Jun 9, 2021

Same issue on 0.2.8

@marcosmarxm
Copy link
Member

@Moofasax and @garden-of-delete I had run an integration sync yesterday. Is it possible to try again?
image
image

@Moofasax
Copy link
Author

What versions of everything? I can try!!

@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 10, 2021

What versions of everything? I can try!!

Airbyte 0.24.7-alpha
Github 0.2.7
Postgres 0.3.5

@Moofasax
Copy link
Author

Moofasax commented Jun 10, 2021

@marcosmarxm I am still seeing the failed behavior, no other tables get created. Did you test with this repo? https://github.com/department-of-veterans-affairs/vets-website
image

@Moofasax
Copy link
Author

Moofasax commented Jun 10, 2021

I tested with the latest version too, same result:
Airbyte: 0.25.0-alpha
Github: 0.2.8
postgres: 0.3.4

@marcosmarxm i dont seem to have the option for postgres 0.3.5

@garden-of-delete
Copy link

Worth adding to the conversation. I'm getting the same result as @Moofasax on github 0.2.8. I have tried with bigquery and postgres destinations and have the exact same result in both cases. Just FYI.

@Moofasax
Copy link
Author

based off @garden-of-delete comment, does that pin point its the github singer issue, and not a destination issue?

@marcosmarxm
Copy link
Member

@Moofasax probably. I'm running using apache/superset. The sync didn't finish yet but got records in _raw tables. After finished I'll update here. This is the current state of my sync connection btw Github => Postgres.
image

@Moofasax
Copy link
Author

Im not seeing the connection to what apache/superset has to do with this, but that data does not look right, there should not be that many stargazers...

@marcosmarxm
Copy link
Member

@Moofasax superset is the project @garden-of-delete works and it has 39k stars 😄
When I used my credentials with a repository I had full access like Airbyte to get all data. When I pull data from another repository that I don't have full access I faced the same issue as you. Looking at the logs there are a lot of errors because Airbyte is trying to pull data from API endpoints that are not allowed for that credential. Should be a problem, but Github closed the connection and Airbyte doesn't handle that error and finish the sync.

@Moofasax
Copy link
Author

Oh I see I'm sorry.

So any idea on why I can pull data from this public repo, https://github.com/ablevets/test-public and not this public repo: https://github.com/department-of-veterans-affairs/vets-website my access to both are the exact same.

@marcosmarxm
Copy link
Member

So any idea on why I can pull data from this public repo, https://github.com/ablevets/test-public and not this public repo: https://github.com/department-of-veterans-affairs/vets-website my access to both are the exact same.
Probably because one is small and the number of errors didn't exceed the limit to close the connection, the second achieve that.

@Moofasax #3695 is a WIP and probably will solve the issue. Is possible to wait until this release? I'll keep you update on any news.

@Moofasax
Copy link
Author

Understood, keep me posted and I'll test!! Appreciate the support!

@garden-of-delete
Copy link

@Moofasax @marcosmarxm

@Moofasax superset is the project @garden-of-delete works and it has 39k stars 😄
When I used my credentials with a repository I had full access like Airbyte to get all data. When I pull data from another repository that I don't have full access I faced the same issue as you. Looking at the logs there are a lot of errors because Airbyte is trying to pull data from API endpoints that are not allowed for that credential. Should be a problem, but Github closed the connection and Airbyte doesn't handle that error and finish the sync.

What permissions the token being used has on the repo could be a factor, but it does not explain the full issue. Specifically, it doesn't explain why I pull empty tables when syncing all streams, but can pull a populated table with the same token when syncing one stream at a time.

I pull data regularly from apache/superset, but i don't have write access to that repo, so my results should be the same as any of yours.

@keu keu added the blocked label Jun 14, 2021
@keu
Copy link
Contributor

keu commented Jun 14, 2021

blocked by #3695

@Moofasax
Copy link
Author

Moofasax commented Jul 8, 2021

Just tested this with the new native github source, still having this issue.

Streams that did not pull data: issues, pull_requests, reviews

Going to test with events streams enabled and see if that helps

@marcosmarxm
Copy link
Member

@Moofasax #4612 was open to solve the problem. You can remove both events streams and the sync should work.

@sherifnada sherifnada removed the blocked label Jul 9, 2021
@Moofasax
Copy link
Author

Moofasax commented Jul 9, 2021

@marcosmarxm with the evetns streams disabled i still do not get data for issues, pull_requests, or reviews with this repo:
https://github.com/department-of-veterans-affairs/vets-api

@Zirochkaa Zirochkaa removed their assignment Jul 16, 2021
@yevhenii-ldv yevhenii-ldv self-assigned this Jul 19, 2021
@yevhenii-ldv
Copy link
Contributor

Hi @Moofasax
Six days ago we released new version of Github Native connector - 0.1.2, where we fixed the error described in the ticket #4612 .
That fix should also correct this issue.
You can upgrade your connector from the Admin page.
Please let us if the issue was fixed. Thanks

@sherifnada
Copy link
Contributor

sherifnada commented Jul 23, 2021

@Moofasax could you please reopn this issue if the issue persists? as far as we can tell we can not reproduce this on the new connector version

@garden-of-delete
Copy link

garden-of-delete commented Jul 23, 2021

I've tested 0.1.2 and am getting data for all streams now. thanks for all the hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants