-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix(source-s3): Optimize File-Based performance in Python CDK and Source-S3 #45721
base: master
Are you sure you want to change the base?
Conversation
airbyte-integrations/connectors/source-s3/acceptance-test-config.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this! Left a bunch of comments.
@aaronsteers what's your thinking here, do you want to try and ship this next week, if we figure out how to make it non-breaking?
@@ -30,7 +30,7 @@ | |||
) | |||
from airbyte_cdk.test import entrypoint_wrapper | |||
from pydantic import BaseModel | |||
from source_s3.v4.source import SourceS3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you able to delete the whole v4 namespace and move it into the root one? I was very convfused by us having two namespaces there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natikgadzhi - Yes, that's right. This removes the v4
namespace entirely. When I changed the Cursor to fully use the Concurrent model, it more-or-less forced us to either spend a bunch more time debugging migrations, or move entirely to the new model. And the benefit of removing support is that there's a lot less code to support, and the connector itself should be more maintainable going forward.
@@ -10,7 +10,7 @@ data: | |||
connectorSubtype: file | |||
connectorType: source | |||
definitionId: 69589781-7828-43c5-9f63-8925b1c1ccc2 | |||
dockerImageTag: 4.9.0 | |||
dockerImageTag: 5.0.0-rc.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beware that as or fight now that would trigger a breaking change, which would mean that people would need to opt into it.
This should not stop us, but I wonder if we can clarify whether the platform would treat a major connector version as breaking by default, and if it will not — perhaps we can change our metadata validation as well to not require major version to come with releases.breakingChanges entry. /cc @alafanechere @pedroslopez
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't strictly need to be a 'v5'. But in my head at least the benefits are: not supporting 'v3' state anymore, plus giving a narrative around "migrate to v4 before migrating to v5" seems a very clear and easy-to-communicate message to deliver for users.
We could do this without bumping to v5, and the message would be "v3 users need to migrate to v4.9 before migrating to v4.10+". Not as concise, but still reasonable.
Another factor in the conversation is that its been a year now since v4 came out, and the upgrade deadline passed earlier this month.
breakingChanges: | ||
4.0.0: | ||
message: UX improvement, multi-stream support and deprecation of some parsing features | ||
upgradeDeadline: "2023-10-05" | ||
4.0.4: | ||
message: Following 4.0.0 config change, we are eliminating the `streams.*.file_type` field which was redundant with `streams.*.format` | ||
upgradeDeadline: "2023-10-18" | ||
5.0.0: | ||
# Technically, this may not actually be a breaking change because the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Let's avoid making this breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationale: every breaking change would mean stopped syncs. We don't want to stop syncs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agreed. And I think we can without issues. (See other comments above.)
@@ -0,0 +1,11 @@ | |||
performance_tests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid this in favor of having metadata entries and doing the slash command wrapper around cloud for performance syncs. I won't block the PR, but I think it's best to do the right thing here. Perhaps a path forward is to keep this local and not have another test config committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I was actually removing the performance-test artifacts but missed this particular file.
For context: I built this in reverse order to what we are reviewing and merging. First wrote a major Polars rewrite, then spun off smaller scope here, then spun out an even smaller scope to get the v5 CDK bump done. The perf tests harness is from several weeks ago now, and I have other ways to test that now: the PyAirbyte validate
command and in the near future we should have the slash-command.
What would be super helpful right now is a semantic language and "home" for how we want to define performance tests. My two best bets are (1) add to metadata.yaml or (2) add into the acceptance-test-config.yaml
(which this performance-test-config.yml
was modeled after). Even if they are a no-op right now, would be helpful to start putting those somewhere in the meanwhile...
source-s3 = "source_s3.run:run" | ||
[tool.poetry.dependencies.airbyte-cdk] | ||
extras = ["file-based",] | ||
version = "^5.7.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well update the explicit piece here
version = "^5.7.4" | |
version = "^5.14.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES! There was a critical CDK update in 5.7 that drove me to make sure this was the minimum - because otherwise a version conflict was forcing it silently to an older CDK version. Happy to bump to the latest, and will apply this suggestion. 👍
@@ -37,4 +42,6 @@ moto = "==4.2.14" | |||
docker = "^7.0.0" | |||
pytest-mock = "^3.6.1" | |||
requests-mock = "^1.9.3" | |||
pandas = "^2.0.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The pandas pin was a hack because --use-local-cdk doesn't properly pin/re-pin dependencies, which forces connector maintainers to "fake it" by forcing their dependencies list to deterministically resolve to the same thing that their local CDK resolves to.
We should consider updating airbyte-ci
to re-lock dependencies when using --use-local-cdk
, as this will reduce the need to redundantly pin dependencies in the connector when consuming the local CDK.
@@ -0,0 +1,100 @@ | |||
# Copyright (c) 2024 Airbyte, Inc., all rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's either leave this out of the PR, or just make the slashcommand with this as we discussed with @johnny-schmidt. I'm happy to help you speedrun / pair on it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be convinved to just leave this in as the first step and then just not forget to migrate from this in favor of the slash command as a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slipped in - sorry. See my comment above. Will remove before merging.
On the one hand: I'm excited to test this and get it running.
On the other: This is admittedly a hack, and 100% just being used as an iteration space. No need to commit in this particular PR, although I do want to see it working and have easy-access when it's working smoothly.
@natikgadzhi re:
Would love to ship next week. I marked as "breaking" just to force the conversation. I'm hoping we can do without actually needing to mark as breaking/blocking for users. |
What
This spins out all the fixes, improvements, and new tests from the Polars S3 PR:
How
Review guide
User Impact
Can this PR be safely reverted and rolled back?