Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix(source-s3): Optimize File-Based performance in Python CDK and Source-S3 #45721

Draft
wants to merge 105 commits into
base: master
Choose a base branch
from

Conversation

aaronsteers
Copy link
Collaborator

@aaronsteers aaronsteers commented Sep 21, 2024

What

This spins out all the fixes, improvements, and new tests from the Polars S3 PR:

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@aaronsteers aaronsteers marked this pull request as ready for review October 18, 2024 17:36
@octavia-squidington-iv octavia-squidington-iv requested a review from a team October 18, 2024 17:47
Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this! Left a bunch of comments.

@aaronsteers what's your thinking here, do you want to try and ship this next week, if we figure out how to make it non-breaking?

.gitignore Show resolved Hide resolved
@@ -30,7 +30,7 @@
)
from airbyte_cdk.test import entrypoint_wrapper
from pydantic import BaseModel
from source_s3.v4.source import SourceS3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to delete the whole v4 namespace and move it into the root one? I was very convfused by us having two namespaces there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natikgadzhi - Yes, that's right. This removes the v4 namespace entirely. When I changed the Cursor to fully use the Concurrent model, it more-or-less forced us to either spend a bunch more time debugging migrations, or move entirely to the new model. And the benefit of removing support is that there's a lot less code to support, and the connector itself should be more maintainable going forward.

@@ -10,7 +10,7 @@ data:
connectorSubtype: file
connectorType: source
definitionId: 69589781-7828-43c5-9f63-8925b1c1ccc2
dockerImageTag: 4.9.0
dockerImageTag: 5.0.0-rc.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beware that as or fight now that would trigger a breaking change, which would mean that people would need to opt into it.

This should not stop us, but I wonder if we can clarify whether the platform would treat a major connector version as breaking by default, and if it will not — perhaps we can change our metadata validation as well to not require major version to come with releases.breakingChanges entry. /cc @alafanechere @pedroslopez

Copy link
Collaborator Author

@aaronsteers aaronsteers Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't strictly need to be a 'v5'. But in my head at least the benefits are: not supporting 'v3' state anymore, plus giving a narrative around "migrate to v4 before migrating to v5" seems a very clear and easy-to-communicate message to deliver for users.

We could do this without bumping to v5, and the message would be "v3 users need to migrate to v4.9 before migrating to v4.10+". Not as concise, but still reasonable.

Another factor in the conversation is that its been a year now since v4 came out, and the upgrade deadline passed earlier this month.

breakingChanges:
4.0.0:
message: UX improvement, multi-stream support and deprecation of some parsing features
upgradeDeadline: "2023-10-05"
4.0.4:
message: Following 4.0.0 config change, we are eliminating the `streams.*.file_type` field which was redundant with `streams.*.format`
upgradeDeadline: "2023-10-18"
5.0.0:
# Technically, this may not actually be a breaking change because the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Let's avoid making this breaking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rationale: every breaking change would mean stopped syncs. We don't want to stop syncs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agreed. And I think we can without issues. (See other comments above.)

@@ -0,0 +1,11 @@
performance_tests:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid this in favor of having metadata entries and doing the slash command wrapper around cloud for performance syncs. I won't block the PR, but I think it's best to do the right thing here. Perhaps a path forward is to keep this local and not have another test config committed.

Copy link
Collaborator Author

@aaronsteers aaronsteers Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. I was actually removing the performance-test artifacts but missed this particular file.

For context: I built this in reverse order to what we are reviewing and merging. First wrote a major Polars rewrite, then spun off smaller scope here, then spun out an even smaller scope to get the v5 CDK bump done. The perf tests harness is from several weeks ago now, and I have other ways to test that now: the PyAirbyte validate command and in the near future we should have the slash-command.

What would be super helpful right now is a semantic language and "home" for how we want to define performance tests. My two best bets are (1) add to metadata.yaml or (2) add into the acceptance-test-config.yaml (which this performance-test-config.yml was modeled after). Even if they are a no-op right now, would be helpful to start putting those somewhere in the meanwhile...

cc @alafanechere

source-s3 = "source_s3.run:run"
[tool.poetry.dependencies.airbyte-cdk]
extras = ["file-based",]
version = "^5.7.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well update the explicit piece here

Suggested change
version = "^5.7.4"
version = "^5.14.0"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! There was a critical CDK update in 5.7 that drove me to make sure this was the minimum - because otherwise a version conflict was forcing it silently to an older CDK version. Happy to bump to the latest, and will apply this suggestion. 👍

@@ -37,4 +42,6 @@ moto = "==4.2.14"
docker = "^7.0.0"
pytest-mock = "^3.6.1"
requests-mock = "^1.9.3"
pandas = "^2.0.3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The pandas pin was a hack because --use-local-cdk doesn't properly pin/re-pin dependencies, which forces connector maintainers to "fake it" by forcing their dependencies list to deterministically resolve to the same thing that their local CDK resolves to.

We should consider updating airbyte-ci to re-lock dependencies when using --use-local-cdk, as this will reduce the need to redundantly pin dependencies in the connector when consuming the local CDK.

@@ -0,0 +1,100 @@
# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's either leave this out of the PR, or just make the slashcommand with this as we discussed with @johnny-schmidt. I'm happy to help you speedrun / pair on it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be convinved to just leave this in as the first step and then just not forget to migrate from this in favor of the slash command as a follow-up

Copy link
Collaborator Author

@aaronsteers aaronsteers Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slipped in - sorry. See my comment above. Will remove before merging.

On the one hand: I'm excited to test this and get it running.

On the other: This is admittedly a hack, and 100% just being used as an iteration space. No need to commit in this particular PR, although I do want to see it working and have easy-access when it's working smoothly.

@aaronsteers
Copy link
Collaborator Author

aaronsteers commented Oct 18, 2024

@natikgadzhi re:

@aaronsteers what's your thinking here, do you want to try and ship this next week, if we figure out how to make it non-breaking?

Would love to ship next week. I marked as "breaking" just to force the conversation. I'm hoping we can do without actually needing to mark as breaking/blocking for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues CDK Connector Development Kit connectors/source/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants