Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125
base: master
Are you sure you want to change the base?
feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125
Changes from 23 commits
49e75bf
6874c76
16847c1
10c09a8
01a55a7
5c90376
baa8af9
766bd72
e18d430
bfd4a45
2c60ef0
506d296
d8e27d4
0dc55e1
fadb1fd
dee105f
379f11d
6f68b1b
96f1150
3042d28
8cfc0ba
bd507e0
e6976fc
c998f5c
58c96f2
1fb01a9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably outside of the scope of this PR but could we eventually have just one cursor as a parameter here? I'm trying to understand why we need both cursor and it seems like we could just have one of the interfaice
Cursor
and the filtering code would look like:If we agree that this is a path forward, I'll create an issue for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible now. The issue is that
_substream_cursor
doesn't have methods to work with the cursor, for example:select_best_end_datetime
,parse_date
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand: if we use
if self._cursor.should_be_synced(record)
,select_best_end_datetime
andparse_date
can be private, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be
self._over_limit
? From my understanding, we are incrementing this value while we're streaming the slices. And we only start incrementing the moment the number of cursors exceeds 10,000.So unless I'm misunderstanding, if we have hypothetically 10,500 cursors, max of 10,000, then the final resulting value of
self._over_limit
= 500.And then in
limit_reached()
we compare 500 >self.DEFAULT_MAX_PARTITIONS_NUMBER
which will be false. So in order for this to return true we would need 20,000 partitions.My question is why do we use
_over_limit
instead oflen(self._cursor_per_partition)
when we calllimit_reached()
from thePerPartitionWithGlobalCursor
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right—we increment
self._over_limit
each time we drop an old partition after exceeding ourDEFAULT_MAX_PARTITIONS_NUMBER
(10,000). So if we have 10,500 partitions, we'll drop 500 oldest ones, andself._over_limit
becomes 500.We compare
self._over_limit
toDEFAULT_MAX_PARTITIONS_NUMBER
inlimit_reached()
. We only switch to a global cursor whenself._over_limit
exceeds 10,000 (i.e., we've had to drop over 10,000 partitions). This way, we avoid switching too early due to temporary spikes in partition counts.Using
len(self._cursor_per_partition)
wouldn't help here because we've capped it at 10,000—it'll never exceed the limit since we keep removing old partitions to stay within that number.I updated the doc with this explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I think it wasn't clear to me that the intent was to switch to global cursor state once we've dropped X number of records (in this case 10,000). I thought the intent was soley based on once we exceeded 10,000 records to return
True
.Thanks for updating the docs, but lets also add a small docstring here too. Something along the lines of saying that
this method returns true after the number of dropped partitions from state exceeds the default max partitions. This is used to prevent against spikes in partition counts