Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinstate CheckpointAsync #616

Closed
bartelink opened this issue Aug 1, 2019 · 15 comments · Fixed by #2331
Closed

Reinstate CheckpointAsync #616

bartelink opened this issue Aug 1, 2019 · 15 comments · Fixed by #2331
Assignees

Comments

@bartelink
Copy link
Contributor

bartelink commented Aug 1, 2019

Is your feature request related to a problem? Please describe.

I've decoupled my consumption from my checkpointing (i.e., I don't necessarily synchronously process all change feed items, instead letting reading get ahead of the checkpointable position in order that I can manage buffering and retries for performance)

Describe the solution you'd like

Provide an overload that exposes something akin to the IChangeFeedProcessorContext.CheckpointAsync method which the v2 CFP API formerly exposed (it's present but internal atm)

Describe alternatives you've considered

Only alternative is to use the v2 CFP SDK, but that closes off tonnes of options and is not tenable from my perspective.

Additional context

  • I flagged this to @ealsur some time ago in the context of some other work in this repo. While Port to Azure.Cosmos / v3 SDK jet/equinox#144 illustrates that the V3 SDK provides some very nice cleanup in general for a relatively complex use case, not being able to port the CFP aspect easily presents a problem in the medium term with having client teams adopt the V3 SDK.

  • Consumption code

    • shows how we use (and need, CheckpointAsync)
    • illustrates that in general, the V3 API is good - it will cut out a lot of boilerplate wrapping we formerly had to provide.
    • never lets an exception escape from the processor observer function, avoiding the fact that the CFP logic does not retry and/or otherwise correctly handle exceptions.
    • illustrates the level of instrumentation (i.e. context re logging Range Ids etc) which cannot be exposed at the present time due to the absence of IChangeFeedProcessorContext (see related: Provide partitionid to ChangeFeedObserverFactory.Create #400)
  • Document parsing code

    • this code cannot be directly ported due to the fact that the Document class has become internal in V3 - it'd be nice to have a sample illustrate how one might most cheaply probe documents to determine whether they are parseable as a given type as was formerly possible.
@ealsur
Copy link
Member

ealsur commented Aug 2, 2019

@bartelink Is your request to add support for Manual Checkpointing?

@bartelink
Copy link
Contributor Author

Yes, in essence; that's the key facility that's missing compared to the CFP v2 API.

To do the full port, I will also need a way to do an equivalent of GetPropertyValue and a way to identify the partition/range id associated with each set of Documents from the changefeed so I can control the max read ahead on a per range basis.

@bartelink
Copy link
Contributor Author

bartelink commented Oct 4, 2019

I really want start migrating various apps of ours to the V3 SDK wrt CFP logic. I really can't use it as it is. Is there any rough roadmap re when there'll be space to consider this?

@ealsur
Copy link
Member

ealsur commented Oct 4, 2019

The CheckpointAsync method I'm not sure there will be a point if we move forward with enabling a higher degree of parallelization beyond the partition (ie, multiple threads can be reading the same partition, each thread with a different range of partition key values within the same partition). Won't this higher degree of processing solve the need for buffering?

Exposing the partition information won't happen, as this is something that is not exposed anywhere in the V3 SDK. At max, we could think about exposing the LeaseToken for context. While it currently works as 1 lease 1 partition, we are looking into expanding that as mentioned in the first paragraph, so it cannot be inferred from the LeaseToken.

Regarding your Document point, you have multiple options:

  • You could use dynamic, or JObject, or the class of your choice
  • You could use your own custom Serializer to manage the deserialization process.

@bartelink
Copy link
Contributor Author

CheckpointAsync ... Won't this higher degree of processing solve the need for buffering?

That depends on exactly what you have up your sleeves ;) The benefits of being able to decouple checkpointing from read/process/write cycle include:

  • not waiting for roundtrips to (what can be a highly contended) aux container
  • being able to overlap transmitting, parsing and filtering of data with processing
  • being able to maintain balanced throughput in the face of an item early in a sequence of batches which takes inordinately long to process (let's say I'm processing a reaction to something which happens to hit rate limiting due to a hotspot)
  • if one has algorithms that benefit from grouping and/or deduplicating by reading ahead, they can't succeed if only one batch from a given logical subset can ever be read at a time
  • if you restrict by max items, you can get very lumpy amounts of data, i.e. actual number of events and/or 'weight' of a batch can vary quite a bit
  • if you restrict by the proposed 'max RUs' limit, one can end up with a low number of items

Unfortunately, I could go on. The V2 API provides a very powerful scheme; archiving homegrown solutions was possible as a direct result. Can you share some more information as to the design of this scheme please in order to allay my concerns? While it was not my first choice to end up implementing a scheme leaning on this facility, it ultimately provides a very high throughput facility which would be a significant loss.

Inferring things from LeaseTokens definitely does not interest me. My desire for information as to which partition a received batch comes from arises from:

  • it's in V2
  • its very valuable to see if some partition is stalled from a troubleshooting/diagnostics perspective
  • when reading ahead, you don't want to get too far ahead on any individual partition in the interests of fairness
    In short, if I have 100 partitions and this processes is covering 33 of them, I want to a) know b) be able to control my progress on all 33 of them, not just through indirect means like the Estimator.

Thanks for the serialization suggestions. Might I suggest the CFP migration example show Document parsing vs doing that with Dynamic (have not tried to attack it or looked at the code, but it would seem that it would be relatively easy yet valuable to demo?)?

@bartelink
Copy link
Contributor Author

@ealsur still really interested on this - we're looking to move to V4 but can't even get off V2 until this API comes back

@ealsur
Copy link
Member

ealsur commented Feb 11, 2020

This is coming back in V3, right now we got jumped by high-pri work. We want to have that and Context and Estimator with all leases for March.

@ealsur ealsur self-assigned this Feb 11, 2020
@ylibrach
Copy link

Hi @ealsur , just wanted to check if ETA is still sometime in March?

@ealsur
Copy link
Member

ealsur commented Mar 19, 2020

Sadly it got a bit pushed back but prioritized still. We are working to release Change Feed pull support and this will be worked right after.

@ylibrach
Copy link

@ealsur, aside from bringing CheckpointAsync into v3 and v4, I'm also curious as to what is the current path for creating a ChangeFeedProcessor in v4? It seems that ChangeFeedProcessorBuilder<T> has been made internal in v4.

@ealsur
Copy link
Member

ealsur commented Apr 28, 2020

V4 won't have Change Feed Processor for the time being (short time), we are working on the base API surface. V4 is not ready for production and it needs first to pass review of base APIs, once base APIs are approved, we can start to onboard features.

@ylibrach
Copy link

I see. By "short time", what do you estimate?

@bartelink
Copy link
Contributor Author

Now that the RU cost discrepancies in V3 are finally resolved, the most significant blocker for moving to V3 for transactional processing has been removed from my perspective.

This brings this Issue back into focus for me; @ealsur

For my roadmap purposes, indicative dates are naturally always welcome, but I guess I'm also wondering when all 3 concerns ([RU consumption]((#990 (comment)), the reintroduction of CheckpointAsync, reintroduction of Context info) will be covered in a single release

I'm ready to validate all of these in the context of V4 when they land.

@ealsur
Copy link
Member

ealsur commented May 18, 2020

Once Change Feed pull model goes GA, I have one PR to enable Estimator per lease, and then another PR to introduce Checkpoint. The blocker for all this is GA of Change Feed pull model.

@bartelink
Copy link
Contributor Author

Any chance of a quick update on how this is all looking in terms of dependencies?

We've been long-fingering various CFP issues on the basis that we'll be moving from V2 to V3 within a reasonable timeframe.

(CheckpointAsync and source context for delivered batches are the critical items of interest that represent blockers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants