Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: limit frequency of controller publications #1532

Closed
wants to merge 1 commit into from

Conversation

psFried
Copy link
Member

@psFried psFried commented Jul 23, 2024

Description:

Fixes #1520

We've observed a scenario where there's a very large capture with hundreds of collections all using schema inference. Among all of the collections, publications due to inferred schema updates are quite frequent. Because each collection publication results in a subsequent publication of the capture and materialization, it results in extremely frequent publications of those tasks, which has two important consequences:

  • It can prevent user-initiated publications from completing successfully due to PublicationSuperceded or ExpectPubIdNotMatched errors.
  • It can prevent the materialization from making progress, because every time it's activated it must re-start any in-progress reads of source journals. There may also be a data-plane bug that's affecting this, but the frequent publications are certainly a factor.

So we've decided to slow down the rate of controller-initiated publications to try to alleviate those symptoms. This introduces two new fields to the publication status: last_publication_time and min_publication_interval. Controllers will now wait at least min_publication_interval since the last_publication_time before publishing a spec in response to publication of dependencies.

The default value of min_publication_interval is 30 minutes, which is somewhat arbitrary, but aligns with the default update_delay of materializations. Since this value is part of the controller status, we can update it on a per-task basis.


This change is Reviewable

We've observed a scenario where there's a very large capture with
hundreds of collections all using schema inference. Among all of the
collections, publications due to inferred schema updates are quite
frequent. Because each collection publication results in a subsequent
publication of the capture and materialization, it results in
extremely frequent publications of those tasks, which has two
important consequences:

- It can prevent user-initiated publications from completing
  successfully due to `PublicationSuperceded` or `ExpectPubIdNotMatched`
  errors.
- It can prevent the materialization from making progress, because every
  time it's activated it must re-start any in-progress reads of source
  journals. There may also be a data-plane bug that's affecting this,
  but the frequent publications are certainly a factor.

So we've decided to slow down the rate of controller-initiated
publications to try to alleviate those symptoms. This introduces two new
fields to the publication status: `last_publication_time` and
`min_publication_interval`. Controllers will now wait at least
`min_publication_interval` since the `last_publication_time` before
publishing a spec in response to publication of dependencies.

The default value of `min_publication_interval` is 30 minutes, which is
somewhat arbitrary, but alignes with the default `update_delay` of
materializations. Since this value is part of the controller status,
we can update it on a per-task basis.
@psFried psFried marked this pull request as ready for review July 23, 2024 18:18
Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (I didn't review deeply but makes sense)

}

// Note that publications in response to the source_capture being updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: trails off

@psFried
Copy link
Member Author

psFried commented Jul 25, 2024

After further e2e testing, it seems that this approach does not really help overall. It just shifts the symptoms so that materializations will stay in a failed state for a longer time. We've discussed another option, which is to ensure that inferred schema updates are published more promptly, and I'm closing this PR in favor of that.

@psFried psFried closed this Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inferred schema updates blocking interactive publications of captures and materializations
2 participants