Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): Org rate limit per metric bucket #2758

Merged
merged 42 commits into from
Dec 6, 2023
Merged

feat(server): Org rate limit per metric bucket #2758

merged 42 commits into from
Dec 6, 2023

Conversation

TBS1996
Copy link
Contributor

@TBS1996 TBS1996 commented Nov 23, 2023

part of: #2716

in order to protect our kafka metric consumers, we want to have a way of rate limiting based on the amount of buckets, as that's what decides the load placed on our kafka topics.

We are starting out with just the org throughput limits but will be expanded upon further as outlined in the linked epic.

@TBS1996 TBS1996 self-assigned this Nov 24, 2023
@TBS1996 TBS1996 marked this pull request as ready for review November 30, 2023 11:08
@TBS1996 TBS1996 requested a review from a team November 30, 2023 11:08
@TBS1996 TBS1996 changed the title feat(server): org total rate limit feat(server): Org rate limit per batch Nov 30, 2023
@TBS1996 TBS1996 requested a review from a team as a code owner December 1, 2023 09:29
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Show resolved Hide resolved
Copy link
Member

@Dav1dde Dav1dde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick nits, I'll have to look into the outcome logging more carefully later.

I would like if we didn't have to duplicate this and somehow can share as much code as possible with the ManagedEnvelope logic (when we drop the envelope transactions and profile outcomes are created). But I'll need more time to think about this.

relay-server/src/utils/metrics_rate_limits.rs Outdated Show resolved Hide resolved
relay-server/src/utils/metrics_rate_limits.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
relay-server/src/actors/processor.rs Outdated Show resolved Hide resolved
let _bucket_qty = buckets.len();
let bucket_partitions = partition_buckets(scoping.project_key, buckets, partitions);

#[cfg(feature = "processing")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to check the cached rate limits even on non-processing relays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well atm, they just get updated from redis, which isnt available on non-processing relays, so im not sure if that would make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how propagation of quotas/limits to pops is exactly working, but I don't see a reason why we cant check it now and once we actually have a working propagation everything just works

Copy link
Contributor

@iker-barriocanal iker-barriocanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR rate-limit buckets on a binary basis? Do we want to consider scenarios in which some buckets are rate-limited but others aren't?

relay-server/src/actors/processor.rs Show resolved Hide resolved

self.drop_buckets_with_outcomes(
reason_code,
total_batches,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between total_batches and bucket_qty (function param), the maximum size of the bucket? Can we use just one of the two? It seems odd to me that sometimes we use one and sometimes the other (see when emitting outcomes and on log messages).

Copy link
Member

@Dav1dde Dav1dde Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucket_qty is the amount of total buckets in the flush, total_batches is the amount of batches we are writing upstream (aka. amount of requests to upstream or amount of messages in the kafka topic).

event_id: None,
remote_addr: None,
category: DataCategory::Metrics,
quantity: total_batches as u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we exclude buckets with the usage metric from rate-limited outcomes? These metrics are data we (sentry) generate but the user doesn't care about, and I assume these outcomes will be presented to them.

@Dav1dde
Copy link
Member

Dav1dde commented Dec 5, 2023

@iker-barriocanal

Does this PR rate-limit buckets on a binary basis? Do we want to consider scenarios in which some buckets are rate-limited but others aren't?

This is on the roadmap but not planned for this PR, the idea is to rate limit based on namespace (there is some info in the epic: #2716)

@jan-auer jan-auer changed the title feat(server): Org rate limit per batch feat(server): Org rate limit per metric bucket Dec 6, 2023
Copy link
Member

@jan-auer jan-auer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Could you add an integration test?

CHANGELOG.md Outdated Show resolved Hide resolved
/**
* Metric bucket.
*/
RELAY_DATA_CATEGORY_METRIC_BUCKET = 15,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will require a release of the python library and bump in Sentry and Snuba.

CHANGELOG.md Outdated
@@ -10,6 +10,8 @@
- Return global config ready status to downstream relays. ([#2765](https://github.com/getsentry/relay/pull/2765))
- Add Mixed JS/Android Profiles events processing. ([#2706](https://github.com/getsentry/relay/pull/2706))
- Allow to ingest measurements on a span. ([#2792](https://github.com/getsentry/relay/pull/2792))
- Add size limits on metric related envelope items. ([#2800](https://github.com/getsentry/relay/pull/2800))
- Include the size offending item in the size limit error message. ([#2801](https://github.com/getsentry/relay/pull/2801))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes shouldnt be here

source_quantities += source_quantities_from_buckets(&BucketsView::new(buckets), mode);
}

let timestamp = UnixTimestamp::now().as_datetime().unwrap_or_else(Utc::now);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not just Utc::now() why the roundtrip through UnixTimestamp?

]

def generate_ticks():
# Generate a new timestamp for every bucket, so they do not get merged by the aggregator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also use tags for this, should be easier than having timestamps tied to bucket_interval

@TBS1996 TBS1996 merged commit 8353064 into master Dec 6, 2023
20 checks passed
@TBS1996 TBS1996 deleted the tor/orgtotal branch December 6, 2023 14:07
jjbayer added a commit that referenced this pull request Dec 6, 2023
jjbayer added a commit that referenced this pull request Dec 6, 2023
Reverts #2758

This was deployed before sentry updated librelay, so sentry crashed with
unknown data category.
TBS1996 added a commit that referenced this pull request Dec 7, 2023
we had an incident when deploying
#2758 because we unexpextedly
sent outcomes of type ItemBucket without having updated sentry.

This PR only adds the new datacategory so we can properly synchronize on
both ends before adding the business logic.

#2821 (comment)
jjbayer added a commit that referenced this pull request Dec 13, 2023
part of: #2716

in order to protect our kafka metric consumers, we want to have a way of
rate limiting based on the amount of buckets, as that's what decides the
load placed on our kafka topics.

We are starting out with just the org throughput limits but will be
expanded upon further as outlined in the linked epic.

This is a re-revert of #2758
(reverted in #2821).

---------

Co-authored-by: Joris Bayer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants