Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce an analytics sampling rate for VIP sites #48

Merged
merged 15 commits into from
Sep 6, 2023

Conversation

ingeniumed
Copy link
Contributor

Description

This PR will introduce a sampling algorithm for the analytics that sent, when this plugin is used on a VIP site. The sampling algorithm is as follows:

  • Sample every 10s, or
  • Sample every 10m.

This will ensure that we catch burst usage, or regular usage throughout the day without losing the core purpose behind the analytics.

Fixes #45

Steps to Test

  1. Spin up a local dev using vip dev-env
  2. Adjust the is_wpvip_site() to give back true
  3. Pop a breakpoint inside is_it_sampling_time.
  4. Call http://<slug>.vipdev.lndo.site/wp-json/vip-block-data-api/v1/posts/1/blocks, while keeping an eye on the time
  5. Every 10m or 10s it should trigger the breakpoint, otherwise it shouldn't be triggered.

@ingeniumed ingeniumed requested a review from a team as a code owner September 4, 2023 06:26
@@ -107,7 +107,7 @@ public function test_rest_api_returns_blocks_for_post() {
'mediaId' => 6,
'mediaLink' => 'https://gutenberg-block-data-api-test.go-vip.net/?attachment_id=6',
'mediaType' => 'image',
'align' => 'wide',
'align' => 'none',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my local testing, it looks like the default value for this has changed so this keeps it in line with that. Wonder if it's a 6.3 change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Looks like this was changed 6 months ago in Gutenberg, which was later merged into WordPress 6.3 like you said.

$minutes = $current_timestamp->format( 'i' );

// Get the seconds from the date.
$seconds = $current_timestamp->format( 's' );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format() produces strings, so we should probably intval() these or expressions like 0 !== $seconds (i.e. 0 !== "0") won't work as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about this a bit more we may not need checks if seconds/minutes is 0. If I'm doing my math right, the code (with proper types) would be skipping 1/6th of the sampling times:

00:00:00 -> Skip, seconds is 0
00:00:10 -> Send analytics
00:00:20 -> Send analytics
00:00:30 -> Send analytics
00:00:40 -> Send analytics
00:00:50 -> Send analytics
00:01:00 -> Skip, seconds is 0

Copy link
Contributor Author

@ingeniumed ingeniumed Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yip, updated that! Good catch 👍🏾

*
* Current sampling algorithm is that every 10s or 10m, we send analytics.
*
* Max calls possible based on 1 call every 10s: 8640.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingeniumed Please correct me if I'm wrong, but does this math work? We could have several requests come through in the same second which would result in multiple analytics calls in the same second. I think that's fine for sampling purposes, but it doesn't necessarily give us an upper limit on analytics requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I added that in mainly for reference. But I don't want to make us think that we can use this to determine actual usage easily. I've taken it out

$seconds = $current_timestamp->format( 's' );

// Only send analytics every 10 minutes or 10 seconds.
if ( ( 0 !== $seconds && 0 === $seconds % WPCOMVIP__BLOCK_DATA_API__STAT_SAMPLING_RATE_SEC ) || ( 0 !== $minutes && 0 === $minutes % WPCOMVIP__BLOCK_DATA_API__STAT_SAMPLING_RATE_MIN ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingeniumed Can you explain a bit why we're doing one sample every 10 seconds and another every 10 minutes? It seems like the sampling math is easier to estimate real usage if we stick with just one of these. Even just doing seconds would reduce analytics by 90% but still give us relatively good coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed, and I have dropped the 10m leaving just 10s in place.

Copy link
Contributor

@alecgeatches alecgeatches left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampling is a great way to reduce our analytics overhead with no extra db/caching cost, I love it. I have a few questions above about the implementation.

@@ -942,6 +942,8 @@ The plugin records two data points for analytics, on VIP sites:

Both of these data points are a counter that is incremented, and do not contain any other telemetry or sensitive data. You can see what's being [collected in code here][repo-analytics], and WPVIP's privacy policy [here](https://wpvip.com/privacy/).

In addition, the analytics are sent every 10 seconds only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could say "analytics are sampled" that way we don't have to update the README if we change the sampling rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do see what you mean, I would lean towards being open about the sampling rate as well as the method with which we do that in the README. That way, we are being transparent about it and when we do change it we also have to update the README to reflect that.

$seconds = intval( $current_timestamp->format( 's' ) );

// Only send analytics every 10 seconds.
if ( 0 === ( $seconds % WPCOMVIP__BLOCK_DATA_API__STAT_SAMPLING_RATE_SEC ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someday we may want a hook so that customers can sent their own sampling rate no lower than X, but for a different day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see the value of that, especially if they are at a scale where even the tiniest change could have a large impact. We would need to figure out a good floor that we can use, that would still make the data useful to us. I'll add this to the backlog.

Copy link
Contributor

@smithjw1 smithjw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does what it says on the tin!

@ingeniumed ingeniumed merged commit 54ceb30 into trunk Sep 6, 2023
2 checks passed
@ingeniumed ingeniumed deleted the modify/reduce-analytics-sampling-rate branch September 6, 2023 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make analytics calls faster
3 participants