From 1aec5fba88f8be0f28cc4679b4b8775740ea0b64 Mon Sep 17 00:00:00 2001 From: Neil Kakkar Date: Fri, 25 Nov 2022 11:20:15 +0000 Subject: [PATCH 1/4] Create resilient-feature-flags --- requests-for-comments/resilient-feature-flags | 52 +++++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 requests-for-comments/resilient-feature-flags diff --git a/requests-for-comments/resilient-feature-flags b/requests-for-comments/resilient-feature-flags new file mode 100644 index 0000000..d80a8b7 --- /dev/null +++ b/requests-for-comments/resilient-feature-flags @@ -0,0 +1,52 @@ +# Request for comments: Resilient feature flags + +## Problem statement +*Who are we building for, what are their needs (include example cases), why is this important?* + +We've recently realised a few limitations with feature flags when PostHog goes down. This isn't great for permanent feature flags as it causes things to break +in unexpected ways & for clients to think really hard about defaults & write code defensively. + +Ideally, our client libraries and APIs should 'just work', even if there's no error handling on the user's side. + +So, how do we make this happen? + +## Context + +The way things currently work is: TODO + +Decide can run on any arbitrary anonymous ID. + +Caching decide responses doesn't make too much sense, since it's called usually only once per distinct ID per day. Every anonymous user has a new ID, and except +for users coming back to the app again & again on the same day, we don't achieve much; while introducing new problems like cache invalidation. + +## Solution + +Introducing, decide v3. + +This is a good opportunity to fix other customer frustrations around this endpoint, and pave the way for breaking changes we've wanted to introduce for a while. + +In short: + +1. Decide v3 returns all flags, with a `true` or `false` based on whether they're enabled or not, vs. implicitly just not returning the false flags. +2. Decide v3 does a best effort calculation of flags when PostHog DBs are down, and tells the client that it faced issues computing all flags. +3. Decide v3 can return a JSON payload with the flag variant (sneaking this into the major version API update) + +On the SDK side: + +1. We stop obliterating all flags if decide returns a 500, and based on the parameter in (2), update only flags that were computed again, while keeping old values of other flags. +2. We add timeouts to flushes & shutdowns + + +To make this possible, we cache flag definitions (and not the decide responses) on our side, so these are available even if the DB is down. + + + + +## Context +*What are our competitors doing, what are the technical constraints, what are customers asking for, what does the data tell us, are there external motivations (e.g. launch week, enterprise contract)?* + +## Design +*What are the key user experience and technical design decisions / trade-offs?* + +## Sprints +*How do we break this into discrete and valuable chunks for the folks shipping it? How do we ensure it's high quality and fast?* From c3bb81d57c17bea5d4f80a4d4b09e9c5f11faf3d Mon Sep 17 00:00:00 2001 From: Neil Kakkar Date: Fri, 25 Nov 2022 11:26:42 +0000 Subject: [PATCH 2/4] Update requests-for-comments/resilient-feature-flags --- requests-for-comments/resilient-feature-flags | 2 ++ 1 file changed, 2 insertions(+) diff --git a/requests-for-comments/resilient-feature-flags b/requests-for-comments/resilient-feature-flags index d80a8b7..3ef1313 100644 --- a/requests-for-comments/resilient-feature-flags +++ b/requests-for-comments/resilient-feature-flags @@ -39,7 +39,9 @@ On the SDK side: To make this possible, we cache flag definitions (and not the decide responses) on our side, so these are available even if the DB is down. +It makes sense to include (3) with (1) for decide because the shape of the response is changing anyways, and it's much easier to handle both at the same time in all client libraries, than doing these one by one. +We don't have to build the rest of it out right now, but having this functionality here means when do build it, it will be extraordinarily faster to do. ## Context From 483770338f5d580692e8304530221e6de72d2ba1 Mon Sep 17 00:00:00 2001 From: Neil Kakkar Date: Tue, 29 Nov 2022 15:06:00 +0000 Subject: [PATCH 3/4] Update resilient-feature-flags --- requests-for-comments/resilient-feature-flags | 33 +++++++++++++++---- 1 file changed, 26 insertions(+), 7 deletions(-) diff --git a/requests-for-comments/resilient-feature-flags b/requests-for-comments/resilient-feature-flags index 3ef1313..bd6d5e1 100644 --- a/requests-for-comments/resilient-feature-flags +++ b/requests-for-comments/resilient-feature-flags @@ -4,7 +4,7 @@ *Who are we building for, what are their needs (include example cases), why is this important?* We've recently realised a few limitations with feature flags when PostHog goes down. This isn't great for permanent feature flags as it causes things to break -in unexpected ways & for clients to think really hard about defaults & write code defensively. +in unexpected ways & forces clients to think really hard about defaults & write code defensively. Ideally, our client libraries and APIs should 'just work', even if there's no error handling on the user's side. @@ -12,12 +12,31 @@ So, how do we make this happen? ## Context -The way things currently work is: TODO +The (simplified) way things currently work is: -Decide can run on any arbitrary anonymous ID. +Whenever a user shows up on a page, we send a request to `/decide` with their randomly generated anonymous ID to parse feature flags. Usually these people don't have +properties attached to them, so only flags that don't depend on user properties resolve to `true` (assuming they're within the rollout bound). -Caching decide responses doesn't make too much sense, since it's called usually only once per distinct ID per day. Every anonymous user has a new ID, and except -for users coming back to the app again & again on the same day, we don't achieve much; while introducing new problems like cache invalidation. +Things change when we call `identify()` on a user, as this updates their ID, and fires another request to `/decide` to get new flags based on not only current +user properties, but any past properties that were set for this user. + +We need the database to: +(a) Validate project tokens and find relevant team +(b) Get feature flag definitions for the given team +(c) Evaluate above flags using the person's properties stored in the DB + +Note that `decide` can run on any arbitrary anonymous ID. + +Caching decide responses doesn't make too much sense, since it's called usually only once per distinct ID per session. Every anonymous user has a new ID, and except +for users coming back to the app again & again on the same day, we don't achieve resiliency as we can't cache for users never seen before, nor can we assume that properties +haven't changed; while introducing new problems like cache invalidation. + +### Possible failure modes + +1. The database is down/unreachable: `decide` evaluation fails and returns 500 +2. The servers are down/unreachable: requests from client libraries time out / error out. + +(2) seems hard to defend against without a distributed app distribution (and then a resolver to go to the 'correct' app), but (1) is possible. ## Solution @@ -36,8 +55,8 @@ On the SDK side: 1. We stop obliterating all flags if decide returns a 500, and based on the parameter in (2), update only flags that were computed again, while keeping old values of other flags. 2. We add timeouts to flushes & shutdowns - -To make this possible, we cache flag definitions (and not the decide responses) on our side, so these are available even if the DB is down. +To make this possible, we cache flag definitions (and not the decide responses) on our side, so these are available even if the DB is down. We need to do the +same for determining the team given a project API key. It makes sense to include (3) with (1) for decide because the shape of the response is changing anyways, and it's much easier to handle both at the same time in all client libraries, than doing these one by one. From cbd71990f210ee77c998fd093ba6500097ecf7ee Mon Sep 17 00:00:00 2001 From: Neil Kakkar Date: Tue, 29 Nov 2022 15:10:29 +0000 Subject: [PATCH 4/4] add motivation --- requests-for-comments/resilient-feature-flags | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/requests-for-comments/resilient-feature-flags b/requests-for-comments/resilient-feature-flags index bd6d5e1..33c2996 100644 --- a/requests-for-comments/resilient-feature-flags +++ b/requests-for-comments/resilient-feature-flags @@ -62,6 +62,14 @@ It makes sense to include (3) with (1) for decide because the shape of the respo We don't have to build the rest of it out right now, but having this functionality here means when do build it, it will be extraordinarily faster to do. +## Is it worth doing this? + +If PostHog never goes down, none of this gives us an advantage (except maybe speeding up `/decide` responses if we decide to put the cache in front). + +However, for the times we do go down, doing this ensures that our customers don't have a terrible experience & don't have their apps break. It's a trust-building +activity that ensures customers stick around. + +It's an extremely big churn risk. If PostHog going down makes their app go down, we become an extra dependency for them being up. ## Context *What are our competitors doing, what are the technical constraints, what are customers asking for, what does the data tell us, are there external motivations (e.g. launch week, enterprise contract)?*