-
Notifications
You must be signed in to change notification settings - Fork 5
Conversation
…s pool (fixes helm connection string issue)
I realised there was a small issue. The postgres adapter wouldn't work if using helm. So I passed on the postgres connection pool instead of DATABASE_URL... and added another env to specify a custom retry db if needed. |
@yakkomajuri regarding your questions, you understand correctly. This is just the first PR and the simplest system that lets us run code in the background. Next we'll build abstractions on top of this thing that will retry |
@mariusandra right yes so completely makes sense to built this backbone first. I guess that helps me answer your question though: I think the reason this feels like a scheduler is in large part because we don't yet have the As for exposing a scheduler, I think it could make sense, but I'd go through this retry stuff first, with a more limited API. With all the tooling we already have in place ( So I'd focus on |
Alternatively, we could get creative and morph this into something like this: export const tasks = {
checkSessionEnd: ({ distinct_id }, ({ cache, tasks }) => {
const ping = await cache.get(`session_${distinct_id}`, null)
if (!ping) {
posthog.capture("session end", { distinct_id })
} else {
await tasks.runIn(60, 'seconds').checkSessionEnd({ distinct_id })
}
}
}
export async function processEvent(event, { cache, tasks }) {
if ((await cache.incr(`session_${event.distinct_id}`)) === 1) {
posthog.capture("session start", { distinct_id: event.distinct_id })
await tasks.runIn(30, 'minutes').checkSessionEnd({ distinct_id: event.distinct_id })
}
await cache.expire(`session_${event.distinct_id}`, 30 * 60)
} This would work with async retries as well: import { createBuffer } from '@posthog/plugin-contrib'
import fetch from 'node-fetch'
export const tasks = {
flushBatch: async ({ batch, retryCount = 0 }, { tasks }) => {
const resp = await fetch('https://httpbin.org/post', {
method: 'post',
body: JSON.stringify(batch),
headers: { 'Content-Type': 'application/json' },
})
if (resp.status !== 200) {
if (retryCount > 5) {
console.error('Could not post batch', batch)
return
}
tasks.runIn(30, 'seconds').flushBatch({ batch, retryCount: retryCount + 1 })
}
}
}
export function setupPlugin({ global, tasks }) {
global.buffer = createBuffer({
limit: 10 * 1024 * 1024, // 10 MB
timeoutSeconds: 10 * 60, // 10 minutes
onFlush: async (batch) => {
await tasks.runDirectly().flushBatch({ batch })
},
})
}
export function teardownPlugin({ global }) {
global.buffer.flush()
}
export function processEvent(event, { config, global }) {
global.buffer.add(event, JSON.stringify(event).length)
return event
} The TypeScript story there seems ambitious though :D |
Having slept over it, I think we should go with some "task" or "job" naming scheme. I don't yet know what's the best API for this though... and I suggest tackling this in another PR. All retry queues are disabled by default, so merging this in should have no effect for anyone, rendering the new I'd suggest thus:
|
The TS story for the different job/task system was quite fine at the end of the day: PostHog/plugin-scaffold#16 |
const consoleFile = path.join(process.cwd(), 'tmp', 'test-console.txt') | ||
|
||
export const writeToFile = { | ||
console: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be plugin logs instead of a custom tests-only solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this should be removed as soon as plugin logs land... ⌛
|
||
const minRetry = process.env.NODE_ENV === 'test' ? 1 : 30 | ||
|
||
// TODO: add type to scaffold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is already ready in PostHog/plugin-scaffold#16, then I think the TODO can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be fair, it'll be deleted in the next PR (or the one after that?), it's in any case gone in #351 already
pluginConfig: PluginConfig | ||
): (type: string, payload: any, retry_in?: number) => Promise<void> { | ||
return async (type: string, payload: any, retry_in = 30) => { | ||
if (retry_in < minRetry || retry_in > 86400) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: retry_in → retryIn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, though can we keep as is? #351 will refactor this totally. We'll get into bad merge conflicts otherwise...
I just modified the example earlier slightly export async function onRetry (type, payload, meta) {
if (type === 'processEvent') {
console.log('retrying event!', type)
}
}
export async function processEvent (event, meta) {
console.log('queuing retry')
meta.retry('processEvent', event, 3)
return event
} |
I would definitely start up a new RDS instance just for the resource isolation and one less thing to migrate from Heroku to AWS eventually.
This seems reasonable to me. Keep things simple.
I love this idea, but there is a heavy cost to be paid here. If we go down this route we will no longer be able to split the task queue to the worker level. Meaning that we won't be able to localize these retry/task queues to the plugin-servers themselves. In a normal case each worker is responsible for the tasks that it is working on and can have their own personally managed queue retry later with. In this example every worker that has seen a certain Distinct_id is going to schedule a 'checkIfSessionOver' every minute. This also means a race condition on figuring out if a session is over and other tasks similar to this. I think we should have two separate task queues:
|
👍 for the RDS instance. The postgres "graphile" worker is configured to automatically write to a schema
Nope, they won't 😉. This is a nifty piece of code. We're using We also guarantee the 30min timing because that's when the key we incr'd export async function processEvent(event, { cache, tasks }) {
if ((await cache.incr(`session_${event.distinct_id}`)) === 1) {
posthog.capture("session start", { distinct_id: event.distinct_id })
await tasks.runIn(30, 'minutes').checkSessionEnd({ distinct_id: event.distinct_id })
}
await cache.expire(`session_${event.distinct_id}`, 30 * 60)
} Inside the task we check if the key is expired and if not, schedule another task to check it a bit later. export const tasks = {
checkSessionEnd: ({ distinct_id }, ({ cache, tasks }) => {
const ping = await cache.get(`session_${distinct_id}`, null)
if (!ping) {
posthog.capture("session end", { distinct_id })
} else {
await tasks.runIn(1, 'minute').checkSessionEnd({ distinct_id })
}
}
} This task will be picked up by just one worker, meaning that for the entire duration this user is on the website, we have just one job somewhere in the job queue to run. This could even be extended to a If that's not uniquely powerful, I'm not sure what is :).
Also nope 😉 and I'm 10^6% with you on this! Imagine the plugin servers job queue as an array of queues. Currently we have just one adapter, As the first queue I'd like to have a in-memory "do this soon" queue, with some sensible max number of items it can hold. This would be used for all the "try this again in 5 seconds" jobs. The second level is postgres for all those "run this next hour" jobs. Finally, if postgres gets full or throws other errors, we have a backup S3 queue, which just saves text files with a timestamp in the filename. It'll be horribly slow to read from, but it'll be durable. Unless the network is down and we can't write anywhere (and then we will retry for a long time). What's more, when we get a SIGTERM, we can tell the memory queue to quickly empty itself into postgres. |
@Twixes I managed to catch this error locally and will fix on Monday. In the third pr this function will no longer throw, so the problem is... um... avoided? :P Assuming I can fix this here on Monday (when you're off), 👍 for merging this and at least the next one in? |
after reading your comment and a re-read of the code this looks awesome 👍 🚢 it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was told to be slightly less code cowboyish, but if you want to go for it then I guess… yeehaw?
🤠 |
Ah, but as in merging after fix, then yeehaw for sure. |
I figured out the issue. You're most likely running NodeJS 15 or 16. Node 15 changed the behaviour of uncaught exceptions in promises to crash the entire app. So guess what happened:
Boom 💥 The fix is here: #352 Since we're mostly just using NodeJS 14, including in all Dockerfiles and thus cloud, this is not that urgent to get in now. Hence I'll merge this PR before 😱 the fix and add the refactoring PR or PRs on top. I'll also merge the fix somewhere in there. |
* extract redlock from schedule * implement generic retrying * capture console.log in tests via a temp file * add graphile queue * make it prettier and safe * style fixes * fix some tests * release if there * split postgres tests * don't make a graphile worker in all tests * revert "split postgres tests" * skip retries if pluginConfig not found * reset graphile schema before test * fix failing tests by clearing the retry consumer redlock * bust github actions cache * slight cleanup * fix github/eslint complaining about an `any` * separate url for graphile retry queue, otherwise use existing postgres pool (fixes helm connection string issue) * convert startRedlock params to options object * move type around * use an enum * update typo in comment Co-authored-by: Michael Matloka <[email protected]>
Changes
Still needs a lot of testing....
Checklist