Set memory and cpu limit to last-50k #306

Tbaut · 2020-04-19T19:45:29Z

The same as what was done for nodewatcher (#298 and #299 ) deployment should be done for the job:

2 pods where recently evicted with the following in their describe:

Pod The node had condition: [MemoryPressure].

and the other

The node was low on resource: memory. Container nomidotwatcher was using 1703264Ki, which exceeds its request of 0.

edit: the jobs logs show nothing (1 has no log at all, the other has no error and just stops logging abruptly tasks.)

pmespresso · 2020-04-20T10:42:02Z

ok so the MemoryPressure complaint is coming from the Node level -
https://kubernetes.io/docs/concepts/overview/components/#node-components

so #298 should have been applied at that level, not to a deployment. I'm surprised the requests: { memory, cpu } fields were valid for the deployment in #298 though.

However, "Unlike pods and services, a node is not inherently created by Kubernetes: it is created externally by cloud providers like Google Compute Engine, or it exists in your pool of physical or virtual machines." - https://kubernetes.io/docs/concepts/architecture/nodes/#manual-node-administration

My Best Guess: So I think this is meant to be configured from our GKE rather than with kubectl, and at the Node level, not the Pods it manages.

So my questions before proceeding are:

first of all is the best guess above correct?
if so, how do we/should we manually adjust these requests for resources?
3a. We also have 3 such nodes in our Node Pool...do we need this many?
3b. More generally, what sort of metrics do we look at to decide how many we need?
3c. Should that be manually fiddled with at all?

cc @fevo1971: Sorry to bombard you with more Nodewatcher Devops ruckus but I'd really appreciate your expert eyes on this when you get a chance so that we can approach something more stable with our dashboards cluster. Thanks!

fevo1971 · 2020-04-20T13:05:11Z

cc @fevo1971: Sorry to bombard you with more Nodewatcher Devops ruckus but I'd really appreciate your expert eyes on this when you get a chance so that we can approach something more stable with our dashboards cluster. Thanks!

will take some time later today to figure out what's happening here, and what limit's we are actually running into here. will keep you posted!

pmespresso · 2020-04-20T13:10:03Z

Thank you! 🙏

fevo1971 · 2020-04-22T22:58:10Z

Yes, you are right, the final limit to this is the memory of the node the pod is running on. In our case ~2.7gb. We can set the limits in the container spec:

spec:
  containers:
  - name: nomidotwatcher
    resources:
      requests:
        memory: "2Gi"
[...]

When i look at the chart it looks to me like the app is leaking memory, gets killed (at ~1.8gb), and gets restarted (until it allocates the <=2gb again) and so on.

Do we know for sure that this is not a bug in the software? From what i understand the app is reading data from the rpc-node and writing it into the database, is it expected to require +2gb of memory? And if so, do we have an idea on what the required limit is, so we can set up a new nodepool accordingly.

Tbaut · 2020-04-23T10:06:47Z

Do we know for sure that this is not a bug in the software?

Not sure at all if you ask me. This memory (from your link) doesn't look normal (the 2 jump to the bottom are when the pod get killed) :

pmespresso · 2020-04-23T10:22:26Z

indeed looks like a common pattern resulting from memory leakage

src https://docs.google.com/presentation/d/1wUVmf78gG-ra5aOxvTfYdiLkdGaR9OhXRnOlIcEmu2s/pub?start=false&loop=false&delayms=3000&slide=id.ge1a6c70_1_14

Tbaut · 2020-04-24T16:45:18Z

My "findings", based mostly on what you can read https://tech.residebrokerage.com/debugging-node-js-memory-problems-d450787d9253

I have a local node with ~700 blocks with activity such as governance proposal.

Launch the following to be able to inspect the node process yarn build && PRISMA_ENDPOINT=http://0.0.0.0:4466 yarn prisma reset && PRISMA_ENDPOINT=http://0.0.0.0:4466 ARCHIVE_NODE_ENDPOINT="ws://0.0.0.0:9944" MAX_LAG=1 node --inspect ./lib/index.js
in chrome chrome://inspect/#devices
take heap snapshot at launch, around block 350, and close the 700.

We can see the following (comparing snapshot 1 and 3):

^The memory leak is clearly noticeable even with so few data, the heap keeps increasing. Unfortunately, the heap explorer didn't allow me to pin point a particular constant in our code. It seems that an array keeps increasing without being freed (on this example, 1.2Mb allocated, 0.2Mb freed between snapshot 1 and 3). this is how the biggest array looks like in details, you may find something interresting.

I then did the same, removing any task other than createBlockNumber, the leak is still here:

Note that I left createBlockNumber arbitrarily.

Another experiment I did, my Kusama node got stuck at block 700, so I've let nodewatcher go up until block 700, and just let it "Waiting for finalization or a max lag of 1 blocks.", while taking heap snapshots every minute or so. Note that this is a heap without any task at all (not even createBlockNumber)

The heap increases slower than with tasks, but most importantly, as soon as it waits, it seems that GC is doing a big job.

I did the same test with all tasks again, and we can see the same behaviour:

My goal was to see if I could reproduce our problem easily --> definitely.
Now we need to look at the code in first, and identify what could leak. I will test also older commits (before introducing the max_lag) and see how the heap behaves.

Tbaut · 2020-04-30T09:08:01Z

Just realized, we're launching our node with node -r ts-node/register --max-old-space-size=8192 ./src/index.ts indicating 8192 would mean that this process can take up to 8Gb memory isn't it? isn't that telling the GC to not care much unless we are around this value?

niklabh · 2020-04-30T09:31:05Z

Also ts-node is not recommended to be used in production. TypeStrong/ts-node#104

We should build with tsc and run node lib/index.js

pmespresso · 2020-04-30T10:28:55Z

Just realized, we're launching our node with node -r ts-node/register --max-old-space-size=8192 ./src/index.ts indicating 8192 would mean that this process can take up to 8Gb memory isn't it? isn't that telling the GC to not care much unless we are around this value?

it should be so i think, but look at the console the nodewatcher deployment hasn't used more than 1GB memorey in the past 30 days....

Tbaut added F3-annoyance 🦟 F8-enhancement 🎁 labels Apr 19, 2020

pmespresso mentioned this issue Apr 30, 2020

dont use ts-node in staging/production #335

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set memory and cpu limit to last-50k #306

Set memory and cpu limit to last-50k #306

Tbaut commented Apr 19, 2020 •

edited

Loading

pmespresso commented Apr 20, 2020

fevo1971 commented Apr 20, 2020

pmespresso commented Apr 20, 2020

fevo1971 commented Apr 22, 2020 •

edited

Loading

Tbaut commented Apr 23, 2020

pmespresso commented Apr 23, 2020 •

edited

Loading

Tbaut commented Apr 24, 2020 •

edited

Loading

Tbaut commented Apr 30, 2020

niklabh commented Apr 30, 2020

pmespresso commented Apr 30, 2020

Set memory and cpu limit to last-50k #306

Set memory and cpu limit to last-50k #306

Comments

Tbaut commented Apr 19, 2020 • edited Loading

pmespresso commented Apr 20, 2020

fevo1971 commented Apr 20, 2020

pmespresso commented Apr 20, 2020

fevo1971 commented Apr 22, 2020 • edited Loading

Tbaut commented Apr 23, 2020

pmespresso commented Apr 23, 2020 • edited Loading

Tbaut commented Apr 24, 2020 • edited Loading

Tbaut commented Apr 30, 2020

niklabh commented Apr 30, 2020

pmespresso commented Apr 30, 2020

Tbaut commented Apr 19, 2020 •

edited

Loading

fevo1971 commented Apr 22, 2020 •

edited

Loading

pmespresso commented Apr 23, 2020 •

edited

Loading

Tbaut commented Apr 24, 2020 •

edited

Loading