Skip to content
This repository has been archived by the owner on Apr 11, 2022. It is now read-only.

Set memory and cpu limit to last-50k #306

Open
Tbaut opened this issue Apr 19, 2020 · 10 comments
Open

Set memory and cpu limit to last-50k #306

Tbaut opened this issue Apr 19, 2020 · 10 comments

Comments

@Tbaut
Copy link
Contributor

Tbaut commented Apr 19, 2020

The same as what was done for nodewatcher (#298 and #299 ) deployment should be done for the job:

2 pods where recently evicted with the following in their describe:

Pod The node had condition: [MemoryPressure].

and the other

The node was low on resource: memory. Container nomidotwatcher was using 1703264Ki, which exceeds its request of 0.

edit: the jobs logs show nothing (1 has no log at all, the other has no error and just stops logging abruptly tasks.)

@pmespresso
Copy link
Contributor

ok so the MemoryPressure complaint is coming from the Node level -
https://kubernetes.io/docs/concepts/overview/components/#node-components

so #298 should have been applied at that level, not to a deployment. I'm surprised the requests: { memory, cpu } fields were valid for the deployment in #298 though.

However, "Unlike pods and services, a node is not inherently created by Kubernetes: it is created externally by cloud providers like Google Compute Engine, or it exists in your pool of physical or virtual machines." - https://kubernetes.io/docs/concepts/architecture/nodes/#manual-node-administration

My Best Guess: So I think this is meant to be configured from our GKE rather than with kubectl, and at the Node level, not the Pods it manages.
Screenshot 2020-04-20 at 18 27 32

So my questions before proceeding are:

  1. first of all is the best guess above correct?
  2. if so, how do we/should we manually adjust these requests for resources?
    3a. We also have 3 such nodes in our Node Pool...do we need this many?
    3b. More generally, what sort of metrics do we look at to decide how many we need?
    3c. Should that be manually fiddled with at all?
    Screenshot 2020-04-20 at 18 29 42

cc @fevo1971: Sorry to bombard you with more Nodewatcher Devops ruckus but I'd really appreciate your expert eyes on this when you get a chance so that we can approach something more stable with our dashboards cluster. Thanks!

@fevo1971
Copy link
Contributor

cc @fevo1971: Sorry to bombard you with more Nodewatcher Devops ruckus but I'd really appreciate your expert eyes on this when you get a chance so that we can approach something more stable with our dashboards cluster. Thanks!

will take some time later today to figure out what's happening here, and what limit's we are actually running into here. will keep you posted!

@pmespresso
Copy link
Contributor

Thank you! 🙏

@fevo1971
Copy link
Contributor

fevo1971 commented Apr 22, 2020

Yes, you are right, the final limit to this is the memory of the node the pod is running on. In our case ~2.7gb. We can set the limits in the container spec:

spec:
  containers:
  - name: nomidotwatcher
    resources:
      requests:
        memory: "2Gi"
[...]

When i look at the chart it looks to me like the app is leaking memory, gets killed (at ~1.8gb), and gets restarted (until it allocates the <=2gb again) and so on.

Do we know for sure that this is not a bug in the software? From what i understand the app is reading data from the rpc-node and writing it into the database, is it expected to require +2gb of memory? And if so, do we have an idea on what the required limit is, so we can set up a new nodepool accordingly.

@Tbaut
Copy link
Contributor Author

Tbaut commented Apr 23, 2020

Do we know for sure that this is not a bug in the software?

Not sure at all if you ask me. This memory (from your link) doesn't look normal (the 2 jump to the bottom are when the pod get killed) :
image

@pmespresso
Copy link
Contributor

pmespresso commented Apr 23, 2020

indeed looks like a common pattern resulting from memory leakage

Screenshot 2020-04-23 at 18 21 07

src https://docs.google.com/presentation/d/1wUVmf78gG-ra5aOxvTfYdiLkdGaR9OhXRnOlIcEmu2s/pub?start=false&loop=false&delayms=3000&slide=id.ge1a6c70_1_14

@Tbaut
Copy link
Contributor Author

Tbaut commented Apr 24, 2020

My "findings", based mostly on what you can read https://tech.residebrokerage.com/debugging-node-js-memory-problems-d450787d9253

I have a local node with ~700 blocks with activity such as governance proposal.

  • Launch the following to be able to inspect the node process yarn build && PRISMA_ENDPOINT=http://0.0.0.0:4466 yarn prisma reset && PRISMA_ENDPOINT=http://0.0.0.0:4466 ARCHIVE_NODE_ENDPOINT="ws://0.0.0.0:9944" MAX_LAG=1 node --inspect ./lib/index.js
  • in chrome chrome://inspect/#devices
  • take heap snapshot at launch, around block 350, and close the 700.

We can see the following (comparing snapshot 1 and 3):
image
^The memory leak is clearly noticeable even with so few data, the heap keeps increasing. Unfortunately, the heap explorer didn't allow me to pin point a particular constant in our code. It seems that an array keeps increasing without being freed (on this example, 1.2Mb allocated, 0.2Mb freed between snapshot 1 and 3). this is how the biggest array looks like in details, you may find something interresting.

I then did the same, removing any task other than createBlockNumber, the leak is still here:
image
Note that I left createBlockNumber arbitrarily.

Another experiment I did, my Kusama node got stuck at block 700, so I've let nodewatcher go up until block 700, and just let it "Waiting for finalization or a max lag of 1 blocks.", while taking heap snapshots every minute or so. Note that this is a heap without any task at all (not even createBlockNumber)
image

The heap increases slower than with tasks, but most importantly, as soon as it waits, it seems that GC is doing a big job.

I did the same test with all tasks again, and we can see the same behaviour:
image

My goal was to see if I could reproduce our problem easily --> definitely.
Now we need to look at the code in first, and identify what could leak. I will test also older commits (before introducing the max_lag) and see how the heap behaves.

@Tbaut
Copy link
Contributor Author

Tbaut commented Apr 30, 2020

Just realized, we're launching our node with node -r ts-node/register --max-old-space-size=8192 ./src/index.ts indicating 8192 would mean that this process can take up to 8Gb memory isn't it? isn't that telling the GC to not care much unless we are around this value?

@niklabh
Copy link
Contributor

niklabh commented Apr 30, 2020

Also ts-node is not recommended to be used in production. TypeStrong/ts-node#104

We should build with tsc and run node lib/index.js

@pmespresso
Copy link
Contributor

Just realized, we're launching our node with node -r ts-node/register --max-old-space-size=8192 ./src/index.ts indicating 8192 would mean that this process can take up to 8Gb memory isn't it? isn't that telling the GC to not care much unless we are around this value?

it should be so i think, but look at the console the nodewatcher deployment hasn't used more than 1GB memorey in the past 30 days....

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants