Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit policy maximum age didn't cleanup resulting storage fill [v2.10.18] #5795

Open
b2broker-yperfilov opened this issue Aug 16, 2024 · 6 comments
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while

Comments

@b2broker-yperfilov
Copy link

Observed behavior

We are using Limit policy with maximum age of 15 minutes. However, 1 of 3 nodes didn't cleanup storage in time, resulting in storage filled and crash.

On the screen below, you can see the storage usage stats of 3 nodes. Notice that blue one has much larger storage usage compared to to red and yellow nodes.
CleanShot 2024-08-16 at 13 09 03@2x

The screenshot below is from NATS dashboard, you can see that stream message count also rose significantly
CleanShot 2024-08-16 at 13 14 27@2x

Configuration of the stream provided on the screen below. Stream was recreated during attempt to fix the issue, but it has exactly the same settings. Notice Max age here of 15 minutes, as well as typical bytes size and message count.
CleanShot 2024-08-16 at 13 10 59@2x

In logs, there were errors (repeated several time):

2024-08-15 19:50:32.572	{"time":"2024-08-15T16:50:32.57225584Z","_p":"F","log":"[181] 2024/08/15 16:50:32.572168 [ERR] JetStream resource limits exceeded for server"}

Please let me know if you need any additional details

Expected behavior

Limit policy cleaning as expected

Server and client version

Server 2.10.18

Host environment

K8s

      resources:
        limits:
          cpu: 400m
          memory: 768Mi
        requests:
          cpu: 400m
          memory: 768Mi

Steps to reproduce

not clear

@b2broker-yperfilov b2broker-yperfilov added the defect Suspected defect such as a bug or regression label Aug 16, 2024
@derekcollison
Copy link
Member

When something like that happens, we request the developer capture some profiles for us, specifically cpu, mem (heap), and stacksz / goroutines.

@b2broker-yperfilov
Copy link
Author

@derekcollison here are screenshot of some metrics. I went through many memory metrics, and all of the looks quite stable

CleanShot 2024-08-19 at 08 49 27@2x
CleanShot 2024-08-19 at 08 49 19@2x
CleanShot 2024-08-19 at 08 46 28@2x
CleanShot 2024-08-19 at 08 46 20@2x
CleanShot 2024-08-19 at 08 46 12@2x
CleanShot 2024-08-19 at 08 46 00@2x

@derekcollison
Copy link
Member

The stream info shows the only limit you have in place, which is age, appearing to work correctly. What do you think is not working correctly?

Also do you properly set GOMEMLIMIT?

@b2broker-yperfilov
Copy link
Author

@derekcollison we do not have GOMEMLIMIT set. At the same time, issue is not with memory of the pod, issue with disk storage.

We have a replication on 3 nodes for this stream, that means that message should be copied to 3 nodes, and at any time the same amount of space should be occupied on each node (assuming all other stream also having replicas factor 3). However, one of the nodes didn't follow tis rule, as can be seen from the initial message, resulting in disk leackage.

@derekcollison
Copy link
Member

Can you share a du -sh from the store directory for the one that has increased disk usage?

@b2broker-yperfilov
Copy link
Author

@derekcollison
Now it is 1.3G. Another node is 102.0M, another is 97.4M

@wallyqs wallyqs changed the title Limit policy maximum age didn't cleanup resulting storage fill Limit policy maximum age didn't cleanup resulting storage fill [v2.10.18] Sep 5, 2024
@github-actions github-actions bot added the stale This issue has had no activity in a while label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression stale This issue has had no activity in a while
Projects
None yet
Development

No branches or pull requests

2 participants