Sever is crashing due to memory swapping #1

josecelano · 2024-04-24T13:07:08Z

The droplet has started crashing, probably after the last update.

Just after rebooting:

After a while:

It looks like the Index is using a lot of CPU.

I don't know the reason yet. I think this recent PR could cause it:

torrust/torrust-index#530

Maybe the execution interval (100 milliseconds) is too short for this machine.

cc @da2ce7

josecelano · 2024-04-24T15:23:03Z

In my PC, with the current interval (100 milliseconds):

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1342564 josecel+  20   0 2355216  38912  29440 S  20.2   0.1   1:26.22 torrust-index

With 1000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1346972 josecel+  20   0 2355216  38656  29696 S   2.3   0.1   0:03.38 torrust-index

With 2000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1348942 josecel+  20   0 2353168  37888  29952 S   1.3   0.1   0:00.59 torrust-index

With 3000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1350156 josecel+  20   0 2353288  37632  29696 S   1.0   0.1   0:00.61 torrust-index

With 4000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1351449 josecel+  20   0 2355216  38400  29440 S   0.7   0.1   0:01.23 torrust-index

With 5000 milliseconds:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1353683 josecel+  20   0 2355224  37632  29440 S   1.0   0.1   0:02.17 torrust-index

Interval (secs)	Number of torrents imported per hour
1 sec	50 * 3600 = 180000
2 sec	50 * (3600/2) = 90000
3 sec	50 * (3600/3) = 60000
4 sec	50 * (3600/4) = 45000
5 sec	50 * (3600/5) = 36000

We are having problems in the live demo server: torrust/torrust-demo#1 dut to a high CPU and memory usage.

We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.

b2c2ce7 feat: increase the tracker stast importer exec interval (Jose Celano) Pull request description: We are having problems with the live demo server. See: torrust/torrust-demo#1 Due to a high CPU and memory usage. This increases the tracker stats importer execution interval from 100 milliseconds to 2000 milliseconds. ACKs for top commit: josecelano: ACK b2c2ce7 Tree-SHA512: ea5a23e4250378c2cb5df6f3d5d81e989dfb1e8f3490e4e42acc70bb77b972a599bbdc0051738a45648aea37d38fe7703a6ca65177bb68b56f21d2b056cdfe19

josecelano · 2024-04-24T19:49:40Z

I've deployed a new version with the increased interval 2 hours ago and it hasn't crashed:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3721 torrust   20   0  750016 499496      0 S  20.5  50.9  23:08.13 torrust-tracker
   3867 torrust   20   0  577576  47760   6164 S   6.3   4.9   6:18.35 torrust-index

josecelano · 2024-04-25T06:45:20Z

The problem is not solved yet :-/:

josecelano · 2024-04-25T07:07:38Z

It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:

The new tracker statistics importer (even with the new 2-second interval).
The new torrent repository implementation with the SkipMap. I don't think so because, in the previous implementation with BTreeMap, we were not limiting the memory consumption either.
The tracker is receiving too many requests, so we need to limit the number of requests or the memory consumption.

The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):

docker ps
CONTAINER ID   IMAGE                       COMMAND                  CREATED        STATUS                    PORTS                                                                                                                                       NAMES
ee67d0951541   nginx:mainline-alpine       "/docker-entrypoint.…"   14 hours ago   Up 14 hours               0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp                                                                    proxy
31c65c66be26   torrust/index-gui:develop   "/usr/local/bin/entr…"   14 hours ago   Up 14 hours (healthy)     0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                                                                                   index-gui
a8b2e8977c5c   torrust/index:develop       "/usr/local/bin/entr…"   14 hours ago   Up 14 hours (healthy)     0.0.0.0:3001->3001/tcp, :::3001->3001/tcp                                                                                                   index
0cc1eefec6b6   torrust/tracker:develop     "/usr/local/bin/entr…"   14 hours ago   Up 10 minutes (healthy)   0.0.0.0:1212->1212/tcp, :::1212->1212/tcp, 0.0.0.0:7070->7070/tcp, :::7070->7070/tcp, 1313/tcp, 0.0.0.0:6969->6969/udp, :::6969->6969/udp   tracker

I'm going to try first switching back to the previous repository implementation with the BTreeMap (there could be some deadlock with the SkiMap). If we still have the problem, we need to limit requests and/or memory consumption.

On the other hand, I think we should improve the container healthchecks. We should not wait for 2 hours to restart the container.

josecelano · 2024-04-25T15:45:44Z

I've opened another issue to discover why there are so many zombie processes. I think there is an easy way to know if the healthcheks cause the crash on the server or not. I can run a private server with the same resources without requests.

Another thing I could check is the number of requests the tracker is handling. Maybe the number of requests per second has increased a lot, and the problem is the server can't handle those requests. In that case, the only solution would be:

Limit the amount of requests (reject requests when the server is busy).
Limit the memory consumption (we try to handle requests but remove torrents/peers when the torrent repository is full).
Increase the server when CPU or memory usage goes over 85%.
All of the previous ones.

josecelano · 2024-04-25T15:54:26Z

5 minutes after restarting the tracker container, I checked the stats:

udp4_connections_handled: 44661,
udp4_announces_handled: 25175,
udp4_scrapes_handled: 652,

The tracker is handling 234.96 req/sec. I think, in this case, the server could be simply busy, and we need to implement one of the solutions above (1 to 3).

cc @da2ce7

UDPATE:

Those are only the tracker client requests. We also have tracker API, Index, stats importer, ...

josecelano · 2024-04-26T07:24:11Z

It looks like the period without restarting is longer during the night. Maybe it's because there are fewer tracker requests.

Now there are 350 req/sec

In 2 hours
udp4_connections_handled: 824422,
udp4_announces_handled: 1646227,
udp4_scrapes_handled: 55240,

2525889 -> 350.817916667 req/sec

josecelano · 2024-04-26T14:51:48Z

It looks like the problem is memory consumption. The memory consumption increases until the server starts swapping. I think the problem could be:

The new tracker statistics importer (even with the new 2-second interval).

The new torrent repository implementation with the SkipMap. I don't think so because, in the previous implementation with BTreeMap, we were not limiting the memory consumption either.

The tracker is receiving too many requests, so we need to limit the number of requests or the memory consumption.

The good thing is that the service restarts with the docker healthcheck after 2 hours. Looking at the docker containers, only the tracker was restarted, so I guess the problem is with the tracker (option 2 or 3):

I've been thinking again about why this is happening now. Regarding the SkipMap, maybe the SkipMap is consuming more memory than the BTreeMapand thas making the container to restart at the current load level.

Regarding the load level, It would be nice to have historical statistics. I would like to know if we have more requests than one or two weeks ago. That would help in understanding if the problem is we simply have more requests and we need to resize the server.

I'm going to resize the droplet from memory 1GB ($6) to 2GB ($12) to see what happens:

josecelano · 2024-04-26T15:06:06Z

Just after resizing the droplet:

josecelano · 2024-04-26T16:47:38Z

The server was restarted at 15:45 approx (now 17:43):

After restarting the server, the tracker has handled:

udp4_connections_handled: 1022134,
udp4_announces_handled: 1864182,
udp4_scrapes_handled: 63645,

2949961 total requests in 2 hours -> 409.716805556 req/sec

It used to crash with 25req/sec with the previous instance size.

Stats after two hours running the tracker:

{
  "torrents": 196455,
  "seeders": 120032,
  "completed": 1241,
  "leechers": 200406,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 1022134,
  "udp4_announces_handled": 1864182,
  "udp4_scrapes_handled": 63645,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}

josecelano · 2024-04-26T21:31:12Z

22:26

{
  "torrents": 344237,
  "seeders": 256095,
  "completed": 4079,
  "leechers": 420604,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 3313661,
  "udp4_announces_handled": 7467544,
  "udp4_scrapes_handled": 236375,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}

From 17:43 to 22:28 -> 17100 sec

"udp4_connections_handled": 3313661,
"udp4_announces_handled": 7467544,
"udp4_scrapes_handled": 236375,

11017580 total requests in 17100 sec -> 644.302923977 req/sec

josecelano · 2024-04-27T07:16:36Z

08:08.

The tracker has been restarted again.

24h graph:

Last-hour graph:

I could see how many requests it was handling before restarting. I want to know how many req/sec this instance can handle. It would be nice to collect statistics every 5 minutes. Maybe I can write a simple script to import them and update a csv file until we implement something like this. I can run it on the server.

josecelano · 2024-04-28T06:44:35Z

The tracker was restarted after running from 8:08 to 1:25.

Stats at 22:30:

Req/seg: 505.0198659 
Seeders+leechers = 1.390.088

{
  "torrents": 521222,
  "seeders": 497614,
  "completed": 8410,
  "leechers": 892474,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 7924576,
  "udp4_announces_handled": 17939819,
  "udp4_scrapes_handled": 497642,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}

top - 21:25:22 up 1 day,  6:42,  1 user,  load average: 2.73, 2.37, 2.19
Tasks: 141 total,   2 running, 128 sleeping,   0 stopped,  11 zombie
%Cpu(s): 24.2 us, 28.8 sy,  0.0 ni,  1.0 id, 37.4 wa,  0.0 hi,  7.0 si,  1.7 st
MiB Mem :   1963.9 total,     93.5 free,   1758.5 used,    111.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     69.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 683491 torrust   20   0 1733056   1.4g   3300 S  17.6  72.3 177:40.05 torrust-tracker
 683460 root      20   0  719640   3784      0 S  15.3   0.2 135:15.98 containerd-shim
    724 root      20   0 1537756  46328   2360 S  14.3   2.3 249:15.92 dockerd
   1296 torrust   20   0  582848  48888   5772 S   4.7   2.4  72:18.78 torrust-index

I think I can confirm the problem is the number of peers and torrents increases, consuming more memory until the server start swapping to much and the container is restarted due to the healthcheck.

I will open a discussion on the Tracker repo. We could limit the resources consumption by limiting the requests or memory but that would lead to worse responses. In a production environment I guess we should only monitor the resources and scale up. Maybe we could try a mixed solution:

Limit resources consumption to avoid restarting the service.
But motinor it, so we can scale up before the service is degraded.

In order to monitor this thing better it would be nice to have more statistics and show some graphs.

josecelano · 2024-04-29T07:01:20Z

Server resized to:

josecelano · 2024-05-01T07:43:58Z

On the 20th of April, the server was resized.

{
  "torrents": 1236779,
  "seeders": 1460674,
  "completed": 26980,
  "leechers": 2691284,
  "tcp4_connections_handled": 0,
  "tcp4_announces_handled": 0,
  "tcp4_scrapes_handled": 0,
  "tcp6_connections_handled": 0,
  "tcp6_announces_handled": 0,
  "tcp6_scrapes_handled": 0,
  "udp4_connections_handled": 24167291,
  "udp4_announces_handled": 58160995,
  "udp4_scrapes_handled": 1485013,
  "udp6_connections_handled": 0,
  "udp6_announces_handled": 0,
  "udp6_scrapes_handled": 0
}

501.75 req/sec

josecelano · 2024-05-02T06:14:36Z

The new server was running (without restarting the tracker) from the 29th of April at 8:10 am to the 2nd of May at 5:30 am.

We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.

josecelano added the Bug Incorrect Behaviour label Apr 24, 2024

josecelano mentioned this issue Apr 24, 2024

Increase the execution interval for the tracker stats importer torrust/torrust-index#565

Open

josecelano added a commit to josecelano/torrust-index that referenced this issue Apr 24, 2024

feat: derease the tracker stast importer exec interval

47be3fa

We are having problems in the live demo server: torrust/torrust-demo#1 dut to a high CPU and memory usage.

josecelano added a commit to josecelano/torrust-index that referenced this issue Apr 24, 2024

feat: decrease the tracker stast importer exec interval

bdbb316

We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.

josecelano added a commit to josecelano/torrust-index that referenced this issue Apr 24, 2024

feat: increase the tracker stast importer exec interval

b2c2ce7

We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.

josecelano mentioned this issue Apr 24, 2024

Increase the tracker stats importer exec interval torrust/torrust-index#566

Merged

josecelano mentioned this issue Apr 25, 2024

There are a lot of zombie processes #2

Closed

This was referenced May 3, 2024

Monitor how many UDP requests are aborted when the maximum number of concurrent requests is reached torrust/torrust-tracker#830

Closed

Memory profiling torrust/torrust-tracker#831

Open

mario-nt pushed a commit to mario-nt/torrust-index that referenced this issue May 6, 2024

feat: increase the tracker stast importer exec interval

a784252

We are having problems with the live demo server: torrust/torrust-demo#1 Due to a high CPU and memory usage.

josecelano mentioned this issue May 8, 2024

Review process to abort UDP request when the ring buffer is full torrust/torrust-tracker#842

Closed

josecelano closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sever is crashing due to memory swapping #1

Sever is crashing due to memory swapping #1

josecelano commented Apr 24, 2024

josecelano commented Apr 24, 2024

josecelano commented Apr 24, 2024

josecelano commented Apr 25, 2024

josecelano commented Apr 25, 2024

josecelano commented Apr 25, 2024

josecelano commented Apr 25, 2024 •

edited

Loading

josecelano commented Apr 26, 2024

josecelano commented Apr 26, 2024

josecelano commented Apr 26, 2024 •

edited

Loading

josecelano commented Apr 26, 2024 •

edited

Loading

josecelano commented Apr 26, 2024 •

edited

Loading

josecelano commented Apr 27, 2024

josecelano commented Apr 28, 2024

josecelano commented Apr 29, 2024

josecelano commented May 1, 2024 •

edited

Loading

josecelano commented May 2, 2024

Sever is crashing due to memory swapping #1

Sever is crashing due to memory swapping #1

Comments

josecelano commented Apr 24, 2024

josecelano commented Apr 24, 2024

josecelano commented Apr 24, 2024

josecelano commented Apr 25, 2024

josecelano commented Apr 25, 2024

josecelano commented Apr 25, 2024

josecelano commented Apr 25, 2024 • edited Loading

josecelano commented Apr 26, 2024

josecelano commented Apr 26, 2024

josecelano commented Apr 26, 2024 • edited Loading

josecelano commented Apr 26, 2024 • edited Loading

josecelano commented Apr 26, 2024 • edited Loading

josecelano commented Apr 27, 2024

josecelano commented Apr 28, 2024

josecelano commented Apr 29, 2024

josecelano commented May 1, 2024 • edited Loading

josecelano commented May 2, 2024

josecelano commented Apr 25, 2024 •

edited

Loading

josecelano commented Apr 26, 2024 •

edited

Loading

josecelano commented Apr 26, 2024 •

edited

Loading

josecelano commented Apr 26, 2024 •

edited

Loading

josecelano commented May 1, 2024 •

edited

Loading