Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG?] very slow in creating library #427

Closed
Cantello opened this issue Aug 31, 2024 · 14 comments
Closed

[BUG?] very slow in creating library #427

Cantello opened this issue Aug 31, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@Cantello
Copy link

Cantello commented Aug 31, 2024

Describe the bug
When adding a library, the media entities are sometimes added quickly but thumbnails take forever or are not generated at all, sometimes, the entities are not created at all.

To Reproduce

  1. Start the docker container (on debian 12), docker-compose.yml is below
  stump:
    image: aaronleopold/stump
    container_name: stump
    volumes:
      - /opt/stump:/config
      - /mnt/_nfs/Stuff:/data_nfs
      - /tmp:/temporary
    ports:
      - 10801:10801
    environment:
      - STUMP_CONFIG_DIR=/config
      - PUID=1000
      - PGID=1000
      - STUMP_VERBOSITY=2
    restart: unless-stopped
  1. Initialize server
  2. Create library (single folder, 40 .cbz files, series-based, thumbnails enabled with 200x350)
  3. Library is either created but thumbnails are not generated (or extremely slow, like 1 per minute), OR
  4. Library is not created at all ("No books to display", the series is created but with 0 books in it, the files are there in the files tab within the series)

Expected behavior
Library is created with thumbnails

Screenshots

Click to expand

2024-08-31 18_08_46-Stump _ Test
2024-08-31 18_09_03-Stump
2024-08-31 18_07_38-Stump _ Home

Desktop (please complete the following information):
Access on Windows, Vivaldi as browser, stump is hosted on docker on debian 12.

Additional context
Log file here (maybe not enough and/or too chatty, was using verbosity 3):

https://logpaste.com/6lqmqRvU

The container stats are also interesting, nearly no CPU usage (1-2%) but huge memory usage (2.5GB after 15 minutes and rising).

@Cantello Cantello added the bug Something isn't working label Aug 31, 2024
@aaronleopold
Copy link
Collaborator

aaronleopold commented Aug 31, 2024

Hey 👋 Thanks for the detailed report!

thumbnails take forever or are not generated at all

Coincidentally, someone else on Discord reported the really slow thumbnail generation. I'm not sure if you're on there, but if you're interested the discussions can be found here.

There is already a draft PR up which I'm hoping will help with this.

Create library (single folder, 40 .cbz files, series-based, thumbnails enabled with 200x350)

Can you share additional details about your library structure? You said a single folder, so something like?:

Root
├── Single Folder of Stuff
│   ├── 40 .cbz files

OR

Root
├── 40 .cbz files

Or something else? I'm curious if perhaps you are being affected by a recently fixed bug in the scanner (not in latest yet): #423

Library is not created at all ("No books to display", the series is created but with 0 books in it, the files are there in the files tab within the series)

This is an odd scenario. I won't have time to review the logs you provided today, but do you notice the server doing anything when this happens? I'm curious if there is a panic happening somewhere.

The container stats are also interesting, nearly no CPU usage (1-2%) but huge memory usage (2.5GB after 15 minutes and rising).

If I had to guess, CPU usage is low during a scan but huge during thumbnail generation. A scan doesn't load too much of the files into memory, but to generate thumbnails means we must load a (potentially large) image into memory, convert it to another format, apply transformations (scale, quality, etc). This is another area I'm hopeful the aforementioned PR will help, though.

Edit to add: I assume you are also using a network drive for your books (/mnt/_nfs/Stuff:/data_nfs). This is going to slow things down even more, in general.

@github-project-automation github-project-automation bot moved this to Backlog in v0.1.x Aug 31, 2024
@aaronleopold aaronleopold moved this from Backlog to Triage in v0.1.x Aug 31, 2024
@Cantello
Copy link
Author

Thanks for the quick reply, I will just experiment around, stump is (not yet) my main comic reader (but will probably be as Komga is probably never going to implement smart filters).

As for your remarks:

Can you share additional details about your library structure? You said a single folder, so something like?:

The first test library was just 40 books in the root folder, my regular library is roughly as follows, one folder per

.
└── Comics/
    ├── English/
    │   ├── A Gift for a Ghost (2020)
    │   ├── Alan Moore's Neonomicon (2010-2011)
    │   ├── Alan Moore's The Courtyard (2004)
    │   ├── Aliens - Rescue (2019)
    │   └── American Gods/
    │       ├── American Gods - My Ainsel (2018)
    │       ├── American Gods - Shadows (2017)
    │       └── American Gods - The Moment of the Storm (2019-2020)
    └── German/
        ├── Ab Irato (2016)
        ├── Adler ohne Krallen (2012)
        ├── Akira (1991)
        ├── Aldebaran (2010)
        ├── Alice im Wunderland (2010)
        ├── Alice Matheson (2016-2020)
        └── Allein (2008-)

This is an odd scenario. I won't have time to review the logs you provided today, but do you notice the server doing anything when this happens? I'm curious if there is a panic happening somewhere.

Nothing that I am aware of, any idea where I could be looking?

Edit to add: I assume you are also using a network drive for your books (/mnt/_nfs/Stuff:/data_nfs). This is going to slow things down even more, in general.

I know that this could be a bottleneck (which is why I switched to NFS instead of SMB, the files are served by a Synology NAS) but other comic readers (tried Kavita & Komga) currently do not suffer from any problems.

P.S.: memory usage is now up to 5.8GB, this does somehow not seem quite right... Is this due to the very verbose logging?

@aaronleopold
Copy link
Collaborator

The first test library was just 40 books in the root folder, my regular library is roughly as follows, one folder per

This sounds like it would trigger that bug I mentioned. The bug is specifically for media files existing in the root of a library. Since it was just a test, it shouldn't otherwise affect you (unless you have files in the library root elsewhere). It should be fixed as of the last nightly.

Nothing that I am aware of, any idea where I could be looking?

Nothing specific, I really am curious what the entire scan would output though. No books is odd, but sporadically is even stranger. In general, keywords to look for in logs include error, warn, or panic.

P.S.: memory usage is now up to 5.8GB, this does somehow not seem quite right... Is this due to the very verbose logging?

I wouldn't think so, you're mapping to a dir on your host where the logs are written to. I am not sure how long the logs would remain in a buffer somewhere if nobody is attached to it.

Would you be open to spinning up a test container if I were to push an image with the changes from #426? I think your setup is a good one for working out some of these issues.

@Cantello
Copy link
Author

Cantello commented Aug 31, 2024

Would you be open to spinning up a test container if I were to push an image with the changes from #426? I think your setup is a good one for working out some of these issues.

Absolutely, just tell me what to do. Might take until Monday, though.

One thing that I noticed after starting a new library scan with reduced log verbosity (the only real error I found in the log):

2024-08-31T18:59:04.918447Z  INFO stump_core::job: Starting job, job_id: 532f68d0-50c8-419f-8b7b-17fdd35e65c0, job_name: "library_scan"
    at core/src/job/mod.rs:619
2024-08-31T18:59:22.479440Z  INFO stump_core::filesystem::scanner::library_scan_job: Executing the init task for library scan
    at core/src/filesystem/scanner/library_scan_job.rs:269
2024-08-31T18:59:22.638665Z  INFO stump_core::filesystem::scanner::library_scan_job: Executing the walk series task for library scan
    at core/src/filesystem/scanner/library_scan_job.rs:407
2024-08-31T19:00:10.251108Z ERROR tower_http::trace::on_failure: response failed, classification: Status code: 500 Internal Server Error, latency: 0 ms
    at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-http-0.4.4/src/trace/on_failure.rs:93

@aaronleopold
Copy link
Collaborator

aaronleopold commented Aug 31, 2024

Absolutely, just tell me what to do. Might take until Monday, though.

I appreciate it! No rush on my end. I'll be pushing a perf-experimental tag for each commit on that PR. The first job should finish shortly: https://github.com/stumpapp/stump/actions/runs/10647624711/job/29515774379?pr=426

Edit to add: it just got pushed https://hub.docker.com/r/aaronleopold/stump/tags

There are two new configuration options to try out:

  • STUMP_MAX_SCANNER_CONCURRENCY: The maximum number of concurrent files which may be processed by a scanner (default 200)
  • STUMP_MAX_THUMBNAIL_CONCURRENCY: The maximum number of concurrent files which may be processed by a thumbnail generator (default 50)

One thing that I noticed after starting a new library scan with reduced log verbosity (the only real error I found in the log):

Without more verbose logs it's hard to say would could have triggered a 500. I wouldn't necessarily attribute it to the scanner though, likely a red herring. When you start a scan via HTTP call, it sends an internal message to queue the job. The HTTP request should end by the time the job would start. The only time a 500 would happen is if the message to queue the job failed, but since it shows some info level logs from the scanner I don't think that is what is happening.

@aaronleopold
Copy link
Collaborator

See #426 (comment) for some more context

@Cantello
Copy link
Author

Cantello commented Sep 4, 2024

I tried the experimental docker image but I guess there is something more afoot here with my setup than just slow thumbnail creation. Somehow, even after two days, the library creation & scanning process was still working and the status just showed 0% complete (0 of 1745 books) 0 in progress. It also seemed as if the frontend was still working but the backend did not provide any updates, e.g. the settings page was not displayed even though the web page switched to settings. According to docker ps the container was still working and logging into the container and installing procps it shows stump working as well.

I recreated the container after deleting all user data and it is currently working. 2/125 tasks done after 50 minutes of a library with ~700 books/100GB and memory consumption is steadily rising with VERBOSITY=2 (now at 1.2GB RAM).
edit: restarted the container on verbosity=0 and still after 2 hours, memory consumption rose to 6-7GB and the container was unresponsive.

@aaronleopold
Copy link
Collaborator

aaronleopold commented Sep 4, 2024

To be clear, when you say you tried the experimental docker image you mean the literal experimental tag or the perf-experimental tag I pushed with (potential) improvements?

It also seemed as if the frontend was still working but the backend did not provide any updates, e.g. the settings page was not displayed even though the web page switched to settings

Yeah that sounds like it was overwhelmed, e.g. no resources available to issue the queries to display misc settings sections so the suspense boundary stays (nothing gets rendered). If you don't have the Live refetch preference enabled (settings > appearance) that is supposed to help. I agree, there is clearly something going on here with how your setup is interacting with Stump.

Is there any way for you to profile what is eating the memory, specifically? E.g. if the process is holding any files open, etc.

@Cantello
Copy link
Author

Cantello commented Sep 5, 2024

To be clear, when you say you tried the experimental docker image you mean the literal experimental tag or the perf-experimental tag I pushed with (potential) improvements?

I was using perf-experimental.

Is there any way for you to profile what is eating the memory, specifically? E.g. if the process is holding any files open, etc.

Sorry, I'm not much of a developer (not at all) and thus cannot really help there. If you have any advice on what you'd need, I can go from there. From what I tried:

root@3e6fdecd9b67:/# free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       6.4Gi       3.4Gi        80Mi       5.7Gi       8.4Gi
Swap:            0B          0B          0B
root@3e6fdecd9b67:/# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   3736  1852 ?        Ss   06:05   0:00 /bin/bash /entrypoint.sh
root          36  0.0  0.0   6144  1408 ?        S    06:05   0:00 su stump -c /app/stump
stump         37  1.0  5.3 3232524 871560 ?      Ssl  06:05   1:09 /app/stump
root         107  0.0  0.0   3976  2952 pts/0    Ss   06:08   0:00 /bin/bash
root         715  0.0  0.0   7640  2584 pts/0    R+   07:57   0:00 ps aux
root@3e6fdecd9b67:/# lsof -p 37
COMMAND PID  USER   FD      TYPE DEVICE SIZE/OFF NODE NAME
stump    37 stump  cwd   unknown                      /proc/37/cwd (readlink: Permission denied)
stump    37 stump  rtd   unknown                      /proc/37/root (readlink: Permission denied)
stump    37 stump  txt   unknown                      /proc/37/exe (readlink: Permission denied)
stump    37 stump    0   unknown                      /proc/37/fd/0 (readlink: Permission denied)
stump    37 stump    1   unknown                      /proc/37/fd/1 (readlink: Permission denied)
stump    37 stump    2   unknown                      /proc/37/fd/2 (readlink: Permission denied)
[... lots more lines, 45 file descriptors in total]
root@3e6fdecd9b67:/var/log# lsof -p 36
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
su       36 root  cwd    DIR   0,79     4096  4915529 /
su       36 root  rtd    DIR   0,79     4096  4915529 /
su       36 root  txt    REG   0,79    63568  7117593 /bin/su
su       36 root  mem    REG    8,1           7117593 /bin/su (path dev=0,79)
su       36 root  mem    REG    8,1           7117908 /lib/x86_64-linux-gnu/security/pam_permit.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117888 /lib/x86_64-linux-gnu/security/pam_deny.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117837 /lib/x86_64-linux-gnu/libnsl-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117811 /lib/x86_64-linux-gnu/libcrypt-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117924 /lib/x86_64-linux-gnu/security/pam_unix.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117899 /lib/x86_64-linux-gnu/security/pam_limits.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117903 /lib/x86_64-linux-gnu/security/pam_mail.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117890 /lib/x86_64-linux-gnu/security/pam_env.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117860 /lib/x86_64-linux-gnu/libpthread-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117859 /lib/x86_64-linux-gnu/libpcre.so.3.13.3 (path dev=0,79)
su       36 root  mem    REG    8,1           7117866 /lib/x86_64-linux-gnu/libselinux.so.1 (path dev=0,79)
su       36 root  mem    REG    8,1           7117911 /lib/x86_64-linux-gnu/security/pam_rootok.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117843 /lib/x86_64-linux-gnu/libnss_files-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117808 /lib/x86_64-linux-gnu/libcap-ng.so.0.0.0 (path dev=0,79)
su       36 root  mem    REG    8,1           7117813 /lib/x86_64-linux-gnu/libdl-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117799 /lib/x86_64-linux-gnu/libaudit.so.1.0.0 (path dev=0,79)
su       36 root  mem    REG    8,1           7117805 /lib/x86_64-linux-gnu/libc-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117880 /lib/x86_64-linux-gnu/libutil-2.28.so (path dev=0,79)
su       36 root  mem    REG    8,1           7117854 /lib/x86_64-linux-gnu/libpam_misc.so.0.82.1 (path dev=0,79)
su       36 root  mem    REG    8,1           7117852 /lib/x86_64-linux-gnu/libpam.so.0.84.2 (path dev=0,79)
su       36 root  mem    REG    8,1           7117791 /lib/x86_64-linux-gnu/ld-2.28.so (path dev=0,79)
su       36 root    0u   CHR    1,3      0t0        5 /dev/null
su       36 root    1w  FIFO   0,13      0t0 46841355 pipe
su       36 root    2w  FIFO   0,13      0t0 46841356 pipe

edit: stupid me thought that the percentage indicator in the upper right corner was the completion of library building but instead is the reading completion...

@aaronleopold
Copy link
Collaborator

Sorry, I'm not much of a developer (not at all) and thus cannot really help there

No worries! I appreciate what you provided. Unfortunately nothing really sticks out to me, aside from curiosity around those Permission denieds. However, I was able to finally reproduce your issues! It's my fault, I should have really tried my initial investigations in a docker container, as that is the key part but everything I tried was local for convenience.

I was able to get a quickly growing memory usage for pretty much any scenario (host mount, SMB, NFS), getting to 10GB easily within 20 minutes. When cancelling the scan, the memory stabilizes and the OS does not free it. I just killed the container after 30 minutes sitting around 10GB idle. I'm not 100% what the next steps should be TBH, since I only am able to experience this in docker I'm thinking:

  • Experiment with allocators (e.g. https://crates.io/crates/jemallocator) when running in docker
  • Experiment with different base image(s)
  • Remove all IO operations and slowly add them back (this is sort of extremely tedious since I have to rebuild an image each time, should be last resort)

@aaronleopold
Copy link
Collaborator

aaronleopold commented Sep 5, 2024

Okay so I might have found something!

I think the root cause of this is an issue in the zip crate. I initially became suspicious of this when:

  1. Using a different allocator (tried mimalloc and jemalloc) didn't help or really change anything
  2. Changing to distroless image didn't help
  3. Scanning a (granted, much smaller) ebook library didn't trigger the issue
  4. Refreshing the books page for a library filled with ZIP files just kept increasing memory usage
  5. Refreshing the books page for a library filled with ebooks didn't increase memory usage 👀

It's important to note that the epub-rs library uses a different, older version of the zip crate. This crate was updated in this project a couple months ago (#353). So to verify that there is something funky going on with the zip crate, I downgraded and testsed twice:

  1. Downgraded the zip crate to 0.6.6
  2. Downgrade the zip crate to 1.1.3 (the version used in epub-rs)

The results were pretty much the same between the versions. The memory usage definitely did still increase during the duration of the scan, but we're talking magnitudes less than before. I was getting to over 10GB of usage within 20 minutes of scanning the NFS mount, but with the downgraded zip crate, I was at 256.2MiB after 8 minutes at which point the scan was complete. Memory dropped back down by about 10MiB, so it's still holding on to memory. In case the numbers aren't clear either wrt the timing, the NFS mount took 8 minutes to scan 1546 books after downgrading. If you recall/read any of the comments in the linked PR, this is a substantial difference to say the least lol

I think next steps will be to push another perf-experimental image with the downgraded zip crate and see if the issue is improved at all on your end. I might eagerly downgrade for the 0.0.6 release I was planning on doing this weekend, as well. As a side note, I'm contemplating keeping the distroless base image since it shrunk the image by 67% in size, IDK if anyone has thoughts on that.

I'll also try to find time to create a good reproduction to hand off to the zip crate maintainers to hopefully figure out what might be happening. Edit to add that I found an issue which might be relevant: zip-rs/zip2#138

aaronleopold added a commit that referenced this issue Sep 5, 2024
aaronleopold added a commit that referenced this issue Sep 6, 2024
* ⬇️ Downgrade `zip` crate to `1.1.3`

Relates to #427

* fix test
@aaronleopold
Copy link
Collaborator

@Cantello I'm going to plan on merging the linked PR sometime over the weekend. If you're able to, let me know whether the latest perf-experimental improves the situation on your end at all

@Cantello
Copy link
Author

Cantello commented Sep 9, 2024

@Cantello I'm going to plan on merging the linked PR sometime over the weekend. If you're able to, let me know whether the latest perf-experimental improves the situation on your end at all

I just tried the latest per-experimental container again and yes, it has greatly improved the situation here. The whole library (for one language, 700 files, 100GB) was scanned and thumbnails were created in ~14 minutes. Memory usage is high (1.8GB) but not dramatically so.
So, thanks for the extremely quick fix and for your hard work to identify this problem! Two thumbs up!

@aaronleopold
Copy link
Collaborator

Awesome, thanks for re-testing with the updates! I'll go ahead and close out this issue now that the slowness aspect has been resolved. I'll try and keep an eye on the memory usage in docker, it's odd that running Stump outside docker doesn't have the same issue. If you find the memory worsens, feel free to create a separate issue.

Thanks for outlining your experience and pointing the slowness out! You directly helped in figuring out the huge performance regression 🤝

@github-project-automation github-project-automation bot moved this from Triage to Done in v0.1.x Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants