Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker kiwix-serve crashing #579

Closed
monstermaker opened this issue Oct 10, 2022 · 18 comments · Fixed by kiwix/libkiwix#834
Closed

docker kiwix-serve crashing #579

monstermaker opened this issue Oct 10, 2022 · 18 comments · Fixed by kiwix/libkiwix#834
Assignees
Milestone

Comments

@monstermaker
Copy link

I have been running the stackoverflow zim file and found that the server crashes intermittently. Initially, it appeared to crash whenever you used the search with keywords "add" and "delete" in it, but then it stopped crashing on these and sometimes just crashes.
I cannot see the exit message as it appears to list the contents of the /data folder on crashing which appears as a very long list in my console window and I cannot scroll back far enough in the window to see any messages before the list of /data/.npm and /data/.cache so cannot see what may be causing it.

We are running on Ubuntu 18.04 using the latest docker image.

@kelson42 kelson42 added this to the 3.3.1 milestone Oct 10, 2022
@kelson42 kelson42 added the bug label Oct 10, 2022
@kelson42
Copy link
Contributor

Sounds really similar to #573. It seems there some kind of instability, but so far pretty unclear where.

@monstermaker
Copy link
Author

After running docker logs on the instance with the less switch I have found that there is no error output. I get just the standard output telling me the IP and port etc. and that it is running and then immediately after that the debug output of "the content of /data is" followed by hundreds of lines of the content as described above.
Additionally using docker stats on the instance shows extensive use of resources when using the search tool with long delays before the autosuggest comes up and sometimes CPU use of over 100%. after a large spike in use, it crashes.
I believe that perhaps it is the larger packages that are causing the issue. To try this out I have now loaded in only the Wiki100 zim file which is one of the smallest and has 14 users continually searching to see if it crashes.
I will run this for a few hours and post the results.

@monstermaker
Copy link
Author

OK. so ran this for 4 hours with up to 14 users trying it, sometimes together sometimes separately and with no issues. the server stayed up and monitored the docker container, the CPU and memory usage were quite low, although higher in the stats than I would expect (docker is still fairly new to me). Everything appears stable on a small zim file.
I tried on the stackoverflow file again and sure enough, it still crashed. I tried monitoring the container in docker stats and it appears to crash when it shows a regular CPU usage of over 100%. I am not sure how you can use over 100% CPU, but that is what is said. Now as I understand it the CPU usage stated should be how much host CPU the container uses, but was confused about it being over 100%. I have monitored the host unit and the CPU usage has not gone over 11% at its peak, so this I am very confused about.
I think it is crashing due to overusing its resources but it is not crashing the host machine or even using excess resources there. another note is that the host machine is in fact a VM running on VMWare. It does run other systems such as GitLab and Moodle at the same time.

I hope this may be giving some clues as to the issue. I never managed to get kiwix-serve working outside a docker container, so cannot see how it performs there.

@kelson42
Copy link
Contributor

@veloman-yunkan We should really try to reproduce the error with the SO specific file and get the core dump. From there it should be easier to diagnose the problem... hopefuly.

@veloman-yunkan
Copy link
Collaborator

@kelson42 I will try to reproduce the crash

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Oct 12, 2022

Crash was reproduced using the latest (3.3.0-1) docker image of kiwix-serve and http://download.kiwix.org/zim/stack_exchange/stackoverflow.com_en_all_2022-05.zim. Will debug it.

@mgautierfr
Copy link
Member

I am not sure how you can use over 100% CPU, but that is what is said

CPU percentage is relative to one core. If you have 4 core, the maximum CPU usage is 400%. So a percentage above 100% is just that we use more than one core (and with multithreading, it is easy)

@monstermaker
Copy link
Author

thanks @mgautierfr that makes sense.

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Oct 12, 2022

After kiwix-serve is started, it can take as little as only two properly timed requests to the /suggest endpoint to result in a segmentation fault.

@veloman-yunkan
Copy link
Collaborator

On my machine, with hot filesystem the following script crashes kiwix-serve with quite high probability:

#!/usr/bin/env bash

./kiwix-serve --verbose -p 8080 stackoverflow.com_en_all_2022-05.zim &
sleep 1
(
  curl 'http://localhost:8080/suggest?content=stackoverflow.com_en_all_2022-05&userlang=en&term=c' &
  sleep 0.2; curl 'http://localhost:8080/suggest?content=stackoverflow.com_en_all_2022-05&userlang=en&term=co' &
)
wait

@veloman-yunkan
Copy link
Collaborator

Such a crash scenario should greatly facilitate debugging.

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Oct 13, 2022

These crashes caused by concurrent suggestion requests on the same book are most likely due to the combination of:

  1. the classzim::SuggestionSearcher and friends (zim::SuggestionSearch, zim::SuggestionDataBase, etc) not being thread safe, and
  2. caching of the searcher objects introduced by Introduce Search/Searcher Caching to Internal Server libkiwix#620

Therefore concurrent /search requests on the same book should be subject to a similar bug. /suggest is simply more vulnerable to it because of the usage pattern. While /search requests are not temporally clustered in a particular way, /suggest requests (for the same book) tend to follow in bursts as the user types in the search box. On large ZIM files, where fulfilling a /suggest request may takes quite long, two or more sequential requests may be performed concurrently using the same zim::SuggestionSearcher object, violating Xapian's requirements on concurrent access:

If you really want to access the same Xapian object from multiple threads, then you need to ensure that it won’t ever be accessed concurrently (if you don’t ensure this bad things are likely to happen - for example crashes or even data corruption). One way to prevent concurrent access is to require that a thread gets an exclusive lock on a mutex while the access is made.

@kelson42
Copy link
Contributor

Top prio to fix obviously and no nrw release of libzim/libkiwix should be done before fixing. I would appreciate if such scenarios are intoduce in automated testa too.

@mgautierfr
Copy link
Member

The search should be protected again race condition : https://github.com/kiwix/libkiwix/blob/master/include/library.h#L145-L158 and https://github.com/kiwix/libkiwix/blob/master/src/server/internalServer.cpp#L781

I don't know why this is not the same case for suggestion.

@veloman-yunkan
Copy link
Collaborator

The search should be protected again race condition : https://github.com/kiwix/libkiwix/blob/master/include/library.h#L145-L158 and https://github.com/kiwix/libkiwix/blob/master/src/server/internalServer.cpp#L781

I don't know why this is not the same case for suggestion.

Protection against race conditions in /search endpoint was introduced later on in kiwix/libkiwix#729 in the context of implementing a significant enhancement to search. The fact that a similar bug existed in a similar piece of code went unnoticed.

@mgautierfr
Copy link
Member

Yes, but I have seen the issue for search (and so implement the protection) but I totally missed the case for suggestion.

@monstermaker
Copy link
Author

Will the docker image be updated with this fix?

@kelson42
Copy link
Contributor

@monstermaker Yes, once the release will be done... soon but no clear date for the moemnt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants