Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

LukasKalbertodt · 2024-03-04T16:09:06Z

From the ETH and Bern, we got reports of Tobira randomly stopping to work. No more log outputs are happening, and HTTP requests are not responded to anymore at all. The process still runs, but it seems like it's not doing anything anymore. Log does not have any relevant entries.

Bern is running v2.5 (where it first occurred AFAIK), while ETH is running v2.6 (and they apparently did not see that problem on v2.5). Both are running it behind an nginx. From ETH I heard that it could happen immediately, or after hours or days. Bern reported it failing on roughly 26.02. 17:27, 22.02. 21:10 and 10.02. 16:53.

Playing around on the ETH system right now, I could somewhat reliably cause the problem to occur by immediately refreshing the page after restarting Tobira. I couldn't yet identify a relevant HTTP request, but assume that it's a graphql request. The index.html was almost (I think?) delivered correctly. The problem also occurred with the -musl build of v2.6. ETH observed that the memory-usage as reported by systemctl status was at only 784KB for the frozen Tobira, but 50MB for a normally running one. I did not see those numbers in my experiments, but it might still be useful information. The frozen process does not consume any noticeable CPU time (i.e. not a full core, for example).

I now deployed a MUSL build of 97d28b2 which is just v2.6 + #1120. There we update Tokio and hyper, two prime candidates for causing things like that. So far I could not reproduce the problem, I think? I will see if it continues to work for a few days.

I have not yet gone through the changelogs of Tokio and hyper, but will do so soon. Other things I have planned to debug this:

Get a Tobira into the frozen state and poke at it in various ways, using strace and friends to see what's happening.
Enable more logs/tracing from dependencies of Tobira (Tokio, hyper, ...).

The text was updated successfully, but these errors were encountered:

LukasKalbertodt · 2024-03-04T16:30:08Z

I skimmed the changelogs for tokio and hyper (starting from the version used in Tobira v2.5). The following issues are potentially relevant. But I haven't looked into any of those in detail:

oas777 · 2024-03-04T21:31:23Z

ETH is running v2.6 (and they apparently did not see that problem on v2.5).

FWIW: I saw that problem once or twice before we upgraded , at least if it is the problem we saw at our meeting last week /discussed with Piri and Waldemar.

Fixes elan-ev#1129 The "frozen Tobira" bug was caused by the following: we had two in-memory hashmaps that cached different things. Each entry in that cache has a time stamp and is evicted/ignored once it is too old. The eviction is done by a long running task, that calls `retain` on the cache. Of course, the cache is also read. In the case of the user cache these steps are regularly performed: 1. Does an entry for this username exists? If yes: 2. Is this entry too old or differs from the new data? If yes: 3. Write new data to database. 4. Update the timestamp in the cache entry to mark it as "brand new". This tricky part is this: step 3 contains an `.await` point (waiting for the DB to reply). At that point, the Tokio runtime will pause that task and run a different task on that same thread. If that different task happens to be the maintenance task that calls `retain` on the hashmap, then we are in trouble: the documentation of `retain` says: > May deadlock if called when holding any sort of reference into the map And in [this issue](xacrimon/dashmap#233) it is specified, that this means: > May deadlock if called when **the current thread is** holding any sort > of reference into the map. That is usually not a problem in non-async multi-threaded code, but in async code when a lock/reference is held across an await point, then it can happen that the same thread holds a lock and tries to call `retain`. The solution is to not do that: avoid holding locks across await points (a good advice generally anyway). However, that isn't trivial in our case, but much more importantly: doing that is a subtle mistake! I don't want bugs like this cropping up accidentally again. I'm sure there is a clippy lint for that, but still, it feels wrong to me. So I decided to use a different library that does not have this problem. It's not yet as widely used but it seems very promising. The auth callback cache could also use `HashIndex`, which might be faster. But for simplicity I kept both `HashMaps` now. Should be plenty fast.

Fixes #1129 See first commit description for the technical details on how this was caused. But the gist is: I incorrectly used `DashMap`, holding locks across await points. This causes a deadlock if the timing is right and two specific tasks are scheduled to run in the same thread. I could have fixed the code around the await point, but since this is a very easy and subtle mistake to make, I decided to use a different concurrent hashmap that can better deal with these scenarios. The second commit also fixes the cache dealing with a 0 cache duration (which is supposed to disable the cache). See commits for more details. On the ETH test system I deployed v2.6 plus this patch and verified that the freeze is not happening anymore in the only situation where I could reliably reproduce it before: starting Tobira and immediately making an authenticated request. Since the cache_duration was set to 0, the timing somehow worked out most of the time. Doing that does not freeze Tobira anymore with this patch (I tried several times). --- <details> <summary>Some additional tests/details</summary> The `scc` hashmap has no problem when a lock is held on the same thread that `retain` is called. I tested it like this: ```rust #[tokio::main(flavor = "current_thread")] async fn main() { let start = Instant::now(); let map = HashMap::new(); let _ = map.insert_async("foo", 4).await; let _ = map.insert_async("bar", 27).await; let map = Arc::new(map); { let map = Arc::clone(&map); tokio::spawn(async move { println!("--- {:.2?} calling entry", start.elapsed()); let e = map.entry_async("foo").await; println!("--- {:.2?} acquired entry", start.elapsed()); tokio::time::sleep(Duration::from_millis(3000)).await; *e.or_insert(6).get_mut() *= 2; println!("--- {:.2?} done with entry", start.elapsed()); }); } { let map = Arc::clone(&map); tokio::spawn(async move { tokio::time::sleep(Duration::from_millis(1500)).await; println!("--- {:.2?} calling entry 2", start.elapsed()); let e = map.entry_async("foo").await; println!("--- {:.2?} acquired entry 2", start.elapsed()); tokio::time::sleep(Duration::from_millis(3000)).await; *e.or_insert(6).get_mut() *= 2; println!("--- {:.2?} done with entry 2", start.elapsed()); }); } tokio::time::sleep(Duration::from_millis(1000)).await; println!("--- {:.2?} calling retain", start.elapsed()); map.retain_async(|_, v| *v % 2 != 0).await; println!("--- {:.2?} done retain", start.elapsed()); } ``` This outputs: ``` --- 31.88µs calling entry --- 38.91µs acquired entry --- 1.00s calling retain --- 1.50s calling entry 2 --- 3.00s done with entry --- 3.00s acquired entry 2 --- 6.00s done with entry 2 --- 6.00s done retain ``` Of course a single test doesn't guarantee that this is supported by the library, but all docs indicate as well that the library can deal with these situations. "async" is used everywhere and these kinds of situations regularly occur in async code. Edit: thinking about it more, I suppose the key feature here is that `retain_async` can yield, whereas the `retain` from dashmap could only block when it couldn't make any progress. </details>

LukasKalbertodt · 2024-03-12T12:36:03Z

See the linked PR for an in-depth explanation of the cause. To add some more useful detail:

This bug likely existed starting with v2.4
The slower your database is, the more likely is it that you run into the bug.
If you are on >= v2.6, setting auth.callback.cache_duration to 0 makes it a lot more likely to run into the bug.
We will be releasing the patch as v2.8 very soon.

LukasKalbertodt added the kind:bug Bug label Mar 4, 2024

LukasKalbertodt mentioned this issue Mar 12, 2024

Fix deadlock in auth cache (solves Tobira freezing problem) #1141

Merged

owi92 closed this as completed in #1141 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

LukasKalbertodt commented Mar 4, 2024 •

edited

Loading

LukasKalbertodt commented Mar 4, 2024

oas777 commented Mar 4, 2024

LukasKalbertodt commented Mar 12, 2024

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

Comments

LukasKalbertodt commented Mar 4, 2024 • edited Loading

LukasKalbertodt commented Mar 4, 2024

oas777 commented Mar 4, 2024

LukasKalbertodt commented Mar 12, 2024

LukasKalbertodt commented Mar 4, 2024 •

edited

Loading