-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queues not terminating gracefully #23050
Comments
For debugging purposes: Edit: this is also showing up on gitea.com, although not during shutdowns. |
I'm also not seeing So I guess there is some problem with cache context in the workers? |
If I had to guess this is going to be a misreport and there's going to be a non-deterministic hierarchy of cancelled contexts - the wrong one of which is being picked and reported. I'd need to spend a good few hours looking through the code thinking about the concurrency again which I won't have time to look at until the weekend. |
The context cache should be used in http request level currently. Do you have a redid cache server? |
@lunny they are using memcached for cache |
This maybe occurs when Gitea start up. |
I see these Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps: |
@lunny the same ctx error can be seen in Gitea.com logs and occurs at times other than startup and shutdown |
I have sent a PR #23054, but I don't think that will fix this problem in fact. It will remove the unnecessary warning logs. |
From the logs, looks like it's a normal terminal process. |
Everything seems to shut down within 1s while If it's not meant for queues, is the mechanism that work gets added back to the queue on such errors, to be redone after restart? It wasn't doing that as far as I could tell, but I may well have missed something. |
I think the logic should be if all workers shut down less than 60s, then it will exit immediately. If some workers shutting down time out(address 60s), then hammer shutdown will be executed and the program will exit at that time. |
See comment and upstream issue go-gitea#23050 for details
I found more details (and a very hacky workaround) for pull requests getting stuck in conflict checking.
The persistent queue in this case seems unnecessary as |
The steps to reproduce this are:
|
This is a test for this problem The results are quite surprising.
func TestPullRequest_QueueStuck(t *testing.T) {
setting_module.AppWorkPath = "/tmp"
_ = util.RemoveAll("/tmp/data")
unittest.PrepareTestEnv(t)
setting_module.InitProviderAndLoadCommonSettingsForTest()
setting_module.LoadQueueSettings()
q1 := func() (completedTasks []string) {
startWhen100Ready := make(chan struct{}) // only start data cnosuming when the 100 tasks are all pushed into queue
stopAt20Shutdown := make(chan struct{}) // stop and shutdown at the 20th item
testHandler := func(data ...queue.Data) []queue.Data {
<-startWhen100Ready
time.Sleep(100 * time.Millisecond)
for _, datum := range data {
s := datum.(string)
completedTasks = append(completedTasks, s)
if s == "task-20" {
close(stopAt20Shutdown)
return nil
}
}
return nil
}
q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
q.Run(func(atShutdown func()) { go func() { <-stopAt20Shutdown; atShutdown() }() }, func(atTerminate func()) {})
// add 100 tasks to the queue
for i := 0; i < 100; i++ {
_ = q.Push("task-" + strconv.Itoa(i))
}
close(startWhen100Ready)
<-stopAt20Shutdown
return
}
q2 := func() (executedTasks []string, hasTasks []string) {
stop := make(chan struct{})
// collect the tasks that have been executed
testHandler := func(data ...queue.Data) []queue.Data {
for _, datum := range data {
executedTasks = append(executedTasks, datum.(string))
}
return nil
}
q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {})
// wait for a while to see whether there are tasks to get executed.
time.Sleep(1 * time.Second)
// check whether the tasks are still in the queue
for i := 0; i < 100; i++ {
if has, _ := q.Has("task-" + strconv.Itoa(i)); has {
hasTasks = append(hasTasks, "task-"+strconv.Itoa(i))
}
}
close(stop)
return
}
q3 := func() (executedTasks []string, hasTasks []string) {
stop := make(chan struct{})
testHandler := func(data ...queue.Data) []queue.Data {
for _, datum := range data {
executedTasks = append(executedTasks, datum.(string))
}
return nil
}
q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {})
// re-run all tasks
for i := 0; i < 100; i++ {
_ = q.Push("task-" + strconv.Itoa(i))
}
// wait for a while
time.Sleep(1 * time.Second)
// check whether the tasks are still in the queue
for i := 0; i < 100; i++ {
if has, _ := q.Has("task-" + strconv.Itoa(i)); has {
hasTasks = append(hasTasks, "task-"+strconv.Itoa(i))
}
}
close(stop)
return
}
completedTasks1 := q1() // run some tasks and shutdown at an intermediate point
time.Sleep(time.Second)
executedTasks2, hasTasks2 := q2() // restart the queue to check the tasks in it
time.Sleep(time.Second)
executedTasks3, hasTasks3 := q3() // try to re-run all tasks
log.Error("TestPullRequest_QueueStuck completed1=%v, executed2=%v, has2=%v, executed3=%v, has3=%v",
len(completedTasks1), len(executedTasks2), len(hasTasks2), len(executedTasks3), len(hasTasks3))
} |
Partially fix #23050 After #22294 merged, it always has a warning log like `cannot get context cache` when starting up. This should not affect any real life but it's annoying. This PR will fix the problem. That means when starting up, getting the system settings will not try from the cache but will read from the database directly. --------- Co-authored-by: Lauris BH <[email protected]>
Partially fix go-gitea#23050 After go-gitea#22294 merged, it always has a warning log like `cannot get context cache` when starting up. This should not affect any real life but it's annoying. This PR will fix the problem. That means when starting up, getting the system settings will not try from the cache but will read from the database directly. --------- Co-authored-by: Lauris BH <[email protected]>
Backport #23054 Partially fix #23050 After #22294 merged, it always has a warning log like `cannot get context cache` when starting up. This should not affect any real life but it's annoying. This PR will fix the problem. That means when starting up, getting the system settings will not try from the cache but will read from the database directly. Co-authored-by: Lunny Xiao <[email protected]> Co-authored-by: Lauris BH <[email protected]>
@wxiaoguang Thanks for your testcase. It's not quite right but it's pointed me in the correct direction. Queues are extremely concurrent and the handler in q1 is going to drop a lot of data - you cannot expect that as soon as you tell a queue to shutdown that no more data will be handled - In fact you can almost guarantee that at least a few more will be handled. I'm just working through it trying to make it safe and properly concurrent safe and I'm placing it in modules/queue once I've slightly rewritten it - (I'm not sure what q3 is supposed to reliably do.) Now to address the Warning - if you switch on Trace logging for modules/queue: gitea manager logging add console --name traceconsole --level TRACE --expression modules/queue [log]
MODE = ..., traceconsole
...
[log.traceconsole]
LEVEL=trace
MODE=console
EXPRESSION=modules/queue You would see that internal channelqueue is explicitly drained at Terminate. There should be no loss of data - the warning is not the cause of the issue you're investigating. I assume you're actually investigating things not being emptied on startup - and that's a real bug. If you look at: gitea/modules/queue/queue_disk_channel.go Lines 175 to 197 in 8540fc4
and compare with: gitea/modules/queue/unique_queue_disk_channel.go Lines 212 to 226 in 8540fc4
I think you'll see the real bug. |
Actually, I do not care about whether the queue is reliable or not. And in most cases, the queue doesn't need to be reliable. But |
As I said, your testcase is somewhat incorrect in form. The reason you're seeing somewhat inconsistent behaviour is partially because of this but also because of the bug that I mentioned above. Thank you for the testcase though because whilst it's not quite right it made me realise that the bug is the one I've mentioned above - which I've fixed in the attached PR. The testcase has also been fixed. |
There have been a number of reports of blocked PRs being checked which have been difficult to debug. In investigating go-gitea#23050 I have realised that whilst the Warn there is somewhat of a miscall there was a real bug in the way that the LevelUniqueQueue was being restored on start-up of the PersistableChannelUniqueQueue. This PR fixes this bug and adds a testcase. Fix go-gitea#23050 and others Signed-off-by: Andrew Thornton <[email protected]>
After applying #23154 , I can still see some strange results (just strange, I do not know whether they are right or wrong)
Are these two behaviors by design? If yes, I think some comments would be very useful. If no, fix or not? Feel free to ignore my comment if you think they are unrelated. ps: I didn't get the point about what do you mean by "incorrect" or "not quite right". That test code is just used to test the queue's behavior, there is no assertion in it, and I didn't say anything about "correct" or "right" before .... // q3 test handler to use `Has` to only executed one task
testHandler := func(data ...queue.Data) []queue.Data {
for _, datum := range data {
s := datum.(string)
if has, _ := q.Has(s); !has {
executedTasks = append(executedTasks, s)
}
}
return nil
} |
There have been a number of reports of PRs being blocked whilst being checked which have been difficult to debug. In investigating #23050 I have realised that whilst the Warn there is somewhat of a miscall there was a real bug in the way that the LevelUniqueQueue was being restored on start-up of the PersistableChannelUniqueQueue. Next there is a conflict in the setting of the internal leveldb queue name - This wasn't being set so it was being overridden by other unique queues. This PR fixes these bugs and adds a testcase. Thanks to @brechtvl for noticing the second issue. Fix #23050 and others --------- Signed-off-by: Andrew Thornton <[email protected]> Co-authored-by: techknowlogick <[email protected]>
There have been a number of reports of PRs being blocked whilst being checked which have been difficult to debug. In investigating go-gitea#23050 I have realised that whilst the Warn there is somewhat of a miscall there was a real bug in the way that the LevelUniqueQueue was being restored on start-up of the PersistableChannelUniqueQueue. Next there is a conflict in the setting of the internal leveldb queue name - This wasn't being set so it was being overridden by other unique queues. This PR fixes these bugs and adds a testcase. Thanks to @brechtvl for noticing the second issue. Fix go-gitea#23050 and others --------- Signed-off-by: Andrew Thornton <[email protected]> Co-authored-by: techknowlogick <[email protected]>
There have been a number of reports of PRs being blocked whilst being checked which have been difficult to debug. In investigating go-gitea#23050 I have realised that whilst the Warn there is somewhat of a miscall there was a real bug in the way that the LevelUniqueQueue was being restored on start-up of the PersistableChannelUniqueQueue. Next there is a conflict in the setting of the internal leveldb queue name - This wasn't being set so it was being overridden by other unique queues. This PR fixes these bugs and adds a testcase. Thanks to @brechtvl for noticing the second issue. Fix go-gitea#23050 and others --------- Signed-off-by: Andrew Thornton <[email protected]> Co-authored-by: techknowlogick <[email protected]>
Backport #23154 There have been a number of reports of PRs being blocked whilst being checked which have been difficult to debug. In investigating #23050 I have realised that whilst the Warn there is somewhat of a miscall there was a real bug in the way that the LevelUniqueQueue was being restored on start-up of the PersistableChannelUniqueQueue. Next there is a conflict in the setting of the internal leveldb queue name - This wasn't being set so it was being overridden by other unique queues. This PR fixes these bugs and adds a testcase. Thanks to @brechtvl for noticing the second issue. Fix #23050 and others Signed-off-by: Andrew Thornton <[email protected]> Co-authored-by: zeripath <[email protected]> Co-authored-by: techknowlogick <[email protected]> Co-authored-by: delvh <[email protected]>
Backport #23154 There have been a number of reports of PRs being blocked whilst being checked which have been difficult to debug. In investigating #23050 I have realised that whilst the Warn there is somewhat of a miscall there was a real bug in the way that the LevelUniqueQueue was being restored on start-up of the PersistableChannelUniqueQueue. Next there is a conflict in the setting of the internal leveldb queue name - This wasn't being set so it was being overridden by other unique queues. This PR fixes these bugs and adds a testcase. Thanks to @brechtvl for noticing the second issue. Fix #23050 and others Signed-off-by: Andrew Thornton <[email protected]> Co-authored-by: zeripath <[email protected]> Co-authored-by: techknowlogick <[email protected]> Co-authored-by: delvh <[email protected]>
Description
We are running into the issue where a Gitea restart happens during merge conflict checking, PRs remain stuck in conflict state forever. I think there's two distinct issues here, one is that these don't get unstuck on new pushes and perhaps that's best left for another report.
However the reason things get into this state in the first place seems to be a problem in shutdown terminating all the queues immediately instead of respecting the hammer time.
When starting
./gitea web
and then dokillall gitea
, I see the following in the log:This log is from my local test instance with configuration set to defaults as much as possible.
On our production instance that
Terminated before completed flushing
is leading to a lot of different errors as workers get terminated in the middle of what they're doing.I can provide more detailed logs if needed, but maybe this is easy to redo on any instance.
Gitea Version
main (43405c3)
Can you reproduce the bug on the Gitea demo site?
No
Log Gist
No response
Screenshots
No response
Git Version
No response
Operating System
No response
How are you running Gitea?
Own build from main.
Database
None
The text was updated successfully, but these errors were encountered: