implement data dump/crash dump for diagnosis of rare panics/crashes on (probably) data-driven defects #6684

christophcemper · 2024-11-11T07:57:36Z

1. What version of Caddy are you running (caddy -version)?

now 2.8.4 with

xcaddy build --output bin/LRTcaddy --with github.com/christophcemper/[email protected] --with github.com/caddyserver/transform-encoder --with github.com/pteich/caddy-tlsconsul --with github.com/caddyserver/cache-handler

after 10 month stable 2.7.6. with

xcaddy build --output bin/LRTcaddy --with github.com/christophcemper/[email protected] --with github.com/caddyserver/transform-encoder --with github.com/pteich/caddy-tlsconsul

that did not experience a very rare, unseen before crash now only 2 times in 3 weeks.

It's unreproducible, yet taking down production for multiple projects if it happens.

Such rare crashes should at least provide more details via panic handler, once they happen.

2. What are you trying to do?

Find out how to create a scenario for reproducing it.
Find out how to create a trimmed down config, by dumping payload data of current processing - not just static line numbers.

Some ideas to dump at least
i.e.

which sub/domain
which client
which input URLs/params were processed?
other helpful details for conditions.

Currently we only have a naked stack trace as posted in #6683
which is not helpful (enough) to perform any config reducing or create a reproducable case.

3. What is your entire Caddyfile?

NA

4. How did you run Caddy (give the full command and describe the execution environment)?

/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile

via a systemd service

systemctl status caddy
● caddy.service - Caddy
Loaded: loaded (/lib/systemd/system/caddy.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2024-11-10 11:45:13 UTC; 7h ago
Docs: https://caddyserver.com/docs/
Process: 1929477 ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force (code=exited, status=0/SUCCESS)
Main PID: 1879405 (caddy)
Tasks: 39 (limit: 154411)
Memory: 2.2G
CGroup: /system.slice/caddy.service
└─1879405 /usr/bin/caddy run --environ --config /etc/caddy/Caddyfile

5. What did you expect to see?

Actually processed data, more than the stack trace below.

What did you see instead (give full error messages and/or log)?

Oct 28 (intro for 1st goroutine - longer log attached)
Oct 28 15:05:08 web01 caddy[2621539]: fatal error: concurrent map read and map write

Oct 28 15:05:08 web01 caddy[2621539]: goroutine 3873976361 [running]:

Oct 28 15:05:08 web01 caddy[2621539]: github.com/caddyserver/caddy/v2/modules/caddyhttp.GetVar({0x1f9bd08?, 0xc015b73290?}, {0x1a68ef6, 0xe})

Right after
Oct 28 15:05:08 web01 caddy[2621539]: fatal error: concurrent map read and map write

it could print all possible details known, based in ideas in point 2.

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

Due to lack of data/config details in the panic log,
we are not able to reproduce the issue yet in any way,
nor do we know how to isolate this down to the host/domain that is causing this.

mholt · 2024-11-11T16:28:07Z

I agree that can be frustrating; we can at least see where the read or write is happening, but without the race detector, we can't have more information (there is a performance hit to keep track of all that).

The Go runtime is crashing the program, there's nothing we can do to recover it or get more information.

If you build Caddy with the Go race detector enabled (having a slightly significant performance penalty, just be advised), it will give you the information you desire. Set -race when compiling or use XCADDY_RACE_DETECTOR=1 if building with xcaddy.

Closing, since there's nothing actionable for us, but feel free to continue the discussion if needed.

christophcemper · 2024-11-11T17:18:38Z

Thanks for the hints and advice, we'll investigate this for sure.

christophcemper · 2024-11-11T20:07:19Z

Thanks @mholt - I've added this to our custom builds and tests.

Unfortunately it doesn't look like Caddy is cleanly shutting down, even the single HTTP request,
at least in our latest stable know version we put to QA and back to PROD soon

CADDY_VERSION=v2.7.6
TLS_CONSUL_VERSION=v1.4.0
CACHE_HANDLER_VERSION=v0.12.0

which for a simulated panic done with

panic("TEST-PANIC: Simulated panic for log testing")

just returns CURL a

curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

via CURL or simply an ugly browser error

and we're not seeing any kind of benefits/outputs from the XCADDY_RACE_DETECTOR but I understood that as only related to the go race detection feature, not dumps on panic

in the dev/test Caddyfile we have this on global scope and expected a separate "panic.log" to be written

	log {
		output file ./log/debug.log {
			roll_size 1gb
			roll_keep 5
			roll_keep_for 720h
		}
		level DEBUG
		format json
	}

	# Complete separate logger just for panics
	log panic {
		output file ./log/panic.log
		format json
		level PANIC
	}

	# Complete separate logger just for errors
	log error {
		output file ./log/error.log
		format json
		level ERROR
	}

but it only leads to 1 more line in the stdout, severity "debug" and not "panic", so in the debug.log, the first.
it also doesn't show any useful payload data as I was hoping for

{"level":"debug","ts":1731354660.0268118,"logger":"http.stdlib","msg":"http2: panic serving 127.0.0.1:59823: TEST-PANIC: Simulated panic for log testing\ngoroutine 773 [running]:\ngolang.org/x/net/http2.(*serverConn).runHandler.func1()\n\t/Users/christophc/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2361 +0x13c\npanic({0x1046ffae0?, 0x104bc1bd0?})\n\t/opt/homebrew/Cellar/[email protected]/1.21.8/libexec/src/runtime/panic.go:914 +0x218\ngithub.com/caddyserver/caddy/v2/modules/caddyhttp.simulatePanic(0x1)\n\t/Users/christophc/Workspace/common/CaddyLRT/caddy-fork/modules/caddyhttp/vars.go:375 +0x52c\ngithub.com/caddyserver/caddy/v2/modules/caddyhttp.GetVar({0x104bde450, 0x140202288a0}, {0x10400b75b, 0x9})\n\t/Users/christophc/Workspace/common/CaddyLRT/caddy-fork/modules/caddyhttp/vars.go:314 +0x1b4\ngithub.com/caddyserver/caddy/v2/modules/caddyhttp.LoggableHTTPRequest.MarshalLogObject({0x1401c6cb900?, 0x20?}, {0x104bfc718, 0x1402022a380})\n\t/Users/christophc/Workspace/common/CaddyLRT/caddy-fork/modules/caddyhttp/marshalers.go:43 +0xec\ngo.uber.org/zap/zapcore.(*jsonEncoder).AppendObject(0x1402022a380, {0x104bc9d60, 0x1402021f5a0})\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/zapcore/json_encoder.go:225 +0x24c\ngo.uber.org/zap/zapcore.(*jsonEncoder).AddObject(0x1401fbd91c8?, {0x104004a3d?, 0x1401fbd91f8?}, {0x104bc9d60, 0x1402021f5a0})\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/zapcore/json_encoder.go:107 +0x3c\ngo.uber.org/zap/zapcore.Field.AddTo({{0x104004a3d, 0x7}, 0x2, 0x0, {0x0, 0x0}, {0x104b53500, 0x1402021f5a0}}, {0x104bfc718, 0x1402022a380?})\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/zapcore/field.go:121 +0x100\ngo.uber.org/zap/zapcore.addFields(...)\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/zapcore/field.go:210\ngo.uber.org/zap/zapcore.(*ioCore).With(0x140007a3320, {0x1402022a300, 0x1, 0x1048c8a80?})\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:83 +0x14c\ngithub.com/caddyserver/caddy/v2.(*filteringCore).With(0x140004b6978, {0x1402022a300?, 0x1401fbd9398?, 0x8a18?})\n\t/Users/christophc/Workspace/common/CaddyLRT/caddy-fork/logging.go:535 +0x34\ngo.uber.org/zap/zapcore.multiCore.With({0x140003c1b80, 0x4, 0x70?}, {0x1402022a300, 0x1, 0x1})\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/zapcore/tee.go:51 +0x90\ngo.uber.org/zap.(*Logger).With(0x14000342af0, {0x1402022a300, 0x1, 0x1})\n\t/Users/christophc/go/pkg/mod/go.uber.org/[email protected]/logger.go:184 +0xac\ngithub.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP(0x14000228dc0, {0x104bd8490, 0x14000399fa0}, 0x14020de9800)\n\t/Users/christophc/Workspace/common/CaddyLRT/caddy-fork/modules/caddyhttp/server.go:309 +0x96c\nnet/http.serverHandler.ServeHTTP({0x0?}, {0x104bd8490?, 0x14000399fa0?}, 0x0?)\n\t/opt/homebrew/Cellar/[email protected]/1.21.8/libexec/src/net/http/server.go:2938 +0xbc\nnet/http.initALPNRequest.ServeHTTP({{0x104bde450?, 0x140233ec270?}, 0x140233b7880?, {0x140009a6690?}}, {0x104bd8490, 0x14000399fa0}, 0x14020de9800)\n\t/opt/homebrew/Cellar/[email protected]/1.21.8/libexec/src/net/http/server.go:3546 +0x1b4\ngolang.org/x/net/http2.(*serverConn).runHandler(0x0?, 0x0?, 0x0?, 0x0?)\n\t/Users/christophc/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2368 +0xc0\ncreated by golang.org/x/net/http2.(*serverConn).scheduleHandler in goroutine 474\n\t/Users/christophc/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2303 +0x208"}

what I suggest/or expected was something like this quick Dump Context helper does

Context Dump:
- ReplacerCtxKey: &{[0x102d560a0 0x102d5eef0 0x10349b8c0] map[]}
- VarsCtxKey: map[client_ip:127.0.0.1 start_time:2024-11-11 20:51:00.026389 +0100 CET m=+335.252142043 trusted_proxy:false uuid:8dfd63df-0ef9-43da-939c-3e1822e22311]
- OriginalRequestCtxKey: {GET /prompts?page=1&pageSize=4&topic=All&activity=All&sortBy=1&search=PANICNOW2&promptList=1  0 0 map[] <nil> <nil> 0 [] false  map[] map[] <nil> map[] 127.0.0.1:59823 /prompts?page=1&pageSize=4&topic=All&activity=All&sortBy=1&search=PANICNOW2&promptList=1 <nil> <nil> <nil> <nil>}
- ExtraLogFieldsCtxKey: &{[]}
- route_group: map[]
- simulatePanic: true
TEST-PANIC: Simulated panic for log testing

(maybe minus the ASCII art for TEST/QA/PROD :-)

This at least leads me to the assumption, that we may have had an issue for the specific issue, or wider, for longer time already,
that only surfaced due to the code changes across Caddy and used modules from 2.7.6 in January to 2.8.4 now.

mholt · 2024-11-11T20:15:53Z

As you suspect, the race detector will only detect races, not panics. Panics, even on ordinary builds, will already print a stack trace. (Although, I can't remember, if we strip so many debugging symbols that the production stack traces are useful, I Think we don't though, because the traces we've received in bug reports have all been helpful.) All we need to pinpoint most panics is a stack trace.

Races need two bits of information, the second one only comes with the race detector enabled.

Also, the crash reporting a "concurrent map read/write" is not a panic, it's a hard-coded exit into the Go runtime, so we can't recover from it.

So, if you see that the race detector caught a data race (you'll know -- it'll say "DATA RACE" in all caps), then put that info into your linked issue. (Where I also ask for caddy build-info and caddy list-modules and the config, all that will be necessary so we can help.)

christophcemper mentioned this issue Nov 11, 2024

fatal error: concurrent map read and map write after upgrading 2.7.6 to 2.8.4 and/or adding github.com/caddyserver/cache-handler #6683

Open

mholt closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2024

mholt added the upstream ⬆️ Relates to some dependency of this project label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement data dump/crash dump for diagnosis of rare panics/crashes on (probably) data-driven defects #6684

implement data dump/crash dump for diagnosis of rare panics/crashes on (probably) data-driven defects #6684

christophcemper commented Nov 11, 2024

mholt commented Nov 11, 2024

christophcemper commented Nov 11, 2024

christophcemper commented Nov 11, 2024

mholt commented Nov 11, 2024

implement data dump/crash dump for diagnosis of rare panics/crashes on (probably) data-driven defects #6684

implement data dump/crash dump for diagnosis of rare panics/crashes on (probably) data-driven defects #6684

Comments

christophcemper commented Nov 11, 2024

1. What version of Caddy are you running (caddy -version)?

2. What are you trying to do?

3. What is your entire Caddyfile?

4. How did you run Caddy (give the full command and describe the execution environment)?

5. What did you expect to see?

7. How can someone who is starting from scratch reproduce this behavior as minimally as possible?

mholt commented Nov 11, 2024

christophcemper commented Nov 11, 2024

christophcemper commented Nov 11, 2024

mholt commented Nov 11, 2024