Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog timer (WDT) support #8284

Closed
Tracked by #8481
dsseng opened this issue Feb 8, 2024 · 11 comments · Fixed by #8313
Closed
Tracked by #8481

Watchdog timer (WDT) support #8284

dsseng opened this issue Feb 8, 2024 · 11 comments · Fixed by #8313

Comments

@dsseng
Copy link
Member

dsseng commented Feb 8, 2024

Feature Request

Support hardware/firmware watchdog. This could increase resilience and help fulfill requirements some users might add to their systems.

Description

Just as systemd is able to do it Talos init process should be able to arm and feed the watchdog if one is detected and configured by the user. If the system hangs (for example due to a driver crashing the kernel or hardware instability) WDT trips after not being fed for a timeout and resets the system, recovering it from a freeze. As hardware or firmware might be quirky I wouldn't risk enabling it in default config.

I want to try myself at implementing this feature in case it's approved.

@smira
Copy link
Member

smira commented Feb 8, 2024

That'd be nice!

@smira
Copy link
Member

smira commented Feb 8, 2024

If you have some high-level overview of API changes required (e.g. new configuration, kernel args, etc.), it might be nice to bootstrap the discussion.

@dsseng
Copy link
Member Author

dsseng commented Feb 8, 2024

Perhaps just an option in machine or machine.install for a timeout, measured in seconds. systemd has multiple timeout options, but unsure how often are they used and whether they'd be more useful than confusing.

@dsseng
Copy link
Member Author

dsseng commented Feb 11, 2024

What would be the right place in the internal/app/machined code to enable the watchdog and start a goroutine + ticker to feed it?

Perhaps a controller, yes?

@smira
Copy link
Member

smira commented Feb 12, 2024

The controller is the best path, as it would consume the machine config once it's available, and control the timer based on the config (which might change on the fly).

@dsseng
Copy link
Member Author

dsseng commented Feb 12, 2024

I did it, however unable to rebuild the kernel to include WDT drivers like the i6300esb I enabled in QEMU. I get this weird error from make kernel-menuconfig (same from make kernel if I add necessary configs manually), and googling it gave me a solution "we stopped using BuildKit":

Pkgfile:1
--------------------
   1 | >>> # syntax = ghcr.io/siderolabs/bldr:v0.2.3
   2 |     
   3 |     format: v1alpha2
--------------------
ERROR: failed to solve: failed to solve LLB: requested experimental feature mergeop  has been disabled on the build server: only enabled with containerd image store backend

Using rootful Docker 24.0.7 from the openSUSE packages:

Client:
 Version:    24.0.7-ce
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.12.1
    Path:     /usr/lib/docker/cli-plugins/docker-buildx

Server:
 Containers: 6
  Running: 6
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 24.0.7-ce
 Storage Driver: overlay2
  Backing Filesystem: tmpfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 oci
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 4e1fe7492b9df85914c389d1f15a3ceedbb280ac
 runc version: v1.1.12-0-g51d5e94601ce
 init version: 
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.7.4-1-default
 Operating System: openSUSE Tumbleweed
 OSType: linux
 Architecture: x86_64
 CPUs: 20
 Total Memory: 62.57GiB
 Name: REDACTED
 ID: REDACTED
 Docker Root Dir: /tmp/docker-storage
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

No problem building stuff in the Talos repo however. If you don't know a quick fix could you please just enable the needed watchdog drivers and push that image?

@smira
Copy link
Member

smira commented Feb 12, 2024

you need to use modern buildkit runners, you can look here and also make help in the pkgs repo

@dsseng
Copy link
Member Author

dsseng commented Feb 12, 2024

That fixed it. Unsure why didn't that command persist from the Talos build (perhaps env var or something). Thank you!

@dsseng
Copy link
Member Author

dsseng commented Feb 12, 2024

For some reason I can't get modules installed using these commands:
pkgs > make kernel REGISTRY=127.0.0.1:5005 PUSH=true PLATFORM=linux/amd64
talos > make kernel initramfs PKG_KERNEL=127.0.0.1:5005/siderolabs/kernel:v1.7.0-alpha.0-22-g0ec4cc3-dirty (rebuilt installer after that as well)
talos > same qemu run command as previously, with the built installer tag

Maybe modules are in a similar fashion as other packages from the pkgs repo?

Will build i6300esb as built-in for now to test.

@smira
Copy link
Member

smira commented Feb 12, 2024

Hard to say, it should work

@dsseng
Copy link
Member Author

dsseng commented Feb 12, 2024

Also failing to update kernel which has the same tag (v1.7.0-alpha.0-22-g0ec4cc3-dirty) but it's updated within the same non-committed changeset, because buildkit uses cache of an old one. Maybe it's not even updated in the container registry. Anyway it should work after I reinit those things

dsseng added a commit to dsseng/talos that referenced this issue Feb 13, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
dsseng added a commit to dsseng/talos that referenced this issue Feb 13, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
dsseng added a commit to dsseng/talos that referenced this issue Feb 13, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
dsseng added a commit to dsseng/talos that referenced this issue Mar 7, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
dsseng added a commit to dsseng/talos that referenced this issue Mar 7, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
dsseng added a commit to dsseng/talos that referenced this issue Mar 8, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
dsseng added a commit to dsseng/talos that referenced this issue Mar 19, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
smira pushed a commit to dsseng/talos that referenced this issue Mar 21, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
smira pushed a commit to dsseng/talos that referenced this issue Mar 21, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Andrey Smirnov <[email protected]>
smira pushed a commit to dsseng/talos that referenced this issue Mar 21, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Andrey Smirnov <[email protected]>
smira pushed a commit to dsseng/talos that referenced this issue Mar 21, 2024
Only enabled when activated by config, disabled on shutdown/reboot

Fixes siderolabs#8284

Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Andrey Smirnov <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants