Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newest beta 3185.1.0 does not boot #710

Closed
goochjj opened this issue Apr 7, 2022 · 23 comments · Fixed by flatcar-archive/coreos-overlay#1812
Closed

newest beta 3185.1.0 does not boot #710

goochjj opened this issue Apr 7, 2022 · 23 comments · Fixed by flatcar-archive/coreos-overlay#1812
Labels
channel/beta Issue concerns the Beta channel. kind/bug Something isn't working

Comments

@goochjj
Copy link

goochjj commented Apr 7, 2022

Description

Dependency failed for ensure-sysext.service (over and over and over and over again until I reboot)

Impact

Machines do not reboot cleanly after the update is applied. I have to reboot the VM manually (which boots on the old OS image)

Environment and steps to reproduce

  1. Have a machine running beta or stable
  2. Set group to beta
  3. Apply update
  4. reboot
  5. fail

Expected behavior

Doesn't fail.

@goochjj goochjj added the kind/bug Something isn't working label Apr 7, 2022
@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

Probably related to flatcar/init#65

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

masking the service before rebooting solves the problem; however, knowing you HAVE to do that (across all my VMs) is a whole other story.

@tormath1
Copy link
Contributor

tormath1 commented Apr 7, 2022

Hi @goochjj,

First thanks a lot for running beta nodes - do you have any log / information to provide for this:

Dependency failed for ensure-sysext.service

@tormath1 tormath1 added the channel/beta Issue concerns the Beta channel. label Apr 7, 2022
@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

Since my system is complete non-functional when I reboot, I have nothing.

Now that I have the unit masked I can try to collect some logs

@tormath1
Copy link
Contributor

tormath1 commented Apr 7, 2022

Awesome, thanks. In the meantime, we are going to try to reproduce the issue - just to confirm: you were on a Stable and you switched to Beta ?

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

Beta to Beta and Stable to Beta both do it

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

It may be part of some of my customizations, i.e.:

systemctl cat ensure-sysext.service

/usr/lib/systemd/system/ensure-sysext.service

[Unit]
BindsTo=systemd-sysext.service
After=systemd-sysext.service
DefaultDependencies=no
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/systemctl daemon-reload
ExecStart=/usr/bin/systemctl restart --no-block sockets.target timers.target multi-user.target
[Install]
WantedBy=sysinit.target

/etc/systemd/system/service.d/email-failure.conf

[Unit]
OnFailure=failuremgmt-failure@%n.service

systemctl start ensure-sysext

A dependency job for ensure-sysext.service failed. See 'journalctl -xe' for details.

Apr 07 15:31:33 NetFlyTrap.redacted bashcmd(root|0)[2310]: /root# systemctl start ensure-sysext
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: Merge System Extension Images into /usr/ and /opt/ was skipped because all trigger condition checks failed.
░░ Subject: A start job for unit systemd-sysext.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://groups.google.com/forum/#!forum/flatcar-linux-user
░░
░░ A start job for unit systemd-sysext.service has finished successfully.
░░
░░ The job identifier is 1006.
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: Dependency failed for ensure-sysext.service.
░░ Subject: A start job for unit ensure-sysext.service has failed
░░ Defined-By: systemd
░░ Support: https://groups.google.com/forum/#!forum/flatcar-linux-user
░░
░░ A start job for unit ensure-sysext.service has finished with a failure.
░░
░░ The job identifier is 1005 and the job result is dependency.
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Job ensure-sysext.service/start failed with result 'dependency'.
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Triggering OnFailure= dependencies.
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: Started Unit Status Failure Management.
░░ Subject: A start job for unit [email protected] has finished successfully
░░ Defined-By: systemd
░░ Support: https://groups.google.com/forum/#!forum/flatcar-linux-user
░░
░░ A start job for unit [email protected] has finished successfully.
░░
░░ The job identifier is 1010.
Apr 07 15:31:33 NetFlyTrap.redacted msmtp[2480]: host=smtpout.redacted tls=off auth=off from=root recipients=mrwizard@redacted mailsize=1434 smtpstatus=250 smtpmsg='250 2.0.0 Ok: queued as 34A708CAE' e>
Apr 07 15:31:33 NetFlyTrap.redacted systemd[1]: [email protected]: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://groups.google.com/forum/#!forum/flatcar-linux-user
░░
░░ The unit [email protected] has successfully entered the 'dead' state.
Apr 07 15:31:34 NetFlyTrap.redacted launcher[891]: {"caller":"request_queries.go:159","err":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: de>
Apr 07 15:31:43 NetFlyTrap.redacted bashcmd(root|0)[2310]: /root# journalctl -xe

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

Removing my onfailure entry doesn't help

@tormath1
Copy link
Contributor

tormath1 commented Apr 7, 2022

Ok, thanks for the logs.

I tried to reproduce locally on QEMU environment from beta-3139.1.1 -> beta-3185.1.0 and it boots fine. However, I do see the dependency issue but it should not prevent the instance to boot.

What does systemctl status [email protected] say ?

@tormath1
Copy link
Contributor

tormath1 commented Apr 7, 2022

I've been able to reproduce with the following ignition:

{
  "ignition": {
    "version": "3.3.0"
  },
  "passwd": {
    "users": [
      {
        "name": "core",
        "sshAuthorizedKeys": [
          "ssh-rsa ..."
        ]
      }
    ]
  },
  "systemd": {
    "units": [
      {
        "dropins": [
          {
            "contents": "[Unit]\nOnFailure=echo@%n.service\n",
            "name": "failure.conf"
          }
        ],
        "name": "ensure-sysext.service"
      },
      {
        "contents": "[Unit]\nDescription=Show error\n[Service]\nType=fork\nExecStart=/usr/bin/echo error\n[Install]\nWantedBy=multi-user.target\n",
        "name": "[email protected]"
      }
    ]
  }
}

I think the OnFailure makes the unit seen as "failed".

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

failuremgmt basically does this
[Unit]
Description=Unit Status Failure Management
After=network.target

[Service]
Type=simple
ExecStart=/opt/bin/failuremgmt-unit-failed %i

Which just emails me :-D

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

Status looks ok
[email protected] - Unit Status Failure Management
Loaded: loaded (/etc/systemd/system/[email protected]; static)
Drop-In: /etc/systemd/system/[email protected]
└─email-failure.conf
Active: inactive (dead) since Thu 2022-04-07 16:28:34 EDT; 3min 3s ago
Process: 2827 ExecStart=/opt/bin/failuremgmt-unit-failed ensure-sysext.service (code=exited, status=0/SUCCESS)
Main PID: 2827 (code=exited, status=0/SUCCESS)
CPU: 26ms

Apr 07 16:28:33 NetFlyTrap.redacted systemd[1]: Started Unit Status Failure Management.
Apr 07 16:28:34 NetFlyTrap.redacted msmtp[2834]: host=smtpout.redacted tls=off auth=off from=root recipients=mrwizard@redacted mailsize=1681 smtpstatus=250 smtpmsg='250 2.0.0 Ok: queued as F391A90C9' exitcode=EX_OK
Apr 07 16:28:34 NetFlyTrap.redacted systemd[1]: [email protected]: Deactivated successfully.

@goochjj
Copy link
Author

goochjj commented Apr 7, 2022

The part I can't really test is a dependency on sysinit - to get it to boot I have to mask the ensure-sysext service, which I think is probably invalidating some of my testing, because it's entirely possible something just isn't there yet when it's running in realtime.

And since I've now definitely booted this VM into the new OS, it'd be hard to back out I suppose.

I can see about snapshotting this VM so I can recover quicker.

@goochjj
Copy link
Author

goochjj commented Apr 8, 2022

More logs mined from journalctl

Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Triggering OnFailure= dependencies.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Triggering OnFailure= dependencies.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Platform Persistent Storage Archival was skipped because of a failed condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Repartition Root Disk was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Merge System Extension Images into /usr/ and /opt/ was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Dependency failed for ensure-sysext.service.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Job ensure-sysext.service/start failed with result 'dependency'.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Triggering OnFailure= dependencies.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: /proc/xen was skipped because of a failed condition check (ConditionVirtualization=xen).
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Set Up Additional Binary Formats was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Load Kernel Modules was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Platform Persistent Storage Archival was skipped because of a failed condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Repartition Root Disk was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Merge System Extension Images into /usr/ and /opt/ was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Dependency failed for ensure-sysext.service.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Job ensure-sysext.service/start failed with result 'dependency'.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Triggering OnFailure= dependencies.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: /proc/xen was skipped because of a failed condition check (ConditionVirtualization=xen).
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Set Up Additional Binary Formats was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Load Kernel Modules was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Platform Persistent Storage Archival was skipped because of a failed condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Repartition Root Disk was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Merge System Extension Images into /usr/ and /opt/ was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Dependency failed for ensure-sysext.service.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Job ensure-sysext.service/start failed with result 'dependency'.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Triggering OnFailure= dependencies.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: /proc/xen was skipped because of a failed condition check (ConditionVirtualization=xen).
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Set Up Additional Binary Formats was skipped because all trigger condition checks failed.
Apr 07 15:04:51 NetFlyTrap.redacted systemd[1]: Load Kernel Modules was skipped because all trigger condition checks failed.

@goochjj
Copy link
Author

goochjj commented Apr 8, 2022

Confirmed working without my OnFailure=

Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Mounted Mount sdb to /var/lib/docker.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Reached target Local File Systems.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Rebuild Dynamic Linker Cache was skipped because all trigger condition checks failed.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Set Up Additional Binary Formats was skipped because all trigger condition checks failed.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Starting Automatic Boot Loader Update...
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Merge System Extension Images into /usr/ and /opt/ was skipped because all trigger condition checks failed.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Bound to unit systemd-sysext.service, but unit isn't active.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Dependency failed for ensure-sysext.service.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: ensure-sysext.service: Job ensure-sysext.service/start failed with result 'dependency'.
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Starting Create Volatile Files and Directories...
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: boot.automount: Got automount request for /boot, triggered by 846 (bootctl)
Apr 07 20:50:30 NetFlyTrap.redacted systemd[1]: Starting File System Check on /dev/disk/by-label/EFI-SYSTEM...
Apr 07 20:50:30 NetFlyTrap.redacted systemd-tmpfiles[847]: /usr/lib/tmpfiles.d/legacy.conf:13: Duplicate line for path "/run/lock", ignoring.
Apr 07 20:50:30 NetFlyTrap.redacted systemd-tmpfiles[847]: /usr/lib/tmpfiles.d/systemd.conf:33: Duplicate line for path "/var/lib/systemd", ignoring.
Apr 07 20:50:30 NetFlyTrap.redacted systemd-fsck[850]: fsck.fat 4.2 (2021-01-31)
Apr 07 20:50:30 NetFlyTrap.redacted systemd-fsck[850]: /dev/sda1: 791 files, 225482/258078 clusters

I will look into other options, unless you have some suggestions. Really I have the email notices in there to detect boot-up issues, but I'd love for that dropin to only apply after sysinit... or at a specific point in the boot sequence. I'll have to dig further and come up with a solution.

For that matter if sysext is firing, /usr and /opt may not be completely consistent yet - so ... it's really not appropriate for my drop-in to be using them that early in the boot. I'm not sure that occurred to me until right now, it's just always worked.

Thanks!

@goochjj
Copy link
Author

goochjj commented Apr 8, 2022

I've made a unit that fires after network-online.target that dumps my service.d/ dropin into /run, instead of having it in etc. This seems to work quite nicely, actually, and eliminates the problem.

I'll leave it to you whether you want to investigate further. :-D

@pothos
Copy link
Member

pothos commented Apr 12, 2022

Seems we need to duplicate the conditions from systemd-sysext.service into this unit:

ConditionDirectoryNotEmpty=|/etc/extensions
ConditionDirectoryNotEmpty=|/run/extensions
ConditionDirectoryNotEmpty=|/var/lib/extensions
ConditionDirectoryNotEmpty=|/usr/local/lib/extensions
ConditionDirectoryNotEmpty=|/usr/lib/extensions

pothos added a commit to flatcar/init that referenced this issue Apr 12, 2022
The unit failed to execute when the dependency systemd-sysext.service
was skipped because it's not needed. Normally this is harmless but it
still triggers a OnFailure event as reported in
flatcar/Flatcar#710

Use the same logic for skipping execution as done in
systemd-sysext.service - we can't use "systemctl is-enabled" because
Flatcar's inbuilt unit symlinks are not reported as "enabled".
pothos added a commit to flatcar-archive/coreos-overlay that referenced this issue Apr 12, 2022
This pulls in flatcar/init#68 to skip
the ensure-sysext unit when systemd-sysext is skipped to prevent a
dependency failure being reported.

Closes: flatcar/Flatcar#710
@jepio
Copy link
Member

jepio commented Apr 12, 2022

This will be a stupid questions but could the ensure-sysext.service commands be added via drop-in to systemd-systext.service as ExecStartPost?

@pothos
Copy link
Member

pothos commented Apr 12, 2022

This will be a stupid questions but could the ensure-sysext.service commands be added via drop-in to systemd-systext.service as ExecStartPost?

Seems also to be a good option, I've tested it and it works well even if other ExecStartPost entries are present.

@pothos
Copy link
Member

pothos commented Apr 12, 2022

(Not sure how a user would disable a drop-in, maybe with an empty one of the same name)

@jepio
Copy link
Member

jepio commented Apr 12, 2022

Oh, if disabling is a concern then a separate unit is probably more sane.

@pothos
Copy link
Member

pothos commented Apr 12, 2022

Not really the main concern, the main concern was to make clear that this is something custom to Flatcar. The ExecStartPost action is nice because it's something we could try to upstream but I'm not really sure it gets accepted… I'll take a note.

pothos added a commit to flatcar-archive/coreos-overlay that referenced this issue Apr 13, 2022
This pulls in flatcar/init#68 to skip
the ensure-sysext unit when systemd-sysext is skipped to prevent a
dependency failure being reported.

Closes: flatcar/Flatcar#710
pothos added a commit to flatcar-archive/coreos-overlay that referenced this issue Apr 13, 2022
This pulls in flatcar/init#68 to skip
the ensure-sysext unit when systemd-sysext is skipped to prevent a
dependency failure being reported.

Closes: flatcar/Flatcar#710
pothos added a commit to flatcar-archive/coreos-overlay that referenced this issue Apr 13, 2022
This pulls in flatcar/init#68 to skip
the ensure-sysext unit when systemd-sysext is skipped to prevent a
dependency failure being reported.

Closes: flatcar/Flatcar#710
@pothos
Copy link
Member

pothos commented Apr 13, 2022

Will be resolved in the next bug fix releases for Alpha/Beta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
channel/beta Issue concerns the Beta channel. kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants