Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rawhide: ext.config.var-mount.scsi-id fails #1670

Closed
c4rt0 opened this issue Feb 14, 2024 · 15 comments
Closed

rawhide: ext.config.var-mount.scsi-id fails #1670

c4rt0 opened this issue Feb 14, 2024 · 15 comments
Assignees
Labels
jira for syncing to jira

Comments

@c4rt0
Copy link
Member

c4rt0 commented Feb 14, 2024

Kola test ext.config.var-mount.scsi-id is failing in Jenkins since Feb-13-2024 with the following error:

[2024-02-14T13:09:05.272Z] --- FAIL: ext.config.var-mount.scsi-id (56.26s)
[2024-02-14T13:09:05.272Z]         harness.go:1820: mach.Start() failed: machine 82d6a3cd-5a42-4c15-9b80-3f2439561b35 entered emergency.target in initramfs

Full console log of the jenkins job can be found here.

When looking at the console.txt of the failed test we see:

[    5.964318] ignition[977]: Stage: disks
...
[    5.974301] ignition[977]: disks: createPartitions: op(1): [started]  waiting for devices [/dev/disk/by-id/scsi-0NVME_VirtualMultipath_disk1]
[    5.979497] systemd[1]: Expecting device dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device - /dev/disk/by-id/scsi-0NVME_VirtualMultipath_disk1...

and further:

[51.057963] systemd[1]: dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device: Job dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device/start timed out.
[51.062464] ignition disks: createPartitions: op(1): [failed]   waiting for devices [/dev/disk/by-id/scsi-0NVME_VirtualMultipath_disk1]: device unit dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device timeout
Failed to start ignition-disks.service Ignition (disks).

See 'systemctl status ignition-disks.service' for details.

Dependency failed for ignition-complete.target - Ignition Complete.

Dependency failed for initrd.target - Initrd Default Target.

[   51.072377] systemd[1]: Timed out waiting for device dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device - /dev/disk/by-id/scsi-0NVME_VirtualMultipath_disk1.
[   51.076315] ignition[977]: disks failed
[   51.237327] ignition[977]: Ignition failed: create partitions failed: failed to wait on disks devs: device unit dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device timeout
[   51.239411] systemd[1]: dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device: Job dev-disk-by\x2did-scsi\x2d0NVME_VirtualMultipath_disk1.device/start failed with result 'timeout'.
@dustymabe
Copy link
Member

Can you share any relevant bits from the test logs? For example, what is in the console.txt for that test?

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Feb 15, 2024
This started failing in rawhide and we haven't had time to fully
investigate yet.

coreos/fedora-coreos-tracker#1670
@dustymabe
Copy link
Member

denial/snooze for now while you investigate @c4rt0 coreos/fedora-coreos-config#2854

dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Feb 15, 2024
This started failing in rawhide and we haven't had time to fully
investigate yet.

coreos/fedora-coreos-tracker#1670
@c4rt0
Copy link
Member Author

c4rt0 commented Feb 15, 2024

Thank's for the denial @dustymabe.
Here's the console.txt from this test.

@c4rt0 c4rt0 self-assigned this Feb 19, 2024
@c4rt0 c4rt0 added the jira for syncing to jira label Feb 26, 2024
marmijo added a commit to marmijo/fedora-coreos-config that referenced this issue Mar 1, 2024
This test is still failing. Let's extend the snooze while we
continue to investigate coreos/fedora-coreos-tracker#1670
marmijo added a commit to coreos/fedora-coreos-config that referenced this issue Mar 1, 2024
This test is still failing. Let's extend the snooze while we
continue to investigate coreos/fedora-coreos-tracker#1670
gursewak1997 added a commit to gursewak1997/fedora-coreos-config that referenced this issue Mar 18, 2024
This test is still failing. Let's extend the snooze while we
continue to investigate coreos/fedora-coreos-tracker#1670
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue Mar 18, 2024
This test is still failing. Let's extend the snooze while we
continue to investigate coreos/fedora-coreos-tracker#1670
aaradhak pushed a commit to aaradhak/fedora-coreos-config that referenced this issue Mar 18, 2024
This started failing in rawhide and we haven't had time to fully
investigate yet.

coreos/fedora-coreos-tracker#1670
aaradhak pushed a commit to aaradhak/fedora-coreos-config that referenced this issue Mar 18, 2024
This test is still failing. Let's extend the snooze while we
continue to investigate coreos/fedora-coreos-tracker#1670
aaradhak pushed a commit to aaradhak/fedora-coreos-config that referenced this issue Mar 18, 2024
This test is still failing. Let's extend the snooze while we
continue to investigate coreos/fedora-coreos-tracker#1670
@jbtrystram

This comment was marked as outdated.

@jlebon
Copy link
Member

jlebon commented Mar 22, 2024

Nice find! I would check for changes in udev rules (/usr/lib/udev/rules.d). Probably 60-persistent-storage.rules or one of the *sg3*.rules files.

To know which package a file comes from, you can use rpm -qf /path/to/file.

@jbtrystram
Copy link
Contributor

jbtrystram commented Mar 22, 2024

Thanks @jlebon for the hint !

the symlink that this test rely on is generated by 63-scsi-sg3_symlink.rules.

Indeed the 63-scsi-sg3_symlink.rules have changed in the 1.48 release of sg3_utls

The new version use different udev attributes to generate the symlink : https://github.com/doug-gilbert/sg3_utils/blob/2355dc4b451989291df695148cd8d8d03b3d987e/scripts/58-scsi-sg3_symlink.rules#L10

Unfortunately the new way of creating the symlink use only TLVS letters (see udevadm info output)

So i think a fix would be to attach a label to the qemu disk and use that . Example
Or maybe there is a better way to have QEMU attach the info to the device in way that comply with that rule

looking at how cosa start the VM the fix is probably there : -device virtio-scsi-pci,id=scsi_mpath10 -device scsi-hd,bus=scsi_mpath10.0,drive=mpath10,vendor=NVME,product=VirtualMultipath,wwn=13645024510734875233,serial=disk1

@jbtrystram
Copy link
Contributor

Looks like we may be able to fix how cosa generates the qemu args

@dustymabe
Copy link
Member

I guess the question we have to ask ourselves is if this is behavior we think people are relying on or was it just behavior that was convenient that we were relying on for our tests?

If it's the former then we might need to engage upstream to fix it or at least consider the regression. If it's the latter then we can safely just change our usage and move on.

@jbtrystram
Copy link
Contributor

I intuitively think people may rely on this, but they have "real" disks that would expose the values in the corrects fields and the issue is more on how QEMU "fakes" the disk.
However i did a quick check on my PC and I only see E fields (i,e the same issue would occur) like a pasted above. My system is only SATA and NVME though, SCSI disk may be different I am not sure

@c4rt0

This comment was marked as off-topic.

@c4rt0
Copy link
Member Author

c4rt0 commented Mar 26, 2024

The next step (after posting the above) was to run rawhide on this server and compare the results. Unfortunately during the cosa build process, my machine became unresponsive to the point that I could no longer produce what was intended (unrelated hardware issue).

@c4rt0
Copy link
Member Author

c4rt0 commented Mar 27, 2024

I guess the question we have to ask ourselves is if this is behavior we think people are relying on or was it just behavior that was convenient that we were relying on for our tests?

If it's the former then we might need to engage upstream to fix it or at least consider the regression. If it's the latter then we can safely just change our usage and move on.

How to find out if it's one or the other? What would be the next step here?

Let's say we start implementing the fix, considering it was convenient ... rather quickly someone would express dissatisfaction - right?
I've been chewing all the above since yesterday and just to be transparent, at this stage I am not quite sure how to proceed.

@jlebon
Copy link
Member

jlebon commented Apr 4, 2024

Mailing list thread for the upstream changes: https://listman.redhat.com/archives/dm-devel/2023-March/053645.html

The upstream changes seem reasonable to me but I would not at all be surprised if there are users/customers making use of the symlinks that are being removed. They would have to migrate to using the WWN-based ones, but if it's widespread enough, we (Fedora, Red Hat) might have to decide to re-enable some of those symlinks (or maybe only during upgrades e.g. via LEAPP). This will land in c10s soon also if it hasn't already and I think upgrade testing will also happen in that context at some point. I don't think there's a need to differ at the CoreOS level; we should just follow the distro.

I think ext.config.var-mount.scsi-id should be improved regardless. What it's testing is valid (verifying that scsi-* symlinks are created in the initramfs), but how it's testing it is an issue. It doesn't need to be a multipath device, I think that was just a convenient way to get a SCSI block device added to test the rules (probably based on a suggestion from me). But in fact, that hits one of the issues that the upstream change warns about (/dev/disk/by-id/scsi-0NVME_VirtualMultipath_disk1 in that test is racy and can point to either /dev/sda or /dev/sdb). I think a better fix there is to support something like 5G:channel=scsi,wwn=... and have kola actually attach the disk as a SCSI device instead of virtio-blk. Then the test would verify that an Ignition config can reference the by-id/scsi-3$WWN symlink (which matches what the reporter was trying to do in https://bugzilla.redhat.com/show_bug.cgi?id=1990506).

@jbtrystram
Copy link
Contributor

Thanks @jlebon for the really detailed answer !

Or maybe there is a better way to have QEMU attach the info to the device in way that comply with that rule

So i was not too far off there :)
I'll cook up something based on your suggestion

jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this issue Apr 5, 2024
Add a customizable WWN option for kola DiskSpec to have reliable links
under `/dev/disk/by-id/`. With this change kola qemuxec can be run like:
`kola qemuexec -D "5G:channel=scsi,wwn=007"`

Resulting in the following links:
```
[core@localhost ~]$ rpm -qa sg3_utils
sg3_utils-1.48-1.fc40.x86_64
[core@localhost ~]$ ls -l /dev/disk/by-id
total 0
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 scsi-30000000000000007 -> ../../sda
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 wwn-0x0000000000000007 -> ../../sda
```

This is motivated by recent changes in sg3_utils [1] which
removed some udev links.
At least one of our tests [2] relying on this started failing.
This patch was suggested by @jlebon [3]

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
[2] coreos/fedora-coreos-tracker#1670
[3] coreos/fedora-coreos-tracker#1670 (comment)
jbtrystram added a commit to jbtrystram/fedora-coreos-config that referenced this issue Apr 5, 2024
Update the scsci-id test to set a WWN for the disk, and use reliable
udev symlinks to adjust for a change in sg3_utils [1]

Note that the `wwn` value set is converted to base 16 by QEMU,
so the symlink in the ignition config must reflects it.

This requires coreos/coreos-assembler#3772
See coreos/fedora-coreos-tracker#1670

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this issue Apr 5, 2024
Add a customizable WWN option for kola DiskSpec to have reliable links
under `/dev/disk/by-id/`. With this change kola qemuxec can be run like:
`kola qemuexec -D "5G:channel=scsi,wwn=007"`

Resulting in the following links:
```
[core@localhost ~]$ rpm -qa sg3_utils
sg3_utils-1.48-1.fc40.x86_64
[core@localhost ~]$ ls -l /dev/disk/by-id
total 0
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 scsi-30000000000000007 -> ../../sda
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 wwn-0x0000000000000007 -> ../../sda
```

This is motivated by recent changes in sg3_utils [1] which
removed some udev links.
At least one of our tests [2] relying on this started failing.
This patch was suggested by @jlebon [3]

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
[2] coreos/fedora-coreos-tracker#1670
[3] coreos/fedora-coreos-tracker#1670 (comment)
jbtrystram added a commit to jbtrystram/coreos-assembler that referenced this issue Apr 8, 2024
Add a customizable WWN option for kola DiskSpec to have reliable links
under `/dev/disk/by-id/`. With this change kola qemuxec can be run like:
`kola qemuexec -D "5G:channel=scsi,wwn=007"`

Resulting in the following links:
```
[core@localhost ~]$ rpm -qa sg3_utils
sg3_utils-1.48-1.fc40.x86_64
[core@localhost ~]$ ls -l /dev/disk/by-id
total 0
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 scsi-30000000000000007 -> ../../sda
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 wwn-0x0000000000000007 -> ../../sda
```

This is motivated by recent changes in sg3_utils [1] which
removed some udev links.
At least one of our tests [2] relying on this started failing.
This patch was suggested by @jlebon [3]

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
[2] coreos/fedora-coreos-tracker#1670
[3] coreos/fedora-coreos-tracker#1670 (comment)
jbtrystram added a commit to jbtrystram/fedora-coreos-config that referenced this issue Apr 11, 2024
Update the scsci-id test to set a WWN for the disk, and use reliable
udev symlinks to adjust for a change in sg3_utils [1]

Note that the `wwn` value set is converted to base 16 by QEMU,
so the symlink in the ignition config must reflects it.

This requires coreos/coreos-assembler#3772
See coreos/fedora-coreos-tracker#1670

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
jlebon pushed a commit to coreos/coreos-assembler that referenced this issue Apr 11, 2024
Add a customizable WWN option for kola DiskSpec to have reliable links
under `/dev/disk/by-id/`. With this change kola qemuxec can be run like:
`kola qemuexec -D "5G:channel=scsi,wwn=007"`

Resulting in the following links:
```
[core@localhost ~]$ rpm -qa sg3_utils
sg3_utils-1.48-1.fc40.x86_64
[core@localhost ~]$ ls -l /dev/disk/by-id
total 0
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 scsi-30000000000000007 -> ../../sda
lrwxrwxrwx. 1 root root  9 Apr  5 09:05 wwn-0x0000000000000007 -> ../../sda
```

This is motivated by recent changes in sg3_utils [1] which
removed some udev links.
At least one of our tests [2] relying on this started failing.
This patch was suggested by @jlebon [3]

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
[2] coreos/fedora-coreos-tracker#1670
[3] coreos/fedora-coreos-tracker#1670 (comment)
jbtrystram added a commit to coreos/fedora-coreos-config that referenced this issue Apr 12, 2024
Update the scsci-id test to set a WWN for the disk, and use reliable
udev symlinks to adjust for a change in sg3_utils [1]

Note that the `wwn` value set is converted to base 16 by QEMU,
so the symlink in the ignition config must reflects it.

This requires coreos/coreos-assembler#3772
See coreos/fedora-coreos-tracker#1670

[1] https://listman.redhat.com/archives/dm-devel/2023-March/053645.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

4 participants