Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent corruption writing files to network share over cifs, from Raspberry Pi, while using certain compressors #6335

Open
pronoiac opened this issue Sep 4, 2024 · 19 comments

Comments

@pronoiac
Copy link

pronoiac commented Sep 4, 2024

Describe the bug

wrong location?

First off, I got this repo from the package description for the installed kernel. Apologies if I'm not in the right spot.

Short version

I was benchmarking some compressors on Debian on a Raspberry Pi, piping to and from a network share on a NAS, and found that some consistently had issues writing to my NAS.
Specifically: lzop, pigz (parallel gzip), and pbzip2 (parallel bzip2).
This seems dependent on kernel version:
Debian 11, bullseye, kernel 6.1.21, was ok.
Debian 12, bookworm, kernel versions 6.6.20 and 6.6.31, were impacted.

Compiling and running a mainline kernel 6.1.21 on bookworm avoided the issue. I don’t think Debian patches are at fault.

There's over a year between those kernel releases. Bisecting won’t be quick, but it is doable.

Steps to reproduce the behaviour

It looks like this, on a mounted network share:

cat 1tb-rust-ext4.img.tar.gz  | \
  gzip -d | \
  lzop -1 > \
  1tb-rust-ext4.img.tar.lzop
# wait 40 minutes

cat 1tb-rust-ext4.img.tar.lzop | \
  lzop -d | \
  sha1sum
# it crashes, due to a corrupt file

Device (s)

Raspberry Pi 4 Mod. B

System

OS & version:

Raspberry Pi reference 2024-07-04
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 48efb5fc5485fafdc9de8ad481eb5c09e1182656, stage4

Firmware version:

Can't open device file: /dev/vcio
Try creating a device file with: sudo mknod /dev/vcio c 100 0

(That device file didn't help)

Kernel version:

Linux pillions 6.6.31+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Logs

No response

Additional context

More details

The Pi and NAS are directly connected by Gigabit Ethernet. Both sides are using self-assigned IP addresses.
The files in question are file systems, about 270 gig.
Compression seems to work, without complaint; decompression crashes the process, usually within the first gig of the compressed file. It looks like the compressed files are corrupt.
Trying decompression during compression gets further along than it does after compression finishes; this might point toward something with writes and caches.
This is a Raspberry Pi 4, with 4 GiB RAM.

Wrong location, more details

I reported the issue to Debian, which they closed:

Debian does not ship 1:6.6.31-1+rpt1 version.

My impression:

  • the Linux kernel is upstream from Debian
  • this repo - not sure what to call the org - is downstream of both Debian and the Linux kernel

Giving a heads up to the most likely impacted people makes sense -

  • if that's here, yay
  • if it's the forums, ok
  • if I just need to go directly to the Linux kernel folks, ok
@6by9
Copy link
Contributor

6by9 commented Sep 4, 2024

I vaguely recall an issue with CIFS in mainline not so long back - a fix had been backported in mainline erroneously.

We do build more recent kernels than are packaged into apt. On Raspberry Pi OS you can use sudo rpi-update to get the latest build of the current LTS branch (6.6 at the moment), or use eg sudo rpi-update rpi-6.10.y to grab the 6.10 kernel.
It would be useful if you could tell us if the issue is still present in the latest 6.6 branch, and on 6.10.

Please be aware that there is a low-but-non-zero risk of regressions in taking these builds, so please test on a non-critical system, or at least backup first. Having a backup copy of the /boot/firmware/kernel*.img files to restore is generally sufficient, as rpi-update does not delete the old modules.

NB These CI builds are only available for 90 days after the last update on that branch, so generally it's only the LTS branch (6.6), the latest released branch (6.10), and the prepatch branch (6.11) that will be available.

@pronoiac
Copy link
Author

pronoiac commented Sep 4, 2024

From rpi-update:

  • 6.6.47: crashed on decompression.
  • 6.10.7: worked.
  • a mainline 6.6.48 I built: crashed on decompression.
    (so it's not simply a firmware thing, I think.)
  • 6.11.0-rc6: worked.

@6by9
Copy link
Contributor

6by9 commented Sep 5, 2024

Interesting that it appears to be something that was broken by 6.6 and now fixed, but not backported.

If you're happy rebuilding the kernel, identifying whether the rpi-6.7.y, rpi-6.8.y, and rpi-6.9.y branches are good or not would be very useful. Unfortunately the CI build artifacts are likely to have expired for those branches, so it needs to be manual builds.

Sorry to ask you to do the investigative work, but you have a system setup that you can get to fail.

@pelwell
Copy link
Contributor

pelwell commented Sep 5, 2024

I've forced rebuilds of rpi-6.7.y, rpi-6.8.y and rpi-6.9.y. Wait about 45 minutes then try sudo rpi-update rpi-6.7.y etc.

@pelwell
Copy link
Contributor

pelwell commented Sep 5, 2024

(You can see the in-progress builds here: https://github.com/raspberrypi/linux/actions?query=is%3Ain_progress)

@pelwell
Copy link
Contributor

pelwell commented Sep 5, 2024

They should be ready now.

@pronoiac
Copy link
Author

pronoiac commented Sep 5, 2024

My Internet connection's misbehaving today, but I will investigate when I can.

@pronoiac
Copy link
Author

pronoiac commented Sep 5, 2024

Possibly of note: the issue might go as far back as v6.3.
Those builds are very helpful; building on my Pi takes about two hours.

  • 6.7.12 - crashed
  • 6.8.12 - worked
  • 6.9.12 - crashed. possibly relevant: api-update fetched a new eeprom as I set this up.

@pronoiac
Copy link
Author

pronoiac commented Sep 6, 2024

I re-ran 6.8.12 - after the new eeprom - and it didn't work.

@pronoiac
Copy link
Author

pronoiac commented Sep 9, 2024

I've been looking for the fix for 6.10; I'm bisecting into its rc1.

@pronoiac
Copy link
Author

pronoiac commented Sep 11, 2024

Reading the rpi-update page (edit: new repo), it looks like it can pull in bleeding edge firmware, with risk of regressions. I intended to use it to pull in kernel 6.6.50, but then checking some kernels I'd built, I'm seeing breakage where it worked before.

Any suggestions?

@popcornmix
Copy link
Collaborator

Reading the rpi-update page

Check the first line of the readme.

I intended to use it to pull in kernel 6.6.50, but then checking some kernels I'd built, I'm seeing breakage where it worked before.
Any suggestions?

Not based on what you've posted. If you post exactly what you did, and exactly what the breakage was it's possible there will be suggestions.

@pronoiac
Copy link
Author

I updated the link, in case you were thinking, that's the deprecated rpi-update repo.

What I did:

  • bisect, keeping notes - "did lzop decompression work?" - and kernels around
  • sudo rpi-update rpi-6.6.y
  • re-run on some previous kernels I'd built, and noted that decompression broke, so, unexpected change in behavior

Vaguely, some options I see:

  • apt / dpkg reinstall something to reset /boot to what came with bookworm
  • grab a new MicroSD card, and start over

@popcornmix
Copy link
Collaborator

I'm still not following which cases are which in "I'm seeing breakage where it worked before."

Is the breakage here the "Silent corruption writing files to network share over cifs" or something else?
Are you saying rpi-update kernel behaves the same or differently to your self built one?

@pronoiac
Copy link
Author

The network share breakage manifests as lzop failing to decompress, and that works, or doesn't, depending on the Linux kernel version. I've attempted bisection of the Linux kernel. rpi-update appears to change something in addition to the Linux kernel version, so that a kernel I'd tested, will stop working.

@popcornmix
Copy link
Collaborator

rpi-update may update bootloader and/or firmware (start.elf).
There are options to disable that.

@pelwell
Copy link
Contributor

pelwell commented Sep 11, 2024

so that a kernel I'd tested, will stop working.

Stop working in what way? Try to be less vague.

@pronoiac
Copy link
Author

Stop working in what way? Try to be less vague.

I'll re-run the compression & decompression, and while they worked before, the decompression fails, as the file was corrupted.

@pelwell
Copy link
Contributor

pelwell commented Sep 11, 2024

What you are describing sounds a lot like a random/timing-related issue, which would make testing challenging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants