Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive: report error from input stream #2012

Conversation

giuseppe
Copy link
Member

if there is an error reading from the input stream, prefer to report it instead of the error from the filter program itself.

We have a test in the buildah CI that expects the "no space left on device" error that comes from the input stream, to avoid changing the test, just fix it here.

Reported here: containers/buildah#5585

pkg/archive/filter.go Outdated Show resolved Hide resolved
pkg/archive/filter.go Outdated Show resolved Hide resolved
if there is an error reading from the input stream, prefer to report
it instead of the error from the filter program itself.

We have a test in the buildah CI that expects the "no space left on
device" error that comes from the input stream, to avoid changing the
test, just fix it here.

Reported here: containers/buildah#5585

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe giuseppe force-pushed the archive-filter-reports-error-from-read branch from 9883769 to 22fa550 Compare July 11, 2024 20:04
@rhatdan
Copy link
Member

rhatdan commented Jul 11, 2024

LGTM

@rhatdan
Copy link
Member

rhatdan commented Jul 11, 2024

@mtrmac PTANL

err = fmt.Errorf("%s: %w", strings.TrimRight(stderrBuf.String(), "\n"), err)
b := make([]byte, 1)
// if there is an error reading from input, prefer to return that error
_, errRead := input.Read(b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways here...is it generally defined behavior in Go to read from a reader after it's returned EOF? It would seem unusual to cache errors right?

Actually I am now a bit more confused as to the flow here - how did this break the ENOSPC in the first place? And how does this fix fix it? The input here is the http request reader, right? It wouldn't be the thing giving us ENOSPC...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is just a race condition. The decompressor is faster now and detects the error earlier, while before we were relying on catching the ENOSPC before the decompressor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level I don't understand: why in a test case that we're checking for ENOSPC are we hitting corruption in the compression stream? That doesn't make sense to me.


Different topic: My semi-understanding here is that the previous code was synchronously doing decompression and writes in a single goroutine. That will make the errors quite predictable indeed (in practice, they could only change if the decompressor happened to change to start reading larger chunks or something).

But here, we're suddenly doing decompression in a separate process (but logically, we could have done it in a separate goroutine before, sending decompressed chunks via a pipe or channel) - and it's that parallelization that makes things racy, not speed of decompression itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways here...is it generally defined behavior in Go to read from a reader after it's returned EOF?

The API does not explicitly say, but I would expect a Reader to infinitely return EOF on end-of-file.

It would seem unusual to cache errors right?

I wouldn’t generally expect a Read to continue returning a previously-reported error. Some implementations do that, but e.g. for a raw file access, I’d expect it to just issue a syscall again. So, yes, that might plausibly lose data, but we don’t care here — if we don’t see any error, we would report the one from the subprocess.


Actually I am now a bit more confused as to the flow here - how did this break the ENOSPC in the first place? And how does this fix fix it?

The input here is the http request reader, right? It wouldn't be the thing giving us ENOSPC...

I think here the input is an io.TeeReader actually writing to the restricted-space directory. The storageImageDestination.putBlobToPendingFile pipeline is

HTTP -> … -> Tee -> digest compressed -> decompress -> digest uncompressed
              |
              -> count size -> store to file

and a read from the Tee causes a write to the side branch, which fails with ENOSPC, and is reported to the consumer of the Tee.

(Yes, it would have been nice to have that explanation at the start of the PR, and to avoid the “let’s merge because we are in a rush, what do you mean we don’t understand whether this is actually a fix” controversy.)


and it's that parallelization that makes things racy, not speed of decompression itself.

It’s not really a race at all, it’s that going across the external process loses the Go error information available for the input of the decompression. The external process is just going to see an EOF on a pipe (or maybe we could send a signal?), there is no way to stuff a “we encountered ENOSPC when providing you data” side channel into the pipe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways here...is it generally defined behavior in Go to read from a reader after it's returned EOF? It would seem unusual to cache errors right?

It’s not really a race at all, it’s that going across the external process loses the Go error information available for the input of the decompression.

On second thought, I think this should be implemented in a different way, one that is clearly correct, reporting all errors and not introducing any risk of extra hangs:

Wrap the input stream in a new errorRecordingReader, which stores the first error value reported, if any. Pass that wrapped stream as cmd.Stdin. Then we can consult the reported error if cmd.Run() fails. (cmd.Run() ensures the goroutine consuming cmd.Stdin terminates before returning, so we don’t need to worry about concurrency.)

(It’s a bit ironic that cmd.Run() already has such an error-recording code in cmd.childStdin; but that error is only reported if cmd returns with exit status 0, the ExitError cause is preferred.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be implemented in a different way

I need to focus on something else; I have filed at least a placeholder #2022 for visibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and a read from the Tee causes a write to the side branch, which fails with ENOSPC, and is reported to the consumer of the Tee.

Thank you! That makes sense to me now. I was unaware of the use of Tee here as I currently just have a superficial awareness of this code, but I'm here to learn. The "reads cause writes via side effect" is definitely not something I am used to seeing.

This is a tangent...and I am sure you guys know what's coming next 😄 but in Go, TeeReader is pretty popular whereas there's no such thing in the Rust stdlib, and while people have written it, it's really vanishingly rare comparatively. It's way more common IME to do this type of thing via a custom Write implementation that writes to two places; e.g. ocidir-rs (my code) and there's similar in coreos-installer I think.

I suspect one reason for this is the "ownership" of things and error handling is much clearer if there's a single "super writer" instead of a tee.

Copy link
Collaborator

@mtrmac mtrmac Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*shrug* STREAMS (…) are hard. Go has ended up with two separate Reader and Writer abstractions, which nicely compose individually, but connecting a codebase built around a Reader to a codebase built around a Writer requires an awkward API translation layer; that might be io.Copy or io.TeeReder (easy), or a separate goroutine and an io.Pipe (doable but annoying and comparatively risky).

Whether something interfaces using a Reader or a Writer is frequently, but not always, fairly arbitrary, but often out of control of the caller. Here, the decompression is built around Readers, and using a TeeReader is to adapt in that direction is much easier than using pipes to adapt in the other direction.

[This is sort of isomorphic to the “XML streaming API design” dilemma, where, without concurrency, it’s easy to write an XML parser triggering hard-to-consume callbacks, vs. it’s harder to write an XML parser which returns an easy-to-consume stream of events; with XML it’s so much more visible because the API surface is much larger than just some byte arrays and errors.]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways here...is it generally defined behavior in Go to read from a reader after it's returned EOF? It would seem unusual to cache errors right?

It’s not really a race at all, it’s that going across the external process loses the Go error information available for the input of the decompression.

On second thought, I think this should be implemented in a different way, one that is clearly correct, reporting all errors and not introducing any risk of extra hangs:

Wrap the input stream in a new errorRecordingReader, which stores the first error value reported, if any. Pass that wrapped stream as cmd.Stdin. Then we can consult the reported error if cmd.Run() fails. (cmd.Run() ensures the goroutine consuming cmd.Stdin terminates before returning, so we don’t need to worry about concurrency.)

(It’s a bit ironic that cmd.Run() already has such an error-recording code in cmd.childStdin; but that error is only reported if cmd returns with exit status 0, the ExitError cause is preferred.)

thanks for the hint. Opened a PR: #2025

@@ -46,7 +46,14 @@ func tryProcFilter(args []string, input io.Reader, cleanup func()) (io.ReadClose
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a tangent but I am not sure why we're not using
https://pkg.go.dev/os/exec#Cmd.StdinPipe above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not used StdoutPipe to report explicitly the error with CloseWithError. I'll need to check if that is the same behavior with StdoutPipe

@rhatdan
Copy link
Member

rhatdan commented Jul 12, 2024

I would like to get this in, so we can move forward with the vendoring into Buildah and then finally into Podman.

@cgwalters
Copy link
Contributor

I would like to get this in, so we can move forward with the vendoring into Buildah and then finally into Podman.

Well,
/approve
from me at least FWIW...I don't think this change will make things worse, I just personally don't understand the problem and how this will fix it, but I may be missing something. But if you or someone else is happy with it then from my PoV feel free to lgtm!

Copy link
Contributor

openshift-ci bot commented Jul 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhatdan
Copy link
Member

rhatdan commented Jul 12, 2024

Buildah is expecting to get an out of space error, which it gets from the non-pigz path. Hopefully this fixes this so the user ends up with a understandable error when the system is out of space.

@rhatdan
Copy link
Member

rhatdan commented Jul 12, 2024

/lgtm

@cgwalters
Copy link
Contributor

Yes, I understood the problem enough on that end. I just don't quite understand how this change as written will actually fix it. If you (or someone else does) feel free to explain...but we can also just see if it actually works when retrying the vendored update.

@openshift-merge-bot openshift-merge-bot bot merged commit 98ad80d into containers:main Jul 12, 2024
18 checks passed
giuseppe added a commit to giuseppe/storage that referenced this pull request Jul 12, 2024
follow-up for containers#2012

report the error as seen by the input stream, instead of attempting
another read.

Closes: containers#2022

Signed-off-by: Giuseppe Scrivano <[email protected]>
giuseppe added a commit to giuseppe/storage that referenced this pull request Jul 15, 2024
follow-up for containers#2012

report the error as seen by the input stream, instead of attempting
another read.

Closes: containers#2022

Signed-off-by: Giuseppe Scrivano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants