Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: encounter "writer has not been closed or aborted, must be a bug" #5101

Open
1 task
hzxa21 opened this issue Sep 6, 2024 · 6 comments
Open
1 task
Labels
bug Something isn't working

Comments

@hzxa21
Copy link

hzxa21 commented Sep 6, 2024

Describe the bug

Version: 0.47.2
With the following writer configurations, we encounter "writer has not been closed or aborted, must be a bug".

    let writer = op
            .clone()
            .layer(TimeoutLayer::new().with_io_timeout(Duration::from_millis(
                config.retry.streaming_upload_attempt_timeout_ms,
            )))
            .layer(
                RetryLayer::new()
                    .with_min_delay(Duration::from_millis(1000))
                    .with_max_delay(Duration::from_millis(10000))
                    .with_max_times(3)
                    .with_factor(2.0)
                    .with_jitter(),
            )
            .writer_with(&path)
            .concurrent(8)
            .executor(Executor::with(monitored_execute))
            .await?;

It seems that this happens when the opendal retry is triggered on writer.close() .

Steps to Reproduce

Expected Behavior

Additional Context

Logs:

2024-09-04T07:42:02.770667396Z WARN opendal::layers::retry: will retry after 1.604547739s because: Unexpected (temporary) at Writer::close, context: { timeout: 10 } => io operation timeout reached

2024-09-04T07:42:04.377279911Z WARN opendal::services: service=s3 operation=Writer::close path=xxx -> data close failed: NotFound (permanent) at Writer::close, context: { uri: ..., response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "cache-control": "no-cache", "content-length": "467", "content-security-policy": "block-all-mixed-content", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-accel-buffering": "no", "x-amz-id-2": "..."} }, service: s3, path: xxx, written: 138426184 } => S3Error { code: "NoSuchUpload", message: "The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.", resource: "xxx", request_id: "xxx" }

2024-09-04T07:42:04.377323167Z WARN opendal::layers::complete: writer has not been closed or aborted, must be a bug

2024-09-04T07:42:04.37733207Z ERROR risingwave_object_store::object: streaming_upload_finish failed error=NotFound (persistent) at Writer::close, context: { uri:..., response: Parts { status: 404, version: HTTP/1.1, headers: {"accept-ranges": "bytes", "cache-control": "no-cache", "content-length": "467", "content-security-policy": "block-all-mixed-content", "content-type": "application/xml", "server": "MinIO", "strict-transport-security": "max-age=31536000; includeSubDomains", "vary": "Origin", "vary": "Accept-Encoding", "x-accel-buffering": "no", "x-amz-id-2": "..."} }, service: s3, path: ... } => S3Error { code: "NoSuchUpload", message: "The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.", resource: "...", request_id: "..." }

Are you willing to submit a PR to fix this bug?

  • Yes, I would like to submit a PR.
@hzxa21 hzxa21 added the bug Something isn't working label Sep 6, 2024
@wcy-fdu
Copy link
Contributor

wcy-fdu commented Sep 6, 2024

Also reproduce in Version: 0.49.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 7, 2024

Hi, could you check if this file has been created successfully? I'm wondering if this is the case:

  • A complete multiple is sent, which returns a timeout.
  • We cancel this request and send a new one.
  • But the old request has been processed successfully, leading to the new request returning an error.

Do you think this is possible? Can you reproduce this over AWS S3? Or it's just on minio?

@wcy-fdu
Copy link
Contributor

wcy-fdu commented Sep 9, 2024

Can you reproduce this over AWS S3? Or it's just on minio?

This error occurs in our CI, which is based on minio. The issue has not yet been reproduced on AWS S3, and it is a bit bard to check whether the file was created successfully on minio.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 9, 2024

This error occurs in our CI, which is based on minio. The issue has not yet been reproduced on AWS S3, and it is a bit bard to check whether the file was created successfully on minio.

Hi, can you run op.stat() on this file if NotFound occurs during w.close() and ignore it file created? If the error disappears, I believe that's the issue.

@wcy-fdu
Copy link
Contributor

wcy-fdu commented Sep 9, 2024

This error occurs in our CI, which is based on minio. The issue has not yet been reproduced on AWS S3, and it is a bit bard to check whether the file was created successfully on minio.

Hi, can you run op.stat() on this file if NotFound occurs during w.close() and ignore it file created? If the error disappears, I believe that's the issue.

The file is not written into minio.

@wcy-fdu
Copy link
Contributor

wcy-fdu commented Sep 18, 2024

Update:
OpenDAL and the AWS SDK handle multipart uploads differently.

  • The AWS SDK operates completely in parallel
  • while OpenDAL processes uploads based on concurrency, utilizing an internal task queue. During the final close, it continuously polls this task queue until all tasks are completed.

As a result, in extreme cases, the timeout in OpenDAL may require waiting through two rounds of I/O: the first round waits for task.execute, and the second waits for all tasks in the batch to finish. Currently, the timeout is not per operation; it also includes the waiting time for tasks. Therefore, when the number of tasks exceeds the concurrency, the timeout is more likely to be triggered compared to the SDK.

Additionally, there may be a MinIO issue: the CompleteMultipartUpload operation is not necessarily idempotent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants