Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: "failed to read dir" error running cargo bench #2495

Closed
mrcnski opened this issue Nov 25, 2023 · 4 comments · Fixed by #2486
Closed

PVF: "failed to read dir" error running cargo bench #2495

mrcnski opened this issue Nov 25, 2023 · 4 comments · Fixed by #2486
Assignees
Labels
I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Nov 25, 2023

Getting the following error after #1918:

Nov 24 15:44:09.323 ERROR parachain::pvf: failed to read dir "/tmp/.tmpxxQbgX" err=Os { code: 2, kind: NotFound, message: "No such file or directory" }

The PR has not been released yet, so we should find out whether this affects production (and also have a test to reproduce this).

The weird thing is, we do successfully create the dir beforehand with create_dir_all. But I've reproduced the error from read_dir on two machines already. 🤔

// Make sure that the cache path directory and all its parents are created.
if let Err(err) = tokio::fs::create_dir_all(cache_path).await {
    // NOTE: This log is not in `master`, I added it to verify that the dir exists.
    if err.kind() != io::ErrorKind::AlreadyExists {
        gum::error!(
            target: LOG_TARGET,
            ?err,
            "failed to create dir {:?}",
            cache_path,
        );
        return
    }
}

let mut dir = match tokio::fs::read_dir(cache_path).await {
    Ok(dir) => dir,
    Err(err) => {
        gum::error!(
            target: LOG_TARGET,
            ?err,
            "failed to read dir {:?}",
            cache_path,
        );
        return
    },
};

cc @eagr if you have any ideas. I'll try to get to it later today.

@eagr
Copy link
Contributor

eagr commented Nov 26, 2023

I don't have access to a linux machine (considering getting one), so I couldn't really do cargo bench right now.

Is there any other way to trigger this? What about cargo test -r on those machines, do the tests pass? There is a test that covers the related code. I tried cargo test -r, not problems no macos.

If we assume read_dir works ok, apparently the tmp dir is removed somewhere. Here are some other possibilities I could think of, likely or not. I guess we couldn't take anything for granted when having parallel tasks.

  • create_dir_all() failed to create the dir but it didn't return Err so you wouldn't see it
  • the dir got removed again after create_dir_all().await
  • read_dir() is called too early

Have you tried the blocking std::fs::create_dir_all or std::fs::read_dir? Or tokio::time::sleep before reading the dir? Let's try one thing at a time. Not suggesting we should do any of these in production, just for ruling out possibilities.

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 26, 2023

Thanks for the ideas @eagr! I think the problem occurs when the directory already exists, but can't be read. Previously we were deleting the directory before recreating it, so we didn't have this problem.

Oh, I think it has to do with the benchmark doing it multiple times in parallel in iter_batched:

group.bench_function("host: prepare Rococo runtime", |b| {
    b.to_async(&rt).iter_batched(
        || async {
            (
                TestHost::new_with_config(rt.handle(), |cfg| {
                    cfg.prepare_workers_hard_max_num = 1;
                })
                .await,
                pvf.clone().code(),
            )
        },
        |result| async move {
            let (host, pvf_code) = result.await;
            [...]

BTW, does cargo bench not work on your Mac? I've reproduced this both on my Mac and on my Linux VPS.

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 26, 2023

I think I see the problem. Luckily this doesn't affect production.

@mrcnski
Copy link
Contributor Author

mrcnski commented Nov 26, 2023

The problem is that here we use a cache path that gets deleted at the end of the end of the function (see https://docs.rs/tempfile/latest/tempfile/fn.tempdir.html). Not a big deal, I fixed it here.

This brings up an interesting question though, what happens if the cache path gets deleted at runtime? It's an unlikely concern in practice, but I'm curious.

@mrcnski mrcnski self-assigned this Dec 5, 2023
@mrcnski mrcnski added T0-node This PR/Issue is related to the topic “node”. I2-bug The node fails to follow expected behavior. labels Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

2 participants