-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint is never created after upgrade (both with Julia 1.5 and 1.6) #1547
Comments
#1514 is failing so at this point you are using Julia v1.6 "at your own risk" :) |
I know, which is why I eventually reverted to 1.5 :) |
Yes definitely. :) I was just making the point that if you are using v1.6 for "production-research ready" runs then you might be in trouble... |
To clarify, the run appears to hang at the point that the |
Yes! It's hard for me to say exactly where it's getting stuck in the case. But it hangs before even running a single step of the simulation. Everything that happens up until the checkpointer seem to be fine. That is, model and simulations are created fine, and I see files created for all my NetCDF outputs, but I never see a file created for the checkpointer. |
Hmm... I'm going to test whether JLD2 works with 1.6... |
I should also note that I've had an issue with the |
It would be super super helpful if you document this on the Oceananigans issue tracker! |
So, I also tried upgrading my whole project without upgrading Julia (so still using Julia 1.5.2) and the error persists. To be clear, before the upgrade below everything was working normally and after the upgrade the checkpointer stopped being created. Here's the upgrade:
I was able to create a MWE this time: using Printf
using Oceananigans
using Oceananigans: Utils, Units
using Oceananigans.OutputWriters
using Oceanostics: SingleLineProgressMessenger
grid = RegularRectilinearGrid(size=(4, 4, 4), extent=(1,1,1))
model = IncompressibleModel(architecture = CPU(), grid = grid)
start_time = 1e-9*time_ns()
simulation = Simulation(model, Δt=1, stop_time=50, iteration_interval=5,
progress=SingleLineProgressMessenger(LES=false, initial_wall_time_seconds=start_time),
)
println("\n", simulation,"\n",)
@info "Setting up chk writer"
simulation.output_writers[:chk_writer] = Checkpointer(model; dir=".",
prefix = "chk.test",
schedule = TimeInterval(5),
force = true, cleanup = true,
)
println("\n", simulation,"\n",)
@printf("---> Starting run!\n")
run!(simulation, pickup=true) This results in the following output: Simulation{IncompressibleModel{CPU, Float64}}
├── Model clock: time = 0 seconds, iteration = 0
├── Next time step (Int64): 1 second
├── Iteration interval: 5
├── Stop criteria: Any[Oceananigans.Simulations.iteration_limit_exceeded, Oceananigans.Simulations.stop_time_exceeded, Oceananigans.Simulations.wall_time_limit_exceeded]
├── Run time: 0 seconds, wall time limit: Inf
├── Stop time: 50 seconds, stop iteration: Inf
├── Diagnostics: OrderedCollections.OrderedDict with 1 entry:
│ └── nan_checker => NaNChecker
└── Output writers: OrderedCollections.OrderedDict with no entries
[ Info: Setting up chk writer
Simulation{IncompressibleModel{CPU, Float64}}
├── Model clock: time = 0 seconds, iteration = 0
├── Next time step (Int64): 1 second
├── Iteration interval: 5
├── Stop criteria: Any[Oceananigans.Simulations.iteration_limit_exceeded, Oceananigans.Simulations.stop_time_exceeded, Oceananigans.Simulations.wall_time_limit_exceeded]
├── Run time: 0 seconds, wall time limit: Inf
├── Stop time: 50 seconds, stop iteration: Inf
├── Diagnostics: OrderedCollections.OrderedDict with 1 entry:
│ └── nan_checker => NaNChecker
└── Output writers: OrderedCollections.OrderedDict with 1 entry:
│ └── chk_writer => Checkpointer
---> Starting run! And then the REPL just hangs there and nothing happens. I also see no checkpoint file created. I checked Here's julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)
(ISI_jet) pkg> st
Status `/glade/scratch/tomasc/ISI_jet/Project.toml`
[c7e460c6] ArgParse v1.1.4
[63c18a36] KernelAbstractions v0.5.5
[9e8cae18] Oceananigans v0.54.0
[d0ccf422] Oceanostics v0.3.0
[5fb14364] OhMyREPL v0.5.10
[d96e819e] Parameters v0.12.2
[91a5bcdd] Plots v1.11.2
[276daf66] SpecialFunctions v1.3.0
[de0858da] Printf
[10745b16] Statistics |
This appears to be fixed with the latest version of Oceananigans and Julia 1.5.4 (probably due to #1621). |
I'm not sure if we're officially supporting Julia 1.6 yet, but I noticed that when I create a
Checkpointer
in Julia 1.6 things don't really work. All other output files are created normally (meaning for me NetCDF files) but the checkpoint never does and the simulation just hangs there.I waited for over 15 min but the next lines (which are supposed to be the progress messenger) never come up. Everything works normally when I revert back to Julia 1.5.3.
I don't have time to create a clean reproducible MWE at the moment, but I can do so later if needed. I just thought I should post this while it's fresh in my head.
The text was updated successfully, but these errors were encountered: