Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint is never created after upgrade (both with Julia 1.5 and 1.6) #1547

Closed
tomchor opened this issue Apr 5, 2021 · 10 comments
Closed

Comments

@tomchor
Copy link
Collaborator

tomchor commented Apr 5, 2021

I'm not sure if we're officially supporting Julia 1.6 yet, but I noticed that when I create a Checkpointer in Julia 1.6 things don't really work. All other output files are created normally (meaning for me NetCDF files) but the checkpoint never does and the simulation just hangs there.

Simulation{typename(IncompressibleModel){typename(CPU), Float64}}
├── Model clock: time = 0 seconds, iteration = 0 
├── Next time step (TimeStepWizard{Float64}): 2.747 seconds 
├── Iteration interval: 5
├── Stop criteria: Any[Oceananigans.Simulations.iteration_limit_exceeded, Oceananigans.Simulations.stop_time_exceeded, Oceananigans.Simulations.wall_time_limit_exceeded]
├── Run time: 0 seconds, wall time limit: 300.0
├── Stop time: 7.272 days, stop iteration: Inf
├── Diagnostics: typename(OrderedCollections.OrderedDict) with 1 entry:
│   └── nan_checker => typename(NaNChecker)
└── Output writers: typename(OrderedCollections.OrderedDict) with 4 entries:
│   ├── out_writer => typename(NetCDFOutputWriter)
│   ├── vid_writer => typename(NetCDFOutputWriter)
│   ├── avg_writer => typename(NetCDFOutputWriter)
│   └── chk_writer => typename(Checkpointer)

---> Starting run!

I waited for over 15 min but the next lines (which are supposed to be the progress messenger) never come up. Everything works normally when I revert back to Julia 1.5.3.

I don't have time to create a clean reproducible MWE at the moment, but I can do so later if needed. I just thought I should post this while it's fresh in my head.

@tomchor tomchor changed the title Checkpointer is never created using Julia 1.6 Checkpoint is never created using Julia 1.6 Apr 5, 2021
@navidcy
Copy link
Collaborator

navidcy commented Apr 5, 2021

#1514 is failing so at this point you are using Julia v1.6 "at your own risk" :)

@tomchor
Copy link
Collaborator Author

tomchor commented Apr 6, 2021

I know, which is why I eventually reverted to 1.5 :)
But eventually these issues will have to be addressed, no?

@navidcy
Copy link
Collaborator

navidcy commented Apr 6, 2021

Yes definitely. :)

I was just making the point that if you are using v1.6 for "production-research ready" runs then you might be in trouble...

@glwagner
Copy link
Member

glwagner commented Apr 6, 2021

To clarify, the run appears to hang at the point that the Checkpointer attempts to write something to disk (rather than the simulation running normally, but with no output created)? We should also check whether JLD2OutputWriter works or hangs.

@tomchor
Copy link
Collaborator Author

tomchor commented Apr 6, 2021

Yes! It's hard for me to say exactly where it's getting stuck in the case. But it hangs before even running a single step of the simulation. Everything that happens up until the checkpointer seem to be fine. That is, model and simulations are created fine, and I see files created for all my NetCDF outputs, but I never see a file created for the checkpointer.

@glwagner
Copy link
Member

glwagner commented Apr 6, 2021

Hmm... I'm going to test whether JLD2 works with 1.6...

@mukund-gupta
Copy link
Contributor

I should also note that I've had an issue with the set! function in Julia 1.6 when running on GPUs.

@glwagner
Copy link
Member

glwagner commented Apr 7, 2021

It would be super super helpful if you document this on the Oceananigans issue tracker!

@tomchor
Copy link
Collaborator Author

tomchor commented Apr 13, 2021

So, I also tried upgrading my whole project without upgrading Julia (so still using Julia 1.5.2) and the error persists.

To be clear, before the upgrade below everything was working normally and after the upgrade the checkpointer stopped being created. Here's the upgrade:

   Updating registry at `~/.julia/registries/General`
######################################################################## 100.0%
  Installed Showoff ───── v1.0.2
  Installed StructTypes ─ v1.6.0
  Installed Tables ────── v1.4.2
  Installed Plots ─────── v1.11.2
  Installed ArgParse ──── v1.1.4
  Installed GR ────────── v0.57.3
Updating `/glade/scratch/tomasc/ISI_jet/Project.toml`
  [c7e460c6] ↑ ArgParse v1.1.2 ⇒ v1.1.4
  [63c18a36] ↑ KernelAbstractions v0.5.4 ⇒ v0.5.5
  [9e8cae18] ↑ Oceananigans v0.53.2 ⇒ v0.54.0
  [91a5bcdd] ↑ Plots v1.11.0 ⇒ v1.11.2
Updating `/glade/scratch/tomasc/ISI_jet/Manifest.toml`
  [79e6a3ab] ↑ Adapt v3.2.0 ⇒ v3.3.0
  [c7e460c6] ↑ ArgParse v1.1.2 ⇒ v1.1.4
  [4fba245c] ↑ ArrayInterface v3.1.6 ⇒ v3.1.7
  [052768ef] ↑ CUDA v2.4.1 ⇒ v2.4.3
  [d360d2e6] ↑ ChainRulesCore v0.9.34 ⇒ v0.9.37
  [35d6a980] ↑ ColorSchemes v3.10.2 ⇒ v3.11.0
  [5ae59095] ↑ Colors v0.12.6 ⇒ v0.12.7
  [34da2185] ↑ Compat v3.25.0 ⇒ v3.27.0
  [0c68f7d7] ↑ GPUArrays v6.2.0 ⇒ v6.2.2
  [28b8d3ca] ↑ GR v0.55.0 ⇒ v0.57.3
  [d2c73de3] ↑ GR_jll v0.56.1+0 ⇒ v0.57.2+0
  [63c18a36] ↑ KernelAbstractions v0.5.4 ⇒ v0.5.5
  [da04e1cc] ↑ MPI v0.17.1 ⇒ v0.17.2
  [872c559c] ↑ NNlib v0.7.17 ⇒ v0.7.18
  [9e8cae18] ↑ Oceananigans v0.53.2 ⇒ v0.54.0
  [91a5bcdd] ↑ Plots v1.11.0 ⇒ v1.11.2
  [ea2cea3b] + Qt5Base_jll v5.15.2+0
  [ede63266] - Qt_jll v5.15.2+3
  [01d81517] ↑ RecipesPipeline v0.3.1 ⇒ v0.3.2
  [992d4aef] ↑ Showoff v0.3.2 ⇒ v1.0.2
  [2913bbd2] ↑ StatsBase v0.33.4 ⇒ v0.33.5
  [09ab397b] ↑ StructArrays v0.5.0 ⇒ v0.5.1
  [856f2bd8] ↑ StructTypes v1.5.0 ⇒ v1.6.0
  [3783bdb8] ↑ TableTraits v1.0.0 ⇒ v1.0.1
  [bd369af6] ↑ Tables v1.4.1 ⇒ v1.4.2
  [6aa5eb33] ↑ TaylorSeries v0.10.11 ⇒ v0.10.12
  [0796e94c] ↑ Tokenize v0.5.15 ⇒ v0.5.16

I was able to create a MWE this time:

using Printf
using Oceananigans
using Oceananigans: Utils, Units
using Oceananigans.OutputWriters
using Oceanostics: SingleLineProgressMessenger

grid = RegularRectilinearGrid(size=(4, 4, 4), extent=(1,1,1))
model = IncompressibleModel(architecture = CPU(), grid = grid)

start_time = 1e-9*time_ns()
simulation = Simulation(model, Δt=1, stop_time=50, iteration_interval=5,
                        progress=SingleLineProgressMessenger(LES=false, initial_wall_time_seconds=start_time),
                        )
println("\n", simulation,"\n",)

@info "Setting up chk writer"
simulation.output_writers[:chk_writer] = Checkpointer(model; dir=".",
                                         prefix = "chk.test",
                                         schedule = TimeInterval(5),
                                         force = true, cleanup = true,
                                         )
                                         
println("\n", simulation,"\n",)          

@printf("---> Starting run!\n")
run!(simulation, pickup=true)

This results in the following output:

Simulation{IncompressibleModel{CPU, Float64}}
├── Model clock: time = 0 seconds, iteration = 0 
├── Next time step (Int64): 1 second 
├── Iteration interval: 5
├── Stop criteria: Any[Oceananigans.Simulations.iteration_limit_exceeded, Oceananigans.Simulations.stop_time_exceeded, Oceananigans.Simulations.wall_time_limit_exceeded]
├── Run time: 0 seconds, wall time limit: Inf
├── Stop time: 50 seconds, stop iteration: Inf
├── Diagnostics: OrderedCollections.OrderedDict with 1 entry:
│   └── nan_checker => NaNChecker
└── Output writers: OrderedCollections.OrderedDict with no entries

[ Info: Setting up chk writer

Simulation{IncompressibleModel{CPU, Float64}}
├── Model clock: time = 0 seconds, iteration = 0 
├── Next time step (Int64): 1 second 
├── Iteration interval: 5
├── Stop criteria: Any[Oceananigans.Simulations.iteration_limit_exceeded, Oceananigans.Simulations.stop_time_exceeded, Oceananigans.Simulations.wall_time_limit_exceeded]
├── Run time: 0 seconds, wall time limit: Inf
├── Stop time: 50 seconds, stop iteration: Inf
├── Diagnostics: OrderedCollections.OrderedDict with 1 entry:
│   └── nan_checker => NaNChecker
└── Output writers: OrderedCollections.OrderedDict with 1 entry:
│   └── chk_writer => Checkpointer

---> Starting run!

And then the REPL just hangs there and nothing happens. I also see no checkpoint file created.

I checked checkpointer.jl and the only packge that it seems to use that's different from the rest of Oceananigans is Glob, which wasn't even updated. So I'm not really sure what's causing this.

Here's versioninfo() and ]st for completeness. Let me know if something else is needed:

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)

(ISI_jet) pkg> st
Status `/glade/scratch/tomasc/ISI_jet/Project.toml`
  [c7e460c6] ArgParse v1.1.4
  [63c18a36] KernelAbstractions v0.5.5
  [9e8cae18] Oceananigans v0.54.0
  [d0ccf422] Oceanostics v0.3.0
  [5fb14364] OhMyREPL v0.5.10
  [d96e819e] Parameters v0.12.2
  [91a5bcdd] Plots v1.11.2
  [276daf66] SpecialFunctions v1.3.0
  [de0858da] Printf
  [10745b16] Statistics

@tomchor tomchor changed the title Checkpoint is never created using Julia 1.6 Checkpoint is never created after upgrade to latest version (both with Julia 1.5 and 1,6) Apr 13, 2021
@tomchor tomchor changed the title Checkpoint is never created after upgrade to latest version (both with Julia 1.5 and 1,6) Checkpoint is never created after upgrade (both with Julia 1.5 and 1,6) Apr 13, 2021
@tomchor tomchor changed the title Checkpoint is never created after upgrade (both with Julia 1.5 and 1,6) Checkpoint is never created after upgrade (both with Julia 1.5 and 1.6) Apr 13, 2021
@tomchor
Copy link
Collaborator Author

tomchor commented Apr 28, 2021

This appears to be fixed with the latest version of Oceananigans and Julia 1.5.4 (probably due to #1621).

@tomchor tomchor closed this as completed Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants