Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution error #104

Open
brey opened this issue May 30, 2023 · 24 comments
Open

Execution error #104

brey opened this issue May 30, 2023 · 24 comments

Comments

@brey
Copy link
Contributor

brey commented May 30, 2023

We are testing our HR global model with nSCHISM_hgrid_node: 11880520 & nSCHISM_hgrid_face: 14840567.

When executing we get

0: ABORT: AQUIRE_HGRID: ilnd_global allocation failure

This happens either for sanity check (ipre=1) or general run (ipre=0).

Any ideas why this could be happening?

We have tried up to 10 nodes (960 cores) on Azure HPC and it doesn't work.

Please note that this mesh is with full resolution GSHHS and has 180491 boundaries. Could that be the reason?

@josephzhang8
Copy link
Member

josephzhang8 commented May 30, 2023 via email

@brey
Copy link
Contributor Author

brey commented May 30, 2023

Thanks @josephzhang8 . Two questions.

Is the mesh loaded in its entirety on one node before partitioning? If this is the case, what is the amount of memory we might need for such a big mesh? I suppose this creates a high demand for the master node is terms of memory, no?

Can you point me to the documentation on how to use static partition?

@josephzhang8
Copy link
Member

josephzhang8 commented May 30, 2023 via email

@pmav99
Copy link
Contributor

pmav99 commented May 30, 2023

@josephzhang8 Thank you! The VMs we use have 448GB RAM and 120 cores. But we can use fewer cores, e.g. 96 per node which would give us something like 4.5GB/core. So this amount of RAM per core should be doable.

Are the instructions the same for schism 5.9? Because that's what we currently use

@josephzhang8
Copy link
Member

josephzhang8 commented May 30, 2023 via email

@pmav99
Copy link
Contributor

pmav99 commented May 31, 2023

I tried to follow the instructions for the static partitioning but the metis prepration step (i.e. step 2) is failing with a segmentation fault. The problem is that:

  1. we have a global mesh with the full resolution for the coastlines. This translates to 180491 land boundaries with the smallest one consisting of 3 nodes and the larger one (Eurasia+Africa) consisting of 1181108 nodes.
  2. The code tries to allocate a single 2D array for all the land boundaries, so it needs enough RAM for: 8 * 180491 * 1181108 = 1705GB and this obviously fails.

The good news is that, from what I understood, the metis prepration script does not really use the ilnd table. So if we comment out the lines referencing ilnd then the script runs and produces the graphinfo file. The relevant block of code is:

allocate(ilnd(nland,mnlnd),stat=stat)
! Aquire global land boundary segments and nodes
rewind(14); read(14,*); read(14,*);
do i=1,np; read(14,*); enddo;
do i=1,ne; read(14,*); enddo;
read(14,*); read(14,*);
do k=1,nope; read(14,*) nn; do i=1,nn; read(14,*); enddo; enddo;
read(14,*); read(14,*);
nlnd=0; ilnd=0;
do k=1,nland
read(14,*) nn
do i=1,nn
read(14,*) ip
nlnd(k)=nlnd(k)+1
ilnd(k,nlnd(k))=ip
if(isbnd(ip)==0) isbnd(ip)=-1 !overlap of open bnd
enddo !i
enddo !k

@josephzhang8 Can you confirm that ilnd is indeed not needed for the metis preparation?

BTW, the segmentation fault happens when we first try to assign a value to ilnd (i.e. line 287). Checking stat after the allocation (line 272) would make it a bit easier to figure out what is going on.

If you want, I can make a PR to remove ilnd or add a check after the allocation. No problem if you'd rather fix it on your end, too.

All that being said, I think that our main problem remains. If I understand the code correctly (and I should mention that my Fortran knowledge is nothing to speak of), the main schism code also tries to do the exact same allocation. The relevant lines are:

allocate(ilnd_global(nland_global,mnlnd_global),stat=stat);
if(stat/=0) call parallel_abort('AQUIRE_HGRID: ilnd_global allocation failure')

If this is True, then for the grid in question we do need 1705GB per process which unfortunately is not really feasible...

@josephzhang8
Copy link
Member

josephzhang8 commented May 31, 2023 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 6, 2023 via email

@pmav99
Copy link
Contributor

pmav99 commented Jun 8, 2023

Thank you @josephzhang8
We did test the division of the land boundaries on a smaller model and indeed it seems to be working fine. We will let you know how it goes after testing it on the global model, too.

@brey
Copy link
Contributor Author

brey commented Jun 12, 2023

Dear @josephzhang8. We have split the boundaries on the big mesh and although the sanity check seems to work we were unable to effectively run it on Azure. You can find the model here. Hopefully, you can use it as a test case for possible modifications in SCHISM. If you manage to make it work on your end, we would be interested to try it out. In the mean time we'll try something simpler. Thanks.

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 12, 2023 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 12, 2023 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 12, 2023 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 12, 2023 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 13, 2023 via email

@brey
Copy link
Contributor Author

brey commented Jun 13, 2023

Great news!

I know that by forcing it to follow such a convoluted coastline I am asking for trouble. I will try some ways to make it more manageable and let you know. We'll also try pre-partition and with your estimate of of ram/core well give it another try.

Based on your experience, could such a mesh work? I know that the mesh is not balanced and I wonder if the skewness of the elements might also give problems both in terms of stability and accuracy.

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 13, 2023 via email

@pmav99
Copy link
Contributor

pmav99 commented Jun 13, 2023

@josephzhang8 The netcdf is indeed 24GB but it is uncompressed. Does schism support reading compressed/deflated Netcdf files?

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 13, 2023 via email

@brey
Copy link
Contributor Author

brey commented Jun 13, 2023

Joseph, indeed the metro forcing is every hour. That means that wtiminc should be 3600?

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 13, 2023 via email

@josephzhang8
Copy link
Member

josephzhang8 commented Jun 14, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants