Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time averaged field/moment output #57

Open
jackmatt16 opened this issue Aug 6, 2019 · 15 comments
Open

Time averaged field/moment output #57

jackmatt16 opened this issue Aug 6, 2019 · 15 comments

Comments

@jackmatt16
Copy link

Is this in the code? If so, how do we turn it on? Sorry if I'm missing something obvious that we discussed already!

@germasch
Copy link
Contributor

germasch commented Aug 6, 2019

It should be supported by just setting the corresponding parameters, where pfield_step is also set:

// ======================================================================
// OutputFieldsCParams

struct OutputFieldsCParams
{
  const char *data_dir = {"."};

  int pfield_step = 0;
  int pfield_first = 0;

  int tfield_step = 0;
  int tfield_first = 0;
  int tfield_length = 1000000;
  int tfield_every = 1;

  Int3 rn = {};
  Int3 rx = {1000000, 1000000, 100000};
};

I think the way this works (you can look at src/include/output_fields_x.hxx for the actual code)

  • tfield_step says how often you want to write the averaged fields
  • tfield_every is how often the current field is added into the averaging
  • tfield_length let's you limit the averaging to the given number of steps before averaging.

So, as an example, tfield_step = 100, tfield_every_2, tfield_length=10 should average the output from step 92, 94, 96, 98, 100 and output it at time 100.

Having said this, I haven't used this in a while, so there's definitely a chance that something may not work as expected, so let me know if you see any issues.

Also, for what it's worth, I'm in the process of simplifying that whole area in a way that should make it more straight forward to implement the output directly yourself in the case, making it easier to adapt it to specific needs.

@jackmatt16
Copy link
Author

The code you've shown above looks different than what's in my psc_flatfoil_xz.cxx:

// -- output fields
OutputFieldsCParams outf_params{};
outf_params.pfield_step = 2000;
std::vector<std::unique_ptr> outf_items;
outf_items.emplace_back(new FieldsItem_E_cc(grid));
outf_items.emplace_back(new FieldsItem_H_cc(grid));
outf_items.emplace_back(new FieldsItem_J_cc(grid));
outf_items.emplace_back(new FieldsItem_n_1st_cc(grid));
outf_items.emplace_back(new FieldsItem_v_1st_cc(grid));
outf_items.emplace_back(new FieldsItem_T_1st_cc(grid));
OutputFieldsC outf{grid, outf_params, std::move(outf_items)};

// -- output particles
OutputParticlesParams outp_params{};
outp_params.every_step = 0;
outp_params.data_dir = ".";
outp_params.basename = "prt";
OutputParticles outp{grid, outp_params};

I am up to date on the branch - is this located in another file?

@germasch
Copy link
Contributor

germasch commented Aug 6, 2019

Yeah, sorry, I wasn't clear. the above is from output_fields_c.hxx, where the parameters are defined, but they're set by the case.

So here in psc_flatfoil_yz.cxx:

outf_params.pfield_step = 2000;

you can add

outf_params.tfield_step = 2000;
outf_params.tfield_every = 100;

or something like that.

@jackmatt16
Copy link
Author

Ahh perfect, makes perfect sense. Thanks for the quick replies Kai!

@jackmatt16
Copy link
Author

After trying to use the tfields, I find I get a segfault, right as the tfd-000000_h5 file is created (nothing is written into it). This occurs right after field balancing, before the first time step. The log file is located (and should be accessible) at /gpfs/alpine/proj-shared/ast147/jmatteuc/flatfoil-summit-547640/flatfoil_summit004.547640

@germasch
Copy link
Contributor

germasch commented Aug 7, 2019

Can you try again with the latest master? I managed to reproduce a crash that was likely it, but maybe that wasn't the only problem.

@jackmatt16
Copy link
Author

Hmmm, just tried but I got another crash. This time the output file only contains this:


[c03n17:140751] *** An error occurred in MPI_Alltoallv
[c03n17:140751] *** reported by process [2154496201,3]
[c03n17:140751] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[c03n17:140751] *** MPI_ERR_ARG: invalid argument of some other kind
[c03n17:140751] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c03n17:140751] *** and potentially your MPI job)
[c03n17:140750] *** An error occurred in MPI_Alltoallv
[c03n17:140750] *** reported by process [2154496201,2]
[c03n17:140750] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[c03n17:140750] *** MPI_ERR_ARG: invalid argument of some other kind
[c03n17:140750] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c03n17:140750] *** and potentially your MPI job)
[c03n17:140749] *** An error occurred in MPI_Alltoallv
[c03n17:140749] *** reported by process [2154496201,1]
[c03n17:140749] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[c03n17:140749] *** MPI_ERR_ARG: invalid argument of some other kind
[c03n17:140749] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c03n17:140749] *** and potentially your MPI job)


Sender: LSF System lsfadmin@batch5
Subject: Job 553200: <flatfoil_summit004> in cluster Exited

Job <flatfoil_summit004> was submitted from host by user in cluster at Thu Aug 8 10:49:34 2019
Job was executed on host(s) <1batch5>, in queue , as user in cluster at Thu Aug 8 10:52:35 2019
<42
c03n17>
</ccs/home/jmatteuc> was used as the home directory.
</ccs/home/jmatteuc/PSC_code/cver_new/psc/submissions> was used as the working directory.
Started at Thu Aug 8 10:52:35 2019
Terminated at Thu Aug 8 10:53:12 2019
Results reported at Thu Aug 8 10:53:12 2019

The output (if any) is above this job summary.


I am running the small case on GPUs with tfd and pfd on.

@germasch
Copy link
Contributor

germasch commented Aug 8, 2019

Can you attach the psc_flatfoil_yz.cxx you're using here, and also on how many procs you're running? So far I haven't been able to reproduce it.

@jackmatt16
Copy link
Author

psc_flatfoil_yz.txt

@jackmatt16
Copy link
Author

I've narrowed it down to still breaking on 4 GPUs on 1 node

@germasch
Copy link
Contributor

germasch commented Aug 9, 2019

Well, I still haven't been able to reproduce it. I ran the current master with your psc_flatfoil_yz.cxx, compiled with CUDA support, on 4 procs, and the 1/2 hour run made it to step 3500 without issue. When did the crash happen for you? Did you run the code with any command line options (there are some, though very few, left)?

The error you're seeing indicates that it happens while redistributing field data for writing output, but it's not very specific beyond that. Can you point me to where the log file is? I see other runs in your dir on summit, but those other runs don't seem to have the logs that go with them.

@jackmatt16
Copy link
Author

Hmmm, I'm pretty stumped, I just cloned a whole new version, just added the two lines for the tfields and ran it, getting the same error. All the files should be here:

/gpfs/alpine/proj-shared/ast147/jmatteuc/flatfoil-summit-556325

Also, just a heads up running for 1/2 hour and getting to 3500 steps seems a bit slower than normal - I think something changed with the heating (or injection, but pretty sure heating) that's now taking an inordinate amount of time.

Thanks!

@germasch
Copy link
Contributor

germasch commented Aug 9, 2019

Okay, I can reproduce it now. Somehow it seems that the problem doesn't happen for a Debug build (and I actually tried both yesterday, but I think I ended up confusing myself having both a build-summit-dbg-gpu and a build-summit-gpu-dbg directory...)

On the heating performance, I saw the other issue. It sounds like it's related to the change where I generate unique random numbers for each cuda thread now, rather than just 1024 of those, but that shouldn't have a lasting impact on performance, so I'll need to look at that next.

@germasch
Copy link
Contributor

So unfortunately, I'm still stumped about what's happening. You can workaround the issue by adding -DCMAKE_CXX_FLAGS_RELEASE="-O2 -UNDEBUG" do your cmake invocation. This is by no means a real fix, though, and it's still totally unclear to me why this even makes a difference.

To give a bit of background, normally the release flags contain -DNDEBUG, which does the same as #define NDEBUG inside of the code. The only place that should have an impact are assert statements, which are used quite a bit throughout the code. If NDEBUG is defined, meaning "no debug", these assert statements are dropped from the code entirely, ie., there will be no assertion failures since they never get checked. By default (or with -UNDEBUG, which really does nothing), those statements are checked at runtime. But with the assertions enabled, the code appears to run fine, so it doesn't make much sense that when dropping the statements that don't get triggered, anyway, things go bad.

There is one exception to this logic, though I don't think I've done anything like this: If you write

// ...
assert(i++ < 100);
// ...

having NDEBUG defined vs not will make a difference in behavior, ie, if NDEBUG is defined, i++ will not happen at all.

Anyway, there are upcoming changes to the workings of the output, so maybe they'll happen to make a difference, anyway. Hopefully the workaround will be enough to allow you to move on for now.

@jackmatt16
Copy link
Author

Ok thanks Kai. I think for now this will suffice, and it's not the worst deal to just output p-fields. The delay that happens in the heating operator is pretty crippling right now - kinda disables the ability to do any bigger runs, so I'd say that's the top priority bug now.. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants