convert to pure SoA particle containers #515

BenWibking · 2024-01-31T00:56:56Z

Describe the proposal
For performance reasons, we should convert the CICParticles to "pure" SoA particles, where all of the particle data is stored in memory in "structure of arrays" layout. This improves performance on GPUs, with no measurable effect on CPUs, for unclear reasons ~~on all platforms, due to this data layout allowing for vectorization (on CPU) and for memory coalesing (on GPU)~~.

Describe alternatives you've considered
We could keep it as is, which does not achieve high performance (compared to pure SoA). This might be okay, since we are probably not dominated by the cost of particle operations.

Additional context
WarpX has done so here: ECP-WarpX/WarpX#4653

BenWibking · 2024-02-09T02:06:09Z

SoA performance on GPUs is 1.73x to 2.25x faster than the default (AoS) particle layout for PIC codes.

See performance benchmarks: ECP-WarpX/impactx#348.

ax3l · 2024-02-10T07:48:25Z

To give more details:

This improves performance on GPUs, with no measurable effect on CPUs

Generally this can improve performance on both CPUs and GPUs (see: ImpactX link), because of better aligned memory access for positions and IDs and of memory bandwidth savings when the id+cpu are not accessed.

I write can, because for some kernels (as seen in the WarpX PR) that are very register heavy (have low occupancy) or are bottlenecked by other parts, e.g., atomics, this does not show an immediate improvement on its own.

We also see performance improvements on CPU (see: ImpactX Drift vs. Quad), but notably there is one other effort to be aware of: CPU performance these days is mostly vectorization. Using an SoA layout is a prerequisite for easier autovectorization and/or manual vectorization (the first step with the old AoS layout was packing into SIMD vectors).
Easy functions now auto-vectorize with your compiler, more complex ones will be easier to vectorize manually/semi-manually.

So all in all: there is no downside transitioning to pure SoA layout.

Other things to consider

The only known performance regression so far (which is easily solvable!) ParticleContainer::RedistributeCPU for Pure SoA AMReX-Codes/amrex#3744
Also consider using the new idcpu Wrapper helpers while transitioning to pure SoA: Add ParticleIDWrapper::make_invalid() AMReX-Codes/amrex#3735

BenWibking · 2024-02-12T18:01:01Z

@ax3l Is there a SoA equivalent of amrex::ParticleInterpolator?

We use it here:

quokka/src/simulation.hpp

Line 1153 in 2d863d6

amrex::ParallelFor(np, [=] AMREX_GPU_DEVICE(int64_t idx) {

We could copy and paste the implementation we are using and rewrite for SoA particle tiles, but ideally that would be avoided.

BenWibking added enhancement New feature or request particles labels Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert to pure SoA particle containers #515

convert to pure SoA particle containers #515

BenWibking commented Jan 31, 2024 •

edited

Loading

BenWibking commented Feb 9, 2024

ax3l commented Feb 10, 2024 •

edited

Loading

BenWibking commented Feb 12, 2024

convert to pure SoA particle containers #515

convert to pure SoA particle containers #515

Comments

BenWibking commented Jan 31, 2024 • edited Loading

BenWibking commented Feb 9, 2024

ax3l commented Feb 10, 2024 • edited Loading

Other things to consider

BenWibking commented Feb 12, 2024

BenWibking commented Jan 31, 2024 •

edited

Loading

ax3l commented Feb 10, 2024 •

edited

Loading