-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert to pure SoA particle containers #515
Comments
SoA performance on GPUs is 1.73x to 2.25x faster than the default (AoS) particle layout for PIC codes. See performance benchmarks: ECP-WarpX/impactx#348. |
To give more details:
Generally this can improve performance on both CPUs and GPUs (see: ImpactX link), because of better aligned memory access for positions and IDs and of memory bandwidth savings when the id+cpu are not accessed. I write can, because for some kernels (as seen in the WarpX PR) that are very register heavy (have low occupancy) or are bottlenecked by other parts, e.g., atomics, this does not show an immediate improvement on its own. We also see performance improvements on CPU (see: ImpactX Drift vs. Quad), but notably there is one other effort to be aware of: CPU performance these days is mostly vectorization. Using an SoA layout is a prerequisite for easier autovectorization and/or manual vectorization (the first step with the old AoS layout was packing into SIMD vectors). So all in all: there is no downside transitioning to pure SoA layout. Other things to consider
|
@ax3l Is there a SoA equivalent of amrex::ParticleInterpolator? We use it here: Line 1153 in 2d863d6
We could copy and paste the implementation we are using and rewrite for SoA particle tiles, but ideally that would be avoided. |
Describe the proposal
For performance reasons, we should convert the CICParticles to "pure" SoA particles, where all of the particle data is stored in memory in "structure of arrays" layout. This improves performance on GPUs, with no measurable effect on CPUs, for unclear reasons
on all platforms, due to this data layout allowing for vectorization (on CPU) and for memory coalesing (on GPU).Describe alternatives you've considered
We could keep it as is, which does not achieve high performance (compared to pure SoA). This might be okay, since we are probably not dominated by the cost of particle operations.
Additional context
WarpX has done so here: ECP-WarpX/WarpX#4653
The text was updated successfully, but these errors were encountered: