Improve NEST performance by revised connection exchange and spike delivery #2926

heplesser · 2023-09-07T21:45:53Z

This pull request significantly improves NEST performance by

compressing and transmitting connections in an improved way;
separating spike gathering (end of update loop) and spike delivery (beginning of next update loop);
exchanging all spikes by at most two MPI exchanges.

The changes are described in detail below.

Note that transmission of SecondaryEvents is essentially not affected by this PR since they are written directly into ready-made buffers on the receiver side. There is only some effect on connection transmission to the presynaptic side.

Many NEST developers have contributed to this work, especially @diesmann, @suku248, @JoseJVS, @med-ayssar, @mlober, @hakonsbm, @ackurth and @JanVogelsang.

Breaking changes in NEST

Spikes from last slice not delivered

Because spikes emitted during the final time slice of a simulation, devices that collect data through the normal spike transmission mechanism (e.g., correlation detector), will miss input from last slice and return different results.
Normal recording devices (e.g., spike_recorder) are not affected since they receive spikes locally.

Removed kernel parameters

sort_connections_by_source — use_compressed_spikes remains, which automatically activates connection sorting. There is simply no relevant use case where sorting but not compressing would make sense.
adaptive_spike_buffers — spike buffers are now always adaptive, see section on Buffer growth and shrinking
max_buffer_size_spike_data — there is no upper limit since all spikes need to be transmitted in one round

New kernel parameters

The following parameters control or report spike buffer resizing (see Buffer growth and shrinking for details):

spike_buffer_grow_extra
spike_buffer_shrink_limit
spike_buffer_shrink_spare
spike_buffer_resize_log

Things to check in particular

Formatting of kernel docstring (kernel_manager.h)
Placement of timers

Modified tests

test_mip_corrdet — need to simulate one step longer due to delivery at beginning of next step
test_regression_issue-1034— need to subtract min delay due to moved delivery

Connection compression and transmission

There is now only the choice between uncompressed spikes including sorting by sources and fully unsorted raw spike transmission.
Code "raw" connection transmission to the pre-synaptic side is essentially unchanged
There is now a separate method gather_target_data_compressed().
Construction and transmission of the compressed target tables is completely revised. The eventual compression is the same as before.
Note: the use of spike in a number of method and data structure names below has historic roots and should be cleaned up in a follow-up step.

Data Structures

`SourceTable::sources_`

sources_[target_thread][syn_id][lcid] stores raw connection info, built during connection creation

`SourceTable::compressible_sources_`

Used for first level compression on each thread
compressible_sources_[target_thread][syn_id][idx] contains one pair<source_node_id, SpikeData(target_thread, syn_id, conn_lcid, 0)>entry, mapping each source node id to a SpikeData entry identifying the first sources_ entry for that source node in the sorted sources_ array
Is cleared after second-level compression

`ConnectionManager::compressed_spike_data`

compressed_spike_data[syn_id][source_index][target_thread] is the result of the second compression step
For each synapse ID and unique source neuron with targets on a given thread, it stores the pair from compressible_sources_.
SourceTable::compressed_spike_data_map_ provides and index from source node id to source_index in compressed_spike_data.

`SourceTable::compressed_spike_data_map_`

compressed_spike_data_map_[syn_id] maps each unique source node id onto the corresponding source_index in the compressed_spike_data
After this mapping has been transferred to the pre-synaptic side, the map can be deleted

`CSDMapEntry`

Compact represenation for entries in compressed_spike_data_map_.

`ConnectionManager::iteration_state_`

vector< pair< syn_id, map< source_gid, CSDMapEntry >::const_iterator > >
For each thread, and syn_id maintain and iterator to corresponding part of compressed_spike_data_map_.
Is used while filling connection transmission buffers

Source compression

Compression works as follows:
1. Sort connections by source neuron (thread parallel)
2. Per-thread compression (SourceTable::collect_compressible_sources(), thread parallel)
  - In sorted SourceTable::sources_, create entry in compressible_sources_ connecting source ID to info about first entry for that source in sources_ and mark sequence of connections from that source neuron as source_has_more_targets.
3. Compression across threads (SourceTable::fill_compressed_spike_data(), serial)
  1. For each syn_id, iterate over connections on all threads in compressible_sources_
  2. If the source node is not known in the compressed_spike_data_map_ (for that syn_id)
    1. Append an empty entry to compressed_spike_data with one slot per thread
    2. Register this entry in compressed_spike_data_map_
  3. Register the connection under the right thread, writing the SpikeData entry created in the first compression step.

Connection transmission

Connection transmission works in multiple rounds if necessary, buffer size may be adjusted
Collocation of data is assigned to "assigned ranks"
Writing is mainly done by ConnectionManager::fill_target_bufferas follows:
1. Iterate through entire compressed_spike_data_map_, outermost by syn_id, then over source entries.
2. Find rank for given source—this is where we must send information.
3. If no space left in buffer chunk for this rank
  1. Store iteration state for iteration through csdmap and target buffer
  2. Inform gather that we did not write all data
4. Otherwise, write data and move to next entry
5. If all data has been written, signal this to gather
Gather MPI-exchanges data that has been written to buffers
If not all data has been transmitted, do more rounds until all data has been transmitted.
For each compressed set of connections, we send to the presynaptic side
- thread and thread-local id of source neuron
- syn_id
- index into compressed_spike_data
NOTE: The iteration scheme is different from the original approach. We stop as soon as a single rank has filled its part of the buffer. In the original, iteration would continue until the last rank had filled its chunk. CSDMap entries were marked as processed when written. On the next round, iteration through CSDMap would start at the point where the first rank had to stop writing, skipping all entries that had been written.

Spike transmission

Spikes written to emitted_spikes_register_ during node update are gathered at end of time slice and exchanged between ranks by gather_spike_data() and collocated_spike_data_buffers_().
Spike gathering is done by master branch only. Benchmarks up to 64 nodes showed no gain from "assigned ranks" approach (assigned ranks approach available in https://github.com/heplesser/nest-simulator/tree/def_ar)
At beginning of a time slice, the spike transmission buffer contains all spikes a given rank needs to deliver
Delivery happens thread-parallel in deliver_events_()

Data structures

Spike register

During node update, all spikes that need to be transmitted globally are written to the emitted_spikes_register
This is much simplified compared to master and only a 2D structure
- Outer dimension is the node-updating thread
- This contains pointers to the inner dimension to ensure that the metadata of the inner vector (begin/end markers etc) is local to the thread updating the neurons and pushing elements into the inner vector
- Contains entried combining SpikeData (to be written directly to transmission buffer) and rank of target neuron (for writing to correct section of target buffer)
Is doubled up for data with/without offgrid times
Data is written into emitted_spikes_register_ in EventDeliverManager::send_remote(), which is called when a node sends a spike
Data is read out from emitted_spikes_register_ in EventDeliveryManager::collocate_spike_data_buffers_(), called from gather_spike_data()

`SendBufferPosition`

Marks where next spike should be written in each target-rank-specific section of spike transmission buffer
Much simplified since we dropped the assigned ranks approach
Constructor moved to cpp-file
New separate class TargetSendBufferPosition used for connection communication with assigned ranks.

`SpikeData`

Compact representation of spike for transmission in MPI buffer
Largely unchanged
Also in OffGrid version
2-bit marker definition and usage changed, see section on Spike Gathering and Transmission below
Note: As soon as spike entry is created by send_remote(), we immediately create the eventual SpikeData entry, which is later copied to the transmission buffer by collocate_spike_data_buffers_(), no more re-coding in the process
- New constructor from (Target, lag) for direct insertion to emitted_spikes_register as part of SpikeDataWithRank entry
- SpikeDataWithRank combines SpikeData with target rank information needed for eventual writing to transmission buffer.
- Construction allows emplace_back() into emitted_spikes_register, i.e., direct construction instead of constuct and copy.
New method set_lcid() to allow transmission of locally required
buffer size per rank in LCID field
New method get_marker()
Corrected copy constructor and assignment operator
Copy and assignment operators for OffGridSpike data
New struct SpikeDataWithRank for emitted_spikes_register, also
variant for OffGrid

Deliver events first

Since NEST 2.16 (introduction of 5G kernel), presynaptic gathering and postsynaptic delivery of spikes were interleaved at the end of the simulation update loop.
This PR separates deliver_events_() again in a separate method at the beginning of each update loop.
This means that the simulation clock_ is advance by one min_delay when spikes are delivered compared to when they were sent.
Therefore, min_delay needs to be subtracted from clock_ when computing arrival times
- Done for normal spikes when computing prepared_timestamps in deliver_events_()
- Done explicitly in several rate_*_impl.h files

Spike gathering and transmission

All spike gathering and transmission is done by the master thread alone
If all data can be transmitted with current transmission buffer size, this is done in a single round of collocating spikes followed by MPI transmission
If, for any rank-rank pair more spikes need to be transmitted than fit into a buffer chunk, buffers are made larger so that all data will fit and the entire communication is repeated.

Marking completeness and required buffer size

All signaling needs to be done with two bits for the SpikeData::marker_ field
All entries in the transmission buffer must be usable for spike payload to avoid waste
To achieve maximum information with minimal bits, we also exploit where in a buffer certain markers occur.
We require that each per-rank chunk of the spike transmission buffer has at least two different entries.
The first entry of a chunk is called begpos, the last endpos (this position is included in the chunk, it is not one beyond)
local_max_spikes_per_rank is the largest number of spikes a given rank needs to transmit to any other rank.
global_max_spikes_per_rank is the maximum of all local_max_spikes_per_rank values. It determines the minimum required buffer chunk size.
SpikeData marker values are defined as follows: have
- DEFAULT: Normal entry, cannot occur in endpos
- END: Marks last entry containing data.
  - If it occurs in endpos,
    - it implies COMPLETE
    - it indicates that local_max_spikes_per_rank of the sending rank is equal to the current buffer size

- COMPLETE: Can only occur in endpos and indicates that the sending rank could write all emitted spikes to the transmission buffer.
```
 - END is then in earlier position.
```

 - The LCID entry of endpos contains the `local_max_spikes_per_rank` of the corresponding sending rank.

- INVALID:

 - In begpos indicates that no spikes are transmitted (@note: END at begpos means one spike transmitted)

 - In endpos, indicates that the pertaining rank could not send all spikes.
 - The LCID entry of endpos contains the `local_max_spikes_per_rank` of the corresponding sending rank.

Collocation, transmission and signalling now proceeds as follows:
1. Each rank writes all spikes that fit into transmission buffer chunks and counts all spikes, also those that do not fit.
2. Each rank determines its local_max_spikes_per_rank.
3. If local_max >= chunk_size, set endpos markers to INVALID and store `local_max_ there.
4. Otherwise,
  1. if no spikes were written to a chunk, set begpos for chunk to INVALID,
    otherwise set END marker on last position written to.
  2. if some spikes were written to a chunk but the chunk not filled, set COMPLETE on endpos for chunk and store local_max in endpos LCID
5. Perform MPI exchange
6. Obtain global_max_spikes_per_rank from all local_max information obtained
7. If global_max > chunk_size, grow buffer and repeat entire process.

Buffer growth and shrinking

This is implemented in gather_spike_data_()
All MPI ranks need to make same changed based on available information
Key quantitiy is global_max_spikes_per_rank_, i.e., the largest number of spikes that any rank has sent to any other rank. The individual sections of the spike transmission buffer must be at least this size.
Shrinking happens at beginning of gather based on on global_max_spikes_per_rank_, growth during gathering if required.
Growing is costly because it means extra round of communication; shrinking is cheap because it is a local operation on each rank.
Here, buffer size is always the size of a section for transmitting from one rank to another rank.

Growing

Grow if spikes that need to be transmitted do not fit into buffer
Minimum required size would be global_max_spikes_per_rank_
Grow to (1 + spike_buffer_grow_extra) * global_max_spikes_per_rank_ to keep number of grow operations small
Some experimentation suggest that sizeable extra is advantageous, therefore default spike_buffer_grow_extra == 0.5

Shrinking

Avoid shrinking too often so that we do not need too grow too frequently.
Shrink only if number of spikes is well below current size, i.e., if
global_max_spikes_per_rank_ < spike_buffer_shrink_limit * buffer_size
When shrinking, leave some spare space in buffer to avoid having to grow quickly:
new_size = ( 1 + spike_buffer_shrink_spare ) * global_max_spikes_per_rank_
Shrinking can be turned off by setting spike_buffer_shrink_limit = 0
Some experimentation suggests the following defaults
- spike_buffer_shrink_limit == 0.3
- spike_buffer_shrink_spare == 0.1

Logging

For each resize operation, a time stamp (always a full time slice, in steps), the value of global_max_spikes_per_rank_ and the new buffer size are recorded.
These data are available through kernel attribute spike_buffer_resize_log, which is a dictionary with the same structure as events dictionaries of recorders, i.e., containing one array for each of the three quantities recorded.

Spike delivery

deliver_events_() is called at beginning of each time slice except for the very first time slice (time 0, nothing to deliver)
deliver_events_() is called in a thread-parallel context
First, count the total number of spikes to be delivered
- Based on end_marker in section from each rank
Then split spikes into batches (currently 8 spikes per batch)
For each batch, transfer spike data into per-quantity C-style arrays, then deliver from those arrays
Finally deliver remaining spikes
This batching gave noticeable performance gains.

Minor changes

Limit on LCID values

MAX_LCID now used to mark invalid_lcid
Therefore, maximum usable lcid is MAX_LCID-1
Changed in nest_types.h
See also Systematize definition of INVALID_* and MAX_* constants #2529 for follow-up

`MPIManager` changes

Code related to adapative spike buffers removed. This is now handled in EventDeliveryManager.

`FULL_LOGGING()` macro

Macro supporting logging, entirely disabled by default
Especially for debugging output from multi-rank, multi-thread simulations
Also write_to_dump() method for logging output
If used, forces sync'ing of threads through critical sections for output
Files containing implementation
- CMakeLists.txt
- cmake/ProcessOptions.cmake
- libnestutil/config.h.in
- kernel_manager.h,cpp
Question: Should all uses of this macro be removed before merging?

Touch ups

Replace int by size_t for spike multiplicity
- music_event_out_proxy
- spike_recorder
- event
Whitespace fixes
- stimulation_backend_mpi.h
- recording_backend_mpi.h

Updated tests

test_stdp_synapse — modernization, no change to logic
`ticket-85.sli — slight simplification, no change to logic

SLI unittest

Add option to print results of distributed_process_invariant_events...

Open issues to be followed up

Consistent semantics for MAX_ and invalid_ constants, see Systematize definition of INVALID_* and MAX_* constants #2529
Considerable bits of code for transmitting connection information to pre-synaptic side is in EventDeliveryManager. Move to ConnectionManager.
Buffer resizing is split between MPIManager and code using the buffers in complicated ways. This should be made more systematic.
Remove considerable code duplication and templatization due to on-/off-grid distinction (see Move precise spike time offset from Event to Time class #2035)
Consider if hard-coded buffer-resizing strategy could be replace by more flexible solution and what better solutions could be, e.g., one taking simulation history into account.
See if batch code in deliver_events_() can be simplified by use of functions.
Could SendBufferPosition be turned into proper iterator (or array of iterators), and should TargetSendBufferPosition moved to file of its own?
Clean up use of spike in names in connection infrastructure building
See if SourceTable::compressed_spike_data_map_ can be cleared after connection transmission

This PR replaces #2617.

…spike_data; use max spike-data buffer size

…ther_spike_data Ensure thread-local memory allocation

…om:suku248/nest-simulator into test_single_threading_in_gather_spike_data

…gle_threading_in_gather_spike_data

…m:suku248/nest-simulator into single_batchwise

…single_threading_in_gather_spike_data Conflicts: nestkernel/event_delivery_manager.cpp

…o debug_nest

…nto def_nolag_mrg

nestkernel/event_delivery_manager.h

Co-authored-by: Jochen Martin Eppler <[email protected]>

nestkernel/event_delivery_manager.h

Co-authored-by: Jochen Martin Eppler <[email protected]>

Co-authored-by: Melissa Lober <[email protected]>

…nto def_nolag_mrg

…g_mrg

nestkernel/spike_data.h

Co-authored-by: Melissa Lober <[email protected]>

jougs

LGTM! Many thanks!

…nto def_nolag_mrg

nestkernel/spike_data.h

nestkernel/target.h

suku248 and others added 30 commits March 7, 2022 21:49

Remove while-loop and multi-threading (except for deliver) in gather_…

a4fbdbe

…spike_data; use max spike-data buffer size

Ensure spike register is thread-local.

472affb

Fixed formatting.

e8d52f4

Merge pull request #2 from heplesser/suku_test_single_threading_in_ga…

9d993f9

…ther_spike_data Ensure thread-local memory allocation

Remove reading thread dimension in spike register for testing

5ee6d82

Merge branch 'test_single_threading_in_gather_spike_data' of github.c…

e647604

…om:suku248/nest-simulator into test_single_threading_in_gather_spike_data

Change spike register to use thread-specific pointers to inner vectors

4b772b1

Merge branch 'master' of github.com:nest/nest-simulator into test_sin…

ece715d

…gle_threading_in_gather_spike_data

Fixed merge error

f0610ea

Process spike receive buffer in a batchwise fashion

eb5851f

Merge branch 'batchwise_processing_spike_receive_buffer' of github.co…

ce0ab45

…m:suku248/nest-simulator into single_batchwise

Compressed spike mapping with fixed slot per spike.

5e06435

Merge remote-tracking branch 'origin/single_batchwise' into hep_test_…

c1883dd

…single_threading_in_gather_spike_data Conflicts: nestkernel/event_delivery_manager.cpp

Fixed thread bugs.

b4a1283

More brute force serialization.

2302845

Added SIONLIB omp barrier skip

afb2ec7

Rearranged deliver events and gather events.

f743f03

Moved deliver events conditional block.

cef6d8a

Shift event_delivery after updating nodes and gather_spikes

02cee95

Shift spikes timestemaps

c7a1e50

Merge branch 'debug_nest' of github.com:med-ayssar/nest-simulator int…

79d4ebd

…o debug_nest

Put the delivery event back at the top of the main while-loop

3c3caab

Substract lag from mindelay in deliver_event

98196e8

Fix static check error

09f5446

Fix assertion error

d7052e0

Fix static check error

0d20dfd

Clean implementation of time stamps on delivery.

daccac5

Attempt to fix time stamps for secondary events.

d257e6c

Added minimal unit test for problems when using one-to-one connectivity.

f3b8033

Converted one-to-one-multithreading test to pytest

f009e45

Merge branch 'def_nolag_mrg' of github.com:heplesser/nest-simulator i…

afba22f

…nto def_nolag_mrg

mlober reviewed Sep 11, 2023

View reviewed changes

nestkernel/event_delivery_manager.h Outdated Show resolved Hide resolved

Update cmake/ProcessOptions.cmake

042b5b2

Co-authored-by: Jochen Martin Eppler <[email protected]>

mlober reviewed Sep 11, 2023

View reviewed changes

nestkernel/event_delivery_manager.h Outdated Show resolved Hide resolved

heplesser and others added 7 commits September 11, 2023 16:50

Update libnestutil/config.h.in

05ce424

Co-authored-by: Jochen Martin Eppler <[email protected]>

Update nestkernel/event_delivery_manager.h

44900f6

Co-authored-by: Melissa Lober <[email protected]>

Update nestkernel/event_delivery_manager.h

87cfdc7

Co-authored-by: Melissa Lober <[email protected]>

Some improvements around FULL_LOGGING_ONLY

073d3d7

Merge branch 'def_nolag_mrg' of github.com:heplesser/nest-simulator i…

f92b21a

…nto def_nolag_mrg

Merge branch 'master' of github.com:nest/nest-simulator into def_nola…

503b7de

…g_mrg

Fix C++ formatting

f7a7af3

mlober reviewed Sep 11, 2023

View reviewed changes

nestkernel/spike_data.h Outdated Show resolved Hide resolved

heplesser and others added 3 commits September 11, 2023 17:39

Fix tests after merge

a92b995

Fix formatting

2a17bdd

Update nestkernel/spike_data.h

8b3488a

Co-authored-by: Melissa Lober <[email protected]>

heplesser requested review from jougs and removed request for suku248 and JanVogelsang September 13, 2023 08:09

jougs approved these changes Sep 13, 2023

View reviewed changes

heplesser added 3 commits September 13, 2023 10:26

Fix formatting of kernel SLI documentation

c4c00fe

Merge branch 'def_nolag_mrg' of github.com:heplesser/nest-simulator i…

bca6f28

…nto def_nolag_mrg

Fix typos

e57b9a1

heplesser requested review from mlober and jessica-mitchell September 13, 2023 08:35

jessica-mitchell approved these changes Sep 13, 2023

View reviewed changes

mlober approved these changes Sep 13, 2023

View reviewed changes

nestkernel/spike_data.h Outdated Show resolved Hide resolved

nestkernel/target.h Outdated Show resolved Hide resolved

nestkernel/target.h Outdated Show resolved Hide resolved

nestkernel/target.h Outdated Show resolved Hide resolved

heplesser merged commit c2258df into nest:master Sep 13, 2023
20 checks passed

heplesser mentioned this pull request Sep 14, 2023

Correct limit for test on target fields and remove duplicate assertions #2945

Merged

heplesser deleted the def_nolag_mrg branch April 24, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NEST performance by revised connection exchange and spike delivery #2926

Improve NEST performance by revised connection exchange and spike delivery #2926

heplesser commented Sep 7, 2023 •

edited

Loading

jougs left a comment

Improve NEST performance by revised connection exchange and spike delivery #2926

Improve NEST performance by revised connection exchange and spike delivery #2926

Conversation

heplesser commented Sep 7, 2023 • edited Loading

Breaking changes in NEST

Spikes from last slice not delivered

Removed kernel parameters

New kernel parameters

Things to check in particular

Modified tests

Connection compression and transmission

Data Structures

SourceTable::sources_

SourceTable::compressible_sources_

ConnectionManager::compressed_spike_data

SourceTable::compressed_spike_data_map_

CSDMapEntry

ConnectionManager::iteration_state_

Source compression

Connection transmission

Spike transmission

Data structures

Spike register

SendBufferPosition

SpikeData

Deliver events first

Spike gathering and transmission

Marking completeness and required buffer size

Buffer growth and shrinking

Growing

Shrinking

Logging

Spike delivery

Minor changes

Limit on LCID values

MPIManager changes

FULL_LOGGING() macro

Touch ups

Updated tests

SLI unittest

Open issues to be followed up

jougs left a comment

Choose a reason for hiding this comment

heplesser commented Sep 7, 2023 •

edited

Loading

`SourceTable::sources_`

`SourceTable::compressible_sources_`

`ConnectionManager::compressed_spike_data`

`SourceTable::compressed_spike_data_map_`

`CSDMapEntry`

`ConnectionManager::iteration_state_`

`SendBufferPosition`

`SpikeData`

`MPIManager` changes

`FULL_LOGGING()` macro