Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NEST performance by revised connection exchange and spike delivery #2926

Merged
merged 146 commits into from
Sep 13, 2023

Conversation

heplesser
Copy link
Contributor

@heplesser heplesser commented Sep 7, 2023

This pull request significantly improves NEST performance by

  • compressing and transmitting connections in an improved way;
  • separating spike gathering (end of update loop) and spike delivery (beginning of next update loop);
  • exchanging all spikes by at most two MPI exchanges.

The changes are described in detail below.

Note that transmission of SecondaryEvents is essentially not affected by this PR since they are written directly into ready-made buffers on the receiver side. There is only some effect on connection transmission to the presynaptic side.

Many NEST developers have contributed to this work, especially @diesmann, @suku248, @JoseJVS, @med-ayssar, @mlober, @hakonsbm, @ackurth and @JanVogelsang.

Breaking changes in NEST

Spikes from last slice not delivered

  • Because spikes emitted during the final time slice of a simulation, devices that collect data through the normal spike transmission mechanism (e.g., correlation detector), will miss input from last slice and return different results.
  • Normal recording devices (e.g., spike_recorder) are not affected since they receive spikes locally.

Removed kernel parameters

  • sort_connections_by_sourceuse_compressed_spikes remains, which automatically activates connection sorting. There is simply no relevant use case where sorting but not compressing would make sense.
  • adaptive_spike_buffers — spike buffers are now always adaptive, see section on Buffer growth and shrinking
  • max_buffer_size_spike_data — there is no upper limit since all spikes need to be transmitted in one round

New kernel parameters

The following parameters control or report spike buffer resizing (see Buffer growth and shrinking for details):

  • spike_buffer_grow_extra
  • spike_buffer_shrink_limit
  • spike_buffer_shrink_spare
  • spike_buffer_resize_log

Things to check in particular

  • Formatting of kernel docstring (kernel_manager.h)
  • Placement of timers

Modified tests

  • test_mip_corrdet — need to simulate one step longer due to delivery at beginning of next step
  • test_regression_issue-1034— need to subtract min delay due to moved delivery

Connection compression and transmission

  • There is now only the choice between uncompressed spikes including sorting by sources and fully unsorted raw spike transmission.
  • Code "raw" connection transmission to the pre-synaptic side is essentially unchanged
  • There is now a separate method gather_target_data_compressed().
  • Construction and transmission of the compressed target tables is completely revised. The eventual compression is the same as before.
  • Note: the use of spike in a number of method and data structure names below has historic roots and should be cleaned up in a follow-up step.

Data Structures

SourceTable::sources_

  • sources_[target_thread][syn_id][lcid] stores raw connection info, built during connection creation

SourceTable::compressible_sources_

  • Used for first level compression on each thread
  • compressible_sources_[target_thread][syn_id][idx] contains one pair<source_node_id, SpikeData(target_thread, syn_id, conn_lcid, 0)>entry, mapping each source node id to a SpikeData entry identifying the first sources_ entry for that source node in the sorted sources_ array
  • Is cleared after second-level compression

ConnectionManager::compressed_spike_data

  • compressed_spike_data[syn_id][source_index][target_thread] is the result of the second compression step
  • For each synapse ID and unique source neuron with targets on a given thread, it stores the pair from compressible_sources_.
  • SourceTable::compressed_spike_data_map_ provides and index from source node id to source_index in compressed_spike_data.

SourceTable::compressed_spike_data_map_

  • compressed_spike_data_map_[syn_id] maps each unique source node id onto the corresponding source_index in the compressed_spike_data
  • After this mapping has been transferred to the pre-synaptic side, the map can be deleted

CSDMapEntry

  • Compact represenation for entries in compressed_spike_data_map_.

ConnectionManager::iteration_state_

  • vector< pair< syn_id, map< source_gid, CSDMapEntry >::const_iterator > >
  • For each thread, and syn_id maintain and iterator to corresponding part of compressed_spike_data_map_.
  • Is used while filling connection transmission buffers

Source compression

  • Compression works as follows:
    1. Sort connections by source neuron (thread parallel)
    2. Per-thread compression (SourceTable::collect_compressible_sources(), thread parallel)
      • In sorted SourceTable::sources_, create entry in compressible_sources_ connecting source ID to info about first entry for that source in sources_ and mark sequence of connections from that source neuron as source_has_more_targets.
    3. Compression across threads (SourceTable::fill_compressed_spike_data(), serial)
      1. For each syn_id, iterate over connections on all threads in compressible_sources_
      2. If the source node is not known in the compressed_spike_data_map_ (for that syn_id)
        1. Append an empty entry to compressed_spike_data with one slot per thread
        2. Register this entry in compressed_spike_data_map_
      3. Register the connection under the right thread, writing the SpikeData entry created in the first compression step.

Connection transmission

  • Connection transmission works in multiple rounds if necessary, buffer size may be adjusted

  • Collocation of data is assigned to "assigned ranks"

  • Writing is mainly done by ConnectionManager::fill_target_bufferas follows:

    1. Iterate through entire compressed_spike_data_map_, outermost by syn_id, then over source entries.
    2. Find rank for given source—this is where we must send information.
    3. If no space left in buffer chunk for this rank
      1. Store iteration state for iteration through csdmap and target buffer
      2. Inform gather that we did not write all data
    4. Otherwise, write data and move to next entry
    5. If all data has been written, signal this to gather
  • Gather MPI-exchanges data that has been written to buffers

  • If not all data has been transmitted, do more rounds until all data has been transmitted.

  • For each compressed set of connections, we send to the presynaptic side

    • thread and thread-local id of source neuron
    • syn_id
    • index into compressed_spike_data
  • NOTE: The iteration scheme is different from the original approach. We stop as soon as a single rank has filled its part of the buffer. In the original, iteration would continue until the last rank had filled its chunk. CSDMap entries were marked as processed when written. On the next round, iteration through CSDMap would start at the point where the first rank had to stop writing, skipping all entries that had been written.

Spike transmission

  • Spikes written to emitted_spikes_register_ during node update are gathered at end of time slice and exchanged between ranks by gather_spike_data() and collocated_spike_data_buffers_().
  • Spike gathering is done by master branch only. Benchmarks up to 64 nodes showed no gain from "assigned ranks" approach (assigned ranks approach available in https://github.com/heplesser/nest-simulator/tree/def_ar)
  • At beginning of a time slice, the spike transmission buffer contains all spikes a given rank needs to deliver
  • Delivery happens thread-parallel in deliver_events_()

Data structures

Spike register

  • During node update, all spikes that need to be transmitted globally are written to the emitted_spikes_register
  • This is much simplified compared to master and only a 2D structure
    • Outer dimension is the node-updating thread
    • This contains pointers to the inner dimension to ensure that the metadata of the inner vector (begin/end markers etc) is local to the thread updating the neurons and pushing elements into the inner vector
    • Contains entried combining SpikeData (to be written directly to transmission buffer) and rank of target neuron (for writing to correct section of target buffer)
  • Is doubled up for data with/without offgrid times
  • Data is written into emitted_spikes_register_ in EventDeliverManager::send_remote(), which is called when a node sends a spike
  • Data is read out from emitted_spikes_register_ in EventDeliveryManager::collocate_spike_data_buffers_(), called from gather_spike_data()

SendBufferPosition

  • Marks where next spike should be written in each target-rank-specific section of spike transmission buffer
  • Much simplified since we dropped the assigned ranks approach
  • Constructor moved to cpp-file
  • New separate class TargetSendBufferPosition used for connection communication with assigned ranks.

SpikeData

  • Compact representation of spike for transmission in MPI buffer
  • Largely unchanged
  • Also in OffGrid version
  • 2-bit marker definition and usage changed, see section on Spike Gathering and Transmission below
  • Note: As soon as spike entry is created by send_remote(), we immediately create the eventual SpikeData entry, which is later copied to the transmission buffer by collocate_spike_data_buffers_(), no more re-coding in the process
    • New constructor from (Target, lag) for direct insertion to emitted_spikes_register as part of SpikeDataWithRank entry
    • SpikeDataWithRank combines SpikeData with target rank information needed for eventual writing to transmission buffer.
    • Construction allows emplace_back() into emitted_spikes_register, i.e., direct construction instead of constuct and copy.
  • New method set_lcid() to allow transmission of locally required
    buffer size per rank in LCID field
  • New method get_marker()
  • Corrected copy constructor and assignment operator
  • Copy and assignment operators for OffGridSpike data
  • New struct SpikeDataWithRank for emitted_spikes_register, also
    variant for OffGrid

Deliver events first

  • Since NEST 2.16 (introduction of 5G kernel), presynaptic gathering and postsynaptic delivery of spikes were interleaved at the end of the simulation update loop.
  • This PR separates deliver_events_() again in a separate method at the beginning of each update loop.
  • This means that the simulation clock_ is advance by one min_delay when spikes are delivered compared to when they were sent.
  • Therefore, min_delay needs to be subtracted from clock_ when computing arrival times
    • Done for normal spikes when computing prepared_timestamps in deliver_events_()
    • Done explicitly in several rate_*_impl.h files

Spike gathering and transmission

  • All spike gathering and transmission is done by the master thread alone
  • If all data can be transmitted with current transmission buffer size, this is done in a single round of collocating spikes followed by MPI transmission
  • If, for any rank-rank pair more spikes need to be transmitted than fit into a buffer chunk, buffers are made larger so that all data will fit and the entire communication is repeated.

Marking completeness and required buffer size

  • All signaling needs to be done with two bits for the SpikeData::marker_ field
  • All entries in the transmission buffer must be usable for spike payload to avoid waste
  • To achieve maximum information with minimal bits, we also exploit where in a buffer certain markers occur.
  • We require that each per-rank chunk of the spike transmission buffer has at least two different entries.
  • The first entry of a chunk is called begpos, the last endpos (this position is included in the chunk, it is not one beyond)
  • local_max_spikes_per_rank is the largest number of spikes a given rank needs to transmit to any other rank.
  • global_max_spikes_per_rank is the maximum of all local_max_spikes_per_rank values. It determines the minimum required buffer chunk size.
  • SpikeData marker values are defined as follows: have
    • DEFAULT: Normal entry, cannot occur in endpos
    • END: Marks last entry containing data.
      • If it occurs in endpos,
        • it implies COMPLETE
        • it indicates that local_max_spikes_per_rank of the sending rank is equal to the current buffer size
    • COMPLETE: Can only occur in endpos and indicates that the sending rank could write all emitted spikes to the transmission buffer.
  •  - END is then in earlier position.
    
  •  - The LCID entry of endpos contains the `local_max_spikes_per_rank` of the corresponding sending rank.
    
    • INVALID:
  •  - In begpos indicates that no spikes are transmitted (@note: END at begpos means one spike transmitted)
    
  •  - In endpos, indicates that the pertaining rank could not send all spikes.
     - The LCID entry of endpos contains the `local_max_spikes_per_rank` of the corresponding sending rank.
    
  • Collocation, transmission and signalling now proceeds as follows:
    1. Each rank writes all spikes that fit into transmission buffer chunks and counts all spikes, also those that do not fit.
    2. Each rank determines its local_max_spikes_per_rank.
    3. If local_max >= chunk_size, set endpos markers to INVALID and store `local_max_ there.
    4. Otherwise,
      1. if no spikes were written to a chunk, set begpos for chunk to INVALID,
        otherwise set END marker on last position written to.
      2. if some spikes were written to a chunk but the chunk not filled, set COMPLETE on endpos for chunk and store local_max in endpos LCID
    5. Perform MPI exchange
    6. Obtain global_max_spikes_per_rank from all local_max information obtained
    7. If global_max > chunk_size, grow buffer and repeat entire process.

Buffer growth and shrinking

  • This is implemented in gather_spike_data_()
  • All MPI ranks need to make same changed based on available information
  • Key quantitiy is global_max_spikes_per_rank_, i.e., the largest number of spikes that any rank has sent to any other rank. The individual sections of the spike transmission buffer must be at least this size.
  • Shrinking happens at beginning of gather based on on global_max_spikes_per_rank_, growth during gathering if required.
  • Growing is costly because it means extra round of communication; shrinking is cheap because it is a local operation on each rank.
  • Here, buffer size is always the size of a section for transmitting from one rank to another rank.
Growing
  • Grow if spikes that need to be transmitted do not fit into buffer
  • Minimum required size would be global_max_spikes_per_rank_
  • Grow to (1 + spike_buffer_grow_extra) * global_max_spikes_per_rank_ to keep number of grow operations small
  • Some experimentation suggest that sizeable extra is advantageous, therefore default spike_buffer_grow_extra == 0.5
Shrinking
  • Avoid shrinking too often so that we do not need too grow too frequently.
  • Shrink only if number of spikes is well below current size, i.e., if
    global_max_spikes_per_rank_ < spike_buffer_shrink_limit * buffer_size
  • When shrinking, leave some spare space in buffer to avoid having to grow quickly:
    new_size = ( 1 + spike_buffer_shrink_spare ) * global_max_spikes_per_rank_
  • Shrinking can be turned off by setting spike_buffer_shrink_limit = 0
  • Some experimentation suggests the following defaults
    • spike_buffer_shrink_limit == 0.3
    • spike_buffer_shrink_spare == 0.1
Logging
  • For each resize operation, a time stamp (always a full time slice, in steps), the value of global_max_spikes_per_rank_ and the new buffer size are recorded.
  • These data are available through kernel attribute spike_buffer_resize_log, which is a dictionary with the same structure as events dictionaries of recorders, i.e., containing one array for each of the three quantities recorded.

Spike delivery

  • deliver_events_() is called at beginning of each time slice except for the very first time slice (time 0, nothing to deliver)
  • deliver_events_() is called in a thread-parallel context
  • First, count the total number of spikes to be delivered
    • Based on end_marker in section from each rank
  • Then split spikes into batches (currently 8 spikes per batch)
  • For each batch, transfer spike data into per-quantity C-style arrays, then deliver from those arrays
  • Finally deliver remaining spikes
  • This batching gave noticeable performance gains.

Minor changes

Limit on LCID values

MPIManager changes

  • Code related to adapative spike buffers removed. This is now handled in EventDeliveryManager.

FULL_LOGGING() macro

  • Macro supporting logging, entirely disabled by default
  • Especially for debugging output from multi-rank, multi-thread simulations
  • Also write_to_dump() method for logging output
  • If used, forces sync'ing of threads through critical sections for output
  • Files containing implementation
    • CMakeLists.txt
    • cmake/ProcessOptions.cmake
    • libnestutil/config.h.in
    • kernel_manager.h,cpp
  • Question: Should all uses of this macro be removed before merging?

Touch ups

  • Replace int by size_t for spike multiplicity
    • music_event_out_proxy
    • spike_recorder
    • event
  • Whitespace fixes
    • stimulation_backend_mpi.h
    • recording_backend_mpi.h

Updated tests

  • test_stdp_synapse — modernization, no change to logic
  • `ticket-85.sli — slight simplification, no change to logic

SLI unittest

  • Add option to print results of distributed_process_invariant_events...

Open issues to be followed up

  • Consistent semantics for MAX_ and invalid_ constants, see Systematize definition of INVALID_* and MAX_* constants #2529
  • Considerable bits of code for transmitting connection information to pre-synaptic side is in EventDeliveryManager. Move to ConnectionManager.
  • Buffer resizing is split between MPIManager and code using the buffers in complicated ways. This should be made more systematic.
  • Remove considerable code duplication and templatization due to on-/off-grid distinction (see Move precise spike time offset from Event to Time class #2035)
  • Consider if hard-coded buffer-resizing strategy could be replace by more flexible solution and what better solutions could be, e.g., one taking simulation history into account.
  • See if batch code in deliver_events_() can be simplified by use of functions.
  • Could SendBufferPosition be turned into proper iterator (or array of iterators), and should TargetSendBufferPosition moved to file of its own?
  • Clean up use of spike in names in connection infrastructure building
  • See if SourceTable::compressed_spike_data_map_ can be cleared after connection transmission

This PR replaces #2617.

suku248 and others added 30 commits March 7, 2022 21:49
…ther_spike_data

Ensure thread-local memory allocation
…om:suku248/nest-simulator into test_single_threading_in_gather_spike_data
…m:suku248/nest-simulator into single_batchwise
…single_threading_in_gather_spike_data

Conflicts:
	nestkernel/event_delivery_manager.cpp
Co-authored-by: Jochen Martin Eppler <[email protected]>
nestkernel/spike_data.h Outdated Show resolved Hide resolved
@heplesser heplesser requested review from jougs and removed request for suku248 and JanVogelsang September 13, 2023 08:09
Copy link
Contributor

@jougs jougs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Many thanks!

nestkernel/spike_data.h Outdated Show resolved Hide resolved
nestkernel/target.h Outdated Show resolved Hide resolved
nestkernel/target.h Outdated Show resolved Hide resolved
nestkernel/target.h Outdated Show resolved Hide resolved
@heplesser heplesser merged commit c2258df into nest:master Sep 13, 2023
20 checks passed
@heplesser heplesser deleted the def_nolag_mrg branch April 24, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: Behavior changes Introduces changes that produce different results for some users S: High Should be handled next T: Enhancement New functionality, model or documentation
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants