Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix threading issue when connecting #1276

Merged
merged 22 commits into from
Jan 9, 2020
Merged

Conversation

stinebuu
Copy link
Contributor

@stinebuu stinebuu commented Aug 27, 2019

A little while ago @heplesser found that when trying to reproduce the figures from Ippen et al (2017), the strong scaling by threads variant behaved badly with master. Luckily, he found the problem as well, which is that a lot of threads are trying to write to have_connections_changed_ at once. This PR use CompletedChecker for have_connections_changed_ to make sure that all threads do not write to the same variable.

A benchmark can be found here: Fig_5_results_one_node_noinst.pdf The benchmark is from July, 0ac2385 was master at the time, and 8fa2e37 contains the fix. With the fix we see that strong scaling by threads and strong scaling by mpi again are equal when we connect.

This PR also cleans up some of our thread-safe connection routines. Instead of passing around one static empty dummy dictionary when connecting, we use a list of empty dictionaries, one dict per thread.

@heplesser heplesser requested review from jakobj and jarsi August 27, 2019 12:11
@heplesser heplesser added ZC: Kernel DO NOT USE THIS LABEL I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) ZP: PR Created DO NOT USE THIS LABEL S: High Should be handled next T: Maintenance Work to keep up the quality of the code and documentation. labels Aug 27, 2019
@terhorstd terhorstd added this to the NEST 2.18.1 milestone Sep 2, 2019
Copy link
Contributor

@jakobj jakobj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks already quite good, thanks for fixing this! i've added a few comments that should be addressed

nestkernel/conn_builder.cpp Outdated Show resolved Hide resolved
@@ -427,7 +426,10 @@ nest::ConnectionManager::connect( const index sgid,
{
kernel().model_manager.assert_valid_syn_id( syn_id );

have_connections_changed_ = true;
if ( not have_connections_changed_[ target_thread ] )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this check here? isn't it cheaper to just write in all cases? the logic wouldn't change, would it?

Copy link
Contributor Author

@stinebuu stinebuu Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobj I tested on Blaustein, and if we do not check before setting the parameter, then the performance gets a lot worse. You can see it as 4c388f4 here. It must have something to do with all the threads trying to access at once and maybe something with the omp automatic write in the set function. I wrote a comment before every if to try to explain and make sure nobody removes the if without testing the performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,thanks for clarifying

const thread tid = kernel().vp_manager.get_thread_id();

if ( not have_connections_changed_[ tid ] )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

nestkernel/connection_manager.cpp Outdated Show resolved Hide resolved
@@ -694,7 +693,10 @@ nest::ConnectionManager::find_connection( const thread tid, const synindex syn_i
void
nest::ConnectionManager::disconnect( const thread tid, const synindex syn_id, const index sgid, const index tgid )
{
have_connections_changed_ = true;
if ( not have_connections_changed_[ tid ] )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

nestkernel/connector_base.h Outdated Show resolved Hide resolved
nestkernel/connector_model_impl.h Outdated Show resolved Hide resolved
nestkernel/sp_manager.cpp Outdated Show resolved Hide resolved
topology/connection_creator.cpp Outdated Show resolved Hide resolved
ConnectionCreator::ConnectionCreator( DictionaryDatum dict )
: allow_autapses_( true )
, allow_multapses_( true )
, allow_oversized_( false )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just added it because I noticed it wasn't initialized. I can remove it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobj If allow_oversized_ is not initialized here, the test test_oversized_mask.sli fails. I don't understand why the test suddenly fails here and not in master though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uninitialized is always bad, thanks for fixing this.

@stinebuu
Copy link
Contributor Author

stinebuu commented Nov 5, 2019

@jakobj First of all, sorry for taking way too long to address your comments! I have made some changes based on your comments. Concerning your comments about why I don't set have_connections_changed_ directly, I want to do some benchmarking again. I feel like there was a reason for this, but as I can't remember it I will test it out, and then either add a comment or fix it.

So no need to do another review yet.

@heplesser
Copy link
Contributor

@jakobj @jarsi Could you (re-)review this PR in the near future? It is one of two PRs for 2.18.1 still open.

@stinebuu
Copy link
Contributor Author

@heplesser It is hanging on me I'm afraid.

@stinebuu
Copy link
Contributor Author

@jakobj @jarsi The PR is again ready for the review, sorry for the delay.

Somewhere along the way the benchmarks showed worse performance again, and it took some time to detect the problem, but that is fixed now: benchmark_results.pdf

Copy link
Contributor

@jakobj jakobj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stinebuu thanks! everything looks good, just one minor comment where we could use std function. otherwise 👍

@@ -64,6 +64,34 @@ CompletedChecker::all_true() const
return true;
}

bool
CompletedChecker::any_false() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use std::any_of since we use c++11 now? we should probably also replace the other functions, maybe in a separate PR though

}

bool
CompletedChecker::any_true() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

@@ -427,7 +426,10 @@ nest::ConnectionManager::connect( const index sgid,
{
kernel().model_manager.assert_valid_syn_id( syn_id );

have_connections_changed_ = true;
if ( not have_connections_changed_[ target_thread ] )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,thanks for clarifying

ConnectionCreator::ConnectionCreator( DictionaryDatum dict )
: allow_autapses_( true )
, allow_multapses_( true )
, allow_oversized_( false )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uninitialized is always bad, thanks for fixing this.

Copy link
Contributor

@jarsi jarsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work, the pr looks fine to me.

@heplesser
Copy link
Contributor

@jakobj Thanks for you comments! We cannot use std::any_of for two reasons: The underlying data type of a_ is a C-style array to allow OpenMP atomic updates and we need and OpenMP barrier before the "any" check.

As all reviewers have approved now, I will merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: High Should be handled next T: Maintenance Work to keep up the quality of the code and documentation. ZC: Kernel DO NOT USE THIS LABEL ZP: PR Created DO NOT USE THIS LABEL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants