Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new parties and retraining #52

Open
Enrique-Marmol opened this issue Feb 8, 2021 · 6 comments
Open

Adding new parties and retraining #52

Enrique-Marmol opened this issue Feb 8, 2021 · 6 comments
Labels
question Further information is requested

Comments

@Enrique-Marmol
Copy link

Hi! I was wondering about what IBMFL can do after a training process. When aggregator.start_training() finish is it possible to make start_training() again and the model leverage what was make before? And between start_training() and start_training() is it possible to add new parties and quit others? And last one, between start_training() and start_training() is it possible to change the data of one party? I mean, maybe that party during the training phase has more data and what to make the next training phases with that data.

Regards and thanks in advance.

@Yi-Zoey Yi-Zoey added the question Further information is requested label Feb 8, 2021
@Yi-Zoey
Copy link
Member

Yi-Zoey commented Feb 8, 2021

Hi, thanks for using IBM Federated Learning Library!

is it possible to make start_training() again and the model leverage what was make before?

Yes, that's possible. You can enter TRAIN again once the global training is finished, the aggregator will continue to train for several rounds (depends on your configurations).

between start_training() and start_training() is it possible to add new parties and quit others?

Parties can drop any time during the training, as long as the quorum is met, the training process will keep going. And new parties can join between each TRAIN command. Once the new party starts, use the REGISTER command to join the training.

between start_training() and start_training() is it possible to change the data of one party?

This answer really depends on how the party loads its datasets. If the data handler looks for and loads a local dataset each time when IBM FL local training module uses get_data() to access the local training dataset, ideally the party can utilize new data in the current training round, since the data handler will reload the dataset instead of looking for it in the memory.

Let us know if you have further questions.

@Enrique-Marmol
Copy link
Author

Hi, I have a problem related with what I mentioned above. I registered 10 parties and I started the training, everything went perfect. Then, I chose one party and made STOP. After that I made start_training() again to train the model with this 9 parties leveraging the work made before, I mean, I did not make the aggregator to stop. However, I had an error that say that it could not connect with one party and above it printed:
image
It has the response of the 9 partied that left, but it has registered the 10 of the begining. Is there a way of solving this issue?

Thanks in advance

@Yi-Zoey
Copy link
Member

Yi-Zoey commented Feb 18, 2021

Hi, I see you are using IBM FL 1.0.2, can you try to upgrade the version to 1.0.3 and see if the issue still exists? Thanks.

@Yi-Zoey
Copy link
Member

Yi-Zoey commented Feb 18, 2021

The logic for IBM FL is that the aggregator will wait until max_timeout for everyone to reply back even after the quorum is met. Therefore, if you didn't set a max_timeout variable, the aggregator will still wait for all registered parties to reply back since it uses the max_timeout to identify dropout parties.

@Enrique-Marmol
Copy link
Author

Enrique-Marmol commented Mar 1, 2021

Hi again! I uploaded to version 1.0.3 and the problem persists. Reading the previous responses I think I did not explain myself well. What I am trying to say is that I start with 10 parties, I register all of them and then I make the aggragator to start training. Then, the training finished successfully. Now, for instance, I make party 7 to disconect, party7.stop(), and it disconnect successfully. Party 7 disconnect after the training, not during the training. Finally, I make the aggregator to start training again but this time with the 9 remaining parties and now it is when the issue come up and the training cannot be completed. I would like to know if this can be made or at the moment this cannot be done.
Thanks in advance.

@Yi-Zoey
Copy link
Member

Yi-Zoey commented Mar 22, 2021

Hi, can you share the config_agg.yml you are using? Just want to check the setting you select for the quorum check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants