Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TensorFlow Task Runner and related workspaces #985

Open
wants to merge 26 commits into
base: develop
Choose a base branch
from

Conversation

kta-intel
Copy link
Collaborator

@kta-intel kta-intel commented Jun 6, 2024

Related Issue: #973

Summary:
This PR aims to update the TensorFlow Task Runner to use Keras as the high-level API, which is in line with best practices as well as updates existing TF workspaces. This enables the usage of non-legacy optimizers (which will be deprecated in future versions of TF/Keras)

Specifically, this PR:

  • Creates a new TensorFlowTaskRunner class in openfl.federated.task.runner_tf which borrows heavily from the KerasTaskRunner task. Major difference is in handling the weights of the optimizer which was necessitated by the removal of the .get_weight() and .weights() attributes from the optimizer. This new TensorFlowTaskRunner extracts weights from the .variables() attribute
    • Also updated the train and validation task names to train_validation and task_validation to be consistent with the torch taskrunner
  • Archived old TensorFlowTaskRunner as TensorFlowTaskRunner_v1 within openfl.federated.task.runner_tf and updated the __init__ files to make it callable. Rationale is to avoid any breaking changes for tutorials or upstream applications that still relied on the low-level TF taskrunner. This can be removed entirely in a future release as needed
    • Also updated the train and validation task names to train_validation and task_validation to be consistent with the torch taskrunner
  • Created a new tf_cnn_mnist workspace and updated the torch_cnn_histology workspace to run on the new TensorFlowTaskRunner using the src/dataloader.py and src/taskrunner.py convention.
    • update to TensorFlow v2.15.1 (latest TensorFlow to not use Keras v3.x by default)
  • Minor tf_3dunet_brats to use new TensorFlowTaskRunner (did not make changes to src files because I did not have Brats3D dataset to verify a large update
  • Minor updates to tf_2dunet to run on archived TensorFlowTaskRunner_v1

Future work

  • Consolidation step still needed:
    • Migrate all tf_2d_unet from TensorFlowTaskRunner_v1 to new TensorFlowTaskRunner
    • Migrate all keras workspaces from KerasTaskRunner to new TensorFlowTaskRunner and remove/archive KerasTaskRunner
  • Look into updated TensorFlowTaskRunner to run on TF v2.16+ with Keras 3.x (this may need some large changes to weight handling that will likely not have backwards compatibility)

openfl-workspace/tf_2dunet/src/tf_2dunet.py Outdated Show resolved Hide resolved
openfl-workspace/tf_3dunet_brats/src/tf_3dunet_model.py Outdated Show resolved Hide resolved
openfl/federated/__init__.py Show resolved Hide resolved
openfl-workspace/tf_cnn_histology/src/dataloader.py Outdated Show resolved Hide resolved
openfl-workspace/tf_cnn_histology/src/taskrunner.py Outdated Show resolved Hide resolved
openfl-workspace/tf_cnn_mnist/plan/data.yaml Show resolved Hide resolved
openfl-workspace/tf_cnn_mnist/plan/plan.yaml Outdated Show resolved Hide resolved
openfl-workspace/tf_cnn_mnist/src/__init__.py Outdated Show resolved Hide resolved
openfl-workspace/tf_cnn_mnist/src/dataloader.py Outdated Show resolved Hide resolved
@MasterSkepticista
Copy link
Collaborator

Most of my comments are nitpicks around naming/formatting and/or structure. Please disposition as you find relevant.
Also, do we need TF v1 (tf.Session API) task runners? These are ancient APIs that none of the community uses. Does OpenFL have users on this legacy TF API?

@psfoley
Copy link
Contributor

psfoley commented Jun 12, 2024

Most of my comments are nitpicks around naming/formatting and/or structure. Please disposition as you find relevant. Also, do we need TF v1 (tf.Session API) task runners? These are ancient APIs that none of the community uses. Does OpenFL have users on this legacy TF API?

@MasterSkepticista I agree the TF v1 / Legacy task runners can be removed at this point.

@kta-intel
Copy link
Collaborator Author

Thanks @MasterSkepticista @psfoley !
I will get back to addressing these comments once I'm done with some other pressing tasks. I appreciate the review

Most of my comments are nitpicks around naming/formatting and/or structure. Please disposition as you find relevant. Also, do we need TF v1 (tf.Session API) task runners? These are ancient APIs that none of the community uses. Does OpenFL have users on this legacy TF API?

@MasterSkepticista I agree the TF v1 / Legacy task runners can be removed at this point.

Great, I am also in agreement with removing the old TF runner. Actually, one reason why I did not propose it directly in this PR was because the tf2dunet workspaces uses the old TF runner. I can update it, but since it uses the BraTS dataset, I will have to get approval so I can test it

@kta-intel kta-intel force-pushed the TensorflowTaskRunner_update branch from 99c82ef to cacd207 Compare July 11, 2024 19:22
@kta-intel kta-intel force-pushed the TensorflowTaskRunner_update branch from cacd207 to 3dc7170 Compare July 11, 2024 19:35
Signed-off-by: kta-intel <[email protected]>
Signed-off-by: kta-intel <[email protected]>
Signed-off-by: kta-intel <[email protected]>
@kta-intel kta-intel force-pushed the TensorflowTaskRunner_update branch from db90ce9 to e9a7364 Compare July 11, 2024 21:21
@kta-intel kta-intel force-pushed the TensorflowTaskRunner_update branch from 6f3f3da to dad006f Compare July 12, 2024 22:36
Copy link
Collaborator

@teoparvanov teoparvanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, @kta-intel !

Please consider my review as a learning exercise of OpenFL and TF/Keras, so do take my comments and questions from that perspective 😊

for param in metrics:
if param not in model_metrics_names:
raise ValueError(
f'KerasTaskRunner does not support specifying new metrics. '
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I suppose you mean TensorFlowTaskRunner here?

Comment on lines +201 to +204
model_metrics_names = self.model.metrics_names
if type(vals) is not list:
vals = [vals]
ret_dict = dict(zip(model_metrics_names, vals))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose TF guarantees that the metric names and values are in the same order, so we can safely zip together the two lists?

for param in param_metrics:
if param not in model_metrics_names:
raise ValueError(
f'KerasTaskRunner does not support specifying new metrics. '
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as a above - shouldn't this rather say TensorFlowTaskRunner ?

Comment on lines +256 to +259
if with_opt_vars:
weight_names = [weight.name for weight in obj.variables]

Args:
with_opt_vars (bool): Specify if we also want to get the variables
of the optimizer
weight_names = [weight.name for weight in obj.weights]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For improved readability, I'd suggest to add an explicit else: block, as in

if with_opt_vars:
   ...
else:
   ...


Parameters
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also document the suffix parameter? (its meaning isn't very obvious for the inexperienced OpenFL reader)

opt_weights_dict = {
name: tensor_dict[name] for name in opt_weight_names
}
self._set_weights_dict(self.model, model_weights_dict)
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this statement can also be moved outside of the if statement as it's executed in both cases:

self._set_weights_dict(self.model, model_weights_dict)

Comment on lines +137 to +138
y_train = np.eye(num_classes)[y_train]
y_valid = np.eye(num_classes)[y_valid]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be equivalent to the following syntax (a bit more explicit IMHO):

from keras.utils import to_categorical
y_train = to_categorical(y_train, num_classes)
y_valid = to_categorical(y_valid, num_classes)

Comment on lines +122 to +124
for metric in metrics:
value = np.mean([history.history[metric]])
results.append(Metric(name=metric, value=np.array(value)))
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we take the mean value, as opposed to the latest one?

If we consider accuracy for instance, wouldn't we be interested in the accuracy at the end of the training (which would certainly be higher than the mean value)?

Comment on lines +126 to +127
y_train = np.eye(num_classes)[y_train]
y_valid = np.eye(num_classes)[y_valid]
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - would it make sense to use to_categorical from keras.utils instead?


self.opt_vars = self.optimizer.variables()
print(f'optimizer vars: {self.opt_vars}')
def train_(self, batch_generator, metrics: list = None, **kwargs):
Copy link
Collaborator

@teoparvanov teoparvanov Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, could we consider not even overriding _train()? It looks like in this instance, it practically does the same as the base method (seemingly differing in terms of validity checks)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants