You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.
Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.
Anyone has any idea how can I do that?
The text was updated successfully, but these errors were encountered:
Hi
I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.
Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.
Anyone has any idea how can I do that?
The text was updated successfully, but these errors were encountered: