Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid distributed training TODO #10279

Closed
Yancey1989 opened this issue Apr 28, 2018 · 9 comments
Closed

Fluid distributed training TODO #10279

Yancey1989 opened this issue Apr 28, 2018 · 9 comments

Comments

@Yancey1989
Copy link
Contributor

Yancey1989 commented Apr 28, 2018

Fluid Distribute Training Features

EDL

  • implement the master process to schedule task
  • etcd operator
  • implement CRD to support kubernetes v1.8

Support different communication library

  • gRPC performance enhancement
  • OpenMPI with RDMA and GPU direct
  • NCCL2 with multiple nodes
  • follow up bRPC

Experiment

  • different distributed training strategy (sync, async etc...) influence on model accuracy/throughput

CE

  • Auto execute benchmark-job on AWS and generate a report

Future

  • differences between multi-machine-single-device and multi-machine-multi-device
  • better integration with single-machine training
  • think about more flexible user-customized device placement for multi-machine training.
  • need to discuss whether we need the remote executor
@panyx0718
Copy link
Contributor

panyx0718 commented Apr 28, 2018

Some extra that might worth adding:

distributed data reader (should unify with single machine reader)
evaluate different distributed training strategy (sync, async etc) influence on model accuracy.
sort out differences between multi-machine-single-device and multi-machine-multi-device
better integration with single-machine training
think about more flexible user-customized device placement for multi-machine training.

@panyx0718
Copy link
Contributor

fault-tolerance is a basic distributed training feature that probably doesn't belong to EDL only.

@seiriosPlus
Copy link
Collaborator

seiriosPlus commented Apr 28, 2018

checkpoint need to be added to train feature.

@typhoonzero
Copy link
Contributor

Maybe we should devide fault-tolerance to several parts:

  • Base features
    • checkpointing and recovering on pserver
    • trainer pull checkpoint from pserver
    • recover reader offset (requires master and etcd)
  • Clustering feature
    • autostart failed trainer based on cluster system (Kubernetes etc.)
    • autoscale trainer based on cluster system.

@typhoonzero
Copy link
Contributor

The overall future roadmap should include the following parts:

  1. Complete features of fluid distributed
  • code clean up and polish
  • implement LARS -- @typhoonzero doing
  • pserver checkpointing
  • init trainer weights from pserver
  • distributed lookup table
  • full overlapping with parallel executor and dist training
  • complete async training, pserver use parallel executor
  • remote executor runs ProgramDesc (depend on "Complete Fluid")
  1. Able switch between communication libraries for different use cases.
  • grpc performance enhancement
  • OpenMPI with RDMA and GPU Direct
  • NCCL2 with multi-node implement
  • follow up brpc
  1. EDL
  • master implement
  • etcd operators
  1. CE

@Yancey1989
Copy link
Contributor Author

Thanks, @panyx0718 @seiriosPlus @typhoonzero, I updated this issue followed by your comments.

@gongweibao
Copy link
Contributor

gongweibao commented Apr 28, 2018

Do we need design communication backend's abstract interface to be compatible with various implementations:

  • Sync: nccl, mpi...
  • Async: RPC

@gongweibao
Copy link
Contributor

I think that it's maybe many things to do and we'd better do them with orders, classification, and priority.

@typhoonzero
Copy link
Contributor

Closing this issue, most of the work are done except brpc and EDL related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants