Skip to content

Latest commit

 

History

History
44 lines (26 loc) · 1.7 KB

PipeList.md

File metadata and controls

44 lines (26 loc) · 1.7 KB

Available pipes

Altough our publication refers mainly to 1-2 pipeline approachs for fine-tuning giant models on commodity hardware (mainly, the Pareto frontiers for the discussed setting), the framework we implemented (quite a while before the publication) supports training all model sizes with, for which, of course, different sweetspots apply.

We implemented many pipeline optimization algorithms to study the tradeoffs of DNN training with asynchronous pipeline-parallelism.

The following pipeline configurations are available:

  • stale: no staleness mitigation.

  • weight prediction (wp) : {msnag, aggmsnag}

    • supported for the {sgd,adam,adamw}` optimizers
    • msnag is momentum based weight prediction
    • aggmsnag is adopting momentum based wieght prediction to gradient accumulation
  • recomputation

    • See Table 1 on FTPipe paper for the effect on stale pipelines
  • no recomputation (nr or norecomp)

  • weight stashing (ws)

  • Gap Aware staleness mitigation (ga)

    • for {sgd, adam, adamw} optimizers
  • scheduler aware prediction: making the weight prediction aware of the scheduler.

  • gradient aggregation in pipeline (step_every)

  • combinations of mostly all of the above: {wp, ws, ga}

Note: Weight predicion is often called msnag in code.

Fully-synchronous

  • gpipe
  • DistributedDataParallel (DDP): SSGD
  • Sequential (seq): naive inter-layer model parallelisem (multi gpu)
  • and of course, a single gpu for small models.

Note: Tied weights are handled (decorated) per use-case.