Support Distribute Lookup Table #9211

jacquesqiao · 2018-03-19T13:35:21Z

Project

https://github.com/PaddlePaddle/Paddle/projects/56

Tasks

Add distributed lookup table design(with Abacus) Add distributed lookup table design #9075
detailed design doc for lookup remote table(Fluid) Add design doc for lookup remote table in Fluid #9068
support empty tensor support empty tensor #9338
Operators
- prefetch_op get value from pserver by ids and output an SelectedRows as parameter for lookup_table_op. @jacquesqiao use split_ids_op -> prefetch_op -> concat_op to compose a prefetch_op.
  - add split_ids_op op. add split ids op #9370
  - prefetch_op add prefetch_op #9495
  - concat_op should support concat SelectedRows: Can use sum_op
  - add a RPC interface PrefetchVariable in send_recv.proto for remote table lookup. Add prefetch interface on gRPC server side #9524
  - use new PrefetchVariable to support remote_table_lookup. Improve prefetch on server #9555
  - pserver should use the new interface to serve the remote table lookup. In the future, it should read parameter from hdfs. run prefetch prog on server #9593
```
trainer: split_id -> send_id_to_pserver -> recv_result_from_pserver -> concat_result
pserver: recv_from_trainer -> lookup_table -> send_back_to_trainer
```
- lookup_table_op, this op should take parameter(SelectedRows) from prefetch_op. when use prefetch, we should remove the initialize_op for it's parameter W. Lookup table support selected rows as parameter #9575
- sgd_op update the gradient(SelectedRows) to table parameter(SelectedRows) Sgd support update selected rows #9597
- distribute_table_initialize_op should initialize a shard of SelectedRows on ParameterServer by shard_id. In the future, it may need to read parameter from a distributed_fils_system. Initialize large table value randomly #9787
Sparse Table
Support auto-grown sparse table, support lookup nonexistent key
- Refine SelectedRows to support an auto-grown sparse table, Need a new class to represent some certain interface of Table used in lookup remote table #9841
- lookup sparse table operator, lookup_sparse_table op to lookup from the sparse table #10046
Transpilers Dist transpiler support prefetch #9714
- the distributed transpiler should:
  1. replace lookup_table_op with split_ids_op -> prefetch_op -> concat_op
  2. add split_ids_op -> send_vars_op to split table@grad and send them to pserver.
  3. insert table_optimize_block[sum(splited_grad) -> sgd_op] to pserver_program.

Problems with the current design

problem: all prefetch input and output vars must share the same variables because there is only one prefetch thread block and prefetch op on pserver, it has to take one input and output. So the splite_ids_op -> prefetch_op-> concat_op set must be executed one by one and cannot be execute parallelly. There are many code in dist transpiler to insert and delete ops
sulotion: A better solution maybe that we have only one prefetch_op and prefetch_grad_op, it does not depend on Variable but use some internal data structure to communicate with pserver.

The text was updated successfully, but these errors were encountered:

helinwang · 2018-03-19T18:45:12Z

Sorry I tried to understand more about the distributed lookup table but still have some questions, maybe it's because I don't understand the full picture:

In our ListenAndServeOp, we already have the lookup table functionality: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/listen_and_serv_op.cc#L119
Furthermore, the ListenAndServeOp is distributed on different parameter servers, supports sparse update and runs the optimization block too.
From my understanding the distributed lookup table is a subset of ListenAndServeOp in terms of functionality, curious what benefit can a new distribute lookup table bring, and what are we planning to do with the ListenAndServeOp?

Yancey1989 · 2018-03-20T12:17:39Z

Hi @helinwang
I will try to explain the background.
For the current design of distributed architecture, each trainer has all parameters of a model, but sometimes the parameter is very large(such as embedding layer has a very very large dict_size ) that could not store in one trainer's memory. So we need an approach to store the parameter and prefetch it before using this parameter. Currently, we have two plans to implement it:

Prefetch rows data from the Parameter Servers, it's suitable for Fluid.
Store the large parameter in a storage service such as Memcached, it's suitable for Abacus

In our ListenAndServeOp, we already have the lookup table functionality:,

You're right, for the plan with Fluid, we don't need to do much more work with ListenAndServeOp, but Trainer would prefetch the correct rows data before one mini-batch.

helinwang · 2018-03-21T00:54:21Z

@Yancey1989 Thank you!

jacquesqiao · 2018-03-22T01:23:40Z

jacquesqiao assigned Yancey1989, helinwang, jacquesqiao and typhoonzero Mar 19, 2018

This was referenced Mar 22, 2018

add distributed_prefetch_op #9316

Closed

support empty tensor #9338

Merged

add split ids op #9370

Merged

jacquesqiao mentioned this issue Mar 30, 2018

add prefetch_op #9495

Merged

Yancey1989 mentioned this issue Mar 30, 2018

Add prefetch interface on gRPC server side #9524

Closed

This was referenced Mar 30, 2018

fix send_recv_op_test #9523

Merged

Improve prefetch on server #9555

Merged

Lookup table support selected rows as parameter #9575

Merged

Sgd support update selected rows #9597

Merged

jacquesqiao mentioned this issue Apr 8, 2018

Dist transpiler support prefetch #9714

Merged

This was referenced Apr 8, 2018

run prefetch prog on server #9593

Merged

Initialize large table value randomly #9787

Merged

This was referenced Apr 24, 2018

add lookup_sparse_table_op #10164

Merged

Auto-grown sparse table #9897

Merged

Fluid distributed training TODO #10279

Closed

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Distribute Lookup Table #9211

Support Distribute Lookup Table #9211

jacquesqiao commented Mar 19, 2018 •

edited by Yancey1989

Loading

helinwang commented Mar 19, 2018 •

edited

Loading

Yancey1989 commented Mar 20, 2018

helinwang commented Mar 21, 2018 •

edited

Loading

jacquesqiao commented Mar 22, 2018 •

edited

Loading

Support Distribute Lookup Table #9211

Support Distribute Lookup Table #9211

Comments

jacquesqiao commented Mar 19, 2018 • edited by Yancey1989 Loading

Project

Tasks

Problems with the current design

helinwang commented Mar 19, 2018 • edited Loading

Yancey1989 commented Mar 20, 2018

helinwang commented Mar 21, 2018 • edited Loading

jacquesqiao commented Mar 22, 2018 • edited Loading

jacquesqiao commented Mar 19, 2018 •

edited by Yancey1989

Loading

helinwang commented Mar 19, 2018 •

edited

Loading

helinwang commented Mar 21, 2018 •

edited

Loading

jacquesqiao commented Mar 22, 2018 •

edited

Loading