-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New design for launch/run #40086
New design for launch/run #40086
Conversation
Thanks for your contribution! |
076107d
to
dbd20ff
Compare
"--rank", type=int, default=-1, help="the peer rank") | ||
|
||
base_group.add_argument( | ||
"--log", type=str, default="INFO", help="log level. Default INFO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think log_level or verbose is more meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
"--np", | ||
type=str, | ||
default="1", | ||
help="the number of peers, i.e. pod/node number") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MPI, np means the total number of processes instead of number of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it is better to use another name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, replace with nnodes later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, everything looks good to me. But something may needs to be improved when porting auto parallel.
"--np", | ||
type=str, | ||
default="1", | ||
help="the number of peers, i.e. pod/node number") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it is better to use another name.
sys.exit(sigint) | ||
|
||
|
||
class Controller(ControllerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need both Controller
and ControllerBase
as the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ControllerBase refer to really basic workflow, Controller holds more API. Nothing but separation.
def replicas(self, replicas): | ||
self._replicas = replicas | ||
|
||
def set_replicas(self, np: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the replicas also mean the number of nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little diff, replicas refers to logical nodes number, since one nodes may deploy two or more replicas.
|
||
|
||
if __name__ == '__main__': | ||
cli = PKVClient("http://localhost:8090") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PKVClient->KVClient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never Mind, just for local testing.
return 0 | ||
|
||
def deploy(self): | ||
for i in self._init_containers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my understanding is right, the init_containers and conatiners are called by this launch module and users don't need to do the relaunch in their own scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
|
||
assert self.client.wait_server_ready(timeout=600), 'server is not ready' | ||
|
||
# 'aaaaaa' make suer main pod (master server) as rank 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make suer > make sure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
help="seconds to wait before elastic perform training") | ||
return parser.parse_args() | ||
|
||
def _valide_env(self, key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valide -> valid ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return parser.parse_args() | ||
|
||
def _valide_env(self, key): | ||
if key in ['POD_IP']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POD_IP(历史遗留)是k8s里的概念,用在这里其实并不是特别合适,是否要用一个新的名称,同时保留兼容POD_IP,后续逐步废弃POD_IP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,不过目前 role_maker 里面用了,目前需要兼容,后续可以考虑替换
def set_env_in_args(self): | ||
env_args = { | ||
'POD_IP': 'host', | ||
'PADDLE_MASTER': 'master', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
相同名称的环境变名称字符串分散在多个py文件中,是否需要将相关环境变量统一定义?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前所有除device相关的环境变量都统一在这里了,且在这里定义 args 和 env 的对应关系
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以在check一下,"PADDLE_MASTER"�在其他地方也出现了�
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
出现的另一个 PADDLE_MASTER 并不是同一个,后者取决于 fleet api 里面的定义。launch/run 模块用到的环境变量在 context。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# python -m paddle.distributed.run train.py | ||
|
||
use specified devices | ||
# python -m paddle.distributed.run --devices=0,1,2,3 train.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between "paddle.distributed.launch fleetrun" and "paddle.distributed.run"? Why not reuse launch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As some args may not compatible with launch, we use run in this version to avoid breaking change.
In the new design, run module is a self contained directory, which can replace launch by renaming in the stable version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在代码里注释说明一下TODO
正式发布时需要重点review下兼容性情况。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for set_tests_properties(test_run PROPERTIES TIMEOUT 200)
PR types
New features
PR changes
Others
Describe
This pr add new launch method for paddle, including elastic support.
More details in python/paddle/distributed/run/init.py