Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICC workshop '20 | DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs #283

Open
jasperzhong opened this issue Mar 15, 2022 · 3 comments
Assignees
Labels

Comments

@jasperzhong
Copy link
Owner

jasperzhong commented Mar 15, 2022

https://arxiv.org/pdf/2010.05337

@jasperzhong jasperzhong added framework open source gnn labels Mar 15, 2022
@jasperzhong jasperzhong self-assigned this Mar 15, 2022
@jasperzhong
Copy link
Owner Author

jasperzhong commented Mar 15, 2022

和之前大模型训练很不一样. 分布式GNN训练主要是图太大,vertex直接有非常复杂的依赖关系,而传统的训练sample之间都是互相独立的. 所以这里的挑战是如何.

解决办法看上去不是很复杂. 首先trainer调用RPC让sampler去采样,返回sampled subgraph,然后trainer去存放node features的KV Store去取对应的node features,然后进行data-parallel训练. 如下图所示.

image

graph首先是被paritition成多个subgraph存放在各个机器上,vertex/edge features也随之partition. 每个机器会有一个graph sampler负责其机器上的subgraph的采样.

原来如此,那么看来采样这件事情是基本是locally的.

graph partitioning
partition算法的目的是让cross partition的edges数量最少. 这是事先做一次的. 并且会把cross-partition的edge的vertex两边都进行copy. 所以整个系统中,edge只有一份,而vertex可能会重复. 重复的vertex叫做HALO vertices,其他的叫做core vertices. 下面是一个示意图.

image

partition graph一个问题是load balancing. 他们formulate成一个multi-constraint partitioning问题,没形式化.

partition graph后,vertex features和edge features也随之partition. 但是,HALO vertices的features不会duplicated. 这样,所有的vertex features和edge features都不会duplicated.

Distributed KV-Store
内部用shared memory作为IPC. 还是会有跨机通信.

Distributed Sampler
trainer用RPC请求sampler. sampler的sampling可以和trainer训练overlap. 秒. 这样就要求RPC是async的.

sampling只应作用于core vertices.

其实有点疑问,cross-partition的graph感觉很难学到啊,因为最多延伸一个节点(HALO vertex)

Mini-batch Trainer

算是明白了. 不过不太懂为啥不事先就balance assign samples to machines呢??

image

@jasperzhong
Copy link
Owner Author

jasperzhong commented Mar 16, 2022

image

Linear scalability
image

不影响convergence.
image

给METIS做了一个ablation study. 看来load balancing很重要.
image

@jasperzhong jasperzhong added the rating (5/5) must read label Mar 31, 2022
@yzh119
Copy link

yzh119 commented Apr 5, 2022

后来还有一个v2版的:https://arxiv.org/pdf/2112.15345.pdf

@jasperzhong jasperzhong reopened this Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants