[AIR] Add distributed `torch_geometric` example #23580

amogkam · 2022-03-30T03:06:27Z

Add example for distributed pytorch geometric (graph learning) with Ray AIR

This only showcases distributed training, but with data small enough that it can be loaded in by each training worker individually. Distributed data ingest is out of scope for this PR.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…h-geometric-examples

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

xwjiang2010 · 2022-03-30T18:03:20Z

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

+    )
+
+    # Disable distributed sampler since the train_loader has already been split above.
+    train_loader = train.torch.prepare_data_loader(train_loader, add_dist_sampler=False)


dumb question, why do we do split separately? Instead of combined in prepare_data_loader?

You need to use torch geometric's NeighborSampler for sampling subgraphs from the overall graph, instead of the standard DistributedSampler

xwjiang2010 · 2022-03-30T18:10:43Z

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

+        return x.log_softmax(dim=-1)
+
+    @torch.no_grad()
+    def inference(self, x_all, subgraph_loader):


Is this planned to be used for predictor impl?

Eventually yes, but the challenge for prediction is how to add "fresh data" to the graph to do inference on.

xwjiang2010 · 2022-03-30T18:11:06Z

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

+        scaling_config={"num_workers": num_workers, "use_gpu": use_gpu},
+    )
+    result = trainer.fit()
+    print(result.metrics)


what does prediction look like?

Prediction is not supported for now- we need to be able to add "fresh data" to the existing graph and then re-run the inference algorithm on the new data.

xwjiang2010 · 2022-03-30T18:12:37Z

python/requirements_ml_docker.txt

@@ -8,3 +8,10 @@ tblib
 -f https://download.pytorch.org/whl/torch_stable.html
 torch==1.9.0+cu111
 torchvision==0.10.0+cu111
+
+-f https://data.pyg.org/whl/torch-1.9.0+cu111.html


curious, what is this for?

These are required dependencies for pytorch geometric

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

gjoliver · 2022-03-31T16:52:09Z

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

+        self.convs.append(SAGEConv(hidden_channels, out_channels))
+
+    def forward(self, x, adjs):
+        for i, (edge_index, _, size) in enumerate(adjs):


can you comment a bit about the format of this adjs matrix?
especially 1. what does size mean in this context? and 2. how do we make sure there are always enough hidden layers to handle the adjacency links in adjs?

Added a comment here- but more information are in the torch geometric docs.

For 2, we pass in a sizes list to the NeighborSampler, so the size of this list should match with the number of layers in the model.

gjoliver · 2022-03-31T17:00:00Z

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

+                    x = F.relu(x)
+                xs.append(x.cpu())
+
+            x_all = torch.cat(xs, dim=0)


looks a bit weird to me. I think I am just clueless.
if we overwrite the entire x_all here, we will only have features for the nodes that we scored with the last layer.
it feels more appropriate to me to update the weights of xs in x_all, not simply "x_all = ..." ??

This is just the inference code so no weights updating. I think this works the same way as a standard feed-forward neural network. We only want the output of the last layer, and we don't care about the hidden states during inference.

ok, I understand this now. subgraph_loader actually samples a subgraph for every single node in the graph.
so if there are n nodes in the graph, the inner loop will run n times. each time, we essentially aggregating data from all neighboring nodes to this specific node.
so at the end, torch.cat(xs) will give us a new updated graph, since xs will contain data for every single node at the point.
interesting design.

matthewdeng · 2022-04-08T19:44:50Z

python/requirements_ml_docker.txt

+
+-f https://data.pyg.org/whl/torch-1.9.0+cu111.html


Do we need to make these changes to requirements_dl.txt (line 6 above)?

Since there's only a GPU test, I think it should be fine for now

Oh but doesn't that make the instruction in line 6 no longer true? Do we actually want these to be in CPU docker as well? Alternative solution: move these above that line.

Updated the comment to reflect the changes!

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py

matthewdeng · 2022-04-08T19:48:16Z

.buildkite/pipeline.gpu.large.yml

@@ -36,6 +36,6 @@
  conditions: ["RAY_CI_ML_AFFECTED"]
  commands:
    - cleanup() { if [ "${BUILDKITE_PULL_REQUEST}" = "false" ]; then ./ci/travis/upload_build_info.sh; fi }; trap cleanup EXIT
-    - DATA_PROCESSING_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 ./ci/travis/install-dependencies.sh
+    - DATA_PROCESSING_TESTING=1 TRAIN_TESTING=1 TUNE_TESTING=1 PYTHON=3.7 ./ci/travis/install-dependencies.sh


For my learning, is this needed?

Torch geometric does not support python 3.6.

We could just make a separate build just for 3.7, but I thought it would be better to just upgrade everything to 3.7 since this is what we do for Tune anyways currently.

Oh wait isn't the default value 3.7?

No it's 3.6 I believe.

Ah I believe it was updated for GPU images here

But similar to my comment on that PR, having it explicit makes sense (in case we change default version in the future)

Ah got it. Actually there are versions of torch geometric that support python 3.6, but the later versions don't. But in any case, it's fine to have this be explicit?

…ample.py Co-authored-by: matthewdeng <[email protected]>

…h-geometric-examples

…o torch-geometric-examples

gjoliver

sorry about the delay, have a few minor questions/comments.

gjoliver · 2022-04-21T16:43:04Z

python/ray/train/torch.py

@@ -504,7 +504,8 @@ def _wait_for_batch(self, item):
        # the tensor might be freed once it is no longer used by
        # the creator stream.
        for i in item:
-            i.record_stream(curr_stream)
+            if isinstance(i, torch.Tensor):


can you comment what may show up here as well, and why you need this if statement now?

The pytorch dataloader can actually just return a batch of anything. In all of our examples and tests so far our data loaders return batches of tensors, but in this case, the torch geometric data loader also returns batch size, node id, etc., which are not all tensors.

gjoliver · 2022-04-21T16:45:10Z

python/ray/ml/examples/pytorch_geometric/distributed_sage_example.py

+    # Use 10% of nodes for validation and 10% for testing.
+    fake_dataset = FakeDataset(transform=RandomNodeSplit(num_val=0.1, num_test=0.1))
+
+    def gen_dataset():


feels a little unnecessary.
why don[t we simply return fake_dataset here, and below in the configuration, we say "dataset_fn": gen_fake_dataset?

Good point 😅. Made a follow up PR here #24080!

gjoliver · 2022-04-21T16:49:39Z

python/ray/ml/examples/pytorch_geometric/distributed_sage_example.py

+    def inference(self, x_all, subgraph_loader):
+        for i in range(self.num_layers):
+            xs = []
+            for batch_size, n_id, adj in subgraph_loader:


actually reading this again now, I am still a bit curious how should a user use this inference call.
this will only work if subgraph_loader iterates through all nodes in a graph. so:

how does a user construct such a subgraph loader?

is it really a common case that someone would want to score an entire graph?

I think the intent is to use this just for validation and testing and not for actual live predictions.

We will need to figure out the inference/prediction story more later. This was copied over from the example on torch geometric, but let me rename this to "test" to make this more clear.

amogkam added 3 commits March 28, 2022 11:32

wip

18f5e8f

Merge branch 'master' of https://github.com/ray-project/ray into torc…

36f2c74

…h-geometric-examples

finish

f93e6c9

amogkam requested review from sven1977, richardliaw and matthewdeng as code owners March 30, 2022 03:06

amogkam assigned matthewdeng, gjoliver, krfricke, xwjiang2010 and Yard1 Mar 30, 2022

amogkam added 2 commits March 29, 2022 20:08

remove

8160c00

revert

a9371ce

xwjiang2010 reviewed Mar 30, 2022

View reviewed changes

matthewdeng reviewed Mar 31, 2022

View reviewed changes

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py Outdated Show resolved Hide resolved

gjoliver reviewed Mar 31, 2022

View reviewed changes

amogkam added 2 commits April 4, 2022 16:46

3.7

59ea9b6

add comment

1864b0e

amogkam requested review from matthewdeng, xwjiang2010 and gjoliver April 5, 2022 22:45

bump timeout

dd2e7dd

richardliaw added this to the Ray AIR milestone Apr 8, 2022

gjoliver approved these changes Apr 8, 2022

View reviewed changes

matthewdeng reviewed Apr 8, 2022

View reviewed changes

python/ray/ml/examples/pytorch_geometric/distributed_reddit_example.py Outdated Show resolved Hide resolved

matthewdeng reviewed Apr 8, 2022

View reviewed changes

Update python/ray/ml/examples/pytorch_geometric/distributed_reddit_ex…

3b82e6b

…ample.py Co-authored-by: matthewdeng <[email protected]>

amogkam requested a review from matthewdeng April 8, 2022 20:03

Merge branch 'master' of https://github.com/ray-project/ray into torc…

2a196f9

…h-geometric-examples

amogkam added 3 commits April 19, 2022 14:15

Merge branch 'torch-geometric-examples' of github.com:amogkam/ray int…

f38189e

…o torch-geometric-examples

address comments and use fake dataset

a74bcf2

revert

7016187

amogkam added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 20, 2022

matthewdeng approved these changes Apr 21, 2022

View reviewed changes

amogkam merged commit 732175e into ray-project:master Apr 21, 2022

amogkam deleted the torch-geometric-examples branch April 21, 2022 16:48

gjoliver reviewed Apr 21, 2022

View reviewed changes

ddelange mentioned this pull request Jun 24, 2022

[CI/Docker] Switch to pytorch new extra index url notation #26072

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Add distributed `torch_geometric` example #23580

[AIR] Add distributed `torch_geometric` example #23580

amogkam commented Mar 30, 2022 •

edited

Loading

xwjiang2010 Mar 30, 2022

amogkam Apr 4, 2022 •

edited

Loading

xwjiang2010 Mar 30, 2022

amogkam Apr 5, 2022

xwjiang2010 Mar 30, 2022

amogkam Apr 5, 2022

xwjiang2010 Mar 30, 2022

amogkam Apr 4, 2022

gjoliver Mar 31, 2022

amogkam Apr 5, 2022

gjoliver Mar 31, 2022

amogkam Apr 5, 2022

gjoliver Apr 8, 2022

matthewdeng Apr 8, 2022

amogkam Apr 8, 2022

matthewdeng Apr 8, 2022

amogkam Apr 20, 2022

matthewdeng Apr 8, 2022

amogkam Apr 8, 2022

matthewdeng Apr 8, 2022

amogkam Apr 8, 2022

matthewdeng Apr 8, 2022

amogkam Apr 8, 2022

gjoliver left a comment

gjoliver Apr 21, 2022

amogkam Apr 21, 2022 •

edited

Loading

gjoliver Apr 21, 2022

amogkam Apr 21, 2022

gjoliver Apr 21, 2022

amogkam Apr 21, 2022

[AIR] Add distributed torch_geometric example #23580

[AIR] Add distributed torch_geometric example #23580

Conversation

amogkam commented Mar 30, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

amogkam Apr 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[AIR] Add distributed `torch_geometric` example #23580

[AIR] Add distributed `torch_geometric` example #23580

amogkam commented Mar 30, 2022 •

edited

Loading

amogkam Apr 4, 2022 •

edited

Loading

amogkam Apr 21, 2022 •

edited

Loading