Skip to content

Data Augmentation for Graph Neural Networks using Gromov-Wasserstein Distances and Graphon Inference

License

Notifications You must be signed in to change notification settings

timofey-efimov/Data-Augmentation-for-Graph-Neural-Networks-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Data-Augmentation-for-Graph-Neural-Networks-

This is repository for the code, which augments the dataset for graph neural networks. Right now the code works for topology-based classification problems.

This code is for the final project for ELEC573: Network Science and Analytics, offered at Rice University

Click on the image to see video of the project presentation:
drawing

Project Authors:

Timofey Efimov, Robbie Kenworthy

About The Project

Data augmentation is a common technique used in machine learning to increase the size and diversity of a dataset. However, it can be more complicated to perform data augmentation on graph datasets, as graph neural networks operate on non-Euclidean structures. In this paper, we propose a data augmentation model that uses graphons to generate new graphs and Gromov-Wasserstein distance to measure the similarity between the original and newly generated graphs. This approach allows us to augment our graph dataset in a way that preserves the structure and relationships within the data.

Data Description

Dataset Link

The dataset we selected was on that contained approximately 200,000 Reddit threads, and we distinguished whether these threads were discussion-based or not. The structure of these graphs is based on replies to the initial post. Discussion posts will often have much longer chains of comments, as people reply to others in the form of a dialog between them while posts that are not will be much more centralized around the initial post. These distinctive structures allow for a much easier graph classification problem, that also enables us to gain insight into the effectiveness of our graph sampling technique.

The binary classification can be represented as follows:

Input for the binary graph classification problem
drawing

Result of binary graph classification
drawing

Example of the graph from Reddit Dataset:
drawing

Example of newly generated graph of 10 times bigger, similar to previous one in structure:
drawing

Graph Neural Network

Neural Network Structure:
drawing

To provide context, this structure defines feature dimensionality at each iteration, it's not entire graph neural network architecture Each layer is graph convolutional layer with a trainable skip connection.

Algorithm Description

The main idea behind this data augmentation technique is separate a small group of graphs from the original dataset, and for each of the graphs in the group infer a graphon. Then, by linear combination with equal weights find the "average" graphon for this group. In our case, we defined group as all graphs with the same number of nodes.

Once the group graphon is found, we need to sample graphs of 3 times previous number of nodes size. Number of samples is a variable that can be tuned in the code. Among these samples using Gromov-Wasserstein distances we can find the samples that are closest to the stochastic block model that represents the "average" graphon.

More mathematical and precise algorithm description is below:

drawing


drawing

It is important to note that for graphon inference adjacency matrices are sorted by node degrees so that it would be possible to meaningfully combine inferred graphons(align them). The example of the graphon, inferred from 1000 nodes, 600 preferentially attached node Barabasi-Albert graphon is below:

drawing

Built With

Python

About

Data Augmentation for Graph Neural Networks using Gromov-Wasserstein Distances and Graphon Inference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published