Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyos] The representaton of self-loops in undirected graphs #398

Open
szhorvat opened this issue Mar 7, 2023 · 15 comments
Open

[pyos] The representaton of self-loops in undirected graphs #398

szhorvat opened this issue Mar 7, 2023 · 15 comments
Labels
discussion Discussing a topic with no specific actions yet enhancement Improve existing functionality or make things work better io Data input, output, and conversions

Comments

@szhorvat
Copy link

szhorvat commented Mar 7, 2023

I noticed that when converting an undirected and unweighted graph from NetworkX, each self loop is represented by 1 (and not 2) on the diagonal. Example:

g = nx.Graph([(0,1), (1,1)])
gb.io.from_networkx(g)

The result is this matrix:

0 1
1 1

I would have expected:

0 1 
1 2

Adding self-loops once to the diagonal is not incorrect per se, but it tends to be mathematically inconvenient as many identities that people expect to hold will be broken. For example, do we consider an undirected graph to be equivalent to a directed one where each undirected edge was replaced by two reciprocal directed ones? That would no longer be the case. Are the degrees obtained as row or column sums? That does not work. Is the number of edges in an undirected graph half the sum of the matrix elements? No. Consider a random walk process on the graph, which is the underlying model for many network analysis concepts such as PageRank. Can the walker traverse a self-loop in both directions? Adding one to the diagonal implies no.

One can certainly build a consistent system while considering self-loops only once in undirected graphs, but the outcome tends to be a bit less intuitive, and a bit harder to understand for people who just want to do network analysis. Several of the network analysis libraries are unfortunately inconsistent: some analysis functions implicitly consider self-loops once, some others twice. Thus be very careful with following others' example.

I would suggest considering this choice very carefully, and making sure that the systems you design (e.g. graphblas-algorithms) behave consistently and intuitively.

pyOpenSci/software-submission#81

@eriknw
Copy link
Member

eriknw commented Mar 7, 2023

Yeah, this is worth attention, thanks @szhorvat. Right now we defer to networkx.to_scipy_sparse, but they "solved" this problem by adding to the documentation and providing a recipe for how to double the diagonals. Their recipe involves creating a dense array, which is not so great.

So, at the very minimum, we should document how self-edges are handled in different cases. We should also consider adding a keyword argument to control behavior and changing the default behavior.

See NetworkX discussion about this here: networkx/networkx#1078
And their docs here: https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.to_scipy_sparse_array.html

As for how we handle self-edges in graphblas-algorithms... this has been a big question mark. Our current plan is to match networkx, and to create tests that run and compare against networkx to help us determine what networkx does.

@szhorvat
Copy link
Author

szhorvat commented Mar 7, 2023

Does the GraphBLAS standard say anything about the interpretation of symmetric matrices as undirected graphs? Or does it sidestep the problem by effectively treating everything as directed? I'm new to GraphBLAS, but I guess the latter?

Be careful about consistency, and with adding converters to/from other network packages (or other network packages implementing converters themselves). igraph and graph-tool both use 2. Delegating to the adjacency matrix function of each package would be problematic.

NetworkX uses 1 but one may argue that this is inconsistent and if you do so, you should also define "degree" as the number of incident edges. So this graph would have degrees (1, 3, 1):

image

But that's not how NetworkX handles degrees, it counts edge endpoints, as is typical in most textbooks. So the degrees it returns are (1, 4, 1). Then it would make sense for the adjacency matrix to also contain the number of edge endpoints, not number of edges and maintain the relation that row (or column) sums give the degrees.

Or consider a random walk on an undirected graph. What is the probability of the random walker being in a given vertex? You'll find in many textbooks that this is proportional to the degree. This scenario is basically PageRank with a damping factor of 1. Do the calculation with networkx.pagerank for the above graph, and you get (0.2, 0.6, 0.2) as if the degrees were (1, 3, 1) ...

@eriknw
Copy link
Member

eriknw commented Mar 7, 2023

I agree with your points @szhorvat, and thanks for the clear explanations. It's helpful to have a "paper trail", and I'm looking forward on your perspective on more things. We'll discuss this issue in our weekly meeting tomorrow, but here are my current thoughts:

  • the current behavior for undirected is "wrong" for the reasons you state and the default behavior should be as you describe
  • how self-edges are converted to GraphBLAS adjacency matrix should be explicitly documented
  • we should consider adding a keyword argument to control behavior of undirected self-edges
    • are there strong reasons for using the current behavior?
    • having a keyword can serve as a "flag" to bring attention to how to handle self-edges
  • we should be consistent when converting undirected graphs from other graph packages
    • converting to/from igraph and graph-tool are within scope
  • we should consider upstreaming this concern back to NetworkX, b/c they are using scipy.sparse more often
    • since the default behavior is "wrong", it may be easy for them to introduce subtle errors

Does the GraphBLAS standard say anything about the interpretation of symmetric matrices as undirected graphs? Or does it sidestep the problem by effectively treating everything as directed? I'm new to GraphBLAS, but I guess the latter?

Despite having "Graph" in the name, GraphBLAS says very little about the interpretation of matrices. The same matrix can be interpreted differently whether it's directed or undirected, and the same graph can have different associated matrices: adjacency, incidence, Laplacian, etc. So, I would say the GraphBLAS approach is to allow users to adhere to well-established theory relating matrices, linear algebra, and graphs. If you know you have an adjecency matrix for an undirected graph, you can treat it as such or wrap it in a Graph class as is done in LAGraph and graphblas_algorithms.

So, we should follow the spirit of adhering to well-established theory, which I think you've convinced me that in this case means changing how we convert self-edges for undirected graphs from networkx.

@eriknw
Copy link
Member

eriknw commented Mar 16, 2023

Perhaps the easiest thing for us to do here (at least for now) is to double the diagonals ourselves when converting from NetworkX, such as:

if not G.is_directed():
    A += gb.select.diag(A)

@eriknw
Copy link
Member

eriknw commented Mar 17, 2023

If we double the diagonals by default when converting from NetworkX, then we ought to halve the diagonals when converting to a NetworkX undirected Graph by default.

What are good keyword arguments to control this behavior?

@szhorvat
Copy link
Author

igraph and graph-tool both use 2.

I have to make a correction here. The Python interface of igraph has not fully transitioned yet, and currently uses 1 by default. graph-tool does use 2.

What are good keyword arguments to control this behavior?

What we are heading towards in igraph is using a loops parameter with settings "ignore", "once", "twice" for all functions it makes sense to control this:

https://python.igraph.org/en/stable/api/igraph.GraphBase.html#Adjacency

@szhorvat
Copy link
Author

I notice that to_networkx() always creates an nx.DiGraph. This issue is only become complicated for undirected multigraphs.

For directed graphs, self-loops should not be counted twice.

For simple undirected graphs, the weight simply needs to be halved, if you choose to go with taking self-loops twice.

This brings up the question:

  • Can one control what class to_networkx() uses out of Graph, DiGraph, MultiGraph and MultiDiGraph?
  • If converting to a Graph (undirected), what if the matrix is not symmetric?
  • Should multigraphs be supported at all?

It seems to me that the GraphBLAS approach of using adjacency matrices is not very suitable for multigraphs. One may certainly represent multiple paralell edges with integers higher than 1 in the adjacency matrix, but then weights cannot be handled, and several parallel edges cannot be distinguished from each other.

@szhorvat
Copy link
Author

I wanted to give you a heads up that I won't be reachable until the week after next. I will think more carefully about these issues when I'm back and will try to finish the review by April 2nd.

@eriknw
Copy link
Member

eriknw commented Mar 17, 2023

Thanks for the suggestions, questions, and heads up @szhorvat. I expect to be especially busy the next two weeks, so this is good timing for a pause. I do appreciate your scrutiny and involvement with your careful review--you're helping to improve things and push things forward :)

We would like to redo from_networkx/to_networkx and also add igraph conversions to try to be consistent. We've discussed this at our previous two meetings, but don't have specific APIs yet. We want to support:

  • directed/undirected
  • keyword argument for how to handle self-edges
  • adjacency and incidence matrices / graphs and multigraphs (and maybe someday hypergraphs)

It seems to me that the GraphBLAS approach of using adjacency matrices is not very suitable for multigraphs.

Right, adjacency matrices aren't the way to go for weighted multigraphs. For unweighted, using the multiplicity for the edge weight can sometimes work as you indicated. We'll need to use incidence matrices in general though. Now I wonder: how are self-edges handled by incidence matrices? Instead of a single row (or column) for an edge with two values to indicate the two end points, I suppose you'd need two rows (or columns) for an edge with one value each to indicate the self-loop.

@eriknw
Copy link
Member

eriknw commented Mar 17, 2023

@jim22k, regarding the shape of the incidence matrix, since we assume row-oriented-ness, I think it makes the most sense to have nodes X edges (node ids as row indices and edge id as column indices) by default. This way, it's easy to ask the question "which edges are incident upon node i" by simply taking B[i, :], which ought to be faster than B[:, i] if the underlying storage is CSR or DCSR. This also seems to be the more common convention.

@eriknw
Copy link
Member

eriknw commented Mar 19, 2023

@SultanOrazbayev @jim22k for the next meeting or two, let's try to work out a proposal for going to/from networkx and adjacency matrix. If we have the time, I think it would be helpful if we each come up with an API and behaviors that we think make the most sense, then we can compare.

I think we can separate and postpone incidence matrix conversions until after we do adjacency matrix conversions, but of course we can consider incidence matrices too if anyone gets inspired!

@SultanOrazbayev
Copy link
Member

Right, adjacency matrices aren't the way to go for weighted multigraphs. For unweighted, using the multiplicity for the edge weight can sometimes work as you indicated. We'll need to use incidence matrices in general though. Now I wonder: how are self-edges handled by incidence matrices? Instead of a single row (or column) for an edge with two values to indicate the two end points, I suppose you'd need two rows (or columns) for an edge with one value each to indicate the self-loop.

Wouldn't this be handled by a simple one-valued row? (so the column with the only value is the self-edge)

@SultanOrazbayev
Copy link
Member

SultanOrazbayev commented Mar 20, 2023

This is not directly relevant to the discussion here, but since incidence matrices were brought up, there is a relationship between incidence and adjacency matrix given here:

The incidence matrix C of a graph and adjacency matrix L of its line graph are related by
$L=C^{T}C-2I $

I don't have the book referenced there, but at least for the special case of $ L=2I $, the corresponding value for incidence matrix is $ C=2I $.

Edit: this math doesn't seem to work with a fully connected graph, so would be great to see the proof of that relationship (I don't have access to the book). Oops, didn't read the link carefully. :/

@eriknw
Copy link
Member

eriknw commented Mar 20, 2023

The relationship I'm more familiar with the relationship connecting the incidence matrix B to the Laplacian L, adjacency matrix A, and diagonal degree matrix D.

For unweighted, undirected graph:

B @ B.T = D + A

For unweighted, directed graph:

B @ B.T = D - A = L

So, for these to hold, I think the single value in a column for a self-edge should be 2 for unweighted/undirected, and 0 for unweighted/directed.

@eriknw eriknw added enhancement Improve existing functionality or make things work better discussion Discussing a topic with no specific actions yet io Data input, output, and conversions labels Mar 25, 2023
@szhorvat
Copy link
Author

Hi @eriknw, I am unable to find an email address for you. Could you please contact me at szhorvat at gmail.com, to discuss a proposal using python-graphblas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet enhancement Improve existing functionality or make things work better io Data input, output, and conversions
Projects
None yet
Development

No branches or pull requests

3 participants