## In a directed graph, how to measure whether a node is more "upstream" or "downstream"?

5

2

Let's say that I have a directed graph, and for each node I want to have some measure of whether it is more of an "upstream" node (that lies at the beginning of many paths), or a "downstream" node (to which many paths converge).

Say, in a sequence A->B->C I want A to have the highest value, and C the lowest value, while B would be "in between". In a perfect cycle A->B->C->A I want all nodes to have the same "in between" value. The metrics should tolerate a mix of cycles and acyclic elements.

Ideally, it would also give more extreme values to nodes from longer acyclic paths: higher values to nodes with larger estuaries (the 'mother of all' node), and lower values to nodes with larger watersheds (the 'all roads lead to Rome' node that integrates all flows).

And also ideally it should generalize to weighted graphs. And, ideally, in a way that does not depend on the scaling factor for weights (so if you multiply all weights in the graph by 2, the metrics for each node shouldn't change, as the topology clearly didn't change with this scaling).

For a non-weighted graph I can come with a metrics like that: find all predecessors of a node, find all children of a node, then divide the number of children to the sum of children and ancestors; this value would be 1 for any point of origin; would give 0 for dead-ends, and would give 0.5 for cycles. So it's mostly OK. What I don't like about this metrics is that 1) it does not care about the length of the path, 2) it will be computationally slow, 3) I don't know how to generalize it to weighted graphs in a scale-invariant way.

So I was wondering whether there's a known metrics with approximately these properties that was described and studied before. It feels like a logical thing to calculate that many people would use when analyzing social networks, for example; so it feels like it should have a name and published algorithms. Thank you!

Edit: I think it's fair to say that the pagerank metrics has many of the properties I described (with the values reversed): sinks are high, sinks with larger watersheds are higher, nodes of origin are low, cycles tend to have "in-between" values, and the algorithm clearly supports weighted graphs. The part it does not care about is whether a node of origin has a large estuary or no estuary at all. Now I'm wondering whether I actually need 2 metrics: one page-rank, for watersheds, and a different one for estuaries. Like a weighted share of nodes visited by random walks initiated in the node of interest, or something like that. Or are there simpler metrics?

1How about the mean distance? – Emre – 2017-08-24T19:37:09.780

Mean distance to all other elements of the graph that can be reached from this point? It qualifies in some ways: is max for origin, is 0 for dead-end, is in between for acyclic in-between, may work for weighted graphs. However I'm not sure how to make it work for cycles (wouldn't it go to infinity?), and also it's unpleasantly asymmetrical: I can calculate mean distance FROM the point, or TO the point, but how to incorporate both? – ampanmdagaba – 2017-08-24T19:50:47.427

If I understand your network flow analogy, you could modify PageRank to sum the $L^p$ norm of the inbound links' PageRanks for $0 \leq p \lt 1$ (instead summing the PageRanks without the norm). This would prevent large PageRanked nodes from swamping the contribution of smaller ones. Then you would probably lose some of the nice computational properties but that's something you can think about separately. – Emre – 2017-08-26T17:33:56.400

2

Please clarify what you are looking for in the presence of cycles. I'd assume the cycles aren't standalone- any node in a cycle could also have inputs from further upstream, and could also send output further downstream. Even in the unweighted case, it seems to me that some nodes in a cycle can be more "upstream" than others. For instance, most of the nodes in the cycle might be downstream of a set of source nodes. One node in that cycle might be the only path out of the cycle to a sink. Surely that node is more "downstream" than the others in the cycle?

Sorry the above isn't just a comment - I'm too new to have high enough reputation in this group for that.

So, to give an answer as well. Take the "flow network" analogy to an extreme. Imagine each source upstream of the target node puts out a water flow, with dye added at the first instant only. Treat the weights as time delays. For each node, calculate the first time the dye reaches a given node. Among all the sinks downstream of your target node, pick either the longest time or the mean time (or itself, if the node is a sink). The target node's position is then just its early arrival time divided by the sinks' mean or longest arrival time. Computationally, this earliest dye arrival time is done by some modification of breadth first traversal of the directed graph. Start at the sources with arrival times of 0, and then each node arrival time is just the minimum over all its predecessors. (As in any graph traversal, visited and completed nodes have to be remembered so cycles aren't followed more than once. This would also be called shortest distance from a set of nodes to a given set of nodes.

But note that cycles aren't given any particular special treatment. If you really want cycles to matter, you could assume that output flows are split equally, and do mixing calculations, so that the dye concentration would decay gradually over time. But that would seem ridiculously complex, and take time, for probably no benefit. But in either case, the nodes in cycles won't all have the same value - some will be more "upstream" than others.

If you really want all the nodes in a cycle to have the same value, you might need a separate pre-processing step to find the directed cycles, and aggregate each cycle into a single node (one for each cycle). That adds complexity, although it simplifies the shortest distance calculations since then the graph will be acyclic and a trivial breadth first traversal will work.

You probably don't really want to require a global analysis of a graph to get your feature value, although that's what you had to do in your initial analysis too. If this is something you would run on a regular basis, you could probably just do local updating when changes occur, with limited propagation of earliest arrival times, assuming you could tolerate some "error" in the somewhat arbitrarily-chosen metric anyway.

1

I believe what you are looking are graph centrality measures. The Betweenness centrality for each vertex reflects the ratio of shortest paths going through that vertex:

$g(v)=\sum _{{s\neq v\neq t}}{\frac {\sigma _{{st}}(v)}{\sigma _{{st}}}}$

where $\sigma_{st}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(v)$ is the number of those paths that pass through $v$. Also check graph drawing algorithms. Some of them such as force based and spectral methods can give reasonable results for your purpose, but with no theoretical guaranties.