Reduce size of a network graph for bipartite projection

1

I have a graph that I created from a pandas data frame. The length of the graph is ~450k edges. When I try to run the weighted_projected_graph function, it runs for a long time (I have not seen it finish), presumably because of the size of this data set. What is a good method for reducing the size of this data set before creating the bipartite graph?

I have tried narrowing it down by using the most connected components:

trim1 = len([c for c in net.connected_component_subgraphs(g) if len(c) > 10])
C = net.Graph()
gg = sorted(net.connected_component_subgraphs(g), key=len, reverse=True)[trim1]

but I don't think this is giving me the results I want and, further, I'm not confident this is an analytically sound strategy. Does anybody have any other recommendations for reducing the size of this set?

EDIT: full code, without reducing. What I would be trying to do with the above would swap out g below for gg from above in the call to bi.projected_graph

reviews = pd.read_csv(r"\BX-Book-Ratings.csv", sep=";", encoding = "ISO-8859-1")
users = reviews['User-ID'].values.tolist()
books = reviews['ISBN'].values.tolist()

g=net.from_pandas_edgelist(reviews,'User-ID','ISBN',['Book-Rating'])

print(len(g))
>>> 445839

p = bi.projected_graph(g, users)
```

Sam cd

Posted 2020-10-12T13:18:28.630

Reputation: 121

How are you constructing the graph from the frame? weighted_projected_graph takes a bipartite graphs as input (but doesn't check whether the input really is bipartite), and generally will not produce a bipartite output. The result should be the same as running the function on each component separately and taking the union, but since the implementation looks at second neighborhoods to get the counts, that shouldn't actually be much faster(?). – Ben Reiniger – 2020-10-12T14:50:16.343

@BenReiniger thanks for that clarification. However, now using projected_graph to create the bipartite graph, that is timing out. I edited the post to include my entire code. – Sam cd – 2020-10-13T14:21:59.823

Answers

2

I think the problem is not with the data size, but with the presence of large degree nodes (the degree of a node is its number of neighbours).

Indeed all neighbours of a node in the bipartite graph are linked together in the projection; they form a clique (complete sub-graph) in the projection. This is terrible, as the number of links in a clique of $d$ nodes is $\frac{d\cdot(d-1)}{2} \sim d^2$.

In concrete words, a node of degree 2,000 in the bipartite graph will induce almost two million links in the projection... A node of degree 20,000 will induce almost two hundred million links...

For more discussion of the problems with projections of bipartite graphs, you may have a look for instance at this paper: Basic Notions for the Analysis of Large Affiliation Networks / Bipartite Graphs

Matthieu Latapy

Posted 2020-10-12T13:18:28.630

Reputation: 131