Representing a community as a vector

2

My setup is this:

Suppose I have transactional data over a large period of time. The parties of each transaction are labled, and I use Louvain algorithm for detecting communities (and sub-communities) in each arbitrary timestep (a day, 24 hour period).

I also used the labels of the parties and tf-idf to give some literal, summarized description for each community, in each timestep. This has allowed me to manually focus on sereval communities that I find interesting for my specific research, which regulary appear in each timestep (they are somewhat consistent in terms of the nodes that make them up - mostly the same nodes each day). It might also be the case that such a community won't appear at all in some timestep (usaully, it wasn't "consolidated" enough that day, so the Louvain was only able to detect isolated components of it).

Based on previous knowledge, I can also label each timestep with some label which is relevant for my research. For example, the labels might be: "Wedding", "Funeral", "Birthday", "Ordinary day". This is crucial: By distinguishing communities that were detected during an ordinary day from those that were detected during a special day, my goal is to recognize what distinguishes them - and eventually use it for a predictive model.

Finally, In each timestep, I also computed centrality measures for each node in each community, such as degree but also betweenness and closeness.

Given this setup,

I realized I'm dealing with kind of a new dataset: The community instances (i.e a specific community in a specific timestep). This has led me to think that representing each community instance as a feature vector will allow me to cluster them together, in an attempt to achieve my goal.

There are some direct propeties which can be included in such a vector, such as: number of nodes, number of edges, number of sub-communities, average degree, etc.

A less naive vector representation could be as an ordered list of number of nodes in each sub-community (of the parent community instance), and zero's elsewhere. (the vector length is the maximal number of sub-communities found). A more sophisticated method would be to caclulate the variance (and mean) of the centrality measures of the nodes in each community instance, and use that as features.

In general, I would like to know if there were any successful study cases based on a similar approach (embedding communities instances as feature vectors), and if so, what features were used? Alternatively, are there any flaws in this kind of approach? or more specifically, in the features I suggested?

Any help would be appreciated.

gbi1977

Posted 2018-08-16T21:22:03.730

Reputation: 121

A community is a subgraph, so I would use a (sub)graph embedding; here is a relevant survey.

– Emre – 2018-08-16T21:24:54.193

Apparently some work has already been done on this subject, such as this: https://arxiv.org/pdf/1610.09950.pdf. The main claim is that a community cannot be represented as a single vector but rather a distribution of node vectors, which makes sense - but since I'm trying to compare discrete-time instances of essentialy the same communities, over and over, I would much rather find a compact vector representation of a single community, with the entries of the vector comprising of a number associated with a global, computed property of the community.

– gbi1977 – 2018-08-19T10:27:33.663

That can be done; the vector would simply represent the parameters of the density, so a vector representation loses no generality. – Emre – 2018-08-19T21:59:59.573

@Emre by parameters of density do you mean the centrality measures I mentioned? or other parameters? – gbi1977 – 2018-08-20T11:26:25.237

I mean that any pdf can be characterized by a set of parameters; e.g., the scale and location. If you don't know the family of the distribution, you can use non-parametric density estimation. Naturally, there are deep models for this too. In fact, many modern models are all about learning these high-dimensional densities and sampling from them. However, this might be overkill for your classification problem. Did you try a seq2seq model with the node embeddings as the input and the class as the output?

– Emre – 2018-08-20T16:28:36.060

No answers