2

Recently I was reading this paper Skeleton Based Action RecognitionUsing Spatio Temporal Graph Convolution. In this paper, the authors claim (below equation (\ref{9})) that we can perform graph convolution with the following formula

$$ \mathbf{f}_{o u t}=\mathbf{\Lambda}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I}) \mathbf{\Lambda}^{-\frac{1}{2}} \mathbf{f}_{i n} \mathbf{W} \label{9}\tag{9} $$

using the standard 2d convolution with kernels of shape $1 \times \Gamma$ (where $\Gamma$ is defined under equation 6 of the paper), and then multiplying it with the normalised adjacency matrix

$$\mathbf{\Lambda}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I}) \mathbf{\Lambda}^{-\frac{1}{2}}$$

For the past few days, I was thinking about his claim but I can't find an answer. Does anyone read this paper and can help me to find it out, please?

Isn't this a duplicate of https://ai.stackexchange.com/q/22650/2444? Or are you interested only in the 1d convolution? If yes, maybe read my answer https://ai.stackexchange.com/a/21824/2444. I talk about 1d convolutions there, but I don't remember anymore the level of detail I used.

– nbro – 2020-07-24T19:58:18.273I am interested in (1xD) convolution specifically. But can you please explain your answer? – Swakshar Deb – 2020-07-24T20:01:23.630

Have you read my answer: https://ai.stackexchange.com/a/21824/2444? (Btw, I edited the comment above to include a link to this answer). I don't know if it answers your question because I don't remember what I wrote there, but I think it should give you some intuition behind 1d convolutions, as far as I remember :P

– nbro – 2020-07-24T20:02:46.990Yes, I go through the answer you gave previously. I think you are more focused on FCN and segmentation but I am trying to know can 1xD standard convolution and then multiplying it with adjacency matrix is the same as graph convolution in the skeleton based graphs? – Swakshar Deb – 2020-07-24T20:07:45.070

It's been a while since I read something about graph NNs, but can you point me exactly to the part of the paper where the author claims that, so that I have some context? – nbro – 2020-07-24T20:14:14.220

obviously, section 3.6, after equation 9. – Swakshar Deb – 2020-07-24T20:17:28.313

So, if I understand correctly, they are performing a 2d convolution between $\mathbf{f}

{i n}$ and $\mathbf{W}$ with kernels of shape $1 \times \Gamma$ to implement the multiplication $\mathbf{f}{i n} \mathbf{W}$. What is $\mathbf{f}{i n}$? The features at the nodes? I think so because they say "_In practice, under the spatial temporal cases, we can represent the input feature map as a tensor of $(C, V, T)$ dimensions" – nbro – 2020-07-24T20:30:25.267Yes, you're right. $f_{in}$ is the feature vector of the nodes. – Swakshar Deb – 2020-07-24T20:33:53.567

I can't think about it right now (cause I don't have much time), but, hopefully, this discussion and clarification was somehow useful. Maybe the next step to understand is what are $C$, $V$ and $T$ (if you haven't already) and why the feature map can be represented as a 3d tensor of dimensions $(C, V, T)$. If I remember correctly, algorithms that apply the so-called "graph convolution" are actually

message-passingalgorithms, where you "flow the information between nodes" for some time steps, and that should explain the parameter $\Gamma$. – nbro – 2020-07-24T20:36:55.243It;s okay. $C$ = coordinate of a perticular joint(x,y,z), $V$ = vertex(in this case joints), $T$ = Timestep. – Swakshar Deb – 2020-07-24T20:39:22.827

Ok. I just want to clarify one thing. They are apparently not talking about 1d convolutions (as I initially thought from your explanation), but 2d convolutions with kernels of shape $1 \times \Gamma$, i.e. rather than having the usual $3 \times 3$ (or $5 \times 5$), you have a $1 \times \Gamma$ kernel. If they really apply a 2d convolution, then the actual kernel will have the same $1 \times \Gamma \times N$, where $N$ is the depth of the feature map, which should be $T$, from what I understand, so the kernel should be $1 \times \Gamma \times T$. – nbro – 2020-07-24T20:43:23.390

Then you apply (i.e. convolve) this kernel $1 \times \Gamma \times T$ to the spatial dimensions of the input, where the spatial dimensions (height and width, or vice-versa) should be $C$ and $V$. I don't know which values $\Gamma$ is supposed to take, but you may want to try to solve values, like $3$, and see why the graph convolution would be equivalent to this 2d convolution with kernel $1 \times 3 \times T$. (I hope I didn't confuse you more). – nbro – 2020-07-24T20:44:42.863

yes, you are right. I initally talk about 1D convolution since in their code $\Gamma$ parameter is set to 1 – Swakshar Deb – 2020-07-24T20:48:33.163

I wrote "kernel will have the same", but I meant "kernel will have the

shape" and I wrote "want to try to solve values, like 3", but I meant "want to try to solve this with values like $\Gamma = 3$". – nbro – 2020-07-24T20:51:23.787Ha, so they set $\Gamma = 1$. I see that you wrote that only now. Well, in that case, you're really applying a $1\times 1 \times T$ kernel to the input feature map (i.e. 1d convolution). The application of a $1 \times 1$ kernel to the input produces a tensor with the same spatial dimensions as the input feature map. So, if you apply only one $1 \times 1 \times T$ kernel to an input feature map of shape $(C, V, T)$, then you will get an output feature map of shape $(C, V)$. – nbro – 2020-07-24T20:59:18.550

What are the dimensions of the normalized adjacency matrix, $\mathbf{\Lambda}^{-\frac{1}{2}}(\mathbf{A}+\mathbf{I}) \mathbf{\Lambda}^{-\frac{1}{2}}$? – nbro – 2020-07-24T21:02:19.270

After the convolution operation, they got 4-dimensional tensor, where extra 2 dimensions are batch size and number of filter. The normalized adjacency matrix is a 3 dimensional tensor – Swakshar Deb – 2020-07-24T21:29:52.410

What are the 3rd and 4th dimensions? – nbro – 2020-07-24T21:46:27.183

After convolution the dimension of the feature map is $(N,C,V,F)$ where $N$=no of batch,$F$ = number of filters and $C,V$ are cordiality and joints. The dimension of the normalized adjacency matrix is $(K,V,V)$ where $V$= no of joints, I am not sure about $K$. – Swakshar Deb – 2020-07-24T22:05:44.457

Then that's consistent with what I said, and it somehow seems to be consistent with what is written below that equation 9, i.e. "

and multiplies the resulting tensor with the normalized adjacency matrix" – nbro – 2020-07-24T22:08:11.890on the second dimensionI think $K=1$, so the dimension of the adjacency matrix is $(1,V.V)$ – Swakshar Deb – 2020-07-24T22:17:38.263

Ha, wait, I misinterpreted your last comment or you changed it. A tensor of shape $(N, C, V, F)$ needs to be multiplied with a tensor of shape $(K, V, V)$, on the second dimension. That looks problematic, actually, because the dimensions don't seem to match for matrix multiplication. – nbro – 2020-07-24T22:20:19.147

My bad input tensor will be $(N,F,C,V)$. They are doing the multiplication like this: torch.einsum('nkctv,kvw->nctw', (x, A)) – Swakshar Deb – 2020-07-24T22:25:03.273

I am not familiar with that function (although I once used PyTorch). I would need to check the documentation, but I don't have time now. – nbro – 2020-07-26T15:26:47.370