Finding the energy function given update rule of a single layer non-linear neural network

0

Consider the network with N neurons, each of which takes a $$2 \times k$$ input specified by the tuple $$(\vec c_t, \vec \theta_t)$$ to produce output $$\vec{R}_t$$ through an update rule on the pairwise weights between the neurons $$\mathbf{W_t}$$:

\begin{align} g(x) &= \lfloor x \rfloor^2, \forall x \in R,\\ \vec r_t &= g\left(\begin{bmatrix}\vec c_t & b\end{bmatrix}^\intercal \begin{bmatrix}\mathbf{f}(\vec\theta_t) \\ \mathbf{1}\end{bmatrix}\right),\\ \vec R_t &= \vec r_t^\intercal [\mathrm{diag}(\sigma^2\mathbf{1} + \mathbf{W_t} \vec r_t/n)]^{-1},\\ \Delta \mathbf{W_{t+1}} &= \mathbf{W_{t+1}} - \mathbf{W_{t}} = \alpha (\vec R_t^\intercal \vec R_t - \mathbf{C}) + \beta (\mathbf{W_{t}} - \mathbf{C}). \end{align}

where $$\vec c \in R^k, \vec \theta \in R^k, \vec r \in R^n, \alpha, \beta, b \in [0,1]$$, nonlinear activation $$\mathbf{f} \in [c_0, 1]^{nk}$$, convex map $$g$$ truncates all negative elements to 0 then squares each element of a vector, $$\mathrm{diag}$$ converts its input vector to a diagonal matrix. At $$t=0$$, network is initialized using inputs $$(\vec c_0, \vec \theta_0)$$ as $$\mathbf{W_0}=\mathbf{1}, C:=\vec R_0^\intercal \vec R_0$$.

• Premise question on existence of stable state(s): What initial conditions of $$b, \alpha, \beta, \sigma$$ and inputs $$(\vec c_t, \vec\theta_t)$$ lead to stable states? That is, when does Eq.4 lead to no update at $$t\to+\infty$$?
• Main question: Suppose the update rule Eq. 4 is applying some "backprop" trying to "drive" some network state $$E$$ to an "optimal" state $$E^*$$ computed by a cost function $$E(R, ...)$$ from the network outputs $$R$$ and the network state $$W, C, b, \sigma, \alpha, \beta$$ [1]. Can we write $$E(R, ...)$$ explicitly in terms of the output $$R$$ (and the initial conditions)?

My attempts

The first question requires solving $$\vec R_t^\intercal \vec R_t = \mathbf{C},\mathbf{W_{t}} = \mathbf{C}$$. Plugging in Eq.3 doesn't seem to simplify much. I tried taking svd of $$\vec R_t^\intercal \vec R_t$$ and $$\mathbf{W_{t}}$$ etc., but didn't lead to much progress. I am not sure how to solve for the equilibrium states of a dynamic system specified by matrices like this.

For the second question, suppose Eq.4 describes traveling along the steepest gradient in the state space $$E(R, ...)$$. Then it should correspond to the maximizing the differential $$\left|\frac{\partial E_t}{\partial\mathbf{W_t}}\right|$$, i.e. $$\mathrm{argmax}_{\mathcal{E}}\left|\frac{\partial E_t}{\partial\mathbf{W_t}}\right|=\alpha (\vec R_t^\intercal \vec R_t - \mathbf{C}) + \beta (\mathbf{W_{t}} - \mathbf{C})$$.

[1]: Inspired by https://cs.nyu.edu/~yann/research/ebm/