How to reduce dimensionality of encoder decoder output?


I have an encoder decoder architecture where the output $ \bar{\bf{y}}_t $ is a sequence of integers of maximum length $n$. Each integer in the sequence is representative of a category so the sequence $ {0 ,1 ,3 ,4 ,6} $ could mean $\text{Car , Train , Plane , House , Dog}$. There are $m$ possible categories. The current output of the network is an $n \times m$ matrix where the entry $(i,j)$ is meant to represent the probability that the $i^\text{th}$ element of the output sequence belongs to category $j$. I was wondering is there a way to reduce the dimensionality of this problem by sharing weights among the rows of the output matrix. I was thinking there may be a way of predicting the outputs sequentially so there is weight sharing among the rows


Posted 2020-11-19T12:42:49.737

Reputation: 1

Hi @KaneM, welcome to the site. What architecture are you currently using? You tagged lstm but this would imply that the predictions are done sequentially and this is precisely what you seem to be willing to achieve, suggesting you currently have an architecture where there is no weight sharing. – noe – 2020-11-19T14:04:20.437

Hi , I am currently using an encoder decoder architecture but the output is a sequence of sequences so I am not sure how to reduce the dimensionality of the output. I feel that I am not being 100% clear in my question so please do ask more questions if it lacks clarity – KaneM – 2020-11-19T14:40:35.917

Encoder-decoder is an architectural pattern. You can have an encoder-decoder based on LSTMs, GRUs, muti-head attention or convolutions. Please provide more details in that respect. – noe – 2020-11-19T15:50:19.697

The encoder decoder is based on LSTM units. My problem stems from their being multiple outputs per time step. In traditional encoder decoder architecture, my understanding is that the output is generally a probability distribution over the dictionary of words. My problem is more akin to their being multiple outputs per time step – KaneM – 2020-11-19T16:32:29.687

Ahhh, now I see what you mean – noe – 2020-11-19T16:44:22.423

Do you you have any ideas on how to change the architecture or reformulate the problem to deal with this? – KaneM – 2020-11-19T17:28:17.387

Not really. I assume that the elements depend on one another in a double autoregressive way. The only thing I can think of is to apply techniques used in semi-autoregressive and non-autoregressive NMT, but normally these come with a decrease in quality. – noe – 2020-11-19T23:15:39.167

No answers