I am trying to design a sequence-to-sequence autoencoder where the encoding sequence has a length shorter than the sequence itself given that the encoding's dimensionality could be more to compensate for that.
Each unit of the resulting encoding sequence should have encoded many timesteps of the input sequence based on similarity or relatedness with each other and then decoder should be able to decode the exact sequence from the above encoding. Basically, if we expect an input sequence to consist of shorter sub-sequences of variable length, I want an encoder than identifies the similarity between the timesteps of a sub-sequence and encode that pattern in a single unit so that the decoder can use that information to decode that information by "calculating" how to the output all the timesteps over a short distance of sub-sequence. All this is to be done in an unsupervised fashion as these shorter sub-sequences are not labelled, hence these short term patterns have to be learned.
As an example, consider the task of phenome extraction where a sound utterance has to encoded into (shorter in length but dense) phenome sequence and then decoded back to original utterance.
Mainly, I would want to know the bottleneck that could bring out this type of encoding in an autoencoder. I'm also open other models/architectures that might help.
NOTE: The input sequence would be a time-series sequence and please expect individual timesteps to be related to their adjacent few timesteps in similarity.
For instance, if input is a speech utterance, I would want to capture the relation of timesteps at phenomic level which would be localised together in the input in continuity. Like the red circles can be timesteps of /k/, the green ones /æ/ and the blue ones can be /t/ if input utterance is "cat". Hence, it is easy to see that red circles would be clumped together as the blue and green ones would be with themselves.