I'm looking to parse human-generated cyber-physical sensor tags into a series of standardised tags and identifiers.
For example "Rm3.ZnT-SP" refers to ROOM ID:3 ZONE TEMPERATURE SETPOINT
The "state of the art" involves using CRF to tag each character using BIO and active learning to provide the labels. Note it's not always this easy - the same output can be encoded many different ways (T, temp, tmp etc.)
It seems there should be a way of avoiding the CRF step - I'd like to just map from the input character sequence to an output sequence of embedding without having to provide BIO labels. This in itself seems to be a fairly standard Seq2Seq problem. What I'm unsure about is the fact that the output sequence if a combination of (known in advance) embeddings (tags/tokens) and (unknown) identifiers from the original sequence.
This seems somewhat similar to NLP - for example, if I was translating from one language to another presumably you translate the words but there will be some literals which are just passed through verbatim (i.e. a street number or a drivers license number). It also has some characteristics of Named Entity Recognition but from what I've seen of NER it only identifies the named entities.
Any pointers to similar problems are appreciated.