How to approach a text parsing problem


I'm looking to parse human-generated cyber-physical sensor tags into a series of standardised tags and identifiers.

For example "Rm3.ZnT-SP" refers to ROOM ID:3 ZONE TEMPERATURE SETPOINT

The "state of the art" involves using CRF to tag each character using BIO and active learning to provide the labels. Note it's not always this easy - the same output can be encoded many different ways (T, temp, tmp etc.)

It seems there should be a way of avoiding the CRF step - I'd like to just map from the input character sequence to an output sequence of embedding without having to provide BIO labels. This in itself seems to be a fairly standard Seq2Seq problem. What I'm unsure about is the fact that the output sequence if a combination of (known in advance) embeddings (tags/tokens) and (unknown) identifiers from the original sequence.

This seems somewhat similar to NLP - for example, if I was translating from one language to another presumably you translate the words but there will be some literals which are just passed through verbatim (i.e. a street number or a drivers license number). It also has some characteristics of Named Entity Recognition but from what I've seen of NER it only identifies the named entities.

Any pointers to similar problems are appreciated.

David Waterworth

Posted 2020-07-30T22:55:38.827

Reputation: 362

How many different patterns are there? Is this not doable with simple regular expressions? At first sight training a complex model for this looks like an overkill, you risk spending more time designing and implementing the model than just listing the patterns. – Erwan – 2020-07-31T12:13:07.710

No answers