semi-structured text parsing using machine learning



I am looking for a method to parse semi-structured textual data, i.e. data poorly formatted but usually having a visual structure of a matrix which may vary a lot in content and number of items in it, which may have headers or not, which may be interpreted sometimes column-wise or row-wise, and so on.

I have read about the WHISK information extraction paper :

but unfortunately, it is not very detailed and I have not been able to find a real-system implementing it, or even snippets of code.

Has anybody have an idea where I can find such help? Or suggest an alternative approach which may be suited to my problem?

Thank you in advance for your reply!


Posted 2015-01-26T17:45:02.203

Reputation: 513


Sorry, I forgot to link to UTAH ;)

– None – 2016-04-02T15:51:32.827

Without a sample it's hard to see what tool could help you. UTAH is a java library that's able to parse semi-structured text. It might be useful for you. – None – 2016-04-02T15:48:54.780



Without a sample of your data, it's unclear what's the structure of your data and what tool is suitable to process it.

Here are some blind recommendations based on my experience:

  • If you just need some flexibilty parsing the text record, such as variable repeat number of certain field, or conditional parsing of fields, then you should check out this python library: it allows you to first define a hirachcal structure of your data, and then apply this structure to parse information from a text file. It's especial useful when parsing binary files.
  • If you're looking for an algorithm that can actually "understand" your text data and "infer" the structure in a smart way. Then you might want to try graph based approach:


Posted 2015-01-26T17:45:02.203

Reputation: 136