semi-structured text parsing using machine learning

5

4

I am looking for a method to parse semi-structured textual data, i.e. data poorly formatted but usually having a visual structure of a matrix which may vary a lot in content and number of items in it, which may have headers or not, which may be interpreted sometimes column-wise or row-wise, and so on.

I have read about the WHISK information extraction paper : https://homes.cs.washington.edu/~soderlan/soderland_ml99.pdf

but unfortunately, it is not very detailed and I have not been able to find a real-system implementing it, or even snippets of code.

Has anybody have an idea where I can find such help? Or suggest an alternative approach which may be suited to my problem?

Thank you in advance for your reply!

mic

Posted 2015-01-26T17:45:02.203

Reputation: 513

2

Sorry, I forgot to link to UTAH ;) https://github.com/sonalake/utah-parser

– None – 2016-04-02T15:51:32.827

Without a sample it's hard to see what tool could help you. UTAH is a java library that's able to parse semi-structured text. It might be useful for you. – None – 2016-04-02T15:48:54.780

Answers

5

Without a sample of your data, it's unclear what's the structure of your data and what tool is suitable to process it.

Here are some blind recommendations based on my experience:

  • If you just need some flexibilty parsing the text record, such as variable repeat number of certain field, or conditional parsing of fields, then you should check out this python library: http://construct.readthedocs.org/en/latest/ it allows you to first define a hirachcal structure of your data, and then apply this structure to parse information from a text file. It's especial useful when parsing binary files.
  • If you're looking for an algorithm that can actually "understand" your text data and "infer" the structure in a smart way. Then you might want to try graph based approach: http://kavita-ganesan.com/opinosis

imadcat

Posted 2015-01-26T17:45:02.203

Reputation: 136