Chunker/shallow parser for spoken language


I'm trying to extract NPs from transcribed spoken text, such as

um it's the bl- it's the blue one in the right no left hand corner

which contains e.g. fillers (e.g. um) and disfluencies (e.g. bl-, right no left hand corner) that are not commonly seen in written text. Ideally, I'd like to get something like the three sequences it, the blue one and the left hand corner (or at the very least the right no left hand corner).

I'm currently using Stanford CoreNLP's pre-trained shift-reduce parser with a beam size of 4 (englishSR.beam.ser.gz) and bidirectional dependency network POS tagging (english-bidirectional-distsim.tagger) after filtering out fillers and duplicated tokens (e.g. uh it's it's that oneit's that one). This performs okay but seems to fail a lot more than I'd expect; Are there no chunkers or (shallow) parsers widely available which are tailored specifically to spoken English as opposed to written English? The language the chunker/parser is written in is irrelevant (i.e. it needn't have a Java API). I've also tried using Stanford CoreNLP's caseless models, which actually seem to perform a bit worse (however, I haven't done any rigorous comparisons).


Posted 2017-10-12T20:57:37.467

Reputation: 119

Would the downvoter not like to make a useful comment? – errantlinguist – 2017-10-13T06:50:18.627

No answers