I'm trying to extract NPs from transcribed spoken text, such as
um it's the bl- it's the blue one in the right no left hand corner
which contains e.g. fillers (e.g. um) and disfluencies (e.g. bl-, right no left hand corner) that are not commonly seen in written text. Ideally, I'd like to get something like the three sequences it, the blue one and the left hand corner (or at the very least the right no left hand corner).
I'm currently using Stanford CoreNLP's pre-trained shift-reduce parser with a beam size of 4 (
englishSR.beam.ser.gz) and bidirectional dependency network POS tagging (
english-bidirectional-distsim.tagger) after filtering out fillers and duplicated tokens (e.g. uh it's it's that one → it's that one). This performs okay but seems to fail a lot more than I'd expect; Are there no chunkers or (shallow) parsers widely available which are tailored specifically to spoken English as opposed to written English? The language the chunker/parser is written in is irrelevant (i.e. it needn't have a Java API). I've also tried using Stanford CoreNLP's caseless models, which actually seem to perform a bit worse (however, I haven't done any rigorous comparisons).