What machine learning algorithms to use for unsupervised POS tagging?


I am interested in an unsupervised approach to training a POS-tagger.

Labeling is very difficult and I would like to test a tagger for my specific domain (chats) where users typically write in lower cases etc. If it matters, the data is mostly in German.

I read about about old techniques like HMM, but maybe there are newer and better ways?


Posted 2018-07-17T15:13:42.290

Reputation: 173

Did you try spaCy on your data? What were the issues? – Adam Bittlingmayer – 2018-07-19T06:06:40.307



There are no unsupervised methods to train a POS-Tagger that have similar performance to human annotations or supervised methods.

The current state-of-the-art supervised methods for training POS-Tagger are Long short-term memory (LSTM) neural networks.

Brian Spiering

Posted 2018-07-17T15:13:42.290

Reputation: 10 864

@Tido, hey how did it go? any progress to share with us? – avocado – 2019-03-13T05:08:52.213


Very interested to hear what do you need tagger for in context of chatbots?

Maybe you need just a stemmer - to produce 'base form' for an inflected word - ?

In that case, you can check this.


Posted 2018-07-17T15:13:42.290

Reputation: 196

like for NER. But indeed, I wanted to use POSs for a lemmatizer: https://github.com/WZBSocialScienceCenter/germalemma

and spaCy's lemmatizer for german is very bad. Do you know about the accuarcy of this stemmer? Sounds interesting.

– Tido – 2018-07-17T18:38:04.307

Snowball is a stemmer so it does not produce lemma (base form) but a kind of 'simplified form' common to all inflected forms. That can lead to a non-existing forms like 'company' -> 'compani but it's enough to have a common form - common dictionary identifier for all inflected forms - for further processing, e.g. text classification. But it may be a bit awkward when you want to visualize results.

Having looked to Snowball for German, it does the same - tries to cut off typical suffixes for conjugation and declination et al.

E.g. Hunde -> Hund but also Katze, Katzen -> Katz. – MkL – 2018-07-18T16:47:04.053


Fortunately, you don't need unsupervised methods for PoS tagging for most languages, especially for German. There are semi or "weakly" supervised methods like mentioned old HMM/EM approaches, however there is new and quite fresh solution with Error-Correcting Output-Code classification: Weakly supervised POS tagging without disambiguation.

Of course the accuracy of fully supervised methods like LSTM is far far better from semi supervised, but due to known issues of fully supervised methods (eg. lot of manual work) people still try to find lazy approaches. Excellent accuracy always cause higher costs.

Edward Weinert

Posted 2018-07-17T15:13:42.290

Reputation: 101