finding themes from text documents


I have a text documents that contain 1000s of abstracts from medical whitepapers. I want to find themes from that text. Any suggestions other than text clustering since clustering helped me to find keywords arranged in a cluster. I tried to auto summarize using markovify library in python but the summary it created or the sentences did not make much sense. Any suitable suggestions are welcome. Thank you


Posted 2017-07-17T18:56:00.993

Reputation: 33

Could you clarify what the expected outcome would be precisely? You said you are not satisfied with keywords nor with a cluster of keywords. But a cluster of keywords is essentially a "topic" - "theme", which is also your desired outcome. So, your target would be a human-generated like summary, key sentences sewed together or maybe a mapping to biomedical entities from ontologies - UMLS concepts etc? – Bogas – 2017-07-18T08:50:07.537

I understand what you are saying but if I can get the cluster of keywords in form of sentences in some way, that would be my expected outcome. The abstracts are from papers related to any type of disease such as cancer, diabetes, heart disease etc. I am not too familiar with UMLS concepts. – sona_1105 – 2017-07-18T13:16:48.197



The best method to find themes in a collection of documents is topic modeling. Topic modeling finds the hidden (aka, latent) themes beyond just keyword counts.

There are many approaches to topic modeling. Latent Dirichlet allocation (LDA) is a standard topic modeling approach. LDA is a probabilistic graphical model that assumes that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. The number of topics is a selected hyperparameter.

Brian Spiering

Posted 2017-07-17T18:56:00.993

Reputation: 10 864