Public dataset for news articles with their associated categories



I am wondering if there are any public datasets of Google news with various news categories such as politics, entertainment, lifestyle, general news, sports etc.

I want to use such dataset for topic detection of various sentences or paragraphs. I was planning to train a classifier with such a dataset and use it for predictions. However, I couldn't find any. Are there any such known datasets available?


Posted 2017-09-26T08:56:30.433

Reputation: 203



Here is a massive dataset of news with categories which I created for exactly such a reason.

Includes all the headlines published by Times of India from 2001-2019 with categories.

Contains ~3 million entries.


Posted 2017-09-26T08:56:30.433

Reputation: 176

1It's an interesting dataset, however, too specific for my use-case. It probably won't generalize well to other news websites. Moreover, headlines do not have enough words for topic modeling. I have tested another such dataset with million headlines and it performed poorly. – utengr – 2018-02-14T12:21:37.933


This dataset is included with scikit-learn, a popular ML library for Python.

It is postings to Usenet and categorized by the group. The group titles are not exactly "categories" like you would see on Google News, but each newsgroup is supposed to be on a specific topic as indicated by the name, so the concepts are similar. For example:

  • alt.atheism, - Atheism
  •, - Computer Graphics
  • ...
  • - Automobiles
  • - Motorcycles


Posted 2017-09-26T08:56:30.433

Reputation: 1 548

1it is a relevant dataset but very small with limited categories. – utengr – 2017-09-28T10:23:49.410


There is another big news dataset in Kaggle called All The News you can dwnload it Here.

The data primarily falls between the years of 2016 and July 2017. And were scraped with beautiful soup from big US news sites like: New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News, Talking Points Memo, Buzzfeed News and many more.


Posted 2017-09-26T08:56:30.433

Reputation: 153

I couldn't find the news categories (classes) for this dataset. – cemsazara – 2018-08-21T17:33:09.853