Publicly Available Datasets

190

192

One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level analysis. Looking at the other people's effort, especially other questions on this site, it appears that many people in this field are doing somewhat repetitive work. For example analyzing tweets, facebook posts, Wikipedia articles etc. is a part of a lot of big data problems.

Some of these data sets are accessible using public APIs provided by the provider site, but usually, some valuable information or metrics are missing from these APIs and everyone has to do the same analyses again and again. For example, although clustering users may depend on different use cases and selection of features, but having a base clustering of Twitter/Facebook users can be useful in many Big Data applications, which is neither provided by the API nor available publicly in independent data sets.

Is there any index or publicly available data set hosting site containing valuable data sets that can be reused in solving other big data problems? I mean something like GitHub (or a group of sites/public datasets or at least a comprehensive listing) for the data science. If not, what are the reasons for not having such a platform for data science? The commercial value of data, need to frequently update data sets, ...? Can we not have an open-source model for sharing data sets devised for data scientists?

Amir Ali Akbari

Posted 2014-05-18T18:45:38.957

Reputation: 1 373

2

Cross-link: A database of open databases?

– WBT – 2016-03-24T04:22:56.613

23

This question might be more appropriate on the dedicated opendata.SE. That said, I cross my fingers for dat, which aspires to become a "Git for data".

– ojdo – 2014-05-18T21:23:52.687

4

@ojdo Thanks, I never heard of opendata.SE before, I also found this interesting (and very similar) question there.

– Amir Ali Akbari – 2014-05-19T08:28:56.713

4

See http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public.

– Piotr Migdal – 2014-05-26T17:08:57.360

1https://zenodo.org/ – Martin Thoma – 2017-08-31T20:25:46.190

1Reserve Bank of India have a huge database about India, World Bank have huge data set – Panjikaran – 2018-05-11T04:05:41.567

I haven't found any good free comprehensive datasets for typical Business Intelligence applications. The Microsoft Contoso BI Demo Dataset for Retail Industry from Official Microsoft Download Center download works with some Microsoft products (see AndyGett on SharePoint and Other Business Software), but I don't see any plain sql or csv dumps of it, nor any license info.

– nealmcb – 2015-04-30T17:42:11.410

A great place to find public data sets is http://opendata.stackexchange.com/

– sheldonkreger – 2015-05-29T18:00:14.140

3

Have you joined the Open Data Stack Exchange? http://opendata.stackexchange.com

– sss4r – 2015-06-21T21:00:42.177

Answers

105

There is, in fact, a very reasonable list of publicly-available datasets, supported by different enterprises/sources.

Some of them are below:

Now, two considerations on your question. First one, regarding policies of database sharing. From personal experience, there are some databases that can't be made publicly available, either for involving privacy restraints (as for some social network information) or for concerning government information (like health system databases).

Another point concerns the usage/application of the dataset. Although some bases can be reprocessed to suit the needs of the application, it would be great to have some nice organization of the datasets by purpose. The taxonomy should involve social graph analysis, itemset mining, classification, and lots of other research areas there may be.

Rubens

Posted 2014-05-18T18:45:38.957

Reputation: 3 967

71

IharS

Posted 2014-05-18T18:45:38.957

Reputation: 4 894

42

There are many openly available data sets, one many people often overlook is data.gov. As mentioned previously Freebase is great, so are all the examples posted by @Rubens

MCP_infiltrator

Posted 2014-05-18T18:45:38.957

Reputation: 986

37

Freebase is a free community driven database that spans many interesting topics and contains about 2,5 billion facts in machine readable format. It is also have good API to perform data queries.

Here is another compiled list of open data sets: http://www.datapure.co/open-data-sets

Konstantin V. Salikhov

Posted 2014-05-18T18:45:38.957

Reputation: 634

Freebase is closing down and its database will move to Wikidata soon.

– cynddl – 2014-12-17T14:39:02.783

34

Jakubee

Posted 2014-05-18T18:45:38.957

Reputation: 401

25

For time series data in particular, Quandl is an excellent resource -- an easily browsable directory of (mostly) clean time series.

One of their coolest features is open-data stock prices -- i.e. financial data that can be edited wiki-style, and isn't encumbered by licensing.

azza-bazoo

Posted 2014-05-18T18:45:38.957

Reputation: 131

21

Enigma is a repository of public available datasets. Its free plan offers public data search, with 10k API calls per month. Not all public databases are listed, but the list is enough for common cases.

I used it for academic research and it saved me a lot of time.


Another interesting source of data is the @unitedstates project, containing data and tools to collect them, about the United States (members of Congress, geographic shapes…).

cynddl

Posted 2014-05-18T18:45:38.957

Reputation: 101

20

I would like to point to The Open Data Census. It is an initiative of the Open Knowledge Foundation based on contributions from open data advocates and experts around the world.

The value of Open data Census is open, community driven, and systematic effort to collect and update the database of open datasets globally on country and, in some cases, like U.S., on city level.

Also, it presents an opportunity to compare different countries and cities on in selected areas of interest.

tomaskazemekas

Posted 2014-05-18T18:45:38.957

Reputation: 313

19

There is also another resource provided by The Guardian, the British Daily on their website. The datasets published by the Guardian Datablog are all hosted. Datasets related to Football Premier League Clubs' accounts, Inflation and GDP details of UK, Grammy awards data etc. The datasets are available at

Some more resources. Some of the datasets are in R format or R commads exist for directly importing data to R.

binga

Posted 2014-05-18T18:45:38.957

Reputation: 674

18

Custom Google Search

You can use the Custom Google Search for datasets:

Google Custom Search: Datasets

It includes 230 sources and meta-sources of datasets, including all mentioned in this question. Please, feel free to exclude .gov and any other websites from results by adding " -.gov" or " -site.com" to the search line. Other Google Search Operators work.

Don't hesitate to contact me if you have ideas what websites to add.

IOGDS

The following service categorizes more than 1,000,000 public datasets:

IOGDS: International Open Government Dataset Search

Anton Tarasenko

Posted 2014-05-18T18:45:38.957

Reputation: 631

What are the parameters for the custom search link you provided? Does it search in a list of websites, keywords, etc.? – Amir Ali Akbari – 2014-12-05T07:42:26.397

@AmirAliAkbari It searches through sources like Data.gov, Quandl, and other major data warehouses. – Anton Tarasenko – 2014-12-05T12:39:34.210

17

Late answer, but here is an eclectic list of 100+ Interesting Data Sets

The blog post is fun and easy to read through (I have no affiliation). It's worth to scan through, and to scrape a few from the top:

  • Last words of every Texas inmate executed since 1984

  • 10,000 annotated images of cats

  • 2.2 million chess matches

philshem

Posted 2014-05-18T18:45:38.957

Reputation: 205

16

I've found this link in Data Science Central with a list of free datasets: Big data sets available for free

lafdez

Posted 2014-05-18T18:45:38.957

Reputation: 101

16

Did you know about the PUMA Benchmarks and dataset downloads? https://sites.google.com/site/farazahmad/pumadatasets

It does include the following:

  1. TeraSort
  2. Wikipedia
  3. List item
  4. Self-Join
  5. Adjacency-List
  6. Movies-database
  7. Ranked-Inverted-Index

algarecu

Posted 2014-05-18T18:45:38.957

Reputation: 181

16

The UK Government provide an excellent source of non-personal data collected throughout government departments: http://data.gov.uk

Federer

Posted 2014-05-18T18:45:38.957

Reputation: 181

15

I am new to this forum. Chiming in late on this question. I have been maintaining (I am a co-founder of) a catalog of publicly available data portals. There is over 1000 now listed and cover portals at international, federal, state, municipal and academic levels across the globe.

http://www.opengeocode.org/opendata/

Andrew - OpenGeoCode

Posted 2014-05-18T18:45:38.957

Reputation: 261

15

chenrui333

Posted 2014-05-18T18:45:38.957

Reputation: 203

1Can you please provide us with some information on both datasets/links? This will indeed ease the burden of those looking for specific types of data set. Take a look at other posts to see what kind of information your references are missing. – Rubens – 2015-01-30T21:31:13.203

15

I'm surprised one has not mentioned this, as it seems fairly obvious: http://www.kaggle.com consistently has new and very interesting datasets. Information is considered an asset, so often companies don't want to release that data (plus privacy concerns). Kaggle gives you data and they hope you solve business problems with it in exchange.

Ram

Posted 2014-05-18T18:45:38.957

Reputation: 323

12

As you mentioned, the API is the hard part, not the data. Quandl seems to solve this problem by providing over 10 million publicly available data sets under one easy, RESTful API. If programming isn't your strong suit, there is a free tool to make loading data into Excel very easy. Additionally, if you do enjoy programming, there are several native libraries in R, Python, Java and more.

Brian Risk

Posted 2014-05-18T18:45:38.957

Reputation: 171

12

To add to a possibly never ending list:

as mentioned by cyndd, there is Wikidata,

and for curated structured knowledge, Wolfram Alpha.

image_doctor

Posted 2014-05-18T18:45:38.957

Reputation: 405

12

I came across this collection on Github. The collection is categorised as well.

https://github.com/caesar0301/awesome-public-datasets

And for the part regarding

Can not a open-source model for sharing data sets devised for data scientists?

you can refer The Leek group guide to data sharing

Shagun Sodhani

Posted 2014-05-18T18:45:38.957

Reputation: 722

10

Not all government data is listed on data.gov - Sunlight Foundation put together a set of spreadsheets back in February describing sets of available data.

Steve Kallestad

Posted 2014-05-18T18:45:38.957

Reputation: 3 078

9

One other data source I didn't see listed is The GDELT Project. From the site:

GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

dvdnglnd

Posted 2014-05-18T18:45:38.957

Reputation: 171

8

This subreddit lists a lot of known Datasets

Reddit Datasets

There are a lot of dataset requests on that subreddit, several of which have been answered.

Some guy

Posted 2014-05-18T18:45:38.957

Reputation: 131

6

I created a github repo for this. The datasets are not big, but are minimal examples meant to practice and explore predictive-modeling techniques which can then be extended to big datasets.

Machine Learning Problem Bible (MLPB)

The cool/unique thing about this repo is that every problem is tagged with tags like [multi-class], [unbalanced-data], [regression], etc. making it easy to find certain types of problems/datasets.

Ben

Posted 2014-05-18T18:45:38.957

Reputation: 131

6

Eurostats http://ec.europa.eu/eurostat and European Central Bank https://www.ecb.europa.eu/stats/html/index.en.html provide a great variety of datasets which I use quite often in my work projects.

Juha

Posted 2014-05-18T18:45:38.957

Reputation: 21

6

Besides all these datasets, if you are interested in data related to India. The publicly official site of Indian Government is

It provides datasets from different departments of Indian government which can be well used for Big Data Analysis & Machine Learning.

Gaurav

Posted 2014-05-18T18:45:38.957

Reputation: 1

4

Just we load MASS package in R we access multiple dataframes or data sets .

install.packages("MASS") require("MASS")

dileep balineni

Posted 2014-05-18T18:45:38.957

Reputation: 333

4

Yahoo just released a huge dataset for research community. Enjoy it!

Kasra Manshaei

Posted 2014-05-18T18:45:38.957

Reputation: 5 323

3

3 datasets from https://www.jc-bingo.com/about

  • visitor-interests.csv Aggregated visitor interests compiled based on 1 week web access logs. Includes visitor IP address, user-agent string, visitor country, accessed page languages and topics. 19,926 records, 2.9 Mb.
  • user-agents.csv Real visitor user agents ordered by popularity. 4,826 records, 716 Kb.
  • bots.csv Robot IP addresses and user-agent strings extracted from web access logs. 1,293 records, 122 Kb.

Yuri

Posted 2014-05-18T18:45:38.957

Reputation: 1

3

Obviously, there exists a large set of public databases.

One not yet mentioned, is from the FAO (Food and Agriculture Organization of the United Nations), accessible at:

http://www.fao.org/faostat/

It contains data about food production for worldwide countries.

setempler

Posted 2014-05-18T18:45:38.957

Reputation: 101

3

Quite late to the party, but I believe the following might be helpful: http://data.worldbank.org/indicator
https://www.quandl.com/

Srini

Posted 2014-05-18T18:45:38.957

Reputation: 1

2

This question already has so many answers but I guess it will be very useful for machine-learning practitioners to use so many CSV files that are here.

Media

Posted 2014-05-18T18:45:38.957

Reputation: 12 077

2

These are the most used ones:

  1. Kaggle: The Home of Data Science & Machine Learning. Kaggle helps you learn, work, and play.
  2. Govt. Repository Data: USCG Marine Casualty and Pollution Data
  3. Quandl: Quandl is a marketplace for financial, economic and alternative data delivered in modern formats for today's analysts, including Python, Excel, Matlab, R, and via our API.
  4. UCI WEB: Classification - Regression - Adult - Abalone - Life Sciences

Anmol Kumar

Posted 2014-05-18T18:45:38.957

Reputation: 53

2

tenshi

Posted 2014-05-18T18:45:38.957

Reputation: 606

1

Now google also provides a dataset search!!

https://toolbox.google.com/datasetsearch

It gives you link to all the other sites. It's like one search to search them all!!

Itachi

Posted 2014-05-18T18:45:38.957

Reputation: 201