Tools for merging similar datasets continuously

25

12

How do I merge similar datasets, not just once but on an ongoing basis? This seems like a reasonably common problem so I'd be interested if there is a tool out there aimed at this problem, as a guided workflow. Or if any general purpose data tools might suit.

The exact problem I'm trying to solve is to curate a dataset of UK public bodies that is based on a number of existing datasets that change over time. The sorts of issues are:

  • The names of the bodies vary subtly between the datasets
  • The datasets all get updates which I'd like to add in regularly.
  • There are thousands of bodies, so plenty of automation is required to match them, yet manual matching is essential too.
  • Once it is established that an item needs copying into my dataset, the columns of the source datasets (name, abbreviation, website, etc.) will need to be mapped to my dataset's columns. I might want to do some manual tidying of the values, and not have this overwritten on the next time I 'run' the tool.

I'm very keen to hear other ideas, as the idea I've come up with seems a little esoteric:

  1. Use crowd-sourcing to match up public body names using Nomenklatura.
  2. Write a python script to take each data source, put in the standardized names, change the columns to a standard structure and save the output as CSV.
  3. Keep each CSV in git as the same 'file' but on different git branches. This allows me to merge changes into the master manually, using visual git tools. Git keeps a full record of changes in each dataset and the merge.

D Read

Posted 2013-05-09T09:50:34.613

Reputation: 1 372

1I totally subscribe to your idea! – RSFalcon7 – 2013-05-09T10:25:52.930

Answers

15

OpenRefine (formerly Google Refine) offers nice tools for data cleansing, e.g. correcting slight spelling variations. You can also script all transformations on the data and re-apply them later for updated datasets.

Is this what you're looking for?

Patrick Hoefler

Posted 2013-05-09T09:50:34.613

Reputation: 4 945

6

OpenRefine automate really badly. There is client to include JSON generated by refine in python or ruby but I am still looking for strong examples of their usage. See the Known Client libraries for Refine page for details

– magdmartin – 2013-05-09T17:13:07.843

9

Duplicate reply to a previous question, but you might like this as its all in Python and very easy to automate.

I would like to recommend an alternative that I have found that I prefer to open refine.

https://github.com/datamade/dedupe

It is a very easy to use python program(recommend using in Linux), that provides many custom options to merge your data sets. You are able to decide how each field is compared. I like this feature, because I have some fields where I only want matches to occur when they are not equal, this field being source, and the values being doc1 or doc2. It handles lat/lon, string, and custom compares. The author is very quick to reply to any questions you might have.

The greatest thing I like is it allows you to help train the program, by taking samples of your data and asking you, is this a match? Yes, No, Unsure, or Finished

So even if you don't want to get into the code and make up your custom comparators your able to quickly train the program what you feel is a match or not.

Very easy to install using Python's pip

pip install "numpy>=1.9"
pip install dedupe

There are many examples of how to use dedupe.

Dan Ciborowski - MSFT

Posted 2013-05-09T09:50:34.613

Reputation: 433

7

Topic Maps is a relationship technology -- standardized in ISO13250 -- that expressly deals with the issue of needing to repeatedly merge disparate data sources.

When I say "expressly" I mean it has excellent structures for solving the identification problem we all recognize from interacting with the web and merging two topics that are about the same thing.

Each of the specific issues you cite are handled directly in Topic Maps:

  • Identity is based on URIs as public subject identifiers, not just names

  • Deduping is automatically handled by the Topic Map engine

  • Unlike database-oriented solutions, constant manual tweaking of the data model and identifying disparate sources as being about the same thing are natural actions when working with Topic Maps

  • Any piece of information can itself become a first-class member of the Topic Map, with any amount of attendant metadata

Ontopia is an open-source implementation of Topic Maps which was previously sold commercially as Ontopia Knowledge Suite (OKS). The Topic Maps community seems to be larger in Europe than in the United States so training and conferences are most easily found there.

Roger_S

Posted 2013-05-09T09:50:34.613

Reputation: 434

6

I am currently working a similar project and we went with Talend Open Studio with no regret so far (after 1 year in the project).

Talend offer:

  1. tons of connectors to get your data from any sources
  2. nice data cleaning and transormation components to massage your data
  3. fuzzymatch option (using levenshtein‎ and metaphone) for reconciliation
  4. job can be exported in a shell script and included in a cron job.

magdmartin

Posted 2013-05-09T09:50:34.613

Reputation: 1 410

So similar to Refine, but runs better from the shell automatically, and you have to pay for it? – D Read – 2013-05-10T10:34:21.933

3Talend has two solutions and free Open Source and and a paid one (for support, collaboration features and some extra components). Please note that Talend is more complex than Refine and the learning curve a bit longer. – magdmartin – 2013-05-10T14:43:19.567

4

There was a talk at TransparencyCamp on cleaning open data that touched on merging datasets. You can see the notes for the talk here. The speaker recommends: deDupe, superfastmatch, saucebrush, name-cleaver. There's also a long list of tutorials for using OpenRefine.

ShaunaGM

Posted 2013-05-09T09:50:34.613

Reputation: 71

3

csv2rdf4lod is a tool designed to integrate multiple versions of multiple datasets from multiple source organizations. It produces RDF and provides a hierarchical URI design that preserves where data elements came from. It uses a unix/shell environment (no GUI) and is focused on server-based automation and replication.

Even if you don't use it, flip through the hierarchical URI design convention. Each dataset "update" that you receive is modeled as a new version and it aligns with the previous versions in One Big RDF Graph depending on how you model the RDF creation.

Best of luck, Tim

p.s. As suggested, I note that I'm the author of the tool.

user696

Posted 2013-05-09T09:50:34.613

Reputation: 31

1Tim, if you are the author, please disclose that. Otherwise your post will be tagged as spam. – Deer Hunter – 2013-05-31T16:15:13.987

1It would be useful to provide an example where you or others have used this tool and how it worked out. – Jeanne Holm – 2013-05-31T20:20:38.823

1

Google Cloud DataPrep is a great data preparation tool now on beta, pros:

  • no size limit
  • GCP integration
  • great UX
  • Serverless

OpenRefine offers nice tools for data cleansing, but it's ancient and limiting and in my personal opinion incomparable in awesomeness to DataPrep

talp

Posted 2013-05-09T09:50:34.613

Reputation: 11

Great, but does it address any of the functionality or challenges described in the question? – D Read – 2017-09-18T08:25:15.807

yep. you work against the datasets by creating flows. There are plenty of automatic inferences of schema mismatching, and solutions recommendation, it's easy to union / join datasets and once you've made "recipes" for each dataset, you can also run the flows automatically. It's the first time I'm using it as a data scientist and I'm really thrilled. – talp – 2017-09-24T16:12:59.530

1

Throwing in my weight into this topic. I work on an open-data project to solve producing dataset compilations from multiple dataset sources. We tackle issues like:

language (we handle 6 languages used by UN: en, fr, es, ru, ar, zh)
language script
character encodings and error detection
formal, short, variant and abbreviated name matching
ISO, US (FIPS/USGS/NGA/CBSA) and Canadian (SGC) geographic identifiers
unit handling
data representation, etc

Below is our on-going proposed spec (CUDE):

http://www.opengeocode.org/cude1.1/csv1.1.php

Below are the governmental datasets (500+) we have processed and released using the spec: http://www.opengeocode.org/cude1.1/index.php

We haven't released the code yet, just the datasets.

Andrew - OpenGeoCode

Posted 2013-05-09T09:50:34.613

Reputation: 8 237

1From a quick read of the link, you describe some unreleased tools, that are not intended to be openly licensed, that convert tabular data into your tightly-defined CSV format. It appears to be aimed at a very narrow type of dataset. It's not clear how you describe the schema of a file to be imported. A more helpful answer would actually describe the tool and cover how it solves the difficulties described in the question. – D Read – 2014-01-27T10:28:32.740