Organized processes to clean data

31

18

From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis.

Are there any best practices or processes for cleaning data before processing it? If so, are there any automated or semi-automated tools which implement some of these best practices?

Jay Godse

Posted 2014-05-14T15:25:21.700

Reputation: 256

1probably it's good idea to clear a bit, what you mean under cleaning data , looks a bit confusing for my opinionMolbOrg 2014-05-14T18:11:45.363

2Explaining further what cleaning data means would be helpful. In the context where I work, cleaning has nothing to do with formatting - I'd just call that parsing/importing - But rather it would mean talking noisy user data and verifying it for coherence. The techniques use are dataset specific, from simple statistical rules, to fuzzy algorithms, especially when the data is sparse.AdrianBR 2014-05-16T05:17:27.677

Answers

21

R contains some standard functions for data manipulation, which can be used for data cleaning, in its base package (gsub, transform, etc.), as well as in various third-party packages, such as stringr, reshape/reshape2, and plyr/dplyr. Examples and best practices of usage for these packages and their functions are described in the following paper: http://vita.had.co.nz/papers/tidy-data.pdf.

Additionally, R offers some packages specifically focused on data cleaning and transformation:

A comprehensive and coherent approach to data cleaning in R, including examples and use of editrules and deducorrect packages, as well as a description of workflow (framework) of data cleaning in R, is presented in the following paper, which I highly recommend: http://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf.

Aleksandr Blekh

Posted 2014-05-14T15:25:21.700

Reputation: 5 573

2I would also add dplyr, which is an optimized rephrasing of certain plyr tools, and data.table which is a completely different approach to manipulating data. Both by Hadley Wickham.shadowtalker 2014-08-02T05:22:07.200

@ssdecontrol: I agree - updated the answer. Hadley is the author of the dplyr package, but not of the data.table one.Aleksandr Blekh 2014-08-02T08:22:18.307

1Funny, I just kind of assumed he was. Thanks.shadowtalker 2014-08-02T11:46:36.047

16

From my point of view, this question is suitable for a two-step answer. The first part, let us call it soft preprocessing, could be taken as the usage of different data mining algorithms to preprocess data in such a way that makes it suitable for further analyses. Notice that this could be the analysis itself, in case the goal is simple enough to be tackled in a single shot.

The second part, the hard preprocessing, actually comes prior to any other process, and is may be taken as the usage of simple tools or scripts to clean up data, selecting specific contents to be processed. To this problem, POSIX provides us with a wonderous set of magic tools, which can be used to compose concise -- and very powerful -- preprocessing scripts.

For example, for people who deal with data coming from social websites (twitter, facebook, ...), the data retrieval usually yields files with very specific format -- although not always nicely structure, as they may contain missing fields, and so. For these cases, a simple awk script could clean up the data, producing a valid input file for later processing. From the magic set, one may also point out grep, sed, cut, join, paste, sort, and a whole multitude of other tools.

In case simple the source file has too many nitty-gritties, it may also be necessary to produce a bundle of methods to clean up data. In such cases, it is usually better to use scripting languages (other than shell ones), such as Python, Ruby, and Perl. This allows for building up API's to select specific data in a very straightforward and reusable way. Such API's are sometimes made public by their writers, such as IMDbPY, Stack Exchange API, and many others.

So, answering the question: are there any best practices? It usually depends on your task. If you will always deal with the same data format, it's commonly best to write an organized script to preprocess it; whereas, if you just need a simple and fast clean up on some dataset, count on POSIX tools for concise shell scripts that will do the whole job much faster than a Python script, or so. Since the clean up depends both on the dataset and on your purposes, it's hard to have everything already done. Yet, there are lots of API's that puts you halfway through with the problem.

Rubens

Posted 2014-05-14T15:25:21.700

Reputation: 2 452

1This is a very good post. R is excellent for data manipulation but can be pretty cumbersome with data cleaning because of its verbose syntax for string manipulation and fairly rigid adherence to lists and matrices for data structures.shadowtalker 2014-08-05T13:01:12.313

hm - get data, format data(awk sed grep stuff), remove noise as first step, then go deeper. so is't hard preprocessing comes at first, if use your thermsMolbOrg 2014-05-14T18:09:54.360

@MolbOrg Yes, that's what I meant. I called hard preprocessing the scripting side, and soft preprocessing the use of data mining algorithms that generally reduce the size of the problem (cleans up the database). I also noted that the second part, hard preprocessing, actually comes prior to any other process. If it's not very clear with such terms, I'd gladly consider any other suggestions to improve the answer.Rubens 2014-05-14T18:20:28.510

1

ah yes, did not paid enough attention, raw data preprocessing. Tested atm - yes perl oneliner is 3times slower then grep ) for 3.5kk strings in 300MB, for perl it took 1.1 sec, for grep 0.31 sec . I saw article where points that perl regexp is slow, much slower then it may be in practice, (i suspect that is also for grep too) http://swtch.com/~rsc/regexp/regexp1.html

MolbOrg 2014-05-14T18:34:15.603

@MolbOrg Nice reference! AFAIK, grep uses POSIX basic regex by default, and allows for extended POSIX regex when run as grep -E, and for PCRE when run as grep -P.Rubens 2014-05-14T18:44:15.457

yes, I used both variants, default was 6 times faster(just two attempts because lost a window with results, and second pattern was not suitable for default grep), and in the link above states that grep uses different(rigth?) approach, did't mean it's new for me, but new angle on that stuff for sure, have to think about ))MolbOrg 2014-05-14T18:49:49.287

11

One reason that data cleaning is rarely fully automated is that there is so much judgment required to define what "clean" means given your particular problem, methods, and goals.

It may be as simple as imputing values for any missing data, or it might be as complex as diagnosing data entry errors or data transformation errors from previous automated processes (e.g. coding, censoring, transforming). In these last two cases, the data looks good by outward appearance but it's really erroneous. Such diagnosis often requires manual analysis and inspection, and also out-of-band information such as information about the data sources and methods they used.

Also, some data analysis methods work better when erroneous or missing data is left blank (or N/A) rather than imputed or given a default value. This is true when there is explicit representations of uncertainty and ignorance, such as Dempster-Shafer Belief functions.

Finally, it's useful to have specific diagnostics and metrics for the cleaning process. Are missing or erroneous values randomly distributed or are they concentrated in any way that might affect the outcome of the analysis. It's useful to test the effects of alternative cleaning strategies or algorithms to see if they affect the final results.

Given these concerns, I'm very suspicious of any method or process that treats data cleaning in a superficial, cavalier or full-automated fashion. There are many devils hiding in those details and it pays to give them serious attention.

MrMeritology

Posted 2014-05-14T15:25:21.700

Reputation: 1 690

1As a side comment: I think spotting errenous data caused by some problem further up the pipeline is a gold skill. Many a time I have wondered why my analysis produced weird results and when I looked at the pipeline i found some kind of error . E.g: I wondered why all my data where heavily skewed towards high prices - WAY out of my mental model. When I asked around, I found out that some subcontractor misunderstood the briefing and delivered data for high income groups, while we whanted mixed data...Christian Sauer 2014-07-15T09:33:38.703

Yes! Data errors are frequently signs of process problems. Knowing where in the process the errors were introduced and also the mechanism, will greatly help with the cleaning process. But better still is to fix the process problems so that they produce clean (or cleaner) data.MrMeritology 2014-07-15T19:12:57.530

5

About automatic cleaning: You really cannot clean data automatically, because the number of errors and the definition of an error is often dependent on the data. E.g.: Your column "Income" might contain negative values, which are an error - you have to do something about the cases. On the other hand a column "monthly savings" could reasonably contain negative values.

Such errors are highly domain dependent - so to find them, you must have domain knowledge, something at which humans excel, automated processes not so much.

Where you can and should automate is repeated projects. E.g. a report which has to produced monthly. If you spot errors, you should place some automated process which can spot these kinds of errors in subsequent months, freeing your time.

Christian Sauer

Posted 2014-05-14T15:25:21.700

Reputation: 407

5

I think that there is no universal technique for "cleaning" data before doing actual research. On the other hand, I'm aiming for doing as much reproducible research as possible. By doing reproducible research, if you used cleaning techniques with bugs or with poor parameters/assumptions it could be spot by others.

There is nice R package knitr which helps a lot in reproducible research.

Of course, not all research could be fully reproduced (for example live Twitter data) , but at least you can document cleaning, formating and preprocessing steps easily.

You can check my assessment prepared for Reproducible Research course at Coursera.

Damian Melniczuk

Posted 2014-05-14T15:25:21.700

Reputation: 370

3

OpenRefine (formerly Google Refine) is a great tool where you can interactively perform data cleanup and transformations. Your final output can be exported to a variety of standard formats (json, csv) or any custom format(eg: wiki).

Exact duplicates can be handled in OpenRefine by sorting + using the "blank down" feature. Also it has has approximate string (fuzzy) clustering feature where you can group similar looking duplicates such as company names or addresses. This feature alone makes it awesome.

Thyag

Posted 2014-05-14T15:25:21.700

Reputation: 131

1This is a fantastic, under-reported tool.thanks_in_advance 2016-07-17T06:22:58.977

2

There is an entire course devoted to this in Coursera. You might want to go over the techniques they mention and the important part is to know when to use what.

https://www.coursera.org/learn/data-cleaning

It's always better to do it manually instead of automating since each dataset has its own problems and issues and not all steps are applicable to all the datasets.

Ram

Posted 2014-05-14T15:25:21.700

Reputation: 221