Tag: data-cleaning

33 How can I transform names in a confidential data set to make it anonymous, but preserve some of the characteristics of the names? 2014-06-16T19:48:31.797

31 Organized processes to clean data 2014-05-14T15:25:21.700

15 How to annotate text documents with meta-data? 2014-05-29T20:11:16.327

12 General approach to extract key text from sentence (nlp) 2015-03-13T16:41:29.280

11 How to do postal addresses fuzzy matching? 2016-03-21T12:01:23.043

10 is there any data tidying tool for python/pandas similar to R tidyr tool? 2016-03-02T08:54:10.503

7 What are the best practices to anonymize user names in data? 2014-12-12T03:00:57.507

7 removing strings after a certain character in a given text 2015-11-19T12:59:40.403

7 Neural Networks: How to prepare real world data to detect low probability events? 2016-03-16T01:03:29.313

7 Fixing data inconsistencies 2016-04-07T22:38:43.297

6 Dealing with training set of questionable quality 2015-11-16T10:57:53.013

6 Creating new columns by iterating over rows in pandas dataframe 2015-12-07T21:39:27.877

6 How to extract paragraphs from text document? 2016-11-11T06:06:35.760

6 How much data are sufficient to train my machine learning model? 2017-06-26T21:26:04.680

5 Good practices for manual modifications of data 2014-06-23T11:03:09.767

5 Amalgamating multiple datasets with different variables coding 2014-10-02T06:18:11.397

5 Anonymizing Datasets 2015-07-30T07:16:07.637

5 Why would you split your train data to compute a value on half of the data to then fill the Nan values on the other half? 2016-10-27T18:59:21.527

5 Do modern R and/or Python libraries make SQL obsolete? 2017-02-24T19:33:34.840

5 Missing Values in Data 2017-08-31T10:08:51.103

4 Data scheduling for recommender 2014-11-19T12:38:05.597

4 Alignment of square nonorientable images/data 2015-07-30T14:07:31.163

4 Choosing data clustering method to visualize data 2016-06-29T21:24:01.147

4 Convert a pandas column of int to timestamp datatype 2016-10-19T21:22:43.257

4 What methods can be used to detect anomalies in temporal texual data? 2016-11-07T07:34:29.327

4 How can I fill NaN values in a pandas data frame? 2016-12-25T22:29:59.157

3 Looking for smallest set of rows that form a natural key in a data set 2015-05-18T05:59:41.207

3 Feature Scaling and Mean Normalization 2015-11-20T20:52:08.420

3 Handling a feature with multiple categorical values for the same instance value 2015-12-15T05:54:18.307

3 Create training data 2016-03-04T18:31:17.653

3 About data cleansing, to what extent should we do our work? 2016-04-29T04:19:50.950

3 unable to parse XML in pig 2016-05-10T16:35:06.310

3 How to fix inconsistent (variable spelling) categorical data and "fill in" missing data 2016-05-30T19:31:13.520

3 Technical name for this data wrangling process? Multiple columns into multi-factor single column 2016-06-07T01:42:41.457

3 Successful ETL Automation: Libraries, Review papers, Use Cases 2016-07-15T08:00:32.163

3 Machine Learning or Survival Analysis? 2016-07-20T21:08:35.813

3 How can I detect events on fuel tank 2016-10-18T19:24:46.373

3 Collapse a list to most common spellings 2016-11-02T05:05:13.200

3 What is the difference between dcast and recast in R? 2016-11-08T17:28:15.877

3 Remove Outliers - Market Basket Analysis 2016-11-14T19:16:26.197

3 Detecting spammers with artificially generated target class 2017-08-20T19:35:34.767

2 Data preparation and machine learning algorithm for click prediction 2014-06-30T12:05:38.597

2 Method to create master product database to validate entries, and enrich data set 2014-11-12T01:31:01.377

2 Simple Excel Question: VLookup Error 2014-12-05T00:26:12.983

2 OCR / Text Recognition and Recovery Problem 2015-02-04T22:03:48.507

2 InterquartileRange takes up most instances in data set 2015-02-10T01:04:15.963

2 Data cleaning: Relationships between columns 2015-05-15T08:02:41.523

2 How do you ascertain the quality of your data? 2015-05-26T22:11:54.650

2 Data.frame vs Data.table in R? 2015-07-17T08:58:09.017

2 Do I have to standardize my new polynomial features? 2015-11-25T11:11:25.923

2 What are the best ways to tune multiple parameters? 2015-12-02T13:56:10.107

2 What would be the best way to structure and mine this set of data? 2016-01-25T23:45:28.583

2 Is there a tool/ library/ algorithm that can learn from the steps I perform in cleaning text and apply them to similar data? 2016-02-15T17:41:46.373

2 Sort by average votes/ratings 2016-02-17T20:43:08.740

2 Ignoring symbols and select only numerical values with pandas 2016-02-23T18:25:24.797

2 Encoding feature with tree-path data 2016-03-09T00:24:50.050

2 Use test data as train: does it make sense? 2016-03-28T17:38:49.710

2 Stackoverflow API Structure data storage 2016-06-21T08:39:52.053

2 How to perform Cleaning of a very large set of addresses 2016-06-25T21:52:13.020

2 Remove indiferent respondents in survey data 2016-08-02T14:20:43.947

2 How to read data in DSX from Data hub object storage container? 2016-09-16T19:26:36.247

2 Any method to filter out objective statements(or say facts)? 2016-10-13T09:00:25.460

2 Are there libraries or techniques for 'noisifying' text data? 2016-10-28T21:47:16.713

2 How to handle data collecting bias in machine model training 2016-11-08T16:25:57.343

2 Which classifier should I use for sparse boolean features? 2016-12-10T19:51:29.767

2 Correct sequence of data prep steps? 2017-03-16T03:13:47.567

2 Preparing data, choosing algorithm 2017-03-30T16:24:43.480

2 Aggregation of Discount 2017-08-10T10:32:46.393

2 Pandas how to fill missing values in one column if the values in another column are equal 2017-09-21T18:40:05.183

2 Are there deduplication algorithms that do not work on a metric space? 2017-11-02T20:19:58.583

2 Dealing with a dataset where a subset of points live in a higher dimensional space 2017-12-04T06:30:36.457

2 One hot encoding of target space 2018-01-12T19:04:18.553

2 Preprocess list data 2018-02-06T19:58:15.913

1 Modelling on one Population and Evaluating on another Population 2014-08-02T00:07:09.267

1 Cross-sell models and additional holders 2014-12-03T14:18:34.003

1 Web Framework Built for Recommendations 2014-12-09T02:28:51.430

1 Correcting Datasets with artificially low starting values 2015-01-08T13:44:20.160

1 How do I get Twitter Dataset for Visualization 2015-03-11T05:25:47.053

1 Clustering uncertain data with independent uncertainty per dimension 2015-07-02T16:45:35.833

1 any reason for this project to use hadoop/spark? 2015-07-08T18:58:25.897

1 Data frame mutation in R 2015-08-20T09:49:51.230

1 Preprocessing in Data mining? 2015-08-26T17:37:49.187

1 Looking for language and framework for data munging/wrangling 2015-08-30T05:14:24.837

1 How to start analysing and modelling data for an academic project, when not a statistician or data scientist 2015-09-19T04:02:11.133

1 R: Revalue multiple special characters in a data.frame 2015-10-17T18:37:01.587

1 removing words based on a predefined vector 2015-11-27T09:26:20.967

1 Detecting boilerplate in text samples 2015-11-30T17:45:34.930

1 Code Vectorization of gsub in R 2015-12-19T16:57:31.583

1 Natural Language text categorization using RapidMiner 2016-01-16T18:54:01.930

1 Combining parameters for Douglas-Peucker Simplification 2016-02-14T16:21:21.510

1 What metrics must i use in my data(unstructured) preprocessing research? 2016-02-20T10:11:17.397

1 What tools are available for semi-automated matching of dirty columnar data 2016-02-29T23:32:53.733

1 convert some observations into variables 2016-03-03T00:09:55.310

1 What's the best way to rank aggregate imdb rating data? 2016-03-05T17:43:06.277

1 Scoring consistency within dataset 2016-04-18T03:50:41.830

1 How to model this variable? 2016-04-24T08:35:17.450

1 Prepping Data For Usage Clustering 2016-04-27T17:12:40.743

1 What is a good way to start Data Analysis of unknown dataset (JSON data) 2016-06-22T18:09:12.363

1 Which one is better performer on wrangling big data, R or Python? 2016-07-26T21:09:57.130