Tag: data-cleaning

41 How can I transform names in a confidential data set to make it anonymous, but preserve some of the characteristics of the names? 2014-06-16T19:48:31.797

35 Organized processes to clean data 2014-05-14T15:25:21.700

29 General approach to extract key text from sentence (nlp) 2015-03-13T16:41:29.280

22 Is there any data tidying tool for python/pandas similar to R tidyr tool? 2016-03-02T08:54:10.503

21 How to annotate text documents with meta-data? 2014-05-29T20:11:16.327

19 removing strings after a certain character in a given text 2015-11-19T12:59:40.403

15 Do modern R and/or Python libraries make SQL obsolete? 2017-02-24T19:33:34.840

15 How much data are sufficient to train my machine learning model? 2017-06-26T21:26:04.680

15 When to use Standard Scaler and when Normalizer? 2019-02-20T16:38:05.920

14 How to do postal addresses fuzzy matching? 2016-03-21T12:01:23.043

14 Convert a pandas column of int to timestamp datatype 2016-10-19T21:22:43.257

14 How can I appropriately handle cleaning of gender data? 2020-03-20T04:23:51.880

11 Creating new columns by iterating over rows in pandas dataframe 2015-12-07T21:39:27.877

11 One Hot Encoding vs Word Embeding - When to choose one or another? 2018-04-03T14:13:28.643

10 Please review my sketch of the Machine Learning process 2020-04-06T01:10:56.257

9 How can I perform stratified sampling for multi-label multi-class classification? 2018-06-13T11:18:12.543

9 Math PhD (Nonlinear Programming) switching to Data Science? 2019-10-02T09:07:10.363

8 What are the best practices to anonymize user names in data? 2014-12-12T03:00:57.507

8 Do I have to standardize my new polynomial features? 2015-11-25T11:11:25.923

8 Fixing data inconsistencies 2016-04-07T22:38:43.297

8 How to extract paragraphs from text document? 2016-11-11T06:06:35.760

8 Why should I normalize also the output data? 2017-10-31T09:39:50.023

8 How to delete entire row if values in a column are NaN 2018-04-13T01:28:07.543

8 Splitting train/test sets by an identifier? 2019-05-03T22:42:39.580

7 Neural Networks: How to prepare real world data to detect low probability events? 2016-03-16T01:03:29.313

7 Audio Analysis : Segment audio based on speaker recognition 2018-06-18T00:50:57.867

7 Under what circumstance is lemmatization not an advisble step when working with text data? 2018-08-08T22:26:50.310

7 Data anonymization in Python 2019-10-23T23:40:54.757

6 Good practices for manual modifications of data 2014-06-23T11:03:09.767

6 Dealing with training set of questionable quality 2015-11-16T10:57:53.013

6 What methods can be used to detect anomalies in temporal texual data? 2016-11-07T07:34:29.327

6 How can I fill NaN values in a Pandas DataFrame in Python? 2016-12-25T22:29:59.157

6 When to use missing data imputation in the data analysis problem? 2019-08-11T22:39:52.013

6 Does label encoding an entire dataset cause data leakage? 2020-07-22T18:50:12.367

5 Amalgamating multiple datasets with different variables coding 2014-10-02T06:18:11.397

5 Data scheduling for recommender 2014-11-19T12:38:05.597

5 Anonymizing Datasets 2015-07-30T07:16:07.637

5 Missing Values in Data 2017-08-31T10:08:51.103

5 Pandas how to fill missing values in one column if the values in another column are equal 2017-09-21T18:40:05.183

5 TypeError: float() argument must be a string or a number, not 'function' 2018-05-21T10:13:25.087

5 How do I find the relevant features out of 11,000+ possibilities? 2019-02-20T18:08:59.893

5 Pyspark: Filter dataframe based on separate specific conditions 2019-06-09T06:22:53.393

4 Looking for smallest set of rows that form a natural key in a data set 2015-05-18T05:59:41.207

4 Alignment of square nonorientable images/data 2015-07-30T14:07:31.163

4 Creating training data 2016-03-04T18:31:17.653

4 Choosing data clustering method to visualize data 2016-06-29T21:24:01.147

4 Why would you split your train data to compute a value on half of the data to then fill the Nan values on the other half? 2016-10-27T18:59:21.527

4 How to count grouped occurrences? 2018-04-03T07:55:17.413

4 Tidy data in panda dataframe 2018-06-04T08:51:19.360

4 Understanding data normalisation 2018-08-21T06:48:32.513

4 Can someone please explain what this sample function is upto? 2018-09-20T21:34:46.670

4 Pandas Conditional Fill NaN Forward/Backward 2018-10-16T15:45:27.543

4 Should I remove outliers if accuracy and Cross-Validation Score drop after removing them? 2018-12-20T15:09:00.010

4 Merging dataframes in Pandas is taking a surprisingly long time 2019-01-24T04:59:16.090

4 How to filter this signal to get a heartbeat signal? 2019-02-11T15:03:15.443

4 How to deal with count data in random forest 2019-02-12T22:59:07.520

4 Dealing with a dataset with a mix of continuous and categorical variables 2019-02-22T07:53:48.327

4 Unformatted data entries 2020-03-26T14:05:16.147

4 Data-preprocessing for Machine Learning model 2020-05-21T11:35:46.697

4 How to perform data scaling/standardization on dataset containing grouped values? 2020-05-21T12:10:20.760

4 If there are no missing values in our training set, should we accommodate missing values in an unseen test set? 2020-09-09T06:45:06.503

3 Data preparation and machine learning algorithm for click prediction 2014-06-30T12:05:38.597

3 Method to create master product database to validate entries, and enrich data set 2014-11-12T01:31:01.377

3 Simple Excel Question: VLookup Error 2014-12-05T00:26:12.983

3 Feature Scaling and Mean Normalization 2015-11-20T20:52:08.420

3 Handling a feature with multiple categorical values for the same instance value 2015-12-15T05:54:18.307

3 About data cleansing, to what extent should we do our work? 2016-04-29T04:19:50.950

3 How to fix inconsistent (variable spelling) categorical data and "fill in" missing data 2016-05-30T19:31:13.520

3 Machine Learning or Survival Analysis? 2016-07-20T21:08:35.813

3 Checking for skewness in data 2016-10-21T17:29:49.993

3 Are there libraries or techniques for 'noisifying' text data? 2016-10-28T21:47:16.713

3 How to handle data collecting bias in machine model training 2016-11-08T16:25:57.343

3 What is the difference between dcast and recast in R? 2016-11-08T17:28:15.877

3 Aggregation of Discount 2017-08-10T10:32:46.393

3 Detecting spammers with artificially generated target class 2017-08-20T19:35:34.767

3 Is Java or Python a better choice for an application involving data intensive algorithms employing natural language processing? 2017-10-03T07:05:12.847

3 One hot encoding of target space 2018-01-12T19:04:18.553

3 Should I Impute target values? 2018-01-12T21:08:05.720

3 When to remove outlier in preparing features for machine learning algorithm 2018-03-06T03:02:27.333

3 What data treatment/transformation should be applied if there are a lot of outliers and features lack normal distribution? 2018-04-23T11:38:56.540

3 Convert from many sub-tables to a single tidy dataframe 2018-05-24T10:01:53.623

3 how to remove unwanted characters from data 2018-05-26T01:31:56.930

3 Find most important features which differentiate two sets 2018-05-28T09:41:38.437

3 R: Checklist for data checking 2018-07-11T22:03:11.633

3 Extract features from a survey 2018-07-13T10:06:12.393

3 A single column has many values per row, separated by a comma. How to create an individual column for each of these? 2018-09-28T14:41:38.027

3 How to get a count the number of observations for each year with a Pandas datetime column? 2018-10-01T00:46:14.590

3 Dealing with NaN (missing) values for Logistic Regression- Best practices? 2018-10-02T09:17:55.730

3 How to fill in missing value of the mean of the other columns? 2019-02-11T14:12:26.630

3 Possible Challenges for a Data Science Escape Room 2019-04-10T14:45:03.103

3 How to normalize data from multiple sources? 2019-04-23T11:53:15.417

3 Jaccard Similarity with Binary Data 2019-05-14T14:19:08.467

3 Tool for clustering and cleansing data set 2019-06-05T15:42:56.310

3 Tips on how to preprocess data and outliers for churn analysis 2019-07-03T03:14:57.057

3 how to handle values that only appear once in a column? 2019-08-23T18:05:13.823

3 Why RANDOM noise images always predicted as BIRD? 2019-12-28T10:32:15.533

3 How can I handle a column with list data? 2020-04-11T08:09:03.127

3 Treating missing data in categorical features 2020-08-21T08:35:22.900

3 IterativeImputer - Returning -0 and other wierd results 2021-01-06T20:22:48.003