Which process step in KDD or CRISP-DM includes labeling of the data?


KDD and CRISP-DM are both processes to structure your Data Mining procedure. Is data labeling not also a important part of Data Mining?

Data labeling is for example in unsupervised learning the target of the Data Mining process. So if I want to classify a data set that was labelled by me before, do I just do the process twice? In my opinion sometimes the labeling is quite trivial, so that doing the process twice would be quite unnecessary?

Is it possible to include the labeling into the data exploring or preprocessing phase? E.g. in CRISP-DM Preprocessing there is something like generating a new parameter. Can this Parameter be also a new target/label?

I know this question is quite process orientated and in Data Mining you are quite free but just assume that in this case you have to follow the process.

Mimi Müller

Posted 2017-11-19T20:21:24.917

Reputation: 128

elaborate a bit more about what do you do twice in the process? – Toros91 – 2017-11-20T09:27:19.397

I would first go through the process as a unsupervised machine learning process, which target it is to label the data and then start again and take it as a supervised machine learning process. – Mimi Müller – 2017-11-20T09:44:38.700

Yes that is a good practice when you don’t know the target variable. This is the traditional process! Everything depends on how well your features are defined in your clusters – Toros91 – 2017-11-20T10:42:21.973



Data Labeling is a very trivial process as you have mentioned.

As far as I know, it falls under Data Understanding(Exploration Analysis). When you don't know anything about data then you do exploratory analysis to understand and derive some insights. If you don't know the target variable then the problem falls under Unsupervised Learning, as you have mentioned in the question that your problem is unsupervised. So, you don't know your Target Variable, you are trying to make new feature/dimension to get some good insight irrespective of the factor whether you derive your Target Variable or any other feature it falls under Data Preparation(any new derived variables), which ever we think are important for our Analysis


Posted 2017-11-19T20:21:24.917

Reputation: 2 237

Is it that trivial? How do i know that the Labeling is done right? There are several factors which result in bad labeling: eg. if an expert labels the data there are some human factors which should be included in the analysis, if a clustering algorithm is used there is a certain error rate. So if it started with a unlabeled data set first a unsupervised learning process should be started? Or does it depend on the effort you put into the data labeling? An expert depending labeling is as well an unsupervised learning method because rules are defined? – Mimi Müller – 2017-11-20T09:47:58.883

so here in this scenario, let us consider that you did the labeling and compared with the labeling done by an expert, in the start it would be problem but down the line you will learn and improve yourself. Same way it is applicable to the model too in the 1st iteration it might classify wrong but in the 2nd and 3rd iteration you model classification would be improved based on how good your feature engg is. It is directly proportional to the amount of time you spend to play around with data. Yes, even a person who is an expert hasn't come from sky it purely out of his experience, same with Mdl – Toros91 – 2017-11-20T10:06:58.580

@Toros91 Could you please answer this question: https://datascience.stackexchange.com/questions/33265/what-to-report-in-the-build-model-asses-model-and-evaluate-results-steps-of-cri Thanks.

– ebrahimi – 2018-06-17T06:02:27.663