numerical or categorical data

3

I have a feature for machine learning (using methods like SVM, naive bayes, neural network and random forest) called member duration as follows: Should I make it as numerical or categorical data?

enter image description here

william007

Posted 2017-02-23T03:51:58.727

Reputation: 585

It depends on your application, for me it seems to be a discrete numerical feature. – Icyblade – 2017-02-23T03:59:01.907

Can you tell us what member duration represents? Is it measured in days? – gingermander – 2017-12-20T21:00:52.497

Answers

1

You definitely have interval data, that is, data which takes on discrete values, as opposed to continuous data, which takes on values along a continuum.

It may be of value to additionally determine if the data is ordinal, meaning that the order of the values is important, for example if [0, 1, 2] signifies [small, medium, large] or some analogous system.

In the case of ordinal data, it may be best to keep the data as exposed to the SVM training process in integer form, as the integer representation encodes some information about the relationship between the categories.

This approach would also be more reasonable if the values that the variable could take on in a production setting could expand beyond the values you've already observed in the training set- a categorical approach would be less able to handle new values in that context.

If there are no ordinal relationships and you suspect all of the possible values are enumerated in the training set,treating the variable as categorical would be approriate.

Thomas Cleberg

Posted 2017-02-23T03:51:58.727

Reputation: 1 437

0

It looks like counting data to me. Without further information in the question, I'd keep it as a categorical data and model it with discrete techniques (e.g. Poisson GLM)

SmallChess

Posted 2017-02-23T03:51:58.727

Reputation: 3 050

This is to pass to the machine learning languages like SVM etc. I have updated the question, let me know if anything requires clarification. – william007 – 2017-02-23T06:33:53.843

@william007 To me, my answer is valid unless you want to run an algorithm only works for numerical data. At the end of the day, your distribution is very discrete. – SmallChess – 2017-02-23T08:18:19.133