Extract features from a survey



I need to use the answers from a questionnaire for training a classifier. I discovered that some questions can have nested sub-questions.. Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:

1) What is your gender?
[] male
[x] female
[] I prefer not to answer

in the case the answer is female (as in the example above) a sub-question is ansked

1_female) are you pregnant?
[x] yes
[] no

Then the questionnaire continues..

How should I use these features to train my model?

Option 1) Treat them separately and transform them with one-hot-encoding I will have then the feature vector

gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no
     0      -        1      -         0           -        0       -       1      -        0

Obviously the feature pregnant_empty will be coded with 1 for all the males

Option 2) Merge the 2 answers and encoding the concatenation

gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered
     1                     -                 0          -      0      -     0

Other options?

Please treat this just as an example... the problem is that in a real scenario

  • the nested question could appear with 2 or more answers
  • expanding the features as in option 2 will make my feature vector explode..

I hope my question was clear enough


Posted 2018-07-13T10:06:12.393

Reputation: 131

Well how is house buying related with gender? There are lot of other more sensible questions that will play a major role like revenue, location, proximity to Different places , recent build etc... Moreover it would be better that we you A JSON or a dict like format to store the stuffs , Create embeddings then for your keywords.. – Aditya – 2018-07-13T10:25:56.433

@Aditya is just an example... I wrote it 2 times in the question... :) then my question is not related to storing the answers but how to use them as features in a classification problem – gabboshow – 2018-07-13T10:29:11.470

it depends on your modelling a lot, CatBoost will take care of cats automatically, xgb will need then either label encoded or target encoding or embeddings,oho etc.. We need to try and see what worked and what didn't , Although I would go with the 1st option as second opt won't make valid sense, though we can build some interactions by combining specific cols... That's all I know , yet exploring ML so pardon me for my limited Knowledge – Aditya – 2018-07-13T16:19:23.513

How about just adding another option for each of the nested questions which is just "N/A", not applicable? That way can just treat them like any other question but could, in your example above, fill any gender_male 1 with pregnant_na 1 – Ken Syme – 2018-07-13T20:24:50.263

@KenSyme isn't it what you suggest option 1? but instead of calling pregnant_na I called it pregnant_empty – gabboshow – 2018-07-13T21:51:14.737

@gabboshow You are quite right, sorry I must have misread that. That is the approach I would take and think it is a perfectly valid way of doing it. – Ken Syme – 2018-07-17T10:37:21.077



The simplest is to keep your features separate and add a synthetic feature, feature cross, that captures the relationship between those features you mention that can possibly be nested.

For example, in a neuronal network based classifier (e.g., TensorFlow), the model will learn the 'correct' weight for those combinations of features' values that are impossible to happen (e.g., male and pregnant), excluding wrong data cases obviously.

In the end... you just want the cartesian product among the features that you need to 'cross'. And yes, your vector will grow.


Posted 2018-07-13T10:06:12.393

Reputation: 101