I need to use the answers from a questionnaire for training a classifier. I discovered that some questions can have nested sub-questions.. Let's say (just an example) that I want to predict whether a person is going to buy a house based on the following questions:
1) What is your gender?  male [x] female  I prefer not to answer
in the case the answer is female (as in the example above) a sub-question is ansked
1_female) are you pregnant? [x] yes  no
Then the questionnaire continues..
How should I use these features to train my model?
Option 1) Treat them separately and transform them with one-hot-encoding I will have then the feature vector
gender_male - gender_female - gender_not_answered - pregnant_empty - pregnant_yes - pregnant_no 0 - 1 - 0 - 0 - 1 - 0
Obviously the feature pregnant_empty will be coded with 1 for all the males
Option 2) Merge the 2 answers and encoding the concatenation
gender_female_pregnant_yes - gender_female_pregnant_not - gender_male - gender_not_answered 1 - 0 - 0 - 0
Please treat this just as an example... the problem is that in a real scenario
- the nested question could appear with 2 or more answers
- expanding the features as in option 2 will make my feature vector explode..
I hope my question was clear enough