Feature Engineering Lists\Vectors as values in dataframe



Let's say I have a dataframe where some of the columns have lists of strings as values. I would like to use ML Algorithms on this dataframe.

In this case, I can:

  • I could add many columns of 1's and 0's for each string that may appear in some list, which seems to me terribly inefficient as some of the items in the lists only appear very few times. If there were maximum of 10 different items that could be listed in all of the lists , I guess it's quite all right to use/
  • I know some use SVD in these cases (even though I don't completely understand it yet.)
  • could put inside the dataframe pandas series as values like some kind of vectors.

So I ask:

  1. what is the best way to feature engineer it assuming there are few items that could appear in the lists?
  2. what is the best way to feature engineer it assuming there are lots of items that could appear in the lists?
  3. Is there a certain way to do it that runs most efficiently with the popular ML\DL python packages?


Posted 2019-02-26T12:12:37.107

Reputation: 31



I think a good starting point is what you have mentioned. For every element in a list, create a feature for that element. If that element is present in the list for a data point, then the element is denoted as a 1 in your feature vector. If it is not present in the list, then it is denoted by 0. This is very much a bag-of-words type way to create features. You could limit the number of features by only taking the top k occurring elements, where you determine k. Another variant is using counts of frequencies, if an element appears multiple times in a list, but it doesn't sound like that is the case.


Posted 2019-02-26T12:12:37.107

Reputation: 637