## How can I handle a column with list data?

3

1

I have a dataset which I processed and created six features:

['session_id', 'startTime', 'endTime', 'timeSpent', 'ProductList',
'totalProducts']


And the target variable is a binary class (gender).

The feature 'productList' is a list:

    df['ProductList'].head()
Out[169]:
0    [13, 25, 113, 13793, 2, 25, 113, 1946, 2, 25, ...
1    [12, 31, 138, 14221, 1, 31, 138, 1979, 1, 31, ...
2                               [13, 23, 127, 8754, 0]
3    [13, 26, 125, 5726, 2, 26, 125, 5727, 2, 26, 1...
4           [12, 23, 119, 14805, 1, 23, 119, 14806, 0]
Name: ProductList, dtype: object


Now, it is obvious that I can't use this feature as it is. How do I handle this feature? I can explode the list and create a row for each list item, but will it serve my purpose?

Update: I applied OHE after exploding the list, and it results in 10k+ columns, which my GCP instance and my computer can't handle; when applying PCA.

PS: There are over 17,000 unique products.

Try grouping them as suggested earlier, may be on category e.g. Cosmetics, Dairy etc. Something similar based on the product domain – 10xAI – 2020-04-11T17:58:14.460

6

You basically want to create a column for each product bought, as the presence or absence of each in the list is a feature in itself. See Hadley Wickham’s definition of tidy data.

That being said, you seem to have a large number of products. To avoid the course of dimensionality, what I would do is to take your binary bought/not features (or count values might be even more effective if you have that data) and do dimensionality reduction to get a reasonable set of features. Latent Dirichlet Allocation (which comes from topic modelling), PCA, t-SNE, our UMAP are all easy to implement and worth trying. PCA is the least sophisticated and the fastest to run and would be a good baseline.

When you have your smaller list of features, you might want to try using a classifier that’s further selects the most relevant features, like gradient boosted trees.

You’d have columns for all the products that were or weren’t in the session, where the value for products not bought would be zero. So you would have a column for every possible product :) – Nicholas James Bailey – 2020-04-11T08:36:55.960

Hi, @Nicholas. I tried to implement PCA on my data, but the computer runs out of memory while applying PCA on 10k+ columns. I tried it in a VM Instance with 32GB memory and 8vCPUs, and the issue persists. – Danish Shakeel – 2020-04-11T14:33:09.390

Are there really that many products?! Goodness me. What happens if you exclude everything that hasn’t been bought more than 5 times? Anything bought less times than that in its history is really noise, as even a fair coin can throw five heads in a row. You could actually do the math to work out the best number to use as a cut off. If that’s not possible, you could cluster the products (e.g. on department and price) and train on whether products from a cluster are bought in a session. – Nicholas James Bailey – 2020-04-11T15:55:34.967

I engineered another way. I only consider which product the customer starts at, i.e. initializes the session with, and the product that he closes the session with. – Danish Shakeel – 2020-04-11T15:58:36.340

Great. Hopefully that’s enough information to get good accuracy. Are you happy to mark my answer as correct? – Nicholas James Bailey – 2020-04-12T16:49:44.513

1

You can think of productList as a sentence and treat it the same way language is treated in NLP.

So yes, if your set of unique products is not too big, then exploding the list and writing each product as a unique column is an approach that can work quite well. You can also look into embedding layers, which extend this idea to lists of items that are "too big".

If the order of items in the list matters, you probably want to decompose the list into individual rows and look for prediction on sequences.

Edit: In response to your comment here is an analogy with semantic analysis on tweets:

We can think of a tweet as a list of words, e.g., "I am happy" -> ["I", "am", "happy"]. These lists vary in length but each word (presumably) comes from the English language (+ some slang and neologisms which we will conveniently ignore). We can take a dictionary of the English language, look up the position of each word in that dictionary, and replace the word with the index of the word in the said dictionary. In our running example, this might look like [23, 54, 219]. This is the same as your list of product ids relating to individual products.

The dictionary only has a finite number of words in it (similarly you only have a finite number of products), so we can OneHot encode each index in the list ([[0,0,..,1,...], [0,...,1,...,0,..], ...]).

Now there are two options: (1) the order of the vectors in the list does not matter, in which case we would sum them up to obtain a single vector for each example, with which you can proceed as described -, or (2) the order of the vectors in the list does matter, in which case you would split the array into multiple examples, one for each vector in the list, and add another feature denoting the position at which it was found in said list. You now have a dataset where a column contains a vector of the same size as every other column, which you can rewrite as a set of many columns.

You can then proceed with any analysis you think is reasonable for your data, e.g., clustering using simple methods, or training a non-linear embedding.

I would be grateful if you could enlighten me about the NLP process of treating a sentence. I have never indulged into NLP. – Danish Shakeel – 2020-04-11T08:31:28.347

1

As soon as you do OHE on products, it will add too many extra dimensions. To handle that, you can take one of the two approaches -

1. Reduce the dimension using standard techniques as suggested by Nicholas

2. You can also try to cluster the product list using the knowledge about the products and their relation to the target variable(i.e. gender).
A typical example of this scenario is converting zipcode into state code.

    import numpy as np,pandas as pd
productlist = pd.DataFrame(np.random.randint(1,14807,(1000,14806)))

##This is a zero matrix with column count equal to product count, rows = data count
productlist_ohe = np.zeros((1000,14806))

##I looped over productlist and make the OHE=1 based on row and product Id
for index, row in productlist.iterrows():
for elem in row:
productlist_ohe[index][elem-1] = 1


The latter is not posible, but the former solution seems good. I can OHE the list, and follow it by some dimensionality reduction, but I am confused as to how to explode a list column to a dedicated column, and follow it by OHE. – Danish Shakeel – 2020-04-11T09:03:12.103

One row can have multiple products. You will need a df with column count = product count and then 1,0 filled accordingly. See my other answer to achieve that in a crude way – 10xAI – 2020-04-11T09:42:30.760

1

What exactly is the aim? Prediction of binary outcome (gender in this case)? If true, you can go down the way suggested by Nicholas, but instead of doing dimensionality reduction (yourself), you could also treat the problem as a high dimensional one and use Lasso / Ridge / Elastic Net to "automatically" select features. In this case there is no need for any feature engineering.

Here is an R implementation of the method. Similar packages exist for Python. Also see Ch. 6.3 in Introduction to Statistical Learning for a good overview.

0

There is a technique called Association Analysis where the prototypical example is a grocery store looking for associated products. A typical grocery store may have half a million distinct items being sold. Each 'grocery cart' is a list of items bought. You treat the grocery cart purchases across some period of time as your initial dataset. Your data has shape [count of total items] (columns) x [count of different grocery carts] (rows).

It's a sparse dataset, and the correlation matrix would have shape (columns x columns) far too massive and often not helpful since most products aren't correlated. What is done instead is you accept some small threshold where if the correlation is smaller than this threshold, you don't compute it. This allows you to actually mine the data for interesting metrics of interest. The Apriori algorithm (or perhaps others if you are sophisticated) is used here (behind the scenes if you import the correct module in Python) and allows a regular computer to handle the number crunching.

The interesting metrics gained are typically:

• Support
• Confidence
• Lift
• Conviction (definitions easily found online)

I have used the following module to do this in the past:

from mlxtend.frequent_patterns import apriori, association_rules


Hope this helps