Machine learning techniques for estimating users' age based on Facebook sites they like

25

10

I have a database from my Facebook application and I am trying to use machine learning to estimate users' age based on what Facebook sites they like.

There are three crucial characteristics of my database:

  • the age distribution in my training set (12k of users in sum) is skewed towards younger users (i.e. I have 1157 users aged 27, and 23 users aged 65);

  • many sites have no more than 5 likers (I filtered out the FB sites with less than 5 likers).

  • there's many more features than samples.

So, my questions are: what strategy would you suggest to prepare the data for further analysis? Should I perform some sort of dimensionality reduction? Which ML method would be most appropriate to use in this case?

I mainly use Python, so Python-specific hints would be greatly appreciated.

Wojciech Walczak

Posted 2014-05-17T09:16:18.823

Reputation: 643

1When you say "many more features than samples" I assume you mean the unique number of liked sites is >> num users. Is that also the case for the root domain of the sites? i.e. are they a number of youtube.com or cnn.com urls in the sites or are they already stemmed to domain? I'm leaning towards dimensionality reduction by collapsing URLs to domain roots rather than specific pages if it's possible.cwharland 2014-05-17T18:12:20.767

Thanks for answer. The number of features (unique liked sites) is 32k, while the number of samples (users) is 12k. The features are Facebook Pages, so there's no need to stem the URLs. A user may either like facebook.com/cnn or not. I like the idea of trying to estimate users' age based on the links they share, though :)Wojciech Walczak 2014-05-17T18:29:41.743

Ahhh, I misread the liked sites description. Thanks for the clarification.cwharland 2014-05-17T18:47:06.270

Answers

17

One thing to start off with would be k-NN. The idea here is that you have a user/item matrix and for some of the users you have a reported age. The age for a person in the user item matrix might be well determined by something like the mean or median age of some nearest neighbors in the item space.

So you have each user expressed as a vector in item space, find the k nearest neighbors and assign the vector in question some summary stat of the nearest neighbor ages. You can choose k on a distance cutoff or more realistically by iteratively assigning ages to a train hold out and choosing the k that minimizes the error in that assignment.

If the dimensionality is a problem you can easily perform reduction in this setup by single value decomposition choosing the m vectors that capture the most variance across the group.

In all cases since each feature is binary it seems that cosine similarity would be your go to distance metric.

I need to think a bit more about other approaches (regression, rf, etc...) given the narrow focus of your feature space (all variants of the same action, liking) I think the user/item approach might be the best.

One note of caution, if the ages you have for train are self reported you might need to correct some of them. People on facebook tend to report ages in the decade they were born. Plot a histogram of the birth dates (derived from ages) and see if you have spikes at decades like 70s, 80s, 90s.

cwharland

Posted 2014-05-17T09:16:18.823

Reputation: 811

Hi, your answer is quite similar to my actual strategy. I used sklearn.neighbors.KNeighborsRegressor with cosine metric on SVD-reduced space (after applying SVD the average estimation error went down from ~6 years to ~4).

Users in my database are aged 18-65 (older users were filtered out), so there are 48 possible classes. I wonder whether that's not too many classes for kNN, and whether I should treat it as regression or a classification problem (I think both are applicable). – Wojciech Walczak 2014-05-18T12:09:17.057

I can say, anecdotally, that I have use per class Random Forests to fit a number of classes individually then combined the results of each of those models in various ways. In this case you might even think about assigning prior probabilities to each user's age with the kNN, then run through each class based model, use those scores to update the prior probabilities for each class and choose the most probable class from those posteriors. It sounds like over complicating a bit but at worst you would have the kNN accuracy.cwharland 2014-05-22T20:36:20.637

7

I recently did a similar project in Python (predicting opinions using FB like data), and had good results with the following basic process:

  1. Read in the training set (n = N) by iterating over comma-delimited like records line-by-line and use a counter to identify the most popular pages
  2. For each of the K most popular pages (I used about 5000, but you can play around with different values), use pandas.DataFrame.isin to test whether each individual in the training set likes each page, then make a N x K dataframe of the results (I'll call it xdata_train)
  3. Create a series (I'll call it ydata_train) containing all of the outcome variables (in my case opinions, in yours age) with the same index as xdata_train
  4. Set up a random forest classifier through scikit-learn to predict ydata_train based on xdata_train
  5. Use scikit-learn's cross-validation testing to tweak parameters and refine accuracy (tweaking number of popular pages, number of trees, min leaf size, etc.)
  6. Output random forest classifier and list of most popular pages with pickle (or keep in memory if you are doing everything at once)
  7. Load in the rest of your data, load the list of popular pages (if necessary), and repeat step 2 to produce xdata_new
  8. Load the random forest classifier (if necessary) and use it to predict values for the xdata_new data
  9. Output the predicted scores to a new CSV or other output format of your choosing

In your case, you'd need to swap out the classifier for a regressor (so see here: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) but otherwise the same process should work without much trouble.

Also, you should be aware of the most amazing feature of random forests in Python: instant parallelization! Those of us who started out doing this in R and then moved over are always amazed, especially when you get to work on a machine with a few dozen cores (see here: http://blog.yhathq.com/posts/comparing-random-forests-in-python-and-r.html).

Finally, note that this would be a perfect application for network analysis if you have the data on friends as well as the individuals themselves. If you can analyze the ages of a user's friends, the age of the user will almost certainly be within a year or two of the median among his or her friends, particularly if the users are young enough to have built their friend networks while still in school (since most will be classmates). That prediction would likely trump any you would get from modeling---this is a textbook example of a problem where the right data > the right model every time.

Good luck!

Therriault

Posted 2014-05-17T09:16:18.823

Reputation: 533

2One interesting aspect of using the top 5000 sites is the fact that they may not be good at segmenting users on age. The top sites, by construction, are ones that everyone visits. They therefore are not very good at segmenting your users since all possible classifications (ages) have engaged with those sites. This is a similar notion to the idf part of tf-idf. idf helps filter out the "everyone has this feature" noise. How do the most visited sites rank as features in your variable importance plots with your RF?cwharland 2014-05-24T04:59:04.420

1Good point. An easy fix for this would be to stratify the training dataset into J age bins (e.g., 13-16, 17-20, 21-24, etc.) and take the top (K/J) pages for each group. That would ensure you have significant representation for each group. There will certainly be some overlap across groups, so if you were really picky you might want to take the top (K/J) unique pages for each group, but I think that might be overkill.Therriault 2014-05-27T13:37:09.423

5

Another suggestion is to test the logistic regression. As an added bonus, the weights (coefficients) of the model will give you an idea of which sites are age-distriminant.

Sklearn offers the sklearn.linear_model.LogisticRegression package that is designed to handle sparse data as well.

As mentionned in the comments, in the present case, with more input variables than samples, you need to regularize the model (with sklearn.linear_model.LogisticRegression use the penalty='l1' argument).

damienfrancois

Posted 2014-05-17T09:16:18.823

Reputation: 1 226

1With LR you would have to make multiple models for age bins i think. How would compare two models for different age bins that predict the same prob on inclusion for a user?cwharland 2014-05-20T15:18:13.717

1Note that LR fails when there are more variables than observations and performs poorly if the assumptions of the model is not met. To use it, dimensionality reduction must be a first step.Christopher Louden 2014-05-20T17:06:04.223

1@cwharland you should not consider the response variable to be categorical as it is continuous by nature, and discretized by the problem definition. Considering it categorical would mean telling the algorithm that predicting age 16 when it actually is 17 is as a serious error as predicting 30 when it actually is 17. Considering it continuous ensures that small errors (16 vs 17) are considered small and large errors (30 vs 17) are considered large. The logistic regression is used in this case to predict the continuous value and not estimate posterior probabilities.damienfrancois 2014-05-20T19:28:01.027

@ChristopherLouden You are right that the vanilla version of logistic regression is not suitable for the 'large p small n' case, I should have mentioned that regularization is important in the present case. I update my answer. But L1-regularized LR is a sort of feature selection so I consider no need for a preliminary FS step.damienfrancois 2014-05-20T19:33:50.253

@damienfrancois: Agreed, thank you.Christopher Louden 2014-05-20T19:48:42.710

@damienfrancois: I definitely agree. I'm just a little concerned that in this case LR will penalize intermediate values too harshly. There's seem to be no motivation to map to a sigmoidal like curve given that you are not particularly interested in extreme age values. Perhaps I'm misinterpreting the use though.cwharland 2014-05-22T20:32:52.293

4

Some research from D. Nguyen et al. try to predict twitter user's age based on their tweets. Maybe you find them useful. They use logistic and linear regression.

lgylym

Posted 2014-05-17T09:16:18.823

Reputation: 306

3

Apart from the fancier methods you could try the Bayes formula

P(I | p1 ... pn) = P(p1 ... pn | I) P(I) / sum_i (P(p1 ... pn | i) P(i))

P(I | p1 ... pn) is the probability that a user belongs to age group I if he liked p1, .., pn

P(i) is the probability that a user belongs to age group i

P(p1 .. pn | i) is the probability that a user liked p1, .., pn if he belongs to age group i.

  • You already have the estimates for P(i) from your data: this is just the proportion of users in age group I.
  • To estimate P(p1 ... pn |i), for each age group i estimate the probability (frequency) p_ij to like a page j. To have p_ij non-zero for all j, you can mix in the frequency for the whole population with a small weight.

  • Then log P(p1...pn| i) = sum(log p_ij, i = p1, .., pn), the sum over all pages that a new user likes. This formula would be approximately true assuming that a user likes the pages in his age group independently.

  • Theoretically, you should also add log (1-p_ij) for all i that he hasn't liked, but in practice you should find that the sum of log (1-p_ij) will be irrelevantly small, so you won't need too much memory.

If you or someone else has tried this, please comment about the result.

Valentas

Posted 2014-05-17T09:16:18.823

Reputation: 296

2

This is a very interesting problem.

I faced a similar one by analyzing the pictures users upload to the social network. I did the following approach:

  • Rather than associating data to ages (15 y.o., 27 y.o., ...) what I did is to establish different groups of ages: Less than 18, from 18 to 30 and greater than 30 (this is due to the specific problem we were facing, but you can choose whatever intervals you want). This division helps a lot to solve the problem.
  • Afterwards, I created a hierarchical clustering (divisive or aggregative). Then I choose those branches where I had users with known ages (or group ages) and then for that branch I extended the same age to that group.

This approach is semi-supervised learning and I recommended it in case you only have some data labeled.

Please, notice that on a social network, people usually lie about the age (just for fun, or sometimes because they want to camuflate themselves on the social net).

adesantos

Posted 2014-05-17T09:16:18.823

Reputation: 393