Fuzzy name and nickname match



I have a dataset with the following structure:

full_name,nickname,match Christian Douglas,Chris,1, Jhon Stevens,Charlie,0, David Jr Simpson,Junior,1 Anastasia Williams,Stacie,1 Lara Williams,Ana,0 John Williams,Willy,1

where each predictor row is a pair full name, nickname, and the target variable, match, which is 1 when the nickname corresponds to the person with that name and 0 otherwise. As you can see, the way the nickname is obtained from the full name doesn't follow a particular pattern.

I want to train an ML algorithm that, given the pair full name, nickname, predict the probability of match.

My baseline is just trying to see the number of carachters that match, and features like that. However, I am thinking about an NLP approach using deep learning. My question is wether there are neural network architectures that are specific of this problem.

David Masip

Posted 2019-03-19T13:36:15.100

Reputation: 5 101

1How man samples are there for training ? – Shamit Verma – 2019-03-19T13:38:33.677

Around 100k, but only the 17% have match = 0 – David Masip – 2019-03-19T13:46:30.883

Is this an open dataset that you can share a link to? – Adarsh Chavakula – 2019-03-22T18:35:40.083

how to find relationship between a name and a nickname when there is none for instance "Anastasia Williams|Stacie".. you need more features to make this work i think. – iamklaus – 2019-03-24T06:27:07.963



I had a similar problem in my last job. My solution was to build features via (transformation(s) + comparison) * many combos and feed to models, then aggregate and model, i.e. 2 layer model. The key is encoding and similarity scores as features.

Transforms: remove vowels (great for certain roots), remove end vowels, remove double characters, convert to phonetic string (IPA, soundex, https://pypi.org/project/Fuzzy/), replace characters that either sound similar or have different sounds in other languages ($J$ in East Europe sounds like $Y$ in US, $C$ can sound like $K, D~T, T\sim TH$, etc), ... The strategy is to handle lots of weirdness/irregularity in people's names.

Comparisons (similarity and difference): try [character level, block/root/[pre/suf]fix level, word level (may not apply to you)] similarity and difference scores. Try Dice's coefficient, Levenshtein, Needleman–Wunsch, Longest common (non)contiguous substring, character histogram similarity, # characters matching, not matching (each left and right), etc. You could try using an RNN/LSTM and have it learn similarity for each transform. Use the output of the trained model(s) as another feature.

Experiment with different combos of the above and select a few that seem to have value. You could simply take all the scores and fit with Logistic Regression (or Neural Net) or you could build statistical models and output percent rank based on a small training set to normalize it. Another way to preprocess the raw scores is to use calibration encoding via logistic function. Then add summary stats from the normalized scores as additional features. Push all this into the final model.

Will you handle names that are derived from Arabic, Spanish, French, etc names? This is just extra, but consider downloading the Social Security and US Census name stats data to enhance your project with more name variations. I'll leave the how to you, but it helps knowing about the likely possibilities. Be aware that simply using Levenshtein doesn't work so well with William->Bill, Dianne->Di, Larry->Lawrence, Mohammed->Muhamed and Hamed, Danielle->Daniela, Thomas->Tom, and Jimmy->James. The strategy I mentioned should help you with all the variation.

Additional resources to explore: https://github.com/jamesturk/jellyfish https://nameberry.com/list/276/If-You-Like-Danielle-You-Might-Love https://pypi.org/project/phonetics/


Posted 2019-03-19T13:36:15.100

Reputation: 216


I couldn't find any useful literature out there for using deep learning for this specific problem. Most methods seem to rely on non-machine learning methods like string similarities and Levenstein distances. A reasonable deep learning based approach to this problem would be a Recurrent Neural Network. An LSTM (Long short term memory) or GRU (Gated Recurrent Unit) would be ideal. The idea is to have an RNN which has an internal state and respects the order in which the inputs are fed.

Unlike text classification, sentiment analysis or sequence generation, the preferred encoding for the text here would be at a character level instead of word level.

For example

Christian Douglas,Chris,1
Jhon Stevens,Charlie,0

would become

[C,h,r,i,s,t,i,a,n, ,D,o,u,g,l,a,s, ,C,h,r,i,s] --> [1]
[J,h,o,n, ,S,t,e,v,e,n,s, ,C,h,a,r,l,i,e]       --> [0]

The two strings to be matched are concatenated into a single sequence. The intuition here is that the RNN would process the sequence character by character and learn (read update weights) that the characters at the end have a similar pattern to what it saw earlier in the same sequence to deduce that it should be a 1 instead of 0.

The vector of [1/0] is the target variable.

The standard RNN pre-processing steps apply as usual - we'll pad the sequences in the beginning so that they're all the same length (say 50), the characters would be encoded as numeric instead of string etc.

Since the dictionary here is pretty small (26 alphabets + space + pad), the network architecture can be a fairly simple one. A single embedding layer + recurrent layer should suffice.

Framing the problem in this manner allows us to use a vanilla RNN or an out-of-the-box LSTM/GRU instead of creating a custom architecture that takes two separate strings as input for each data point and throws out a number.

You could give this approach a shot and see if is able to satisfactorily beats baseline models.

A good reading for character level RNNs is Andrej Karpathy's blog and code. The problem he's trying to solve is different and the code is in pure numpy but it still captures the idea pretty well.

Adarsh Chavakula

Posted 2019-03-19T13:36:15.100

Reputation: 365


I would use some normalization as preprocessing, such as:

  • Jr converted into Junior.
  • Convert the name into Soundex as ldmtwo said

And then use string algorithms instead of ML for this, such as Z algorithm, KMP algorithm, or Levenshtein distance, and then use threshold on the score.

Rizky Luthfianto

Posted 2019-03-19T13:36:15.100

Reputation: 1 968