Extract canonical string from a list of noisy strings

10

2

I have thousands of lists of strings, and each list has about 10 strings. Most strings in a given list are very similar, though some strings are (rarely) completely unrelated to the others and some strings contain irrelevant words. They can be considered to be noisy variations of a canonical string. I am looking for an algorithm or a library that will convert each list into this canonical string.

Here is one such list.

  • Star Wars: Episode IV A New Hope | StarWars.com
  • Star Wars Episode IV - A New Hope (1977)
  • Star Wars: Episode IV - A New Hope - Rotten Tomatoes
  • Watch Star Wars: Episode IV - A New Hope Online Free
  • Star Wars (1977) - Greatest Films
  • [REC] 4 poster promises death by outboard motor - SciFiNow

For this list, any string matching the regular expression ^Star Wars:? Episode IV (- )?A New Hope$ would be acceptable.

I have looked at Andrew Ng's course on Machine Learning on Coursera, but I was not able to find a similar problem.

lacton

Posted 2014-08-22T15:59:07.097

Reputation: 201

2PS I think the term you're looking for is "canonical" – Sean Owen – 2014-08-23T09:23:00.063

Is the "most probable"/"most consensual" string you are looking to idendify a regular expression? Or one of the strings on the list? – MrMeritology – 2014-08-24T21:47:38.487

@MrMeritology I am not looking for a regular expression. I have shown a regular expression in my question just to illustrate how flexible I am in the kind of strings I would consider to be correct. – lacton – 2014-08-25T08:07:58.927

OK. Then the answer I gave below should work for you. – MrMeritology – 2014-08-25T18:53:19.470

Would this come under NER (named entity recognition)? – hippietrail – 2014-09-26T03:30:30.803

Answers

4

As a naive solution I would suggest to first select the strings which contain the most frequent tokens inside the list. In this way you can get rid of irrelevant string.

In the second phrase I would do a majority voting. Assuming the 3 sentences:

  • Star Wars: Episode IV A New Hope | StarWars.com
  • Star Wars Episode IV - A New Hope (1977)
  • Star Wars: Episode IV - A New Hope - Rotten Tomatoes

I would go through the tokens one by one. We start by "Star". It wins as all the string start with it. "Wars" will also win. The next one is ":". It will also win.

All the tokens will ein in majority voting till "Hope". The next token after "Hope" will be either "|", or "(" or "-". None of the will win in majority voting so I will stop here!

Another solution would be probably to use Longest common subsequence.

As I said I have not though about it much. So there might be much more better solutions to your problem :-)

Pasmod Turing

Posted 2014-08-22T15:59:07.097

Reputation: 463

3

First compute the edit distance between all pairs of strings. See http://en.wikipedia.org/wiki/Edit_distance and http://web.stanford.edu/class/cs124/lec/med.pdf. Then exclude any outliers strings based on some distance threshold.

With remaining strings, you can use the distance matrix to identify the most central string. Depending on the method you use, you might get ambiguous results for some data. No method is perfect for all possibilities. For your purposes, all you need is some heuristic rules to resolve ambiguities -- i.e. pick two or more candidates.

Maybe you don't want to pick "most central" from your list of strings, but instead want to generate a regular expression that captures the pattern common to all the non-outlier strings. One way to do this is to synthesize a string that is equidistant from all the non-outlier strings. You can work out the required edit distance from the matrix, and then you'd randomly generate regular using those distances as constraints. Then you'd test candidate regular expressions and accept the first one that fits the constraints and also accepts all the strings in your non-outlier list. (Start building regular expressions from longest common substring lists, because those are non-wildcard characters.)

MrMeritology

Posted 2014-08-22T15:59:07.097

Reputation: 1 850