I have thousands of lists of strings, and each list has about 10 strings. Most strings in a given list are very similar, though some strings are (rarely) completely unrelated to the others and some strings contain irrelevant words. They can be considered to be noisy variations of a canonical string. I am looking for an algorithm or a library that will convert each list into this canonical string.
Here is one such list.
- Star Wars: Episode IV A New Hope | StarWars.com
- Star Wars Episode IV - A New Hope (1977)
- Star Wars: Episode IV - A New Hope - Rotten Tomatoes
- Watch Star Wars: Episode IV - A New Hope Online Free
- Star Wars (1977) - Greatest Films
- [REC] 4 poster promises death by outboard motor - SciFiNow
For this list, any string matching the regular expression
^Star Wars:? Episode IV (- )?A New Hope$ would be acceptable.
I have looked at Andrew Ng's course on Machine Learning on Coursera, but I was not able to find a similar problem.