Looking for a 'CITY, STATE' within a body of text (from a CITY-STATE database)


I'm looking for an optimal way to search a large body of text for a combination of words that resemble any CITY, STATE combination I have in a separate CITY-STATE database.

My only idea would be to do a separate search against the body of text for each CITY, STATE in the database, but that would require a lot of time considering the amount of CITY, STATE combinations the database has in it. The desired result from this query would be to pull a single CITY, STATE for each body of text I am analyzing to tell the geographical side of the story for this data subset.

Anyone know of an optimal way/process to do such a query?


Posted 2015-09-28T22:39:45.037

Reputation: 103

not as a query in a database. Using text-mining tools might be an option. these are optimized for these kind of tasks. – phiver – 2015-09-29T06:45:43.660

Use Named Entity Resolution method. – Vladislavs Dovgalecs – 2015-09-30T16:05:40.570

Thanks for the recommendation @xeon. I found http://opennlp.apache.org/ that can perform such NER methods.

– Collarbone – 2015-09-30T18:42:19.737

@Collarbone I am building lots of custom NER models myself. I can detail you every step such that you can learn as well :) – Vladislavs Dovgalecs – 2015-09-30T23:19:14.200

@xeon That'd be great. Feel free to reach out to me via email. My email is on my website http://collarbone.com. :)

– Collarbone – 2015-10-05T16:40:53.620

@Collarbone I just sent you an email. – Vladislavs Dovgalecs – 2015-10-05T17:16:56.530



The only thing I can see would be to separate both city and state lists and treat the problem as an automaton: parse your text, run through the n-grams, whenever you detect a CITY token (meaning a n-gram present in your list of cities or close to it in a similarity sense, as there might be misspellings) then look for a STATE token in its neighbourhood (similarly by looking into a list of states, using an edit distance metric to allow for misspellings). If you find one, then you can tag your text with that geographical location.

Of course, allowing for misspellings will bring some false positives but you could easily bypass that by doing a quick lookup through your corpus to see that "SALAMI, OREGANO" is different from "SALEM, OREGON" (because the frequency of the latter will hopefully be higher than the former)

Jérémie Clos

Posted 2015-09-28T22:39:45.037

Reputation: 330