## How to create vectors from text for address matching using binary classification?

4

I want to compare two addresses for their similarity (purely similarity in terms of textual occurrences rather than semantic similarity). I have a labelled dataset having "Address1", "Address2" and "Similar", where "Similar" takes a value of 0 (not similar) or 1 (similar).

Now, how would I convert the text in "Address1" and "Address2" into numerical vectors, so as to treat this as a binary classification problem?

Can you give some example rows of this dataset? – apnorton – 2016-12-20T16:13:07.313

2

I would rather recommend to consider each address as a separate instance. (Separate row in feature matrix)

Each address can be further broken into attributes like: House no., Street, City, State, Country.

Above attributes can be vectorised based on distance between them. Like vectorise each country from west to east(incremental numeric value). Then vectorise each state in country from west to east. Similarly for street. In case two entities fall on same longitude, assign consecutive numerical values.

Once each attribute is vectorised, similarity between two addresses can be Euclidean distance between vectors of two addresses.

Anyways leaving aside data science approach, this problem can be solved further simply by getting distance between two addresses using Google Maps API.

Good suggestion. I would try using something smarter than Euclidean distance though. For instance, differences in country should weight more than door number, and differences in door should not be important when the road is different. But all depends on the context. Converting address to Geo coordinates may be the smartest though. – Ricardo Cruz – 2016-12-23T10:39:57.180

1

This falls into the category of "distance metric learning" or "similarity learning"; specifically, it is what Wikipedia calls a "classification similarity learning problem".

In your specific case, I suggest you use logistic regression to classify pairs of addresses as either "Similar" or "Not Similar". In particular, map a pair of addresses (Address1,Address2) to a $d$-dimensional feature vector $x \in \mathbb{R}^d$. Then, train a logistic regression model to predict "Similar" vs "Not Similar" from the feature vector $x$. This turns your problem into a boolean classification task, and solves it by applying logistic regression.

How do you form a feature vector? That's where you'll use your domain knowledge to identify features, where each feature represents a factor that might be helpful in determining whether the two addresses match. You could parse the address into elements (house number, street name, city, state, zip code), and have one feature (one dimension) per element of the address:

• One feature is 0 or 1 according to whether the house number matches in both addresses.

• Another feature has the Levenshtein edit distance between the street name in the first address and the street name in the second address (after normalizing both, e.g., by lowercasing them).

• Another feature has the Levenshtein edit distance between the two cities in the two addresses.

• Another has 0 or 1 according to whether the state matches.

• Another has 0 or 1 according to whether the first five digits of ZIP code match.

• Another has 0 or 1 according to whether both have the same ZIP+4.

Then, train a classifier using logistic regression to predict "Similar" vs "Not Similar" from these features, using your labelled training set.

One nice thing about this approach is that it is easy to add other features, if you discover that the classifier makes mistakes. For instance, it is easy to add the edit distance between both addresses as another feature, or to replace the edit distance with a thresholded version of the edit distance, or feed each address into a geolocation service to get back (latitude,longitude) and then compute the distance between those two locations, or any other scheme of your choice.

And, of course, you can replace logistic regression with any other boolean classification method, such as random forests. However, logistic regression is probably a nice simple starting point for this problem.

0

Mainly to add on to the previous answer, firstly what do you mean by similarity in textual occurences, do you mean what other words they frequently occur or you want to recognize how closely a piece of text looks like an address. If you are just looking for a metric to compare addresses, then the best thing would be to use the Geo Location API mentioned in the previous answer. If its just about converting them to vectors, you can take the entire vocabulary and assign an index to each token, if in the future you need semantic similarity you can apply a word2vec to the indexes.