This falls into the category of "distance metric learning" or "similarity learning"; specifically, it is what Wikipedia calls a "classification similarity learning problem".

In your specific case, I suggest you use logistic regression to classify pairs of addresses as either "Similar" or "Not Similar". In particular, map a pair of addresses (Address1,Address2) to a $d$-dimensional feature vector $x \in \mathbb{R}^d$. Then, train a logistic regression model to predict "Similar" vs "Not Similar" from the feature vector $x$. This turns your problem into a boolean classification task, and solves it by applying logistic regression.

How do you form a feature vector? That's where you'll use your domain knowledge to identify features, where each feature represents a factor that might be helpful in determining whether the two addresses match. You could parse the address into elements (house number, street name, city, state, zip code), and have one feature (one dimension) per element of the address:

One feature is 0 or 1 according to whether the house number matches in both addresses.

Another feature has the Levenshtein edit distance between the street name in the first address and the street name in the second address (after normalizing both, e.g., by lowercasing them).

Another feature has the Levenshtein edit distance between the two cities in the two addresses.

Another has 0 or 1 according to whether the state matches.

Another has 0 or 1 according to whether the first five digits of ZIP code match.

Another has 0 or 1 according to whether both have the same ZIP+4.

Then, train a classifier using logistic regression to predict "Similar" vs "Not Similar" from these features, using your labelled training set.

One nice thing about this approach is that it is easy to add other features, if you discover that the classifier makes mistakes. For instance, it is easy to add the edit distance between both addresses as another feature, or to replace the edit distance with a thresholded version of the edit distance, or feed each address into a geolocation service to get back (latitude,longitude) and then compute the distance between those two locations, or any other scheme of your choice.

And, of course, you can replace logistic regression with any other boolean classification method, such as random forests. However, logistic regression is probably a nice simple starting point for this problem.

Can you give some example rows of this dataset? – apnorton – 2016-12-20T16:13:07.313