I have text documents which contain mainly lists of Items.
Each Item is a group of several token from different types: FirstName, LastName, BirthDate, PhoneNumber, City, Occupation, etc. A token is a group of words.
Items can lie on several lines.
Items from a document have about the same token syntax, but they don't necessarily have to be exactly the same.
They may be some more/less tokens between Items, as well as within Items.
FirstName LastName BirthDate PhoneNumber Occupation City FirstName LastName BirthDate PhoneNumber PhoneNumber Occupation City FirstName LastName BirthDate PhoneNumber Occupation UnrecognizedToken FirstName LastName PhoneNumber Occupation City FirstName LastName BirthDate PhoneNumber City Occupation
The goal is to identify the used grammar, e.g.
and in the end identify all the Items, even thought they don't exactly match.
In order to stay short and readable, let's use instead some aliases A, B, C, D, ... to designate those token types.
A B C D F A B C D E F F A B C D E E F A C B D E F A B D C D E F A B C D E F G
Here we can see that the Item syntax is
A B C D E F
because it is the one that matches the best the sequence.
The syntax (token types and orders) can vary a lot from one document to another. e.g. another document may have that list
D A D A D D A B D A
The goal is to figure out that syntax without prior knowledge of it.
From now, a new line is considered as a token as well. A document can then be represented as a 1-dimension sequence of tokens:
Here the repeated sequence would be
A B C B because it is the token that creates the least conflicts.
Let's complexify it a bit. From now each token has no determined type. In the real world, we are not always 100% sure of some token's type. Instead, we give it a probability of having a certain type.
A 0.2 A 0.0 A 0.1 B 0.5 B 0.5 B 0.9 etc. C 0.0 C 0.0 C 0.0 D 0.3 D 0.5 D 0.0
Here is an abstract graphic of what I want to achieve:
Solution considered A: Convolution of a patch of tokens
This solution consists in applying a convolution with several patches of tokens, and take the one that creates the least conflicts.
The hard part here is to find potential patches to roll along the observe sequence. Few ideas for this one, but nothing very satisfying:Build a Markov model of the transition between tokens
Drawback: since a Markov model has no memory, we will lose the orders of transition.
e.g. If the repeated sequence is
A B C B D, we lose the fact that A->B occurs before C->B.
This seems to be used extensively in biology in order to analyse nucleobases (GTAC) in DNA/RNA. Drawback: Suffix trees are good for exact matching of exact tokens (e.g. characters). We have neither exact sequences, nor exact tokens.Brute force
Try every combination of every size. Could actually work, but would take some (long (long)) time.
Solution considered B: Build a table of Levenshtein distances of suffixes
The intuition is that there may exist some local minima of distance when computing the distance from every suffix to every suffix.
The distance function is the Levenshtein distance, but we will be able to customize it in the future in order to take in account the probability of being of a certain types, instead of having a fixed type for each token.
In order to stay simple in that demonstration, we will use fixed-type tokens, and use the classic Levenshtein to compute the distance between tokens.
e.g. Let's have the input sequence
ABCGDEFGH ABCDEFGH ABCDNEFGH.
We compute the distance of every suffix with every suffix (cropped to be of equal size):
for i = 0 to sequence.lengh for j = i to sequence.lengh # Create the suffixes suffixA = sequence.substr(i) suffixB = sequence.substr(j) # Make the suffixes the same size chunkLen = Math.min(suffixA.length, suffixB.length) suffixA = suffixA.substr(0, chunkLen) suffixB = suffixB.substr(0, chunkLen) # Compute the distance distance[i][j] = LevenshteinDistance(suffixA, suffixB)
We get e.g. the following result (white is small distance, black is big):
Now, it's obvious that any suffix compared to itself will have a null distance. But we are not interested by suffix (exactly or partially) matching itself, so we crop that part.
Since the suffixes are cropped to the same size, comparing long string will always yield a bigger distance than comparing smaller strings.
We need to compensate that by a smooth penalty starting from the right (+P), fading out linearly to the left.
I am not sure yet how to choose a good penalty function that would fit all cases.
Here we apply a (+P=6) penalty on the extreme right, fading out to 0 to the left.
Now we can clearly see 2 clear diagonal lines emerge. There are 3 Items (Item1, Item2, Item3) in that sequence. The longest line represents the matching between Item1 vs Item2 and Item2 vs Item3. The second longest represents the matching between Item1 vs Item3.
Now I am not sure on the best way to exploit that data. Is it as simple as taking the highest diagonal lines?
Let's assume it is.
Let's compute the average value of the diagonal line that start from each token. We can see the result on the following picture (the vector below the matrix) :
There are clearly 3 local minima, that match the beginning of each Item. Looks fantastic!
Now let's add some more imperfections in the sequence:
ABCGDEFGH ABCDEFGH TROLL ABCDEFGH
Clearly now, our vector of diagonal averages is messed up, and we cannot exploit it anymore...
My assumption is that this could be solved by a customized distance function (instead of Levenshtein), where the insertion of a whole block may not be so much penalized. That is what I am not sure of.
None of the explored convolution-based solutions seem to fit our problem.
The Levenshtein-distance-based solution seems promising, especially because it is compatible with probability-based-type tokens. But I am not sure yet about how to exploit the results of it.
I would be very grateful if you have experience in a related field, and a couple of good hints to give us, or other techniques to explore. Thank you very much in advance.