## Identifying templates with parameters in text fragments

7

1

I have a data set with text fragments having a fixed structure that can contain parameters. Examples are:

 Temperature today is 20 centigrades


or

 Her eyes are blue and hair black.
Her eyes are green and hair brown.


The first example show a template with one numerical parameter. The second one is a template with two factor parameters.

The number of templates and the number of parameter is not know.

The problem is to identify the templates and assign each text fragment to the corresponding template.

The obvious first idea is to use clustering. The distance measure is defined as a number of non matching words. I.e. the records in example one have distance 1, in example two distance is 2. The distance between the record in example one and two is 7. This approach works fine, providing the number of clusters is know, which is not the case, so it is not useful.

I can imagine a programmatic approach scanning the distance matrix searching for records with lot of neighbors in distance 1 (or 2,3,..), but I'm curious if I can apply some unsupervised machine learning algorithm to solve the problem. R is preferred, but not required.

1I am not an NLP expert by why wouldn't you consider doing an LSA on the whole corpus and then use the corresponding scores in a clustering algorithm? That would easily take care of the fact that different templates have different numerical parameters. Finding the number of clusters $k$ would then follow a standard methodological procedure (eg. $k$-means/AIC). – usεr11852 – 2015-09-21T07:27:24.557

@usεr11852 I appreciative you proposal, but interestingly I get better result using simple distance on TermDocumentMatrix (tm) then with LSA. I suppose this is due to the fixed structure of my templates, e.g. the order of the terms is relevant for the distance. I intuitively guess there must be a simple elementary solution for my problem, probably based on the distance matrix, but feel free to formulate your proposal as answer, so I can honor it. – Marmite Bomber – 2015-09-21T20:06:25.347

@usεr11852 The difficulties with LSA were casued by: 1) problems with ignoring short terms - the parameter wordLengths 2) non-transformed IDF application. I added an answer approaching both problems. – Marmite Bomber – 2015-09-27T09:48:31.660

3

The basic rationale behind the following suggestion is to associate "eigenvectors" and "templates".

In particular one could use LSA on the whole corpus based on a a bag-of-words. The resulting eigenvectors would serve as surrogate templates; these eigenvectors should not be directly affected by the number of words in each template. Subsequently the scores could be used to cluster the documents together following a standard procedure (eg. $k$-means in conjunction with AIC). As an alternative to LSA one could use NNMF. Let me point out that the LSA (or NNMF) would probably need to be done to the transformed TF-IDF rather than the raw word-counts matrix.

I added an answer based on transformed IDF to distinct between the parameters and template terms. As this is my first usage of transformed IDF, I'd appreciate commenting of the approach. – Marmite Bomber – 2015-09-27T09:55:21.533

1I am glad i could help. – usεr11852 – 2015-09-28T17:03:12.127

3

You might consider using word2vec to identify phrases in the corpus. The presence of a phrase (instead of single tokens) is likely to indicate a 'template.'

From here, the tokens most similar to your template phrase are likely to be the values for your parameters.

thanks for the suggestion. This will work fine in case that the fix part (template) is leading, followed by the parameters. I will recognize "Temperature today is" as a phrase, but I will probably have problem to distinct if 20 and centigrade is parameter or part of the template in my example one. – Marmite Bomber – 2015-09-21T19:49:28.147

True - but if you then list out the words most similar to "Temperature today is," you'd expect that "centrigrade" is essentially the exact same (and hence part of the template), whereas "20" (or other temps) will be fairly similar, but much less so. – jamesmf – 2015-09-21T20:33:53.027

1

The script below uses LSA with transformed IDF to cut off the parameters from templates. The idea is, that all terms with IDF higher that some threshold are considered as parameters and their frequency is reset to zero. The threshold can be approximated with the average template occurrence in the corpus. Eliminating the parameters, the distance of the records with the same template is zero.

 library(tm)
library(lsa)
df <- data.frame(TEMPLATE = c(rep("A",3),rep("B",3),rep("C",3)),
TEXT = c(
paste("Temperature today is",c(82,75,68),"Fahrenheit"),
paste("Her eyes are ",c("blue","black","green"), "and hair",c("grey","brown","white"))) , stringsAsFactors=FALSE)
> df
TEMPLATE                                TEXT
1        A Temperature today is 28 centigrades
2        A Temperature today is 24 centigrades
3        A Temperature today is 20 centigrades
4        B  Temperature today is 82 Fahrenheit
5        B  Temperature today is 75 Fahrenheit
6        B  Temperature today is 68 Fahrenheit
7        C    Her eyes are  blue and hair grey
8        C  Her eyes are  black and hair brown
9        C  Her eyes are  green and hair white

corpus <- Corpus(VectorSource(df\$TEXT))
td <- as.matrix(TermDocumentMatrix(corpus,control=list(wordLengths = c(1, Inf)) ))

> td             Docs
Terms         1 2 3 4 5 6 7 8 9
20          0 0 1 0 0 0 0 0 0
24          0 1 0 0 0 0 0 0 0
28          1 0 0 0 0 0 0 0 0
68          0 0 0 0 0 1 0 0 0
75          0 0 0 0 1 0 0 0 0
82          0 0 0 1 0 0 0 0 0
and         0 0 0 0 0 0 1 1 1
are         0 0 0 0 0 0 1 1 1
black       0 0 0 0 0 0 0 1 0
blue        0 0 0 0 0 0 1 0 0
brown       0 0 0 0 0 0 0 1 0
centigrades 1 1 1 0 0 0 0 0 0
eyes        0 0 0 0 0 0 1 1 1
fahrenheit  0 0 0 1 1 1 0 0 0
green       0 0 0 0 0 0 0 0 1
grey        0 0 0 0 0 0 1 0 0
hair        0 0 0 0 0 0 1 1 1
her         0 0 0 0 0 0 1 1 1
is          1 1 1 1 1 1 0 0 0
temperature 1 1 1 1 1 1 0 0 0
today       1 1 1 1 1 1 0 0 0
white       0 0 0 0 0 0 0 0 1

## supress terms with idf higher than template frequency
## those terms are considered as parameters
template_freq <- 3
tdw <- lw_bintf(td) * ifelse(gw_idf(td)> template_freq,0, gw_idf(td))
dist <- dist(t(as.matrix(tdw)))

> dist
1        2        3        4        5        6        7        8
2 0.000000
3 0.000000 0.000000
4 3.655689 3.655689 3.655689
5 3.655689 3.655689 3.655689 0.000000
6 3.655689 3.655689 3.655689 0.000000 0.000000
7 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341
8 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000
9 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000 0.000000


The distance matrix clearly shows, that the records 1,2,3 are from the same template (distance = 0, with the synthetic data; in a real case some small threshold should be used). Same is valid for records 4,5,6 and 7,8,9.