Sequence extraction in a dataset

0

I am looking for a way to extract sequences/patterns from a dataset such as this one:

dataset = ['sample1', 'sample2', 'sample3', 'sample1', 'sample2', 'sample3', 'sample3', 'sample2'...]

And my goal is to know that the sequence ['sample1', 'sample2', 'sample3'] occurs 2 times in this dataset. Ideally, I would also like to know all sequences that occur more than once in my dataset.

Is there a library (sklearn...) that could help me do that or do I just have to iterate over my dataset and test each and every possible combination? I assume there must be a more intelligent way to do that.

Thanks for your help!

Nes

Posted 2019-03-12T08:58:26.270

Reputation: 3

Answers

0

You can use nltk.util.ngrams for ngram extraction. See an example below:

To extract bigrams:

dataset = ['sample1', 'sample2', 'sample3', 'sample1', 'sample2', 'sample3',\ 
'sample3', 'sample2', 'sample2', 'sample3', 'sample1', 'sample2']

from nltk.util import ngrams
import collections

bigrams = ngrams(dataset, 2)
result = collections.Counter(bigrams)
result.most_common()

Out[1]: 
[(('sample1', 'sample2'), 3),
 (('sample2', 'sample3'), 3),
 (('sample3', 'sample1'), 2),
 (('sample3', 'sample3'), 1),
 (('sample3', 'sample2'), 1),
 (('sample2', 'sample2'), 1)]

To extract trigrams:

trigrams = ngrams(dataset, 3)
result = collections.Counter(trigrams)
result.most_common()

Out[2]: 
[(('sample1', 'sample2', 'sample3'), 2),
 (('sample2', 'sample3', 'sample1'), 2),
 (('sample3', 'sample1', 'sample2'), 2),
 (('sample2', 'sample3', 'sample3'), 1),
 (('sample3', 'sample3', 'sample2'), 1),
 (('sample3', 'sample2', 'sample2'), 1),
 (('sample2', 'sample2', 'sample3'), 1)]

Fourgrams:

fourgrams = ngrams(dataset, 4)
result = collections.Counter(fourgrams)
result.most_common()

Out[3]: 
[(('sample2', 'sample3', 'sample1', 'sample2'), 2),
 (('sample1', 'sample2', 'sample3', 'sample1'), 1),
 (('sample3', 'sample1', 'sample2', 'sample3'), 1),
 (('sample1', 'sample2', 'sample3', 'sample3'), 1),
 (('sample2', 'sample3', 'sample3', 'sample2'), 1),
 (('sample3', 'sample3', 'sample2', 'sample2'), 1),
 (('sample3', 'sample2', 'sample2', 'sample3'), 1),
 (('sample2', 'sample2', 'sample3', 'sample1'), 1)]
 ...

You only need to specify the length of your n-grams and what number of repetitions is representative in your case.

Hope this helps!

TitoOrt

Posted 2019-03-12T08:58:26.270

Reputation: 1 482