Using Information from the rest of a Sequence to Predict the Label for any one Item

1

I have a dictionary of variable-length sequences:


[(file_name[-10:], len(tag_is_header_list)) for file_name,
 tag_is_header_list in HEADER_PATTERN_DICT.items()]
[('37bd1.html', 25),
 ('0bcce.html', 40),
 ('90364.html', 28),
 ('8f9c7.html', 24),
 ('d12d4.html', 73),
 ('46837.html', 37),
 ('adb92.html', 53),
 ('0a1e7.html', 69),
 ('da077.html', 43),
 ('9366a.html', 21),
 ('6ae4d.html', 37),
 ('f62ee.html', 19),
 ('73aee.html', 33),
 ('e090a.html', 35),
 ('8b093.html', 44)]

These contain a label for each item as to whether or not they are a subject heading:


HEADER_PATTERN_DICT[sorted([(file_name, len(tag_is_header_list)) for file_name,
                            tag_is_header_list in HEADER_PATTERN_DICT.items()],
                           key=lambda x: x[1])[0][0]]
[(None, True),
 ('<div', False),
 ('<div', False),
 (None, True),
 (None, False),
 ('<li', False),
 ('<li', False),
 ('<li', False),
 (None, False),
 (None, False),
 ('<li', False),
 ('<li', False),
 ('<li', False),
 (None, True),
 (None, True),
 ('<li', False),
 ('<li', False),
 ('<li', False),
 ('<div', False)]

Every item in the sequence is an instance for which the label should be predicted. So, what is the best way to use some variable-length sequence vectorization to train a model to predict the label?

Dave Babbitt

Posted 2020-12-02T16:33:52.503

Reputation: 131

The task is not very clear to me: every item in the sequence is an instance for which the label should be predicted, right? And I assume that the goal is to use information from the rest of the sequence when predicting the label for one item? If yes then it looks like a good case for sequence labeling, Conditional Random Fields are the standard approach.

– Erwan – 2020-12-02T22:55:48.920

Yes, and I will add your guess as to what I meant to the question. – Dave Babbitt – 2020-12-04T01:01:36.153

Hey @erwan, I got CRF working thanks to your help. How do you or I write this up as an answer? – Dave Babbitt – 2020-12-05T22:16:13.543

please go ahead and write an answer, it's perfectly fine to answer your own question and it would certainly be more specific than mine since I don't use python a lot. – Erwan – 2020-12-05T23:13:06.917

Answers

1

The fastest way to train a model to predict each item's label is using Conditional Random Fields (CRT) like in this example. h/t @erwin

Dave Babbitt

Posted 2020-12-02T16:33:52.503

Reputation: 131