Mining timelines in a long text


I am trying to detect timeline of brands histories. For my specific case, I believe it is easy because data is already clustered. For each Wikipedia article I can spot sentences surrounding dates. Here is an example:

McDonald's Corporation is an American fast food company, founded in 1940 as a restaurant operated by Richard and Maurice McDonald, in San Bernardino, California, United States. They rechristened their business as a hamburger stand, and later turned the company into a franchise, with the Golden Arches logo being introduced in 1953 at a location in Phoenix, Arizona. In 1955, Ray Kroc, a businessman, joined the company as a franchise agent

From this, it is easy to narrow results programmatically to

McDonald's is founded in 1940

Golden Arches logo introduced in 1953

Ray Kroc, a businessman, joined the company in 1955

This seems easy if documents are clustered. If not, I am thinking of a basic algorithm to mine timelines 'or natural numbers). So I want to discuss existing studies and my intuition here.


  1. Timeline: a logical succession of events on one single subject.
  2. Dates in a timeline are natural numbers, and can be "unordered relatively".
  3. Timelines are continuous (one range like) and cannot intersect.

Let's ignore the NLP related part, and try to figure out timelines in natural numbers ignoring topics (1st definition).

Distance: Initial timeline length. It represents the minimum.


Step A

1, 4, 2, 5,  3, 8, 7, 9, 20, 21, 23, 24, 1, 5, 7, 9

dist = 4
  • Becomes:

1, 4, 2, 5/ 3, 8, 7, 9/ 20, 21, 23, 24/ 1, 5, 7, 9

  • Score each set (of 4 elements): Scoring is critical but lets think of bubble sort score, where score = 1 / number exchange ops.

    1,4,2,5 => 1/1 | 3,8,7,9 => 1/1

Step B

The reason to score sets is to identify if a set represents a timeline or the combination of two sets represent a timeline, to decide, we score the combined set and divide by two

1,4,2,5,3,8,7,9 => 5/2

We conclude 1,4,2,5 and 3,8,7,9 are two sets, while 1,4,2,5,3,8,7,9 is not.

We move sequentially to process next sets.

The reason I said Distance is minimum is that before comparing scoring initial sets, we first identify sets of 4, 5, 6 or more elements and score them ((step A)) and only take separate sets with better score (minimum bubble sort score here).

Any thoughts ?


Posted 2020-11-11T01:43:49.500

Reputation: 361



It looks to me like what you propose makes sense, but there has been some research done around these questions of time representation already. I'd suggest you check the state of the art in this domain, if only not to reinvent the wheel or miss important cases.

I'm not very knowledgeable about it but I can at least point you to TimeML and the related publications. There are certainly other recent works building on TimeML, for example this one (disclaimer: I know the author).


Posted 2020-11-11T01:43:49.500

Reputation: 12 600

Sure Erwan, I tried some keywords and couldn't find any. I will see your links – bacloud14 – 2020-11-12T13:41:49.977