I am trying to detect timeline of brands histories. For my specific case, I believe it is easy because data is already clustered. For each Wikipedia article I can spot sentences surrounding dates. Here is an example:
McDonald's Corporation is an American fast food company, founded in 1940 as a restaurant operated by Richard and Maurice McDonald, in San Bernardino, California, United States. They rechristened their business as a hamburger stand, and later turned the company into a franchise, with the Golden Arches logo being introduced in 1953 at a location in Phoenix, Arizona. In 1955, Ray Kroc, a businessman, joined the company as a franchise agent
From this, it is easy to narrow results programmatically to
McDonald's is founded in 1940
Golden Arches logo introduced in 1953
Ray Kroc, a businessman, joined the company in 1955
This seems easy if documents are clustered. If not, I am thinking of a basic algorithm to mine timelines 'or natural numbers). So I want to discuss existing studies and my intuition here.
- Timeline: a logical succession of events on one single subject.
- Dates in a timeline are natural numbers, and can be "unordered relatively".
- Timelines are continuous (one range like) and cannot intersect.
Let's ignore the NLP related part, and try to figure out timelines in natural numbers ignoring topics (1st definition).
Distance: Initial timeline length. It represents the minimum.
1, 4, 2, 5, 3, 8, 7, 9, 20, 21, 23, 24, 1, 5, 7, 9 dist = 4
1, 4, 2, 5/ 3, 8, 7, 9/ 20, 21, 23, 24/ 1, 5, 7, 9
Score each set (of 4 elements): Scoring is critical but lets think of bubble sort score, where score = 1 / number exchange ops.
1,4,2,5 => 1/1 | 3,8,7,9 => 1/1
The reason to score sets is to identify if a set represents a timeline or the combination of two sets represent a timeline, to decide, we score the combined set and divide by two
1,4,2,5,3,8,7,9 => 5/2
3,8,7,9 are two sets, while
1,4,2,5,3,8,7,9 is not.
We move sequentially to process next sets.
The reason I said Distance is minimum is that before comparing scoring initial sets, we first identify sets of 4, 5, 6 or more elements and score them ((step A)) and only take separate sets with better score (minimum bubble sort score here).
Any thoughts ?