## How to split temporal sequences to sub-sequences in a meaningful yet unsupervised manner?

0

I have a biological process that undergoes some cellular event which I am observing. I have a series of events, with different temporal gaps between them. For example

30, 0, 65, 0, 100, 0, 300, 25, 5, 5, 5, 235, 0, 5, 480, 0


That is, 30 seconds between the first event and the second one, the third even occurring together with the second one (activation events can occur in different locations in the cell and I classify them as distinct events), the forth even occurring 65 seconds after the third event, and so on.

I assume that one event can trigger the next one (spatial distance also has an effect but we have decided to ignore it at this point) - forming a sub-sequence of events. Or the events can be non-related, and thus belong to different sub-sequences. I assume that this can generally determined by temporal distance.

The question is - how? I know that the temporal distances between different sub-sequences should be larger than those between events in the same sub-sequence, but I still need to decide what the threshold is.

Seeing as I don't know what a good separation would be, I think that an unsupervised approach might be useful here.

However beyond that I do not have any idea how to approach that.

Do any of you have any methodologies suitable for this endeavor, or any other insights and tips?

Many thanks.

3

It seems like your problem is not a typical "time series" problem: in data science and related problems, we normally look at evenly-spaced time series (that is, a measurement every interval, with each interval being the same length, such as 1 second, 3 months, 24 microseconds, etc).

There exist methods for working with events that can take some random time to happen, see survival analysis. If the coordinate domain (e.g. cell location) is important, you probably need a much more structured statistical model. Either way, it feels like unsupervised learning won't help you much.

Unfortunately, you haven't explained your data format very well. Without a good understanding of your $x \rightarrow y$ relationship, nobody will really be able to help you. What is being observed? What do the numbers mean? (Time until the next event?) What groups are you trying to get? (Sequences of cell events?)

Thanks, I will look into survival analysis. Sorry if I was not clear, the times above are the gaps (in seconds) between events. Events are the formation of protein-protein interactions as detected by a fluorescent signal (FRET, to be specific, but I don't really think it is relevant to my question so I did not include these details above).

Indeed, the x->y relationships is unclear ; if I had a cost function I would not have needed to ask this question, after all. – Lafayette – 2018-09-03T09:35:30.410

The groups I am trying to get are events that are sufficiency temporally close to each other for me to suggest that they are part of the same occurrence (of course, I [or, more probably, another student in our lab] will need to test this hypothesis in reality). – Lafayette – 2018-09-03T09:38:31.823

1I don't understand what the 0's are for, then - concurrent events? I guess a better format for representation would be something akin to: [event ID - start time - end time]. It seems like survival analysis or related statistical models (such as insurance claims models) might guide you in the right direction to look. If these events occur only one at a time, but can start immediately after the other, then you could try grouping them as you initially proposed; however, without external predictors your model will be very poor. (edit because accidentally pressed enter :P) – Anatoly Makarevich – 2018-09-03T19:27:06.360