## Evaluating the result of topic modeling in a way that time matters

3

2

I have run different topic modeling approach on my data(its clinical data related to Cognitive impairment diseases. we are going to process what thing is important that make it develop to more harsh disease). before anything, I have divided my data into different 6-month data(from a starting point back every 6 months) and then run the topic modeling approach on every 6 months. I was going to see the difference between the derived topics of each 6 months.

For example, for the first six months there are 20 topics, then for the second six month there 20 topics and... till the tenth (5 years). I was hopeful to see a different topic because of the use case I have in every six months or at least each 1 year. but sadly most of the words have been repeated in every 6 months. however, the number of the words has changed.

For example in the first six months word "sleeping" has been repeated 10 times in different topics but in the second 6 months, it has been repeated 4 times.

What I am going to say is that, if we look at this as a thing that times matters, I can not see any pattern visibly in my data unless I rely on the number of words changing in every six months.

Do you think analyzing my output and plotting the different words number in different 6 months makes sense at all? or its something unreliable.

Also, do you mind to let me know what other approach is there that I can apply to get insight out of the output of my topic modeling(please consider that the changing in each six months matters)?

2

I think the issues comes from the fact that the item you are looking at (the word "sleeping") is a rare event, so the probability that you observe one is about 0. Technically, it is called a Poisson process.

One of the way to circumvent it is what you did: aggregate on a period (in your case 6 month) so that the number of events get significant.

You do not need to cut you observation time in 6 month period though, you may use moving averages: period 1-6, 2-7, 3-8, 4-9,... If there is a temporal pattern it will be more visible.

An other way is to use cumulative data: the number of words up-to time t. If you get a logistic "S" shaped curve, then you are on an important thing.

Basically you want to see if a certain word (sleeping) is more frequent in the CI sub-population than the nonCI sub-population. You can use the t-test (unpaired observations, unequal sample size, equal variance), to check if the count of the word is significantly different in the two sub-populations. And you can do this over time.

More adapted tests can be found in this paper treating of the similar problem of testing if documents have the same author. Here again, nothing prevent to make a test per time period, cumulative or not.

Thank you so much for sharing your idea with me. Actually I can not change the way I divided the data as we are interested to see in which period there may be hidden data related to CI when compared with noCI. though I think your approach maygive us some insight but Im running of time so I will keep with the way I divided data. – sariii – 2018-07-30T15:15:02.117

regarding the other way you mentioned, May I have your idea is it correct to do that on the result of topic modeling?one may say that why not counting the numbers of all words in each six month as most topic modeling approach are working based on concurrent appearance of two words. actually my main concern is that do you think this idea to count the numbers of words of result of topic modeling in each six month and then compare CI and noCI does make sense to you? I do not want to do silly thing :) – sariii – 2018-07-30T15:15:11.050

The basic idea is that as 6 month seams a short time to do topic modeling, do it on cumulative data : 1-6, 1-12, 1-18, 1-24,... You will have a larger statistical base(then more stable and significant results), respect the topic model analysis and be able to see some evolution (if there is). – AlainD – 2018-07-30T15:30:19.997

thanks for the following up with me.hopefully I have a very large data such that dividing in 6 month I still have at least 5k documents to 15 k documents, my main concern is the approach; counting the number of each word in each six month and then comparing between CI and noCI I want to make sure this does make sense – sariii – 2018-07-30T15:38:49.737

Yes, it does makes sense. You are comparing the values of one random variables (the count of occurrence of the word "sleeping") on two populations (Ci and nonCI), and look if they are significantly different. For this this you should also estimate their standard deviation which will lead you to the so called t-test. If you want to look if the occurrence of a set of words globally is correlated to the CI/nonCI subpopulation, it is called an ANOVA (analysis of variance). – AlainD – 2018-07-30T16:03:41.180

thank you so much for sharing your view, I will go to apply this and accept your answer, I think it would be great if you can update your answer with the last comment, I hope you do not mind if I ask you some questions later if I faced with an issue. again many thanks :) – sariii – 2018-07-30T16:06:54.187