Video engagement analysis with deep learning



I am trying to rank video scenes/frames based on how appealing they are for a viewer. Basically, how "interesting" or "attractive" a scene inside a video can be for a viewer. My final goal is to generate say a 10-second short summary given a video as input, such as those seen on Youtube when you hover your mouse on a video.

I previously asked a similar question here. But the "aesthetics" model is good for ranking artistic images, not good for frames of videos. So it was failing. I need a score based on "engagement for general audience". Basically, which scenes/frames of video will drive more clicks, likes, and shares when selected as a thumbnail.

Do we have an available deep-learning model or a prototype doing that? A ready-to-use prototype/model that I can test as opposed to a paper that I need to implement myself. Paper is fine as long as the code is open-source. I'm new and can't yet write a code given a paper.

Tina J

Posted 2019-08-27T03:39:25.507

Reputation: 889

In your model, are you looking for an accurate summary, or want to maximise interest (whilst still limiting output to an edit from the referenced video)? The two goals are often not compatible, witness any film trailer, YouTube "clickbait" etc. I am asking because I think I have seen references to work on the goal of generating accurate summaries, and might be able to find something. But that doesn't appear to be what you want? – Neil Slater – 2019-08-27T21:08:14.150

Not really accurate summary, but to maximize interest. Yes, it's highly subjective. We don't know the best solution. We just need "a" solution! As long as a model is targeting that concept, it should be fine. – Tina J – 2019-08-27T21:14:56.263

@NeilSlater Something like this: they claim their deep models find thumbnails that will drive more clicks, likes, and shares. ​But their codes are client/server based and not easy for me to run.

– Tina J – 2019-08-27T21:16:07.033


OK, I don't know that area well enough. So I was going to suggest things like (although note GPL licensing, which may not suit a commercial product), which is geared around creating an accurate summary. It's an interesting area though, so I hope you find an answer. Sometimes accuracy is an attractive end goal of course, if for instance you were indexing video with the goal of helping an end user finding something that they were looking for

– Neil Slater – 2019-08-27T21:35:44.540

Thanks Neil. I will look into that repo. If anything came into mind within a short term, please let me know. – Tina J – 2019-08-27T22:02:25.007

@NeilSlater Neil, did you try building the repo yourself? cmake runs ok, but not the make. – Tina J – 2019-08-29T22:21:34.603

No I have not tried building Vis-DSS – Neil Slater – 2019-08-30T07:08:06.937

This strikes me as an extremely difficult problem due to the subjectivity of the response that you are trying to evaluate. At the very least, you would need to include the demographics of the viewer into the model. – DrMcCleod – 2019-09-12T07:52:32.413



One of the key terms in the literature that you are looking for is video captioning.

You can have a look at some of the relevant papers with code on this subject. In short, it is an active area of research and a difficult problem, one reason is because videos are still difficult to learn about (because of larger amount of data + larger model, etc...) and this model has to be working with two modalities of data: text and image.

A paper that you might want to start with is Deep Visual-Semantic Alignments for Generating Image Descriptions which works on single images. In short, you can use something similar like in the paper: object detector (e.g. Faster RCNN) to extract visual features and feed them into the state of an RNN (LSTM) which would output a sequence of words in your summary (see picture below). image captioning model

Anuar Y

Posted 2019-08-27T03:39:25.507

Reputation: 309

Thanks. But how is video captioning related to video scoring? I like to know how a scene is interesting for a viewer. – Tina J – 2019-09-11T23:31:44.800

Right, I focused on "generate say a 10-second short summary given a video as input". Yeah I see your problem. It sounds like a specific problem for which there wouldn't be a dataset available online (don't know at the top of my head). If you have an opportunity to create your own dataset then you can perform regression on your score directly. For example this paper: YouTube-8M performs video classification. Then instead of predicting a class you would predict a score (also changing the loss to l1 or l2 loss).

– Anuar Y – 2019-09-11T23:40:31.787

This is a useful answer for a different question, which is why I downvoted it. – DrMcCleod – 2019-09-12T07:53:32.923