Aesthetics analysis with deep learning



I'm trying to score video scenes in terms of aesthetics and cinematography features. Basically, how "interesting" a scene or video frame can be for a viewer. Simpler, how attractive a scene is. My final goal is to tag intervals of video which can be more interesting to viewers. It can be a "temporal attention" model as well.

Do we have an available model or prototype to score cinematographic features of an image or a video? I need a starter tutorial on that. Basically, a ready-to-use prototype/model that I can test as opposed to a paper that I need to implement myself. Paper is fine as long as the code is open-source. I'm new and can't yet write a code given a paper.

Tina J

Posted 2019-08-15T21:45:42.813

Reputation: 889

Generally we don't just find and link tutorials here. Could you re-phrase the question so that it is more about the problem you face. Any initial thoughts or partial solutions you have would help too, as it helps to pitch the answer correctly for where you are technically. For instance, how familiar are you with CNNs used for image classifying and regression? Have you tried to collect a bunch of images with subjective scores and train a regression model to predict subjective score on new images - or failing that have you looked for such a model? – Neil Slater – 2019-08-15T22:03:08.313

I'm new to CNN/DL. I couldn't find an available tool. Was hoping to get some useful insights here. – Tina J – 2019-08-15T22:31:17.520

OK, I think the question needs something more to go on that looking for a "starter tutorial" and "get some useful insights". Could you add a little more detail to the question about what you know already from this topic, and what kind of model you are looking for? For instance if someone was to find a mathematical paper discussing the maths of loss functions and complexity of the problem (Google published one or two in this area, amongst others), would it be any use to you? Or are you really looking for some Python code that you can hack? Use [edit] to add details. – Neil Slater – 2019-08-16T08:43:29.033

Mode ready-to-use prototype/model as opposed to a paper that I need to implement myself. I'm new and can't yet write a code given a paper. – Tina J – 2019-08-16T15:54:02.470

@NeilSlater Btw, do you happen to know the license of using NIMA? Is it OK to use them? There are some open-source implementation of them on Github. – Tina J – 2019-08-16T21:25:14.333

The license of the repo I linked is Apache 2.0, which is a permissive free license plus in theory protects you from patent issues (provided it is licensed correctly by the author). Google quite often publish code under Apache 2.0 so that seems quite normal, and I would not think twice about using the code on a personal project. If you want to be thorough though, you may want to do additional research. If you are using a different repo to the one I linked, then check the license file. Open source does not always mean free to use. – Neil Slater – 2019-08-16T22:06:26.647

I want to use it for my friend's start-up. There is another repo that implements NIMA also. I was wondering if we should check the repo's license, or NINA'S license, or both? – Tina J – 2019-08-17T02:10:49.263

1You should check the repo's license for the repo you want to use, and check whether use of NIMA is encumbered by any patents. The Apache 2 license protects you against copyright and patent claims by the publisher of the repo, so that's a good start. Also, a lot of ML published methods are fine to use in practice, it is a relatively open environment (unlike some areas of research). I would say you are very probably OK to go ahead, but I am not a lawyer - at some point you will want to do legal due dilligence. Personally I would add that to the list of things to check for the startup. Good luck – Neil Slater – 2019-08-17T06:16:42.157



Aesthetics of images has a strong subjective element and possibility of multiple dimensions depending on purpose of the media. That means:

  • It is hard to define what we mean by scoring aesthetics.

  • Given any well-constrained definition, it is then time-consuming to collect relevant data.

However, there is some interest in the machine-learning community, as media quality would be a very useful metric to sort and filter data on (provided the metric is close enough to the end user who wants to select it). As a result, there are data sets, research papers and pre-built models for this.

Media quality training data can be crowdsourced in a variety of ways, including looking at popularity of items on social media, to paying experts to assess large numbers of images. An example of one open dataset compiled by researchers for this purpose is called AVA.

This data might be reduced to image/quality pairs which you can then train a CNN model to predict the quality metric (score out of 10 for example). This might just be a regression, but other more complex loss functions are also considered.

A quick search for existing models brings up Google's NIMA project, which has more than one implementation available as open-source code. NIMA appears to use multiclass classification approach to predict which ratings humans would most likely give the image, and the resulting score is then a weighted average of the predicted scores - the claimed benefit of that seems to be that it better matches how the quality ratings are sourced, and it will better capture split opinions (e.g. half of people think the image is terrible, but half think it is great is a different type of image to one where everyone thinks it is just average).

Here is an implementation of NIMA by Github account "idealo" looks complete with documentation, and ready to use with pre-built scripts.

Just to show this is not a one-off, here's a blog by Andrej Karpathy about using CNNs to rate selfies which includes some introduction to core CNN concepts.

Neil Slater

Posted 2019-08-15T21:45:42.813

Reputation: 14 632

1Aesthetically pleasing can vary from region to region....And is also non-stationary with time....Albeit I think it depends on a huge amount of factor when something will become aesthetically pleasing. I am not sure but I think Dawkins first put forth the idea of memes as a kind of cultural tool which propagates/popularises an idea. I don't think ML can handle all these factors. – DuttaA – 2019-08-16T17:43:54.283

Thanks. Yes I actually found NIMA yesterday, and the pre-trained models are really helping me. Although it's more of a quality assessment (and not attention model), but good enough to start with. It's photo-based (and not video), but still applicable enough. – Tina J – 2019-08-16T19:18:17.060

Your inputs are welcome:

– Tina J – 2019-08-27T03:40:13.780