No General Movie Search Yet
There have been successes in recognizing a very narrow sequence of a very narrow set of possible actions, but nothing like a general movie searching system that can return a set of matches with the start time, end time, and movie instance for each match to one of the search criteria listed in this question.
- Somebody was driving a car
- Talking over the phone
Normalizing the List
First of all, "Was scared," is not the description of an action. It should be, "Becoming scared." Secondly, "Talking over the phone," is not a proper action description. It should be a conjuctive action such as, "Talking into a phone AND listening to the same phone." To make the list homogenous in format, the first item should be "Car driving," since the actor is human in every other case.
- Car driving
- Becoming scared
- Talking into a phone and listening to the same phone.
Realistic System Design Expectations
It is unrealistic to think that an artificial neural net, by itself, can be trained to return as output the set of start and stop ranges and associated movie instances from a database of movies and one of the above list items as input. This will require a complex system with many ANNs and other ML devices, and may require other AI components that are not activation type networks at all. Certainly convolution kernels and various types of encoders should be considered as key system components.
You will need a large amount of training data to cover the above six cases (the last of the five items actually being two distinct actions that we normally associate and consider one). If you want to detect more actions, you will need a large amount of training data for them too.
Verbs and Nouns
The reason this question is interesting to me is because recognizing ACTIONS are not the same as recognizing ITEMS. All mammals learn ITEMS first and ACTIONS later. Linguistically, nouns come before verbs in child language development. That is because, just as detecting edges is preliminary to detecting shapes, which is preliminary to detecting objects, detecting motion is preliminary to detecting action.
Verbs like, "Eating," are an abstraction over the top of motion, and, in the case of eating, the motion is complex. Also, eating is not the same thing as gum chewing, so the sequence detected must be as follows:
- Insertion of food into face through mouth
The probability of a sequence is the product of the probability of its parts, so that math is simple and easy to implement. Concurrency, as in the case of conjunctive actions like talking into and listening to the same phone, is also relatively easy to handle in general.
A Realistic Approach
Certainly generalization (and more specifically feature extraction) will need to occur in object recognition, collision detection, motion detection, facial recognition, and other planes simultaneously. A complex topology, perhaps employing equalibria as in GAN design, will most likely be necessary to assemble elements of criteria associated with the movie query string and to run windows over the frames of each movie.
To provide a service that returns results within a few days or weeks will probably require a cluster and DSP hardware (perhaps leveraging GPUs).
Special Cases that Human Brains Handle
Determining how long one of the two elements of concurrency can be undetected before it invalidates the conjunction can be tricky. (How long can one not speak into the phone before it appears that it is no longer considered phone conversation?)
If, in the movie, only the swallowing is shown, a human can infer eating. That kind of conclusion reliability from sparse data is a huge AI challenge discussed in various contexts throughout the literature.
The Emergence of Associated Technology — A Projection
I suspect that the system topography comprised of ANNs, encoders, convolution kernels, and other components to perform the search for any of a select set of actions will emerge within the next ten years. Work seems to be tracking in that direction in the literature.
A system that will acquire its own training information, sustainably grow in knowledge, and perform general searches if increasing breadth and complexity may be anywhere from forty to two hundred years out. It is difficult to predict.
Gross Overoptimistic Predictions
Every generation seems to view knowledge growth as an exponential function and tends to make unrealistic predictions about the advent of certain covetted technology capabilities. Most of the predictions fail dramatically. I have come to believe that the exponential growth is an illusion created by the inverse exponential decay of interest in the past with respect to time.
We loose track of the energy and rate of growth in eras before us because they become socially irrelevant. People into scientific history, like Whitehead, Kuhn, and Ellul know that technology has moved forward quickly for at least a few hundred years. Vernadski inferred in his The Biosphere that life may not have arisen, that like matter and energy, it may always have existed. I wonder if technology has been moving at an essentially constant rate for the last 50,000 years.
Germany decided to double its solar panel energy output every year and published its exponential success, until a few years ago when doubling it again would cost a hundred billion dollars more than what they had to spend. They stopped publishing the exponential growth graphs.