Object IN/OUT counting using CNN+RNN


I am building a video analytics program for counting moving things in a video. I am detecting bicycles and nothing else. I run object detection using the SSD mobile-net model in all the frames and store the bounding box coordinates (x,y,w,h) of each detection to a CSV file.

So for a video, I have a CSV file of one row each for a frame and each row has multiple detections of D1, D2, D3,.., Dn. Each detection has the bounding box coordinates as values. D1 is x,y,w,h.

Based on the x,y values of each detection, I am trying to find the direction of the bicycles and if the bicycle crosses the whole frame to do a UP/DOWN count.

How do I count/track(Don't want to use classic tracking algos) these bounding boxes moving in the video?

I see LSTM/RNN coming up in my search results when I search for video analytics. Being a noob, I am not able to find any tutorial that suits my needs.

I would like to check if my approach towards the problem is correct.

I don't want to use the classical tracking solutions for two reasons

  1. I feel the tracking and counting conditions that I program in Python is always leaky/fails in certain conditions, hence I want to see how AI manages to count the objects.

  2. The video stream I am using has heavy distortion on the objects that I track, hence the shape and size of the object changes drastically within 10s/20s of frames.

Any help or suggestion towards other better approaches is much appreciated.

Edit 1: The area of view under the camera is fixed. And we expect the bicycles to move from one entry side. Lets assume that the view and entry/exit is like shown in this video https://www.youtube.com/watch?v=tW7Pl3bSzR4


Posted 2019-05-06T19:20:08.353

Reputation: 71

what do you mean with "Classic algos" what algos that you think is classic? – malioboro – 2019-05-10T04:12:43.317

I mean the object tracking algorithms like MedianFlow, MOSSE, GOTURN, kernalized correlation filters, and discriminative correlation filter – 55597 – 2019-05-10T11:39:50.000



Assessing the Question

Based on the x,y values of each detection, I am trying to find the direction of the bicycles and if the bicycle crosses the whole frame to do a UP/DOWN count.

It appears from the question that there is an interest in both counting bicycles and determining the direction of travel of each. from within the frames of a video stream or file. We can assume that each bicycle has at least one rider. It also appears that it is not a problem involving a fixed optical system positioned to point at a path that is tangential to the optical path, which would make the problem much easier, locking the approximate distance of the bicycle wheels to the optical system to near constant.

The use of the SSD mobile-net model seems reasonable as a starting point for developing expertise.

Starting With ML Design Basics

Let's consider the purpose of CNN and RNN designs.

  • The purpose of a convolutional network is to deal equally with regions in a multi-dimensional array of values in discrete samples of \R^n during an adaptive (learning) process.
  • The purpose of a recurrent network is to adapt to (learn) potentially complex temporal (time-wise) trends in potentially complex nonlinear systems.

Understand that the SSD class of algorithms do not do what natural visual systems do. They do not zoom in and out on independent objects within the network seamlessly. They cannot note that a base ball player is running to first base and a ball is coming from the catcher at the same time, requiring independent conceptual zoom operations within the neural network. This cannot be done with a zoom lens. That is why the Director of Photography is such a key role in movie making. The visual data must contribute well to the story telling, using lighting, camera orientation, panning, zooming, and depth of focus.

Although one can create several bicycle concept classes to cover various bicycle sizes, orientations relative to the optics, and distances away, there are limitations to this approach, which can be diminished with circuit parallelism in hardware. Multi-threading and serial evaluation can, depending on resources and patience factors, increase training time beyond what is practical. The challenge is to create seamlessness between low level concept classes of a bicycle in a frame as the bicycle angle and distance changes relative to the optical path.

Deeper into Details

The, "heavy distortion on the objects," could be a show-stopper if the root cause of the distortion mentioned is poor resolution in the time, horizontal, or vertical dimensions. The most significant and consistent image-oriented feature of a bicycle is two ellipses (not always circles) in close horizontal proximity and even closer vertical proximity — the two wheels. The wheels need to be recognizable.

Two general categories of networks were mentioned in the question, CNNs and RNNs, which, in general are the two most relevant overall categories of components in a visual system that recognizes motion. We have some nomenclature in the question, which begins the mathematical theory behind the design of the training of the networks and the real time requirements on those network components once the network components are trained.

... each ... frame ... has multiple detections of D1, D2, D3, ..., Dn

$$ D_i \land i \in {1, 2, ..., n} \\ \Downarrow \\ D_{xywh} $$

The above nomenclature presumably refers to a post-learning detection of concept classes $C_a, C_b, C_d$, where there is a many-to-one relationship between the above numeric indices for detections and these letter indices for the concepts of a bicycle to be recognized. Each concept class $C$ might correspond to a particular recognizable bicycle feature set given a particular range of distances to the optics and orientation of the wheels relative to the direction of light rays between the wheels and the camera. The designer, considering this correspondence cannot dismiss the turning of the front wheel. We cannot assume that the eccentricity of the visual representations of the two wheels will be the same, since the bicycle may be turning. Even in this more complex case, the ellipses are likely the two most differentiating features of bicycles in common scenes.

This may be a good time to point out that tricycle recognition may require the recognition of an entirely distinct set of concept classes.

Also notice that, if the optics (camera) is at a drastically different altitude than the wheels of the bicycles, such as images from a drone or a camera on a tall pole, the problem is a different one. This is the intensely effective quality of natural vision systems. Over millions of years, the kind of training that recognizes a bicycle from a drone video stream having only been trained to recognize a bicycle from ground level has emerged. Nature's ability to apply cognitive abilities to visual sequence recognition to use in trajectory prediction is not yet realized in software and hardware and the main problem in automated vehicle piloting and driving.

Two Different Output Requirements

There are two somewhat distinct problems that must be considered in analysis of the project requirements. The output could be either of these two, depending on whether counting bicycles is the primary goal or whether metrics of travel is its own independent objective.

  • Unit vector $\vec{r}$, presumably in $\mathbb{R}^2$ as a pixel vector, essentially a normalized vector of the first derivative of pixel position with respect to time. The center of the bike, in this case, would be based on the features of the bike in the field of view.
  • Unit vector $\vec{r}$, presumably in $\mathbb{R}^2$ as a geocoordinate unit vector, essentially a normalized vector of the first derivative of position with respect to time. The center of the bike, in this case, would be based on the features of the bike in geo-space without the altitude coordinate.

Approaches to AI Design

The common, but inefficient, artificial network approach is to locate the bicycle in each frame with a CNN and then use one of the progressive RNN types (either a GRU or a b-LSTM network) to recognize motion trends. One of the largest drawbacks is that you may have many concept classes that represent adjacent size-distance-orientation concepts (kernel based recognition models) of a bicycle to train into the CNN. If the bike is traveling toward or away from the optics at some angle, then the disappearance of the bike from $D_a$ and its appearance in $D_b$ needs to be construed as the contiguous motion of one bike. This is not an easy challenge but is heavily covered in the literature.

It is recommended to use web searches designed to search scholarly articles, not dummies guides, which are not reliable. There many academic publications that can be found with the search term, "Image recognition changing distance orientation." Looking at old articles from the 90s will provide a good historical context. Looking at new ones from the last three years will provide a survey of the current state of research.

Other Questions Within the Primary Question

The original types of recurrent networks are essentially for historical context. The dominant in-field recurrent network successes are often of the LSTM, b-LSTM, or GRU types.

The language (Scala, Python, Java, C, C++) is not particularly relevant if you are delegating the training to a GPU (which always runs C/C++ code), so it may be unwise to consider reliability concerns as a primary driver for programming language selection.

Regarding, "How AI manages to count the objects," AI doesn't — not at the current state of technology. There is no one approach or algorithm across AI technology that dominates over all other approaches for all domains, into which bicycles can be plugged in and it works.

Currently, the AI engineer designs how the objects will be counted based on the characteristics of the objects to be counted, the meta-features of the incoming stream or data set, and the specifics of the recognition challenge. This is again because the wider capabilities of natural vision systems using the more sophisticated neural nets in animals and people has not yet been invented.

Final Recommendations Regarding System Design

The division between the use of kernels, in the CNN context, and the use of one of the recurrent network types is critical. If the engineer tries to delegate too much to the kernel, the above issue of bicycle distance to the optics and turning corners is exacerbated, because kernel operations do not lend themselves well to orientation and distance complexity. However, the CNN approach is excellent for the most upstream operations, such as edge detection and primitive object detection.

Let the recurrent network (of the more advanced types mentioned above) detect the bicycles as their distance and orientation in relation to the optical path changes, unless you have a sizable GPU farm that will perform CNN operations covering many distance and orientation ranges in parallel. Even if you do or have the patience of a saint, it may be best to delegate total bicycle recognition to the recurrent network anyway, since this is likely closer to the way natural systems do it and the modeling of bicycle travel between distance and orientation categories can be made more naturally seamless.

Recapping the comment above in this context, the problem complexity would be much lower if the bicycles were on a path, could not turn corners, and must travel either right to left or left to right within a speed range governed by the flow of traffic.

Response to Comments

Regarding Edit 1,

Edit 1: The area of view under the camera is fixed. And we expect the bicycles to move from one entry side. Lets assume that the view and entry/exit is like shown in this video,

YouTube.com reports, "This video does not exist." That the camera is fixed, had been assumed in the writing of this answer because no camera trajectory described in the question. Had the expectation that bicycles will move from one entry side of the frame been included in the question, the answer would have addressed that case, but there was no hint of that requirement prior to Edit 1. Nonetheless, much of the content in this answer to the more general case still applies.

Regarding practicality, let's differentiate practical from prefabricated. The problem originally described before Edit 1 has no prefabricated solution into which one can plug bicycle data and tweak a few parameters to achieve success. In fact, the general interest on the web in seeking prefabricated solutions is usually met with the practical reality that such plug and play cases in machine learning are rare. Most of the time an approach that involves design and experimentation is usually the case.

Those that have hired and managed human beings know that even the hoped-for high level AI of the future, although possibly quite practical, may be as prefabricated as idealized. For instance, hiring an EE does not mean that the electrical engineering department will immediately see a practical improvement in throughput. Management, training, and workflow design will still be prerequisites to employee effectiveness. To have a practical and reliable bicycle counting, some comprehension of concepts to guide initial experimentation, design, and tuning of the training and use scenarios will likely be necessary.

If the author of the question has one of those very rare cases where bicycles travel in exactly one direction and in series, with no foot traffic, trikes, walking pedestrians, or pets, then AI is not necessary at all. A simple LED and photo-transistor with a passive low pass filter and a digital input to a counter circuit will perform a reasonable count. However, if two bicycles might pass in parallel, then we are back with the camera, the need for concept classes, and challenges much like those discussed in detail above.

Regarding steps to a solution, if approaching from the side of a technology to investigate, this answer includes a recommendable sequence, although there is usually quite a bit of overlap and occasional backtracking in actual practice. If this is not a learning exercises and the bicycle problem is one that is actually needed to operate in the field, then the above answer is the correct background to comprehend what it is the machine is required to perform. Following that comprehension would be the investigation of various designs and algorithms, beginning with a search of scholarly articles using the key phrases in the original answer.

Douglas Daseeco

Posted 2019-05-06T19:20:08.353

Reputation: 7 174

Can you check the edit and Thanks for the detailed explanation, however, I am looking for a more practical solution. I started the bounty because I am in need of a practical solution or steps towards how to build a practical solution. – 55597 – 2019-05-12T18:34:20.677