Can one use an Artificial Neural Network to determine the size of an object in a photograph?


My question relates to but doesn't duplicate a question that has been asked here.

I've Googled a lot for an answer to the question: Can you find the dimensions of an object in a photo if you don't know the distance between the lens and the object, and there are no "scales" in the image?

The overwhelming answer to this has been "no". This is, from my understanding, due to the fact that, in order to solve this problem with this equation,

$$Distance\ to\ object(mm) = \frac{f(mm) * real\ height(mm) * image\ height(pixels)}{object\ height(pixels) * sensor\ height(mm)} $$

you will need to know either the "real height" or the "distance to object". It's the age old issue of "two unknowns, one equation". That's unsolvable. A way around this is to place an object in the photo with a known dimension in the same plane as the unknown object, find the distance to this object and use that distance to calculate the size of the unknown (this relates to answer from the question I linked above). This is an equivalent of putting a ruler in the photo and it's a fine way to solve this problem easily.

This is where my question remains unanswered. What if there is no ruler? What if you want to find a way to solve the unsolvable problem? Can we train an Artificial Neural Network to approximate the value of the real height without the value of the object distance or use of a scale? Is there a way to leverage the unexpected solutions we can get from AI to solve a problem that is seemingly unsolvable?

Here is an example to solidify the nature of my question:

I would like to make an application where someone can pull out their phone, take a photo of a hail stone against the ground at a distance of ~1-3 ft, and have the application give them the hail stone dimensions. My project leader wants to make the application accessible, which means he doesn't want to force users to carry around a quarter or a special object of known dimensions to use as a scale.

In order to avoid the use of a scale, would it be possible to use all of the EXIF meta-data from these photos to train a neural network to approximate the size of the hail stone within a reasonable error tolerance? For some reason, I have it in my head that if there are enough relevant variables, we can design an ANN that can pick out some pattern to this problem that we humans are just unable to identify. Does anyone know if this is possible? If so, is there a deep learning model that can best suit this problem? If not, please put me out of my misery and tell me why it it's impossible.


Posted 2018-08-31T19:28:58.630

Reputation: 51

5Interesting question! My first instinct is "no" if we're talking about a solution that would be robust against "adversarial" inputs (e.g. if we're taking pictures of a cube, there's probably an infinite number of different combinations of cube size + distance to camera that would all look identical, so be impossible to reliably distinguish just from a 2D image). My instinct is "yes / kind of" if we're just talking about a solution that would work decently well for "natural" / "real-world" pictures, since objects will tend to have certain typical sizes in "natural" pictures. – Dennis Soemers – 2018-08-31T19:57:02.733

2Those are just my instincts though, not sure enough about them to put them in an answer. Food for thought for anyone who does want to address the question with a full answer though! – Dennis Soemers – 2018-08-31T19:57:30.463

Welcome to SE:AI! (I took the liberty of converting your formula to MathJax for convenience of potential answerers--feel free to tweak, or roll back the edit if I got anything wrong.) – DukeZhou – 2018-08-31T20:58:34.583

2I don't know whether there is research on this or not, but a natural approach would be a sort of transfer learning: Train the model on the sizes of known objects, and then show it pictures containing both known and unknown objects, interacting. I think @DennisSoemers is right that this won't work for adversarial inputs, but then again, neither do our own eyes! – John Doucette – 2018-09-01T11:58:41.953

@JohnDoucette What about known objects of variable size? Like a hail stone? Would this still be considered adversarial input? I was hoping there might be some combination of inputs like focal length or image depth or ISO that the learning network might be able to pick up on and accurately predict hail size with only knowing a range of distances... – dingFAching – 2018-09-01T20:04:36.967

Our eyes can estimate the size of an object because they're two !! Actually, you can calculate the distance of an object using two pictures with a different point of view. If you have the distance, you can then obtain the size of the object. But you have some hard constrains about the pictures. More details here :

– Jérémy Blain – 2018-09-04T12:57:02.627



In my thesis I actually solve the problem of depth estimation with a CNN based on a single monocular image so I can share my experiences for understanding that problem.

As you already stated in general you have the problem that you cannot recover the scale of the scene in an image by geometrical approaches directly. And that is still not the case even if you know the properties of your camera and lens, like focal length, but don't know any absolute sizes of the scene. However, a neural network is still able to solve the task of depth estimation based on monocular images (at least for fixed camera properties) due to known objects sizes it learned through the training on the dataset. That means it can use the learned size of specific objects and the relative depth relations to give a fairly good approximation of the depth in the scene.

However, in your special case this approach would not work, if I understand you correctly. If you just take a photo of a stone that can have arbitrary sizes and no depth cues or any unique patterns that relate to depth are present in the image there is no chance to ever estimate the absolute depth. A CNN eventually would probably just learn some average depth values or recurring depth patterns of your used dataset or memorize the whole training set to minimize the training error since it simply cannot solve this task. So you would not get a tool that does somehow generalize to new scenes. A neural network is still just a function approximator and not something magical that can solve the unsolveable.

For your usecase there could be some (complex) solutions that could give you a more or less accurate depth estimation. For example you could use a structure from motion approach where you somehow mesure the absolute camera movement with the accelerometer of the phone. Or best would be a stereo-camera based setup where you know the absolute camera displacement of the camera positions which could solve this task if you have textures in your images. With that you could find the absolute depth of specific points through a classical stereo depth estimation or by using a CNN that estimates the depth on the stereo image pair. Another approach would be to let the user input the phone height above the ground or approximate it through the accelerometer of the smartphone to then approximate the stone size based on the size in the image and the absolute known height above the ground (probably inaccurate).


Posted 2018-08-31T19:28:58.630

Reputation: 394


Can one use an Artificial Neural Network to determine the size of an object in a photograph?

Yes: Learning Depth from Single Monocular Images

In the end, depth is just one special form of size.

Of course, you need something partially known, e.g. another car. You don't need to know the exact size of the car, but you know which size cars in general have. If you have an image without any reference, it is impossible.

Martin Thoma

Posted 2018-08-31T19:28:58.630

Reputation: 1 023

And if the reference car is a limousine ??? – Jérémy Blain – 2018-09-24T07:08:55.980

1Then it also is a reference. The important point is not what reference there is, but that there is one. For this task, just think of yourself: If I would see a specific scene, could I estimate the size of objects? How do I do it? – Martin Thoma – 2018-09-24T07:46:59.913

Does it means that the model have to learn the "broad" size of reference objects ? – Jérémy Blain – 2018-09-24T07:56:42.670

2Yes. If you train the network on car data and then apply it to close-ups of fruits I would expect it to utterly fail. – Martin Thoma – 2018-09-24T08:33:21.720