Training a CNN from scratch over COCO dataset


I am using Tensorflow Object Detection API for training a CNN from scratch on COCO dataset. I need to use this specific configuration. There is no pre-trained model on COCO with that configuration and this is the reason why I am training from scratch.

However, after 1 week of training and evaluating each checkpoint generated by the training phase this is how my learning phase appears on Tensorboard:

Tensorboard eval

Thus, my questions are:

  • does anyone know how many iterations approximately will be necessary? Right now I did more than 500'000 iterations.
  • How can be possible that after 500'000 the evaluation is 0,8%? I would expected something like 60-70%.
  • Why does there is a sudden drop after 500k iterations? I thought that the eval was supposed to converge to some limit. (this is what SGD should do)
  • Is there any 'trick' to speed up the training phase? (ex: increasing the learning rate, etc).

Giacomo Bartoli

Posted 2018-08-13T21:22:19.097

Reputation: 141

2It might be useful to know a little more about your problem. What would the precision of guessing uniformly at random look like? How much data do you have?

Note too, that SGD is not guaranteed to converge smoothly, because you might update repeatedly based on an unfortunate random sample. It converges only in expectation. – John Doucette – 2018-08-14T01:27:12.517

2I'm training on coco dataset, which is 18GB of labeled data. I've no idea how much the precision of guessing uniformly at random should look like. – Giacomo Bartoli – 2018-08-14T07:36:28.903



It's hard to know for sure what's gone wrong, but here are some possibilities:

  1. The problem is difficult. The COCO paper reports that a typical object covers just 4-6% of the image. A randomly initialized model is therefore likely to do extremely poorly, with an expected precision of between 4 and 6% for detecting the object of interest in a frame. You also have 90 classes in your configuration file. It's not clear whether the model has access to the correct label, but if it's also inferring the class, we'd expect initial accuracy somewhere around 0.06%. That's actually about the precision of your starting model.

  2. You're training on mini-batches of size 32. It's not clear to me from your config whether the lower axis of that graph is iterations or epochs. If it's the former, then your model will have seen only 1.5 million examples during the entire training period. For a problem this hard, that's not nearly enough. Indeed, COCO contains only 328k unique examples, so the model would have seen each of them just 5 times (corresponding to 5 epochs). If you've done 500,000 epochs though, then that ought to be enough.

  3. It's possible that the hyper-parameters are not well set. I'm not an expert at training CNNs, but deep networks are notoriously finicky. I have difficulty reading your config file, but your learning rate looks reasonable to me. It also appears to undergo exponential decay over time though, and it's not clear to me that you want to be doing that, or whether the schedule you're using for the decay makes sense. This might be worth reviewing.

John Doucette

Posted 2018-08-13T21:22:19.097

Reputation: 7 904

1Thank you John Doucette for providing me such a detailed answer. Anyway:

  1. Right now I got 8% accuracy
  2. The lower axis represents a training step(=one gradient update).
  3. I can't change hyperparameters. I can't change anything into the configuration. This is because I'm working with the new Google Vision Kit and this is the only configuration compatible.
  4. < – Giacomo Bartoli – 2018-08-14T17:57:23.750


PS: the drop in accuracy happened because I tried to set a learning rate 10 times higher than the original. As soon as I realized that accuracy was decreasing I re-set the learning rate as the beginning and then the accuracy started to increase again. This is the result (almost 700k iterations):

– Giacomo Bartoli – 2018-08-14T17:57:39.847

1Checking the TF documentation I can say that, for instance, if iteration 50 means we're processing the 50th batch. – Giacomo Bartoli – 2018-08-14T18:30:27.690

@GiacomoBartoli Ah, if 50 iterations = 50 batches, you'll need to train a while longer! Probably on the order of 100 times longer would be my guess. – John Doucette – 2018-08-14T21:15:09.943

100 times longer means something like 100 weeks = 23 months of training. The real question is: when someone trains a network from scratch, which accuracy level is expected to obtain before move on and apply transfer learning? What developers do, typically, is to start from a pre-trained model and then train only the last layers on a custom dataset (this is what I mean for transfer learning). How much accuracy does the pre-trained model have? – Giacomo Bartoli – 2018-08-14T21:43:36.363