Overfitting due to features correlating with training set generation rules


As background, I am using a Deep Neural Network built using Keras to classify inputs into 5 categories.

The current structure of the network is:

  • Input layer (~450 nodes)
  • Dense layer (750 nodes)
  • Dropout layer (750 nodes, dropout rate = 0.5)
  • Dense layer (5 nodes)

The issue I'm having if one of overfitting. My model performs well on the held-out test set (a proportion of my training set), with accuracy sitting right around 99%. However, when I look to apply the model onto unlabelled data, it is only able to classify ~67% of observations into any category, before even considering the correctness of those classifications!

I think the issue may be around my feature and training set generation process. I generated the training set using a rules-based string matching method. This generated a training set of around 3.6 million observations (10% of population).

However, one of the largest features for my input layer is an embedding of the same text used to generate the training set. Therefore, the words matched to generate the training set are also embedded and used as features. Worth noting the text is around 140 characters per observation, and I matched bigrams from that text (so there is other information in that text that would be useful as a feature).

I would remove this feature altogether, however this is the richest information associated with each observation.

Is there a way to solve this without removing that feature altogether?

Hope this makes sense and happy to provide more clarification.

Simplified explanation:

  • My model performs well on my training and test sets.
  • Performs badly on unlabelled data.
  • Each observation is associated with a block of text.
  • To label my training set I used string matching on that text.
  • The text is also a feature (embedding).
  • Is this causing my poor performance on unlabelled data (model learning those string matches?).
  • If so what can I do?

EDIT: Also happy to hear if you think the issue is something else.


Posted 2019-12-04T12:07:45.953

Reputation: 53

Not quite sure what is going on, could you maybe post the relevant code? – matthiaw91 – 2019-12-04T13:13:10.923

Not sure the code would hep because it's not a coding issue, more conceptual (and this is part of quite a large project). But will add some clarification. – Jamie – 2019-12-04T13:19:31.870

Probably irrelevant to your overfitting problem, but how does the model fail to classify observations? – Ben Reiniger – 2019-12-04T13:33:53.537

@BenReiniger The final dense layer has a sigmoid activation function, and then I round the results. When I say fail, all 5 nodes are rounded to 0, so cannot place the observation in a category. Sorry for the fluffy language! – Jamie – 2019-12-04T13:38:12.513

1I think the main confusion here is the process of "generating training set" and I assume the problem lies there. What do you mean by "generating data"? what is the original raw data? – Kasra Manshaei – 2019-12-04T14:23:08.797

1I mean the process of getting labelled examples to use as training data. I had a set of rules - i.e. any observation whose text contained "mortgage repayment" was a "housing" observation. By using these rules i created a set of labelled examples to train my model on. The trained model is used to identify those observations that aren't captured by my rules. – Jamie – 2019-12-04T14:34:04.087

Ah, so the model is (more-or-less) successfully learning your rules, but little else because it has little incentive. Can you remove just the bigrams that appear in the rules before training? Or manually label some data without the rules to train on (probably too hard to label enough)? – Ben Reiniger – 2019-12-04T16:11:01.217

@BenReiniger That sounds like a good idea, essentially removing the bigrams used for the rules, and embedding what remains. I'll have a go and see if it has an impact. With my training set already at 3.6million not sure hand labelling will have much of an impact. – Jamie – 2019-12-04T16:19:49.117



I would say that there are no "problems" in the sense what is happening is to be expected.

First of all, here is a key ML reminder which somehow often gets lost:

  • Performing well on the test set is pointless if that test set is not representative (i.e. is not similar to unlabeled instances)
  • Adding more training data does not help if that training data isn't covering new situations

You say you created your training set using string matching rules, I assume that these rules are similar to " "mortgage repayment" was a "housing" observation." as you pointed out in the comments.

Since your model considers bigrams as input, it is not surprising that it found a way to reverse-engineer which string matching rules you used.

To improve your model, I would look at what mistakes your model currently makes on unlabeled data, this should provide you with training instances that do not fit your string matching rules.

Valentin Calomme

Posted 2019-12-04T12:07:45.953

Reputation: 4 666

1Thanks Valentin, this is useful. When looking at misclassified instances, should I look to generate rules based on common mistakes? Hand labelling would take a long time to have an effect with 3.6m current training examples. But then couldn’t my model just learn that new rule! – Jamie – 2019-12-05T09:08:02.103

1Indeed, manual labeling is challenging. I would indeed perhaps refine your rules so that your training labels are less noisy – Valentin Calomme – 2019-12-05T09:09:17.977

1Thanks again - one last thing. Do you think removing the bigrams used in rules from the text before embedding could also be a good strategy? – Jamie – 2019-12-05T09:11:44.727

1Perhaps not, an approach could be to add noise (typos) in your data so that it's harder for your network to learn the rules you used to label it. Or even train with less data to avoid overfitting – Valentin Calomme – 2019-12-05T09:16:15.540