As background, I am using a Deep Neural Network built using Keras to classify inputs into 5 categories.
The current structure of the network is:
- Input layer (~450 nodes)
- Dense layer (750 nodes)
- Dropout layer (750 nodes, dropout rate = 0.5)
- Dense layer (5 nodes)
The issue I'm having if one of overfitting. My model performs well on the held-out test set (a proportion of my training set), with accuracy sitting right around 99%. However, when I look to apply the model onto unlabelled data, it is only able to classify ~67% of observations into any category, before even considering the correctness of those classifications!
I think the issue may be around my feature and training set generation process. I generated the training set using a rules-based string matching method. This generated a training set of around 3.6 million observations (10% of population).
However, one of the largest features for my input layer is an embedding of the same text used to generate the training set. Therefore, the words matched to generate the training set are also embedded and used as features. Worth noting the text is around 140 characters per observation, and I matched bigrams from that text (so there is other information in that text that would be useful as a feature).
I would remove this feature altogether, however this is the richest information associated with each observation.
Is there a way to solve this without removing that feature altogether?
Hope this makes sense and happy to provide more clarification.
- My model performs well on my training and test sets.
- Performs badly on unlabelled data.
- Each observation is associated with a block of text.
- To label my training set I used string matching on that text.
- The text is also a feature (embedding).
- Is this causing my poor performance on unlabelled data (model learning those string matches?).
- If so what can I do?
EDIT: Also happy to hear if you think the issue is something else.