These numbers suggest that the first model is not, in fact, overfit. Rather, it suggests that your training data had few data points near the decision boundary. Suppose you're trying to classify everyone as older or younger than 13 y.o. If your test set contains only infants and sumo wrestlers, then "older if weight > 100 kg, otherwise younger" is going to work really well on the test set, not so well on the general population.
The bad part of overfitting isn't that it's doing really well on the test set, it's that it's doing poorly in the real world. Doing really well on the test set is an indicator of this possibility, not a bad thing in and of itself.
If I absolutely had to choose one, I would take the first, but with trepidation. I'd really want to do more investigation. What are the differences between train and test set, that are resulting in such discrepancies? The two models are both wrong on about 16% of the cases. Are they the same 16% of cases, or are they different? If different, are there any patterns about where the models disagree? Is there a meta-model that can predict better than chance which one is right when they disagree?