What got me to understand the problem about overfitting was by imagining what the most overfit model possible would be. Essentially, it would be a simple look-up table.
You tell the model what attributes each piece of data has and it simply remembers it and does nothing more with it. If you give it a piece of data that it's seen before, it looks it up and simply regurgitates what you told it earlier. If you give it data it hasn't seen before, the outcome is unpredictable or random. But the point of machine learning isn't to tell you what happened, it's to understand the patterns and use those patterns to predict what's going on.
So think of a decision tree. If you keep growing your decision tree bigger and bigger, eventually you'll wind up with a tree in which every leaf node is based on exactly one data point. You've just found a backdoor way of creating a look-up table.
In order to generalize your results to figure out what might happen in the future, you must create a model that generalizes what's going on in your training set. Overfit models do a great job of describing the data you already have, but descriptive models are not necessarily predictive models.
The No Free Lunch Theorem says that no model can outperform any other model on the set of all possible instances. If you want to predict what will come next in the sequence of numbers "2, 4, 16, 32" you can't build a model more accurate than any other if you don't make the assumption that there's an underlying pattern. A model that's overfit isn't really evaluating the patterns - it's simply modeling what it knows is possible and giving you the observations. You get predictive power by assuming that there is some underlying function and that if you can determine what that function is, you can predict the outcome of events. But if there really is no pattern, then you're out of luck and all you can hope for is a look-up table to tell you what you know is possible.