I'm going to assume here that you're using the standard, basic, simple variant of $Q$-learning that can be described as tabular $Q$-learning, where all of your state-action pairs for which you're learning $Q(s, a)$ values are represented in a tabular fashion. For example, if you have 4 actions, your $Q(s, a)$ values are likely represented by 4 matrices (corresponding to the 4 actions), where every matrix has the same dimensionality as your maze (I'm assuming that your maze is a grid of discrete cells here).
With such an approach, you are learning $Q$ values separately for every single individual state (+ action). Such learned values will always only be valid for one particular maze (the one you have been training in), as you seem to have already noticed. This is a direct consequence of the fact that you're learning individual values for specific state-action pairs. The things you are learning ($Q$ values) can therefore not directly be transferred to a different maze; those particular states from the first maze do not even exist in the second maze!
Better results may be achievable with different state representations. For example, instead of representing states by their coordinates in a grid (as you would likely do with a tabular approach), you'd want to describe states in a more general way. For example, a state could be described by features such as:
- Is there a wall right in front of me?
- Is there a wall immediately to my right?
- Is there a wall immediately to my left?
An alternative that also may actually be able to better generalize to some extent could be pixel-based inputs if you have images (a top-down image or even a first-person-view).
When states are represented by such features, you can no longer use the tabular RL algorithms that you are likely familiar with though. You'd have to use function approximation instead. With good state-representations, those techniques may have a chance of generalizing. You'd probably want to make sure to actually also use a variety of different mazes during the training process though, otherwise they'd likely still overfit to only a single maze used in training.