How does reinforcement learning algorithms work without the assumption of (PO)MDPs?
It doesn't. The theory of reinforcement learning is tied very strongly to an underlying MDP framework. The RNN-based solutions that you are referring to are fully compatible with such an MDP model, and don't even require a POMPDP to be useful.
Without the core guarantees of a (PO)MDP model, or something closely equivalent, it is not clear that any learning could occur, with any kind of agent. The MDP model of an environment is about describing consistent behaviour, that is predictable to some degree, and random/stochastic otherwise, and the predictable parts make it amenable to at least some optimisation. The split into states, actions, time steps and rewards help organise the thinking around this. They are not necessary for other kinds of policy-search approaches, such as genetic algorithms. However, if you try to break away from something that would fit to a (PO)MDP, it would break any other kind of meaningful policy too:
If actions had no consequence, then you could learn the value of being in a particular state, but you could not optimise an agent. This could be modelled as a Markov Reward Process, provided state transitions were not completely random, otherwise just learning associations of state to reward using a supervised learning approach would be the best you could do.
If rewards were not consistently based on any data available to the agent, not even history, but not random, then there is no way to learn how to predict or optimise rewards.
Similarly for state transitions, if they bear no relation to any information known about the environment, current state or history, but are not random, then there is no way to learn about the non-randomness, and no kind of agent could generate a meaningful policy to take advantage of knowledge about the system, because the knowledge available is not relevant. However, if the current state still influenced what rewards were available to which actions, then a contextual bandit approach might work (plus a supervised learning approach could predict currently available rewards).
When the information about rewards or state transitions is not directly available, but can be inferred or guessed at least partially from history or context, then you can model this as a POMDP.
One common scenario you can face is that you have available some observations about the environment, but are not sure how to construct a state description that has the Markov property. The velocity of an object might be such a detail, when your observations only give you positions. Technically a POMDP and this observation/state mismatch are the same basic issue, if you decided arbitrarily that your observation was the state.
When faced with this mismatch between easily-available observations and a state description based on history that would be more useful, you can either try to engineer useful features, or you can turn to learning models that will infer them. That is where using RNNs can come in useful as part of RL, and they can help with both observation to state mapping and also inferring more complex hidden state variables in POMDPs. Use of hidden markov models to model a "belief state" that augments the observed state is similar to the latter use of RNNs.