The output layer of the network contains one unit, telling me the Q value of the provided state with the assumption that the action taken in that state will be determined by the policy.

Typically in Reinforcement Learning, the symbol $Q$ is used when you calculate an *action value*, and if you are evaluating for a specific policy, it is noted $q_{\pi}(s,a)$ where $\pi$ is the policy, $s$ is the current state, and $a$ the action to be taken.

What you appear to be calculating with your network is not the action value, but the *state value*, $v_{\pi}(s)$. Note that $v_{\pi}(s) = q_{\pi}(s,\pi(s))$ when you have a deterministic policy, and there is a similar relationship for stochastic policies.

With your setup, it would be possible to learn the value of any fixed policy that you provided as input. It is also *possible* to drive policy improvements, with some caveats, but that would not be Q learning. The rest of this answer assumes you want to implement Q learning, and probably something like Deep Q Networks (DQN).

However, I don't know what policy would make the agent learn to play the best. I can't just make it choose random actions because that would make the policy non-stationary.

Actually that would make the policy *stochastic*. Non-stationary means that the policy would change over time. Choosing random actions is a perfectly valid stationary policy (provided the probabilities in each state remain the same).

In addition, in an optimal control scenario you actually will have and *want* a non-stationary policy. The goal is to start with some poor guess at a policy, like completely random behaviour, and improve it based on experience. The policy is going to change over time. That makes learning Q values in RL a non-stationary problem.

Typically you really do start control problems with a random policy. At least in safe environments such as games and simulations.

I'm trying to train an agent to play it at a reasonable skill level using TFLearn. What can I do?

First, modify your network so that it estimates action values $Q(s,a)$. There are two ways to do that:

Take the action $a$ as an input, e.g. one hot encoded, concatenated with the state $s$ to make the complete inputs to the neural network. This is simplest approach conceptually.

Assign a number in range $[0,N_{actions})$ to each action. Change the network to output a value for each possible action, so that your estimate $\hat{q}(s,a)$ is obtained by looking at output indexed by $a$. This is a common choice because it is more efficient for selecting best actions to drive the policy later.

Using action values for $Q(s,a)$, not state values $V(s)$, is an important part of Q learning. Technically it is possible to use state values if you have a model of the environment, and you could add one here. But it would not really be Q learning in that case, but some other related version of Temporal Difference (TD) learning.

Once you start working with action values, then you have a way to determine a policy. Your best estimate of the optimal policy is to take the action in state $a$ with the highest action value. This is:

$$\pi(s) = \text{argmax}_a \hat{q}(s,a)$$

So run your neural network for the current state and all actions. Using the first form of NN above this means run a minibatch, using the second form you run it once which is more efficient (but constructing the training data later is more complex).

If you always take the best action, this is called acting *greedily*. Usually in Q learning, you want the agent to explore other possibilities instead of always acting the same way - because you need to know whether changing a policy would be better. A very common approach in DQN is to act $\epsilon$-greedily, which means take this greedy action by default, but with some probability $\epsilon$ take a completely random action. Usually $\epsilon$ starts high, often at $1$, and then is decayed relatively quickly down to some lower value e.g. $0.1$ or $0.01$, where it stays during learning. You set $\epsilon = 0$ for fully greedy behaviour when evaluating or using the policy in production.

So that has set your policy, as per your question. There is still more to implement in DQN, summarised below:

Unfortunately you cannot train the agent directly online on each step and expect good results in practice. DQN requires *experience replay* as the source of training data. For each step of experience, should store data for start state, action, immediate reward, end state and whether this was the last step in an episode ($s,a,r,s',done$)

To train the network, improve the Q estimates and thus improve the policy, then after each step you use the Bellman relation for optimal policy as a target: $q^{*}(s,a) = r + \gamma \text{max}_{a'} q^{*}(s',a')$:

- Construct a minibatch of training data from the experience replay table, using your NN to calculate the TD target for each one $\hat{g} = r + \gamma \text{max}_{a'} \hat{q}(s',a')$
- Train your NN towards $\hat{q}(s,a) \rightarrow \hat{g}$ on this minibatch.

- It is usual to use a separate NN to calculate the TD target, and every so many steps (e.g. 1000 steps) make this "target network" a clone of the current learning network. It helps with stability, but you may not need that for a simple environment like Flappy Bird.

This answer has already got quite long, so if you need more details on these last parts then you can search for existing answers here or please ask a new question about whichever detail is not making sense. – Neil Slater – 2019-07-14T05:40:35.307