3

In deep Q learning, we execute the algorithm for each episode, and for each step within an episode, we take an action and record a reward.

I have a situation where my action is 2-tuple $a=(a_1,a_2)$. Say, in episode $i$, I have to take the first half of an action $a_1$, then for each step of the episode, I have to take the second half of the action $a_2$.

More specifically, say we are in episode $i$ and this episode has $T$ timesteps. First, I have to take $a_1(i)$. (Where $i$ is used to reference episode $i$.) Then, for each $t_i\in\{1,2,\ldots,T\}$, I have to take action $a_2(t_i)$. Once I choose $a_2(t_i)$, I get an observation and a reward for the global action $(a_1(i), a_2(t_i))$.

Is it possible to apply deep Q learning? If so, how? Should I apply the $\epsilon$-greedy twice?

Do you only wait one timestep after taking action $a_1$ before action $a_2$ is executed? Do you have to choose $a_1$ and $a_2$ simultaneously, or are you given an observation of the environment after executing $a_1$ and then allowed to choose $a_2$? – DeepQZero – 2020-06-05T21:11:35.497

At episode $i$, I choose $a_1(i)$. Now, for each time step $t_i$ in episode $i$, I choose $a_2(t_i)$. Only after choosing action $a_2(t_i)$ I get an observation and receive a reward for my chosen action $(a_1(i), a_2(t_i))$ at $(i, t_i)$. – zdm – 2020-06-05T21:16:18.270

Just to be sure, after choosing action $a = (a_1, a_2)$, then episode terminates (i.e. you don't choose another action for the remainder of the episode)? – DeepQZero – 2020-06-05T21:21:25.617

No, the episode has, say, $T$ steps. After choosing action $a_1$, I may choose $a_2$ at time $t$ and action $a_2'$ at time $t'$. – zdm – 2020-06-05T21:22:40.477

Are $t$ and $t'$ in the same episode? If they are, it seems to contradict what is stated in the question "then for each step of the episode, I have to take the second half of the action $a_2$." It might help to edit your original question with some of this information. It seems like a very interesting question, and I'm eager to try and solve it. – DeepQZero – 2020-06-05T21:29:56.107

1Yes, $t$ and $t'$ are in the same episode. I am editing the question. – zdm – 2020-06-05T21:40:38.330

1Can you explain your MDP further? Once you’ve chosen ($a_1,a_2$) do you then have to choose another two tuple before getting the next state and reward? – David Ireland – 2020-06-05T21:58:08.407

Once I choose $(a_1, a_2)$ at episode $i$ and timestep $t_i$, I get a reward and move to the next state at $t_i+1$ and I have to choose another $a_2$, etc. – zdm – 2020-06-05T22:04:07.643

To me, the fact that you are taking actions at two different time scales suggests a similarity with hierarchical reinforcement learning - HRL([1], [2]). Have you considered looking at your problem from this perspective? I do not have a deep knowledge about HRL, so I don't know if there would be an issue if you apply existing HRL algorithms to your problem, but it was really the first thing that came to mind when reading your question.

– user5093249 – 2020-06-08T11:00:23.077