3

In the paper: Reinforcement learning methods for continuous-time Markov decision problems, the authors provide the following update rule for the Q-learning algorithm, when applied to Semi-Markov Decision Processes (SMDPs):

$Q^{(k+1)}(x,a) = Q^{(k)}(x,a) + \alpha_k [ \frac{1-e^{-\beta \tau}}{\beta}r(x,y,a) + e^{-\beta \tau} max_{a'} Q^{(k)}(y,a) - Q^{(k)}(x,a) ] $

where $\alpha_k$ is the learning rate, $\beta$ is the continuous time discount factor and $\tau$ is the time taken to transition from state $x$ to state $y$.

It is not clear to me what is the relationship between the sampled reward $r(x,y,a)$ and the reward rate $\rho(x,a)$ specified in the objective function $\mathbb{E}[ \int_{0}^{\infty} e^{-\beta t}\rho(x(t),a(t)) dt ]$.

In particular, how do they determine $r(x,y,a)$ in the experiments in Section 6? In this experiment, they consider a routing problem in an M/M/2 queuing system, where the reward rate is: $c_1 n_1(t) + c_2 n_2(t)$. $c_1$ and $c_2$ are scalar cost factors and $n_1(t)$ and $n_2(t)$ are the number of customers in queue 1 and 2, respectively.