3

I am trying to model operational decisions in inventory control. The control policy is base stock with a fixed stock level of $S$. That is replenishment orders are placed for every demand arrival to take the stock level to $S$. The replenishments arrive at constant lead time $L$. There is an upper limit $D$ on the allowed stock out time and it is measured every $T$ periods, otherwise, a cost is incurred $C_p$. This system functions in a similar manner to the M/G/S queue. The stock out time can be thought as the customer waiting time due to all server busy. So every $R$ period ($R$ is less than $T$) the inventory level and pipeline of outstanding orders are monitored and a decision about whether to expedite outstanding order (a cost involved $C_e$) or not is taken in order to control the waiting/stock-out time and to minimize the total costs.

I feel it is a time and state-dependent problem and would like to use $Q$-learning to solve this MDP problem. The time period $T$ is typically a quarter i.e. 3 months and I plan to simulate demands as poisson arrivals. My apprehension is whether simulating arrivals would help to evaluate the Q-values because the simulation is for such a short period. Am I not overestimating the Q value in this way? I request some help on how should I proceed with implementation.

1I have edited the question. I think the question earlier reflected my confused state of mind. I hope there is some clarity now. Basically, I am having trouble in understanding how I should implement q learning for my problem! – ranya – 2019-07-20T10:35:29.793

@ranya: Yes it is clearer now, although for those of us not familiar with your stock control problem, it may help to give still more of your MDP model and experiment design. I am not clear how you end up with a short simulated time period for instance - surely you can just run a simulation that includes very many periods of length T, in theory simulating 1000s of years of stock management, in order to learn the value functions? – Neil Slater – 2019-07-20T10:42:07.527

@NeilSlater Thank you. I understand that I need to simulated for much longer horizon to learn the value function. But the stock-out time is measured only every $R$ (typically a quarter) period and the counter re-set every $R$ interval within the $T$ (typically a year). So I was doubting that simulation in essence would become repetitions of only $R$ period, which is small. – ranya – 2019-07-20T11:22:21.333

I'm not 100% sure from your description, but I think that makes each R an episode, and you will want to simulate very many episodes. Whether this is true depends on whether you also reset stock levels every R based on some fixed rule unrelated to your stock control agent's job. If R is just some event that measures and manages ongoing stock, then things continue, then that's different - you will want to allow for it in your model, but it is essentially part of the environment that the agent needs to learn (making your simulations and achieving the goals more complex) – Neil Slater – 2019-07-20T11:33:03.023