Should I update action value functions when there is no change?


Suppose there is a website and the decision-maker wants to recommend some products to each customer visiting the website. Customers visit the websites in which the time interval between two consecutive visits is a random variable (in some cases it might take one time unit, in some cases it might take several time units, like the concepts in SMDP). When a customer visits the website but there is no purchase, there would be no reward. Should I update the $Q$ function even when there is no reward? In this situation, I should set $R_{t+1}=0$ and use any updating method like Q-learning? Moreover, if I consider eligibility trace and $\tau$ represents the time units between two-arrival, should I update eligibility trace like $E(s,a)=\gamma^\tau \lambda E(s,a)+1$?



Posted 2020-05-31T00:18:50.067

Reputation: 191

No answers