What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?


I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation:

$$\frac{\varepsilon}{|\mathcal{A}(s)|} \sum_{a} Q^{\pi}(s, a)+(1-\varepsilon) \max _{a} Q^{\pi}(s, a)$$

Here is the source of the formula.

I also want to clarify that I understand the idea behind the $\epsilon$-greedy approach and the motivation behind the on-policy methods. I just had a problem understanding this notation (and also some other minor things). The author there omitted some stuff, so I feel like there was a continuity jump, which is why I didn't get the notation, etc. I'd be more than glad if I can be pointed towards a better source where this is detailed.


Posted 2020-07-14T20:11:35.197

Reputation: 93

2Where did you take that formula from? And what is it supposed to represent? The action selection mechanism? Normally, epsilon-greedy simply means that you choose with epsilon probability a random action instead of taking the greedily (i.e. best possible) selected action. – Daniel B. – 2020-07-14T20:19:52.480

Sorry for that. Here's the source of the formula: http://www.incompleteideas.net/book/first/ebook/node54.html.

It also shows up in the practical implementation of the epsilon-greedy algorithm (bottom of the page)

– Metrician – 2020-07-14T20:28:47.227

2From the pseudo code, it is pretty clear that $A(s)$ refers to the set of all possible actions, since in step c) the algorithm iterates through all actions ($a$) (taken from that set). That it is about the actions becomes apparent from the use of $a$. – Daniel B. – 2020-07-14T20:34:53.303

Yes I realized that I was talking more about the notation $|A(s)|$ but I get it now. Thanks. – Metrician – 2020-07-14T20:44:04.983



This expression: $|\mathcal{A}(s)|$ means

  • $|\quad|$ the size of

  • $\mathcal{A}(s)$ the set of actions in state $s$

or more simply the number of actions allowed in the state.

This makes sense in the given formula because $\frac{\epsilon}{|\mathcal{A}(s)|}$ is then the probability of taking each exploratory action in an $\epsilon$-greedy policy. The overall expression is the expected return when following that policy, summing expected results from the exploratory and greedy action.

Neil Slater

Posted 2020-07-14T20:11:35.197

Reputation: 14 632

Thanks for the answer ! I suspected that but I was reading deeper into it. Can you recommend anything that delves deeper into what motivates structuring that formula as: ϵ/ |A(s)| and 1 - ϵ + ϵ/ |A(s)| ? – Metrician – 2020-07-14T20:41:23.603

1@Metrician I added explanation for the formula. I expect it is being broken apart to help with an expansion or substitution. – Neil Slater – 2020-07-14T20:53:28.683

But shouldn't be simply $1 - \frac{\epsilon}{|\mathcal{A}(s)|}$ ( instead of $1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|}$ ) so that it sums to 1 (by definition of a probability) when added to its complement $\frac{\epsilon}{|\mathcal{A}(s)|}$ ? – Metrician – 2020-07-14T21:01:34.927

1@Metrician: It does add up to 1, because the first term includes all $a$ values, including the maximising one. So the value from the maximising action is split across both parts. – Neil Slater – 2020-07-14T21:09:51.343

Ah okay I get it now. Can you please point me towards a source that has detailed proofs of these expressions? I've been looking everywhere for these – Metrician – 2020-07-14T21:15:36.027

1The book you are reading has the proofs - it seems like you are partway through one them in fact. It should refer the earlier results that is using - the id (5.2) in the text is reference to where it was proven in the same book. The rest of the lines are typically just re-arrangements of terms. If you have jumped straight in to that page on some search, I suggest you read the whole book, it's a very good introduction to RL – Neil Slater – 2020-07-14T21:18:40.667

Okay thanks a lot ! – Metrician – 2020-07-14T21:25:14.240