1

In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?

1

In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?

4

A **deterministic policy** is a function of the form $\pi_{\mathbb{d}}: S \rightarrow A$, that is, a function from the set of states of the environment, $S$, to the set of actions, $A$. The subscript $_{\mathbb{d}}$ only indicates that this is a ${\mathbb{d}}$eterministic policy.

For example, in a grid world, the set of states of the environment, $S$, is composed of each cell of the grid, and the set of actions, $A$, is composed of the actions "left", "right", "up" and "down". Given a state $s \in S$, $\pi(s)$ is, with probability $1$, always the same action (e.g. "up"), unless the policy changes.

A **stochastic policy** is often represented as a *family* of conditional probability distributions, $\pi_{\mathbb{s}}(A \mid S)$, from the set of states, $S$, to the set of actions, $A$. A probability distribution is a function that assigns a probability for each event (in this case, the events are actions in certain states) and such that the sum of all the probabilities is $1$.

A stochastic policy is a family and not just one conditional probability distribution because, for a fixed state $s \in S$, $\pi_{\mathbb{s}}(A \mid S = s)$ is a possibly distinct conditional probability distribution. In other words, $\pi_{\mathbb{s}}(A \mid S) = \{ \pi_{\mathbb{s}}(A \mid S = s_1), \dots, \pi_{\mathbb{s}}(A \mid S = s_{|S|})\}$, where $\pi_{\mathbb{s}}(A \mid S = s)$ is a conditional probability distribution over actions given that the state is $s \in S$ and $|S|$ is the size of the set of states of the environment.

Often, in the reinforcement learning context, a stochastic policy is misleadingly denoted by $\pi_{\mathbb{s}}(a \mid s)$, where $a \in A$ and $s \in S$ are respectively a specific action and state, so $\pi_{\mathbb{s}}(a \mid s)$ is just a number and not a conditional probability distribution. A single conditional probability distribution can be denoted by $\pi_{\mathbb{s}}(A \mid S = s)$, for some fixed state $s \in S$. However, $\pi_{\mathbb{s}}(a \mid s)$ can also denote a family of conditional probability distributions, that is, $\pi_{\mathbb{s}}(A \mid S) = \pi_{\mathbb{s}}(a \mid s)$, if $a$ and $s$ are arbitrary.

In the particular case of games of chance (e.g. poker), where there are sources of randomness, a deterministic policy might not always be appropriate. For example, in poker, not all information (e.g. the cards of the other players) is available. In those circumstances, the agent might decide to play differently depending on the round (time step). More concretely, the agent could decide to go "all in" $\frac{2}{3}$ of the times whenever it has a hand with two aces and there are two uncovered aces on the table and decide to just "raise" $\frac{1}{3}$ of the other times.

A deterministic policy can be interpreted as a stochastic policy that gives the probability of $1$ to one of the available actions (and $0$ to the remaining actions), for each state.

0

Its means that for every state you have clear defined action you will take

**For Example:** We 100% know we will take action **A** from state **X**.

Its mean that for every state you do not have clear defined action to take but you have probability distribution for actions to take from that state.

**For example** there are 10% chance of taking action **A** from state **S**, There are 20% chance of taking **B** from State **S** and there are 70% chance of taking action **C** from state **S**, Its mean we don't have clear defined action to take but we have some probability of taking actions.

How does this add something new apart from the info in my answer? – nbro – 2020-05-06T10:56:30.823

I just gave simple understanding you or other people looking for simple answer can get help from this content. because for me it took lots of time to understand simple meaning that is why i shared my understanding. – Mr. Laeeq Khan – 2020-05-06T16:09:34.033

0

Apart from the answers above,

*Stochastic Policy function*: $\pi (s_1s_2 \dots s_n, a_1 a_2 \dots a_n): \mathcal S \times \mathcal A \rightarrow [0,1]$ is the probability distribution function, that, tells the probability that action sequence $a_1a_2 \dots a_n$ may be chosen in state sequence $s_1 s_2 \dots s_n$[2][3].

In *Markov Decision Process (MDP)*, it's only $\pi (s, a)$ following the assumptions[1]:
$$ \mathbb P(\omega_{t+1}| \omega_t, a_t) = \mathbb P(\omega_{t+1}| \omega_t,a_t, \dots \omega_o,a_o)$$
Where $\omega \in \Omega$ which is the set of Observations. $\mathcal A, \mathcal S$ denote the set of actions and states respectively. Since, the next observation is dependent only on present states and not the past, the policy function only needs the present state and action as parameter.

The next action is chosen as[2]: $$ a^* = \arg \max_a \pi(s_{t+1}, a) \quad\forall a \in \mathcal A $$

*Deterministic Policy function* [3]: is a special case of Stochastic Policy function where for particular $a_o \in \mathcal A$, $\pi(s, a_n) = \delta^o_n$ for all $a_n \in \mathcal A$. Here, we are totally certain to choose particular action $a_o$ in some arbitrary state $s$ and no other. Here $\delta$ is Kronecker delta. Since, the probability distribution here is discrete, it's often written in the form of $\pi(s): \mathcal S \rightarrow \mathcal A$, where the function takes arbitrary state $s$ and maps it to an action $a$ which is 100% probable.

The **Stochastic Policy function** is not meant to be confused with the **Transition Function**[2] (which is also a Probability Distribution Function), $T(s_t, a_t, s_{t+1}): \mathcal S \times \mathcal A \times \mathcal S \rightarrow [0, 1]$ which tells the probability that - at state $s_t$, the action $a_t$ will lead us to next state $s_{t+1}$.

https://ocw.mit.edu.

*6.825 Techniques in Artificial Intelligence*.*https://ocw.mit.edu*. Page Number - 6. Web. 6 May 2020Simonini, Thomas. https://www.freecodecamp.org .

*An introduction to Policy Gradients with Cartpole and Doom*. 9 May 2018. Web. 6 May 2020.https://www.computing.dcu.ie/. Reinforcement Learning.

*2.1.1 Special case - Deterministic worlds*. Web. 6 May 2020

How does this add something new apart from the info in my answer? – nbro – 2020-05-06T10:56:27.457

@nbro Generalization to Non-Markov Models and Difference between Deterministic policy and Transition function (often confused).... – abhas_RewCie – 2020-05-06T13:10:06.810

The question was not about the difference between transition function and deterministic policy. – nbro – 2020-05-06T13:29:37.547

'A deterministic policy can be interpreted as a stochastic policy that gives the probability of 1 to one of the available actions (and 0 to the remaining actions), for each state.' don't think this is correct since the definition of stochasticity is that the event can't be predicted. Meaning : "having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely." – DuttaA – 2019-05-12T19:21:40.313

@DuttaA You can have a probability distribution that assigns $1$ to one event and $0$ to everything else. This is mathematically possible. I said "you can

interpret". I am not saying that this is a good way of thinking about it. – nbro – 2019-05-12T19:29:17.530The definition of stochasticity is that you cannot predict, which is not the case here. – DuttaA – 2019-05-12T19:39:23.197

@DuttaA This is one definition of stochasticity. I have actually read one definition of PMF that states that each event must have a probability greater than 0 (but they likely meant $\geq 0$). What happens if there is only one event? In that case, that event must have probability $1$. So, a probability distribution can give probability $1$ to one event (and $0$ to the others). – nbro – 2019-05-12T19:46:54.927