3

Q-learning seems to be related to A*. I am wondering if there are (and what are) the differences between them.

3

Q-learning seems to be related to A*. I am wondering if there are (and what are) the differences between them.

15

Q-learning and A* can both be viewed as search algorithms, but, apart from that, they are not very similar.

Q-learning is a **reinforcement learning** algorithm, i.e. an algorithm that attempts to find a *policy* or, more precisely, **value function** (from which the policy can be derived) by taking stochastic moves (or actions) with some policy (which is different from the policy you want to learn), such as the $\epsilon$-greedy policy, given the current estimate of the *value function*. Q-learning is a **numerical** (and stochastic optimization) algorithm that can be shown to converge to the optimal solution in the tabular case (but it does not necessarily converge when you use a function approximator, such as a neural network, to represent the value function). Q-learning can be viewed as a search algorithm, where the solutions are value functions (or policies) and the search space is some space of value functions (or policies).

On the other hand, A* is a general **search algorithm** that can be applied to any search problem where the search space can be represented as a **graph**, where nodes are positions (or locations) and the edges are the weights (or costs) between these positions. A* is an **informed** search algorithm, given that you can use an (informed) heuristic to guide the search, i.e. you can use domain knowledge to guide the search. A* is a *best-first search* (BFS) algorithm, which is a family of search algorithms that explore the search space by following the next best location according to some objective function, which varies depending on the specific BFS algorithm. For example, in the case of A*, the objective function is $f(n) = h(n) + g(n)$, where $n$ is a node, $h$ the heuristic function and $g$ the function that calculates the cost of the path from the starting node to $n$. A* is also known to be optimal (provided that the heuristic function is *admissible*)

2Note that there are algorithms like RTA* and LRTA* (and many others) that are much closer to Q-learning while keeping many of the principles of A*. (Although typically they learn a value function for states instead of a q-value for state/action pairs.) – Nathan S. – 2020-08-18T03:48:33.147

@NathanS. I was not aware of these algorithms. Thanks for noticing them. Maybe you could provide an answer that briefly explains them and the difference between them and A* and Q-learning, although that wasn't the question. – nbro – 2020-08-18T10:10:23.830