Both finite differences and the parameter-shift rule can be used to compute quantum gradients on quantum hardware. However, there are several reasons that lead to the parameter-shift rule being preferred.

### Numerical differentiation

One method to compute gradients is finite difference, a form of numerical differentiation. Here we treat the function to be differentiated as a 'black box', and approximate the gradient of a function via:

$$\frac{d f(x)}{dx}\approx \frac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}.$$

Note that it gives an approximation of the derivative, whose quality depends on $\epsilon$.

- In general, we require $\epsilon\ll 1$ for the result to be accurate
- However, finite-differences can be quite unstable when used iteratively, e.g., for gradient descent.

### Parameter shift

Another method is the parameter-shift rule. The idea here appears the same on the surface; we want to express the gradient as a linear combination of the function evaluated at two separate points. However, it is important to point out that, unlike finite differences, there are several differences:

Parameter shift is **exact**, it is not an approximation.

The shift does *not* have to be small or infinitesimal like in finite differences.

Whereas finite differences can be applied to any function $f(x)$, the parameter-shift rule can only be used to determine $f'(x)$ if the function satisfies certain constraints.

For example, consider $f(x)=\sin(x)$. Using the trig identity $\sin(a+b)-\sin(a-b) = 2\cos(a)\sin(b)$, its derivative, $\cos(x)$, can be written as

$$\frac{df(x)}{dx} = \cos(x) = \frac{\sin(x+s)-\sin(x-s)}{2\sin(s)},$$

for (almost) **any value** of $s$. So in this case, we have

$$\frac{d f(x)}{dx}=\frac{f(x+s)-f(x-s)}{2\sin(s)}.$$

This is simply taking advantage of the fact that we know that $f(x)$ satisfies a particular trig identity. Notice that the "shift" from $x$ is not required to be tiny and the equation is exact, i.e., it is not an approximation. Typically, we would choose a value like $s=\pi/2$:

$$\frac{d f(x)}{dx}=\frac{f(x+\pi/2)-f(x-\pi/2)}{2}.$$

A similar trick can be used for gradients of quantum gates.

Why do we prefer the parameter shift rule over finite differences?

Since near term quantum hardware is noisy, we typically cannot rely on finite differences to provide us with accurate results; if we choose $\epsilon$ to be sufficiently small (usually we aim for $\epsilon\approx 10^{-7}$ or so), the two shifted evaluations of the circuit are swamped by noise.

In contrast, the parameter shift rule is (a) exact, and (b) doesn't require a small shift. We can optimize by choosing the shift so as to *maximise* the distance in the parameter-space between the two circuit evaluations.

**Note:** of course, if using near term quantum hardware, we are restricted to a finite number of shots, so we can only ever approximate the gradient, even with the parameter-shift rule (with the approximation becoming more accurate as the number of shots increases).

The macroscopic shift, as well as the fact that the parameter-shift rule provides an unbiased estimator of the gradient, still makes it preferable to finite differences. In fact, convergence is still guaranteed *even in the extreme case of estimating the gradient with a single shot*.

Do other methods exist for taking the derivative, besides these two?

Yes!

If we are restricted to simulators, we can use standard backpropagation or a form of backpropagation that is more memory efficient for reversible/unitary computations.

If you mean methods that support quantum hardware, other methods also exist:

There are extensions or generalizations of the parameter-shift rule, including a stochastic parameter-shift rule that applies to additional gates, a parameter-shift rule for noisy channels, and parameter-shift rules for controlled operations that require more than the standard two terms.

Further, there are approaches that involve modifying the circuit structure to compute the gradient (as opposed to the parameter-shift rule, which requires multiple evaluations of the same circuit structure).

### Relevant papers

Thank you for the amazing answer. Is it correct to say that you cannot use backpropagation within a variational quantum circuit, on real hardware, because you're not allowed to know the state of the system at any step but the last one? – incud – 2021-01-07T14:10:01.293

Yes, I believe that is the main impediment --- we are unable to cache intermediate computations on quantum hardware, a requirement for reverse-mode backpropagation used in most machine learning frameworks. – Josh Izaac – 2021-01-08T08:16:30.133