The basic idea of how the quantum feature map works is that you're using a quantum computer to map each input datapoint $x$ from your training domain $\mathcal{X}$ into a quantum state $|\phi(x)\rangle = U(x)|0\rangle$ in the (presumably) high dimensional quantum state space, and then evaluating a set of kernel functions:

$$
k_Q(x_i, x_j) = |\langle 0|U(x_j)^\dagger U(x_i)|0\rangle|^2
$$

for all pairs $x_i,x_j \in \mathcal{X}$. The Support Vector Machine (SVM) that classifies $\mathcal{X}$ can then be treated as a black box that takes in the kernel matrix $K_{ij} = k_Q(x_i, x_j)$ and returns a model $f_Q: \mathcal{X} \rightarrow \{0,1\}$.

The "kernel trick" is the substitution that allows us to use the SVM (ordinarily a linear classifier on $\mathcal{X}$) to classify the data using $K$ to achieve non-linear decision boundaries. But regardless of whether $K$ is generated by a classical or a quantum computer, this doesn't guarantee an effective classifier. An example of a kernel that can be shown to fail is the quadratic polynomial kernel

$$
k_C(x_i, x_j) = (\langle x_i, x_j \rangle + b)^2
$$

which will be generally incapable of classifying data that is labelled by a function of degree 3 or higher. So if you can find a quantum kernel $k_Q$ that results in a kernelized SVM that successfully classifies data labeled by those functions for which $k_C$ fails, you've found at least some evidence to support the use of your quantum feature map.

More generally, the motivation to use quantum feature maps is that they *might* be more expressive than some classical counterparts. For instance (Schuld, 2020) uses Fourier analysis to connect the spectrum of a classifier $f_Q$ to the number of local rotations in the circuit $U(x)$.

But justifying the use of quantum kernels also requires finding a feature map that is inefficient to compute classically, otherwise you would just simulate the unitaries $U(x) \forall x$ to evaluate your kernels$^\dagger$. Some recent work (Huang, 2020) takes steps towards evaluating the power of quantum kernels compared to some classical counterparts but overall this is still a very open question.

$^\dagger$ keep in mind that if you can simulate $U(x)$ efficiently then you can do "one-shot" evaluation of the kernel matrix so that the number of circuit simulations is only $O(n)$ instead of the $O(n^2)$ needed to evaluate $k_Q(x, x')$ on hardware. This raises the bar for demonstrating speedup using this kind of quantum SVM.

How is the kernel matrix an input? – Sinestro 38 – 2021-01-24T04:46:36.217

What is a kernelized SVM? – Sinestro 38 – 2021-01-24T11:25:48.110

Hey Forky so I'm trying to get a high level understanding of what the QC does so correct me if I'm wrong here. In a SVM, the classical computer computes through kernel functions of n degrees to find the most suitable space with a clear hyperplane dividing the two classes. But since some datasets can be computationally expensive to calculate the kernels of (or inner products with the kernel trick), quantum computers can efficiently apply that kernel function to obtain a speedup(theoretically). – Sinestro 38 – 2021-01-24T11:32:14.570

"The 'kernel trick' is the substitution that allows us to use the SVM (ordinarily a linear classifier on $X$) to classify the data using $K$ to achieve non-linear decision boundaries." --> I thought that the kernel trick was used to efficiently determine the utility of casting data points to a higher dimension through calculating the inner products of each pair. – Sinestro 38 – 2021-01-24T11:36:40.287

The most basic formulation of an SVM is as a constrained optimization problem that looks for a linear ("hyperplane") boundary that separates two classes of linearly separable data while enforcing that the chosen hyperplane maximizes its perpendicular distance ("margin") from members of either class. After some rearranging this can be stated as a maximization problem w.r to a Lagrangian $L = \sum_i \alpha_i - \sum_{i,j} \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle$. The "kernel trick" refers to substituting the $\langle x_i , x_j \rangle$ for a positive definite symmetric $k(x_i, x_j)$ – forky40 – 2021-01-24T20:57:09.123

when $k(x_i, x_j)$ is evaluated over a fixed dataset ${x_k}

{k=1}^n$ it just becomes an $n \times n$ matrix of values with entries $K{ij} = k(x_i, x_j)$. This is called the kernel matrix or sometimes the "Gram matrix". The kernelized SVM is just an SVM that you've applied the kernel trick to. – forky40 – 2021-01-24T21:01:29.950Regarding your third comment, that's not quite right. Usually a researcher will start with some choice of $k(x, x')$ to try and classify their data, and then play around with different functions to see which gives the best SVM performance for a fixed dataset. When you compute a quantum kernel its usually not to reproduce some classical $k_C$, but rather to find a choice of $k_Q$ that is better suited to the problem at hand. So the goal is to find a $k_Q$ that is a better choice for a given dataset, but to actually justify the use of the QC you need to confirm that $k_Q$ is classically hard – forky40 – 2021-01-24T21:13:32.700