What is the guarantee this implementation is efficient? Is there any
rule regarding when implementing such POVMs is efficient?

The implementation of such a gate will only depend on the parameter $k$ (which I assume you mean to be fixed), not $n$. Since efficiency is generally phrased in terms of scaling with $n$, and you have no dependence on that, it is efficient.

How do I implement this POVM using a fixed universal gate set and the
ability to measure in the standard basis? What is the unitary that I
have to apply before measuring in the standard basis

Let $H_i=UDU^\dagger$, where $D$ is diagonal (with entries between 0 and 1 on the diagonal) and $U$ is a unitary. Apply $U^\dagger$ to the appropriate set of qubits. This now reduces you to the problem of performing the measurement $\{D,1-D\}$.

You'll need to introduce a single ancilla qubit, prepared in the $|0\rangle$ state. It is this ancilla that you will measure in the computational basis, with the two outcomes corresponding to the two different measurement operators. But before that, we need to construct a unitary between the original system (S) and the ancilla (A). Let $D=\sum_id_i|i\rangle\langle i|$, and let $V|i\rangle_S|0\rangle_A=\sqrt{d_i}|i\rangle|0\rangle+\sqrt{1-d_i}|i\rangle|1\rangle$. You can decompose this unitary via standard techniques. Apply $V$, and measure the ancilla.

To see that this works, let your input state be $|\psi\rangle=U\sum_i\alpha_i|i\rangle$. You sould get the measurement outcome with probaility
$$
\langle\psi|H_i|\psi\rangle=\sum_i|\alpha_i|^2d_i.
$$
This is what we need to check that we get. So, our simulation first applies $U^\dagger$, so we have
$$
\sum_i\alpha_i|i\rangle_S|0\rangle_A.
$$
We apply $V$ to prepare
$$
|\Psi\rangle=\sum_i\alpha_i|i\rangle_S(\sqrt{d_i}|0\rangle_A+\sqrt{1-d_i}|1\rangle_A).
$$
We calculate the probability of the 0 outcome:
$$
\langle\Psi| 1_S\otimes|0\rangle\langle 0|_A|\Psi\rangle=\sum_i|\alpha_i|^2d_i,
$$
as required.

Note that I've not worried about the state after the measurement because you've only specified a POVM, which immediately implies you're only interested in the measurement probability, not the output state.

and how much error can I tolerate?

This depends on what you mean, and is probably an entirely separate question to do justice to.

Regarding the error, what I meant is, when I am decomposing the unitaries with gates from my universal gate set (comprising of H, S, T, and the CNOT gate, let's say), what is the trade-off between the error incurred and the size of the circuit? – BlackHat18 – 2020-10-12T09:56:28.950

Also, what is the cost (in terms of circuit zie) of implementing the unitary $U$? Why do we assume it is not high? – BlackHat18 – 2020-10-12T10:09:47.960

Because everything only acts on $k$ qubits. It might b exponentially large in $k$, but if $k$ i fixed as $n$ scales, we don't care about that. – DaftWullie – 2020-10-12T10:14:27.780

Regarding the issue of error, why is this different to any standard "build a unitary" task, where we know the optimal solution and how the error scales (run time is $\log(1/\epsilon)$ for each single qubit gate) – DaftWullie – 2020-10-12T10:15:35.717

Thanks! It's clear now. One thing though, in the analysis, I am not quite sure why you used the state $|\psi\rangle=U\sum_i\alpha_i|i\rangle$ for your analysis. Won't the analysis work for states like $|\psi\rangle=\sum_i\alpha_i|i\rangle$? – BlackHat18 – 2020-10-12T10:35:20.020

Yes, it just made it easier because I knew the first thing I would do was apply $U^\dagger$. But either form is a completely arbitrary state. – DaftWullie – 2020-10-12T13:02:49.380