I'm listening on headphones in a relatively noisy environment... but I'll try my best to answer.
The voice recording is mono, band limited and compressed/distorted. The background music is spectrally-rich and rich in stereo information too. With regards the stereo separation vs mono, your brain is doing much of the legwork here, Iggy!
I don't think the author had to put much thought into it. Both sources are of a different nature by default.
If you're struggling, you could try some subtle side-chain compression, by where your voice triggers varying compression of the background music slightly.
Beyond that, if your voice recording setup is of better fidelity than the video you've linked: consider compressing/distorting your recording, and band-limiting it afterwards. Telephone transmissions (300Hz - 3.4kHz) don't even aim to capture the fundamental frequencies of a person's voice: https://en.m.wikipedia.org/wiki/Voice_frequency
Maybe try 200Hz - 5kHz or so for better clarity/less obvious processing.
I hope this helps.
In addition, the background music is also fairly rich in reverb and delay. So the music sits further back, and wide. The voice recording is upfront, and central.
The reason I have suggested compressing/distorting your vocals, prior to band limiting, is that it will better emphasise the fundamentals in your voice.
We recognise vowels in speech through formants, which are moving resonant band-pass filter shapes, created by the shape of your mouth, the position of the tongue and the pharyngeal space. The consonants belong in higher frequencies, and are often transient in nature.
Your voice contains both odd and even order harmonics. Distortion and aggressive compression will create mainly odd order harmonics of the existing partials. There is a fine balance between emphasising the existing characteristics of a voice, and ruining it... by creating too many higher frequency harmonics, that will mostly interfere with the clarity of consonants.
Imitating a higher fidelity 'telephone' type of recording will best suit your purposes in my opinion.