Why processor is "better" for encoding than GPU?



I was reading this article and I saw that a CPU is better for video compression than a GPU.

The article only says that happens because the processor can handle more complex algorithms than the GPU, but I want a more technical explanation, I did some search's on internet but I didn't find anything.

So, anyone know to explain or link a site to I had a more deep explanation of this?

Mateus Felipe Martins Da Costa

Posted 2015-01-24T00:59:43.700

Reputation: 48



The article you linked is not very good.

Normally, single pass bitrate encodings convert your bitrate into a RF value with a maximum bitrate limit and takes it from there.

x264's one-pass ABR ratecontrol is not implemented as CRF + limit. He's right that 2pass is by far the best way to hit a target bitrate, though.

And he apparently doesn't realize that he could start x264 with threads=3 or something, to leave some CPU time free for other tasks. Or set x264's priority to verylow, so it only gets CPU time that no other task wants.

He also mixes up threads=1 with using CUDA, or something. No wonder you have questions, because that article has a TERRIBLE explanation. The whole article basically boils down to: use x264 --preset veryslow --tune film --crf 26 in.m2ts --out out.mkv, or maybe use some light filtering with an input AviSynth script. He actually recommends "placebo". That's hilarious. I've never seen a pirated file encoded with placebo. (you can tell from me=esa or me=tesa, instead of me=umh for all the good quality presets, right up to veryslow.

He also doesn't mention using 10bit color depth. Slower to encode and decode, but even after downconverting back to 8bit, you get better 8-bit SSIM. Having more precision for motion vectors apparently helps. Also, not having to round off to exactly a whole 8 bit value helps. You can think of 8-bit per component as a speed-hack; quantizing in the frequency-domain and then compressing that with CABAC means that higher bit-depth coefficients don't have to take more space.

(BTW, h.265 gets less benefit from 10-bit encodes for 8-bit video because it already has more precision for motion vectors. If there is a benefit to using 10-bit x265 for 8-bit video inputs, it's smaller than with x264. So it's less likely that the speed penalty will be worth it.)

To answer your actual question:

edit: doom9 is up again now, so I'll tidy up the link. Go to it for proper quoting of who said what.


google only caches the stupid print version which doesn't properly show the quoting. I'm not quite sure which parts of these messages are quotes, and which are attributed to the person themselves.

Highly irregular branching patterns (skip modes) and bit manipulation (quantization/entropy coding) don't suit present GPUs. IMO the only really good application at the moment are full search ME algorithms, in the end though accelerated full search is still slow even if it's faster than on the CPU.
-- MfA

Actually, basically everything can be reasonably done on the GPU except CABAC (which could be done, it just couldn't be parallelized).

x264 CUDA will implement a fullpel and subpel ME algorithm initially; later on we could do something like RDO with a bit-cost approximation instead of CABAC.

Because it has to do everything at single precision floating point
-- MfA

Wrong, CUDA supports integer math.

-- Dark Shikari

Dark Shikari is the x264 maintainer, and developer of most of the features since 2007 or so.

AFAIK, this CUDA project didn't pan out. There is support for using OpenCL to offload some work from the lookahead thread (quick I/P/B decision, not a high quality final encode of the frame).

My understanding is that the search space for video encoding is SO big that smart heuristics for early-termination of search paths on CPUs beat the brute-force GPUs bring to the table, at least for high quality encoding. It's only compared to -preset ultrafast where you might reasonably choose HW encoding over x264, esp. if you have a slow CPU (like laptop with dual core and no hyperthreading). On a fast CPU (i7 quad core with hyperthreading), x264 superfast is probably going to be as fast, and look better (at the same bitrate).

If you're making an encode where rate-distortion (quality per file size) matters at all, you should use x264 -preset medium or slower. If you're archiving something, spending a bit more CPU time now will save bytes for as long as you're keeping that file around.

side note, if you ever see messages from deadrats on a video forum, it's not going to be helpful. He's been wrong about most stuff he's talking about in every thread I've ever seen. His posts turned up in a couple threads I googled about x264 GPU encoding. Apparently he doesn't understand why it isn't easy, and has posted several times to tell the x264 developers why they're dumb...

Peter Cordes

Posted 2015-01-24T00:59:43.700

Reputation: 1 993


2017 update:

ffmpeg supports h264 and h265 NVENC GPU-accelerated video encoding. You can do 1-pass or 2-pass encoding at the quality that you choose, for either hevc_nvenc or h264_nvenc, or and even with an entry-level GPU it's much faster than non-accelerated encoding and Intel Quick Sync accelerated encoding.

2-pass high-quality encoding:

ffmpeg -i in.mp4 -vcodec h264_nvenc -preset slow out.mp4

1-pass default encoding:

ffmpeg -i in.mp4 -vcodec h264_nvenc out.mp4

NVENC ffmpeg help and options:

ffmpeg -h encoder=nvenc

Use it, it's much faster than CPU encoding.

If you don't have a GPU you can use Intel Quick Sync codec, h264_qsv, hevc_qsv, or mpeg2_qsv, which are also much faster than non-accelerated encoding.


Posted 2015-01-24T00:59:43.700

Reputation: 41

1Use it if you value speed (and low CPU usage) over quality per filesize. In some use-cases, e.g. streaming to twitch, that's what you want (especially the low CPU usage). In others, e.g. encode once to create a file that will be streamed / watched many times, you still aren't going to beat -c:v libx264 -preset slower (which is not that slow, like near realtime for 1920x1080p24 on a Skylake i7-6700k.)Peter Cordes 2018-02-20T12:31:35.937


To elaborate a little further on what Peter says, in general using multiple processors helps in cases where you have several independent tasks that all need to be done but don't have dependencies on each other, or one task where you're performing the same math on massive amounts of data.

If, however, you need to the output of calculation A as the input of calculation B, and the output of calculation B as the input to calculation C, then you can't speed it up by having a different core work on each task (A, B, or C) because one can't start until the other finishes.

However, even in the above case, you may be able to parallelize it another way. If you can break your input data into chunks, you may have one core work on doing A, then B, then C with one chunk of data, while another core works on doing A, then B, then C on a different chunk of data.

There are other considerations, too. Maybe you could find a way to parallelize the calculations, but just reading the data from disk, or over the network, or sending it to the GPU will take longer than doing the calculations. In that case, it doesn't make sense to parallelize it because just getting the data into memory takes longer than the amount of time you save by doing the calculation in parallel.

In other words, it's as much an art as it is a science.


Posted 2015-01-24T00:59:43.700

Reputation: 2 116

Oh, yes x264 parallelizes quite well on multicore CPUs. I scales nearly linearly up to at least 8 cores, and decently even beyond 32. Motion estimation can be done in parallel, leaving only the necessarily-serial work for another thread, and similar tricks.Peter Cordes 2015-01-24T21:43:15.363

The question isn't parallelism in general, it's GPUs in particular. They're much more restrictive in the code you can get them to run than CPUs. I think it's because you can't have code with branches that go different ways on different blocks of the image. I don't understand exactly why, but I think it's something like that. Each stream processor is so simple, and with such limited means of having it run independently of the others, that either you always have to wait for the slowest one to finish, or you are limited in branching at all, or both.Peter Cordes 2015-01-24T21:46:46.467

If you had a cluster of computers (CPUs with independent RAM that didn't compete with each other for memory bandwidth and CPU cache), you'd break your input video into GOPs, and send sections of the still-compressed input video to be decoded and compressed on other machines in the cluster. So only compressed input or output video would have to be transfered. One a multicore shared-cache/RAM system like even a multisocket x86 workstation, you have multiple threads operate on the same frames at once. (also means you don't need new code to do global ratecontrol for segmenting encodes.)Peter Cordes 2015-01-24T21:50:56.097