Why is thread safety such a huge deal for Graphics APIs?



Both Vulkan and DirectX12 are claimed to be usable in a thread-safe manner. People seem to be excited about that.

Why is this considered such a huge feature? The "real" processing gets thrown over the memory bridge on a separate processing unit anyway.

Also if it is so big, why is it not until now that a thread safe Graphics API came out?

ratchet freak

Posted 2015-08-04T21:47:25.783

Reputation: 4 695

This article is much more "gamer focused" but it might give you some insights... http://www.pcgamer.com/what-directx-12-means-for-gamers-and-developers/

– glampert – 2015-08-04T22:18:49.450



The main gain would be that it would be easier to divide CPU tasks into multiple threads, without having to solve all the difficult issues with accessing the graphics API. Normally you either would have to make the context current (which might have bad performance implications) or provide a queue and call the graphics api in a single thread. I don't think that any performance is gained this way, because the GPU indeed processes them sequentially anyway, but it makes the developer job a lot easier.

The reason that it was not done until now probably is because directx and opengl were created in a time where multithreading was not really apparent. Also the Khronos board is very conservative in changing the API. Their view on Vulkan is also that it will coexist next to OpenGL, because both serve different purposes. It probably was not until recently that paralism became so important, as consumers get access to more and more processors.

EDIT: I don't mean that no performance is gained from doing work in multiple CPUs, it is not useful to split your calls into multiple threads to create textures/shaders faster. Rather the performance is gained due to having more processors busy and keeping the gpu busy with things to perform.

Maurice Laveaux

Posted 2015-08-04T21:47:25.783

Reputation: 411

As an extra note OpenGL generally only works on one thread so a graphics intensive app could max out one core. Something like Vulkan allows multiple threads to dispatch commands to a queue which means many graphics calls can be made from multiple threads. – Soapy – 2015-08-05T07:46:37.030


There's a lot of work needed on the CPU to set up a frame for the GPU, and a good chunk of that work is inside the graphics driver. Prior to DX12 / Vulkan, that graphics driver work was essentially forced to be single-threaded by the design of the API.

The hope is that DX12 / Vulkan lift that restriction, allowing driver work to be performed in parallel on multiple CPU threads within a frame. This will enable more efficient use of multicore CPUs, allowing game engines to push more complex scenes without becoming CPU-bound. That's the hope—whether it will be realized in practice is something we'll have to wait to see over the next few years.

To elaborate a bit: the output of a game engine renderer is a stream of DX/GL API calls that describe the sequence of operations to render a frame. However, there's a great distance between the stream of API calls and the actual binary command buffers that the GPU hardware consumes. The driver has to "compile" the API calls into the GPU's machine language, so to speak. That isn't a trivial process—it involves a lot of translation of API concepts into low-level hardware realities, validation to make sure the GPU is never set into an invalid state, wrangling memory allocations and data, tracking state changes to issue the correct low-level commands, and so on and on. The graphics driver is responsible for all this stuff.

In DX11 / GL4 and earlier APIs, this work is typically done by a single driver thread. Even if you call the API from multiple threads (which you can do using DX11 deferred command lists, for example), it just adds some work to a queue for the driver thread to chew through later. One big reason for this is the state tracking I mentioned before. Many of the hardware-level GPU configuration details require knowledge of the current graphics pipeline state, so there's no good way to break up the command list into chunks that can be processed in parallel—each chunk would have to know exactly what state it should start with, even though the previous chunk hasn't been processed yet.

That's one of the big things that changed in DX12 / Vulkan. For one thing, they incorporate almost all the graphics pipeline state into one object, and for another (at least in DX12) when you start creating a command list you must provide an initial pipeline state; the state isn't inherited from one command list to the next. In principle, this allows the driver not to have to know anything about previous command lists before it can start compiling—and that in turn allows the application to break up its rendering into parallelizable chunks, producing fully-compiled command lists, which can then be concatenated together and sent to the GPU with a minimum of fuss.

Of course, there are many other changes in the new APIs, but as far as multithreading goes, that's the most important part.

Nathan Reed

Posted 2015-08-04T21:47:25.783

Reputation: 15 036


Modern GPUs generally have a single frontend section that processes an entirely linear stream of commands from the CPU. Whether this is a natural hardware design or if it simply evolved out of the days when there was a single CPU core generating commands for the GPU is debatable, but it's the reality for now. So if you generate a single linear stream of stateful commands, of course it makes sense to generate that stream linearly on a single thread on the CPU! Right?

Well, modern GPUs also generally have a very flexible unified backend that can work on lots of different things at once. Generally speaking, the GPU works on vertices and pixels at fairly fine granularity. There's not a whole lot of difference between a GPU processing 1024 vertices in one draw and 512+512 vertices in two different draws.

That suggests a fairly natural way to do less work: instead of throwing huge number of vertices at the GPU in a single draw call, split your model into sections, do cheap coarse culling on those sections, and submit each chunk individually if it passes the culling test. If you do it at the right granularity you should get a nice speedup!

Unfortunately, in the current graphics API reality, draw calls are extremely expensive on the CPU. A simplified explanation of why: state changes on the GPU may not directly correspond to graphics API calls, so many graphics API calls simply set some state inside the driver, and the draw call that would depend on this new state goes and looks at all the state that is marked as having changed since the last draw, writes it into the command stream for the GPU, then actually initiates the draw. This is all work that is done in an attempt to get a lean and mean command stream for the GPU frontend unit.

What this boils down to is that you have a budget for draw calls which is entirely imposed by the overhead of the driver. (I think I heard that these days you can get away with about 5,000 per frame for a 60 FPS title.) You can increase that by a large percentage by building this command stream in parallel chunks.

There are other reasons too (for example, asynchronous timewarp for VR latency improvements), but this is a big one for graphics-bound games and other drawcall-heavy software (like 3D modeling packages).

John Calsbeek

Posted 2015-08-04T21:47:25.783

Reputation: 2 927