Vulkan now available as a rendering backend!

Posted on Jan 11, 2017

We are happy to announce since earlier today Banshee fully supports Vulkan as a graphics sub-system. Vulkan allows us to push performance even higher by providing lower overhead graphics, multi-threaded rendering and a much more control over the rendering process. We have also taken this opportunity to overhaul our low-level rendering API, to make it even more modern and possible to use Vulkan specific features.

At this point we can safely say Banshee has the most advanced open source rendering backend out there. We are now continuing work on the front-end side of things and will be adding features like physically based shading, HDR and gamma correct rendering, as well as a variety of other high fidelity graphical enhancements. Expect a major release with all these features in a few months!

Our entire Vulkan implementation is available on our GitHub right now, under the Source/BansheeVulkanRenderAPI folder! It's pretty sweet.

For those that want all the juicy details of the Vulkan implementation keep on reading.

Implementation details

All the rendering in Banshee is done through an abstraction API. We call this the low-level rendering API. It allows us to perform rendering using different backends like DirectX or OpenGL without having to change any of the code that uses them. This allows systems like our renderer to work completely transparently over all available backends, ensuring we have a massive amount of flexibility when it comes to adding/changing backends.

The API is very similar to DirectX or OpenGL (it's mostly based on DirectX 11 specifically). But Vulkan is a much more complex beast, requiring considerable resource state tracking, manual memory (de)allocation and exposing features like command buffers, GPU queues and multiple GPUs. DirectX and OpenGL abstract all those away from you, making their interface much simpler and intuitive, but at a performance cost.

We wanted our API to remove the tedium of Vulkan state tracking, manual memory allocation and other boilerplate, so it was just as easy to use as DirectX/OpenGL, yet still allowed for all the most important Vulkan additions to shine.

It was a complex problem to figure out, but we are very happy with the final result. Our implementation covers all the major DirectX/OpenGL features (everything but the most obscure features), with some important Vulkan additions:

  • First class support for GPU pipeline states and resource descriptors.
  • Support for command buffers ensuring all rendering commands can now be executed on worker threads meaning rendering can be distributed.
  • Support for multiple GPU queues in order to support async upload and async compute.
  • Support for multiple GPUs so you can execute commands and create resources on any available GPU.

Why do we believe it is better than any other open source Vulkan implementation currently out there?

  1. It's a complete implementation. Many implementations focus on a very limited subset of functionality and/or expect very specific usage. We have made our implementation very general and tried to cover similar scope to existing DirectX/OpenGL APIs.
  2. No implementation seems to treat GPU pipeline states and resource descriptors as first class citizens in the API. Instead they abstract them behind APIs looking like DirectX/OpenGL. This means they need to do much of the tracking under the hood and lose the benefit of reduced overhead compared to if they were using them directly.
  3. Most open source implementations don't support multi-threaded rendering with command buffers. Rendering can be expensive and take up an entirety of a CPU core, bottlenecking the GPU. Also this is one of the most talked about Vulkan features, yet surprising amount of implementations skip it!
  4. Support for multiple GPU queues seems to be non-existent in other implementations, even though it is a massively beneficial feature. Just a year ago async compute was all the rage, yet most engines seem to be forgetting this is the feature that allows it!
  5. Other implementations don't support multiple devices. With Banshee you can explicitly control every available GPU by creating resources on them and executing commands. (Note: Vulkan spec doesn't yet have full multi-GPU support and is to be expanded, so the current usability is limited - therefore other implementations missing out on this can be forgiven)

It's easy to say you have implemented Vulkan while skipping over all the important new features. We claim we have implemented it properly, but we'll gladly have you challenge us on that!

Those curious can check out this example, which demonstrates how to use our low-level rendering API in ~550 lines of code. If you're familiar with DirectX or OpenGL it should be a piece of cake to pick up, yet you'll be using the full power of Vulkan.

Want even more details? Oh you're a curious one! Lets go into specifics about individual changes.

GPU pipelines

GPU pipeline is a set of states that control either rendering or compute operations on the GPU. The pipeline includes a set of programmable stages and a set of fixed stages. The programmable stages are know as shaders (vertex, fragment, etc.), and the fixed stages handle features like rasterization, depth-stencil operations, blending and similar.

Programmable stages can be programmed (duh) using high level languages like GLSL or HLSL. Fixed stages cannot be programmed but can often be customized via a set of parameters (e.g. enable or disable blending, choose blending operation, etc.).

Older APIs like OpenGL and DirectX 11 have no external concept of a GPU pipeline. Instead you individually bind the GPU programs and individually bind the state objects (containing customization parameters) for fixed stages. With OpenGL it's even more fine grained as you cannot (without extensions) set fixed stage states as monolithic objects, but rather you control individual parameters.

But this is a problem! Even though those APIs present those states and programs as something you can change on an individual basis, that's usually not how a modern GPU works internally. Instead there is a monolithic pipeline state that the GPU can switch between - and this switch comes with a cost.

This means that the driver is the one that needs to keep track of all these individual program and state changes, since internally even a single changed stage parameter might result in an entirely new pipeline state. This means a lot of book-keeping by the driver and switching states on-the-go as you draw objects.

These state transitions are also hidden from the user, who might not be aware that changing some state parameter will internally result in a pipeline switch, adding to the overhead even more. There are tricks the developer can use to reduce this behaviour, but it is guesswork at best since he doesn't know how the driver operates, nor is there a specification that guarantees certain behaviour.

All this amounts to additional overhead and more work for the CPU.

However if the GPU pipeline object is exposed to the render API, then the engine can manually create and bind it, therefore knowing exactly when and how these pipeline state transitions happen. And the engine usually know exactly which programs and states it will use, allowing these objects to be created once without any additional book-keeping. This is why modern APIs like Vulkan expose GPU pipeline state as one monolithic object, ensuring the engine has full control, minimizing any driver overhead and ensuring the process is fully transparent to the engine.

So far, Banshee used the DX11 approach where individual GPU programs and blend, rasterizer and depth-stencil states were bindable via the RenderAPI. With the addition of Vulkan we have refactored the API so it also uses monolithic GPU pipeline objects, more precisely GraphicsPipelineState for graphics operations and ComputePipelineState for compute operations. We have removed methods for binding individual GPU programs or fixed states to RenderAPI and instead expect the user to create the pipeline object instead.

Here's an example of how simple it is to set up a graphics pipeline with a couple of shaders:

//////////////////////////////////////////////////////////////////////////////////////////////
// Example of a graphics pipeline state with a fragment + vertex shader, and enabled blending
//////////////////////////////////////////////////////////////////////////////////////////////

// Vertex program GLSL source
const char* vertProgSrc = R"(
    layout (binding = 0, std140) uniform GUIParams
    {
        mat4 gWorldTransform;
        float gInvViewportWidth;
        float gInvViewportHeight;
        vec4 gTint;
    };      

    layout (location = 0) in vec3 bs_position;
    layout (location = 1) in vec2 bs_texcoord0;

    layout (location = 0) out vec2 texcoord0;

    out gl_PerVertex
    {
        vec4 gl_Position;
    };

    void main()
    {
        vec4 tfrmdPos = gWorldTransform * vec4(bs_position.xy, 0, 1);

        float tfrmdX = -1.0f + (tfrmdPos.x * gInvViewportWidth);
        float tfrmdY = 1.0f - (tfrmdPos.y * gInvViewportHeight);    

        gl_Position = vec4(tfrmdX, tfrmdY, 0, 1);
        texcoord0 = bs_texcoord0;
    }
)";

// Fragment program GLSL source
const char* fragProgSrc = R"(
    layout (binding = 0, std140) uniform GUIParams
    {
        mat4 gWorldTransform;
        float gInvViewportWidth;
        float gInvViewportHeight;
        vec4 gTint;
    };  

    layout (binding = 1) uniform sampler2D gMainTexture;

    layout (location = 0) in vec2 texcoord0;
    layout (location = 0) out vec4 fragColor;

    void main()
    {
        vec4 color = texture2D(gMainTexture, texcoord0.st);
        fragColor = color * gTint;
    }
)";

// Descriptor structures used for creating the GPU programs
GPU_PROGRAM_DESC vertProgDesc;
vertProgDesc.type = GPT_VERTEX_PROGRAM;
vertProgDesc.entryPoint = "main";
vertProgDesc.language = "GLSL";
vertProgDesc.source = vertProgSrc;

GPU_PROGRAM_DESC fragProgDesc;
fragProgDesc.type = GPT_FRAGMENT_PROGRAM;
fragProgDesc.entryPoint = "main";
fragProgDesc.language = "GLSL";
fragProgDesc.source = fragProgSrc;

// Descriptor structures used for setting blend and depth-stencil states
BLEND_STATE_DESC blendDesc;
blendDesc.renderTargetDesc[0].blendEnable = true;
blendDesc.renderTargetDesc[0].renderTargetWriteMask = 0b0111; // RGB, don't write to alpha
blendDesc.renderTargetDesc[0].blendOp = BO_ADD;
blendDesc.renderTargetDesc[0].srcBlend = BF_SOURCE_ALPHA;
blendDesc.renderTargetDesc[0].dstBlend = BF_INV_SOURCE_ALPHA;

DEPTH_STENCIL_STATE_DESC depthStencilDesc;
depthStencilDesc.depthWriteEnable = false;
depthStencilDesc.depthReadEnable = false;

// Create pipeline state descriptor
PIPELINE_STATE_DESC pipelineDesc;
pipelineDesc.blendState = BlendState::create(blendDesc);
pipelineDesc.depthStencilState = DepthStencilState::create(depthStencilDesc);
pipelineDesc.vertexProgram = GpuProgram::create(vertProgDesc);
pipelineDesc.fragmentProgram = GpuProgram::create(fragProgDesc);

// And finally, create the pipeline
SPtr<GraphicsPipelineState> pipelineState = GraphicsPipelineState::create(pipelineDesc);

Once created the pipeline state can be bound for rendering using RenderAPI. This ensures there is no magic behind the scenes, and the user is in full control over state transitions, making draw/dispatch calls much cheaper.

RenderAPI::setGraphicsPipeline(pipelineState);
// At this point you would bind GPU program parameters, vertex and index buffers, and issue a draw call

If you use Banshee Shading Language for your shaders, you can fully define a GPU pipeline state (or multiple) within the .bsl file.

//////////////////////////////////////////////////////////////////////////////////////////////
// Example of the pipeline state we created above, in BSL
//////////////////////////////////////////////////////////////////////////////////////////////
Parameters =
{
    mat4x4  gWorldTransform;
    float   gInvViewportWidth;
    float   gInvViewportHeight;
    color   gTint;

    Sampler2D   gMainTexSamp : alias("gMainTexture");
    Texture2D   gMainTexture;
};

Blocks = 
{
    Block GUIParams;
};

Technique =
{
    Language = "GLSL";

    // Single pass == single pipeline state
    Pass =
    {
        Target = 
        {
            Blend = true;
            Color = { SRCA, SRCIA, ADD };
            WriteMask = RGB;
        };  

        DepthRead = false;
        DepthWrite = false;

        Common =
        {
            layout (binding = 0, std140) uniform GUIParams
            {
                mat4 gWorldTransform;
                float gInvViewportWidth;
                float gInvViewportHeight;
                vec4 gTint;
            };          
        };      

        Vertex =
        {
            layout (location = 0) in vec3 bs_position;
            layout (location = 1) in vec2 bs_texcoord0;

            layout (location = 0) out vec2 texcoord0;

            out gl_PerVertex
            {
                vec4 gl_Position;
            };

            void main()
            {
                vec4 tfrmdPos = gWorldTransform * vec4(bs_position.xy, 0, 1);

                float tfrmdX = -1.0f + (tfrmdPos.x * gInvViewportWidth);
                float tfrmdY = 1.0f - (tfrmdPos.y * gInvViewportHeight);    

                gl_Position = vec4(tfrmdX, tfrmdY, 0, 1);
                texcoord0 = bs_texcoord0;
            }
        };

        Fragment =
        {
            layout (binding = 1) uniform sampler2D gMainTexture;

            layout (location = 0) in vec2 texcoord0;
            layout (location = 0) out vec4 fragColor;

            void main()
            {
                vec4 color = texture2D(gMainTexture, texcoord0.st);
                fragColor = color * gTint;
            }
        };
    };
};

Using the API in this way provides almost no extra work for the user, yet it can result in significant performance gains due to lower CPU usage. Additionally, because pipeline states are monolithic, it's easy to compare if objects are rendered using identical states, which makes it easier to sort them and reduce the amount of pipeline state transitions even further.

Resource descriptors

Second major API change we made was how GPU program parameters work. Almost all GPU programs need some input parameters, like albedo or normal maps to access in your fragment program, or a buffer of bone matrices for your vertex program. Older APIs required you to set those resources manually, often individually, before every draw/dispatch call. This has an overhead as redundant operations are done for each call, but more importantly the driver has to perform book-keeping and verification of those resources each time they're bound, adding to the overhead.

In most real-time applications, various draw/dispatch calls will use the same set of parameters for multiple frames (e.g. most objects in your game level will have a constant texture, position, bind poses, etc.). Often those parameters won't change during the entire runtime of the application. Therefore modern API's expose the concept of resource descriptors, which represent sets of GPU parameters that can be permanently stored on the GPU, without individually binding them for every draw/dispatch call.

They offer two main benefits:

  1. Their structure is known beforehand and their layout is shared with a specific GPU pipeline. This means there is less overhead when binding parameters, as the driver needs to do less work to actually verify what you have bound is valid.
  2. They are grouped in sets, which are permanently stored on the GPU. The sets only need to be updated when a parameter within it actually changes. The sets themselves are then bound before draw/dispatch calls (instead of individual parameters), ensuring all parameters for the entire pipeline are bound at once.

To make use of resource descriptors Banshee has extended the GpuParams object. Previously this object held all the parameters for a single GPU program, but now we have refactored it slightly:

  • It used to be just a storage container but it is now part of the render API that plugins can implement. This means APIs like Vulkan can override it and use it to manage resource descriptors internally. APIs that don't use resource descriptors can treat it as a parameter container.
  • They no longer store parameters for individual GPU programs, but rather for entire pipelines. They must receive a pipeline when they're created, which ensures their layout matches the pipeline perfectly.

An example of how to create, set and use GPU parameters:

//////////////////////////////////////////////////////////////////////////////////////////////
// Creating and setting GPU parameters for the pipeline we created above
////////////////////////////////////////////////////////////////////////////////////////////// 

// Create a container object to hold the parameters for all GPU programs in the pipeline state
SPtr<GpuParams> gpuParams = GpuParams::create(pipelineState);

// Create a structure that will hold our uniform block variables
struct UniformBlock
{
    Matrix4 gWorldTransform;
    float gInvViewportWidth;
    float gInvViewportHeight;
    Color gTint;
};

// Fill out the uniform block variables
UniformBlock uniformBlock;
uniformBlock.gWorldTransform = Matrix4::IDENTITY;
uniformBlock.gInvViewportWidth = 1.0f / 1920.0f;
uniformBlock.gInvViewportHeight = 1.0f / 1080.0f;
uniformBlock.gTint = Color::White;

// Create a uniform block buffer for holding the uniform variables
SPtr<GpuParamBlockBuffer> uniformBuffer = GpuParamBlockBuffer::create(sizeof(UniformBlock));
uniformBuffer->write(0, &uniformBlock, sizeof(uniformBuffer));

// Assign the uniform buffer to set 0, binding 0
gpuParams->setParamBlockBuffer(0, 0, uniformBuffer);

// Import a texture to assign to the gMainTexture parameter
HTexture texture = gImporter().import<Texture>("myTexture.png");

// Assign the texture to set 0, binding 1
gpuParams->setTexture(0, 1, texture);

// Bind the GPU parameters for use
RenderAPI::setGpuParams(gpuParams);

Command buffers

Rendering using DX11/OpenGL APIs can be a very CPU intensive operation, often being a bottleneck and taking up an entirety of a CPU core. This is why engines concerned with performance perform rendering on a separate thread - those that don't are usually severely limited when it comes to scene complexity as the rendering thread needs to share resources with other operations.

We have tried to make Banshee multi-threaded as possible from the start, and it already has a separate rendering thread. But even with a separate rendering thread engines will often still be bottlenecked by the CPU if the scene is complex enough.

Vulkan helps solve this issue with the introduction of command buffers. Instead of executing render commands on the render thread through a single access point, command buffers allow rendering commands to be queued from multiple threads. Each command buffer can be populated on a completely separate thread, ensuring the rendering thread can be broken into multiple worker threads. This ensures the CPU load can be distributed over multiple cores and significantly reduces the possibility that the CPU will be a bottleneck for rendering.

To support this we have extended our render API so every rendering command accepts a command buffer argument. All of those commands are now thread safe and can be executed from worker threads in parallel. Once the command buffer commands are generated they are submited for execution on the rendering thread.

//////////////////////////////////////////////////////////////////////////////////////////////
// Example for executing 8000 draw calls distributed over eight threads
////////////////////////////////////////////////////////////////////////////////////////////// 

// Retrieve the core thread's render API interface
RenderAPI& rapi = RenderAPI::instance();

// Create eight command buffers we'll use for parallel command submission
SPtr<CommandBuffer> commandBuffers[8];
for (UINT32 i = 0; i < 8; i++)
    commandBuffers[i] = CommandBuffer::create(GQT_GRAPHICS); // Command buffers running on the graphics queue

// Worker that queues 1000 different draw calls on a command buffer with the provided index
// (For simplicity, assuming you have created relevant pipeline states, GPU parameters, index/vertex buffers
// and vertex declarations earlier)
auto renderWorker = [&](UINT32 idx)
{
    SPtr<CommandBuffer> cb = commandBuffers[idx];
    for(UINT32 i = 0; i < 1000; i++)
    {
        UINT32 entryIdx = idx * 1000 + i;

        rapi.setGraphicsPipeline(pipelineStates[entryIdx], cb);
        rapi.setGpuParams(gpuParams[entryIdx], cb);
        rapi.setVertexBuffers(0, &vertexBuffers[entryIdx], 1, cb);
        rapi.setIndexBuffer(indexBuffers[entryIdx], cb);
        rapi.setVertexDeclaration(vertexDeclarations[entryIdx], cb);
        rapi.setDrawOperation(DOT_TRIANGLE_LIST, cb);
        rapi.drawIndexed(0, numIndices[entryIdx], 0, numVertices[entryIdx], 0, cb);
    }
};

// Run all of our worker threads
SPtr<Task> tasks[8];
for(UINT32 i = 0; i < 8; i++)
{
    tasks[i] = Task::create("Render", std::bind(&renderWorker, i));
    TaskScheduler::instance().addTask(tasks[i]);
}

// Block this thread until all workers are done populating the command buffers
for (UINT32 i = 0; i < 8; i++)
    tasks[i]->wait();

// Submit all the command buffers for execution
for (UINT32 i = 0; i < 8; i++)
    rapi.submitCommandBuffer(commandBuffers[i]);

Async shaders

Concept of multiple GPU engines and multiple queues is something that isn't talked about much. In fact most current Vulkan implementations so far (that I've seen) completely skip over this functionality, even though it is very useful. In mainstream media this feature is often called "async compute and "async shaders", but is in fact a bit more general than that.

Modern GPUs usually have a multi-engine architecture. These engines can be separated into three categories:

  • Graphics - Performs shading, rasterization and blending. Also performs everything that Compute and Transfer engines do.
  • Compute - Performs compute operations. Also performs everything that Transfer engine does.
  • Transfer - Performs data transfers between CPU and GPU.

The exact multi-engine implementations are GPU specific, so this is just a rough idea, but it is how Vulkan represents those engines.

Why are these engines useful? Lets look at a couple of examples:

  • When an application requires some data to be streamed to the GPU, this data transfer will interrupt rendering unless it is done on an engine separate to the one rendering is happening on. Unless the rendering actually depends on that data, this is completely unnecessary and will often introduce serious stuttering while data is being streamed. Modern desktop GPUs almost always have at least one separate transfer engine and we should strive to make use of it.
  • Modern GPUs don't perform all computation using the same hardware, instead there are different sets of units for different things. All shaders execute on programmable compute units, but operations like blending or rasterization operate using their own units. So if a certain operation is being bottlenecked by blend or rasterization units, there should be no reason some unrelated compute shader doesn't utilize the compute units while rasterization/blending is being done. Again, with a single engine this is not possible. This is where compute engines, or multiple graphics engines come in. By queuing operations on different engines you ensure that GPU can best utilize those possibly unused resources. Note that support for this is a little less spread in modern GPUs, but it's likely to become more common.

Vulkan exposes all available GPU engines in the form of queues. There are graphics, compute and transfer queues, mapping to the engines mentioned above. When command buffers are submitted, the developer can choose on which queue to submit the command buffer. By submitting command buffers on different queues you ensure that GPU can best utilize its resources and ensure your game runs smoothly even during transfer operations.

In Banshee we allow command buffers to be created on graphics or compute queue types, with up to 8 queues per type. We also allow the users to run any transfer operation of a queue of their choice. This way you can have command buffers specially for compute operations that run along graphics operations, or upload/download data among either. Internally your GPU might not support that many queues, or might not support some types, but Banshee will take care of distributing the command buffer submissions over all available queues.

Use of multiple queues does require some special considerations by the developer. When multiple queues reference the same resources the user must be careful to issue dependencies between those queues. For example you cannot issue a transfer operation on one queue and use the resource targeted by that operation on another, without synchronizing access. Otherwise the queue might attempt to read the resource before the transfer completes. We handle such dependencies via "sync masks" which are provided during command buffer submission. Sync mask consists of bits, each maps to a queue the submission is dependant upon. Once a mask with a set of queues is provided, the current submission will wait until the dependant queues complete before executing.

//////////////////////////////////////////////////////////////////////////////////////////////
// Example that uses three different queues in order to allow the GPU to maximize its usage:
// - On a graphics queue we perform shadow mapping, notable for being heavy on the fixed portion of the pipeline,
//   while leaving the programmable units mostly unused.
// - On the compute queue we perform some GPU particle simulation. This allows the GPU to make use of those
//   unused programmable units.
// - Finally, our game is an open world and we need to stream in new meshes & textures so we can display them
//   on the next frame, which we do on the upload queue.
////////////////////////////////////////////////////////////////////////////////////////////// 

while (!gameQuit()) // Main rendering loop
{
    // Create a command buffer to render on
    SPtr<CommandBuffer> renderCB = CommandBuffer::create(GQT_GRAPHICS);

    // And one to execute compute on
    SPtr<CommandBuffer> computeCB = CommandBuffer::create(GQT_COMPUTE);

    // Create a pipeline for shadow mapping
    PIPELINE_STATE_CORE_DESC pipelineDesc = ...;
    SPtr<GraphicsPipelineStateCore> shadowPipeline = GraphicsPipelineStateCore::create(pipelineDesc);

    // Create a pipeline for compute
    SPtr<GpuProgramCore> computeProgram = ...;
    SPtr<ComputePipelineStateCore> computePipeline = ComputePipelineStateCore::create(computeProgram);

    // Retrieve the core thread's render API interface
    RenderAPICore& rapi = RenderAPICore::instance();

    // Add commands for rendering
    rapi.setGraphicsPipeline(shadowPipeline, renderCB);
    for (UINT32 i = 0; i < numSceneObjects; i++)
    {
        // Bind GPU params, index & vertex buffers, vertex declaration...
        rapi.drawIndexed(...);
    }

    // Add commands for compute
    rapi.setComputePipeline(computePipeline, computeCB);
    rapi.dispatchCompute(...);

    // Submit command buffers for execution
    //// Note that we add a sync mask for the render command buffer. This sync mask ensures that all upload operations
    //// from the previous frame complete before the rendering executes (these are the operations we perform down below).
    UINT32 syncMask = CommandSyncMask::getGlobalQueueIdx(GQT_UPLOAD, 0);

    rapi.submitCommandBuffer(renderCB, syncMask);
    rapi.submitCommandBuffer(computeCB);

    // Stream new texture data (assuming we created the textures and loaded the data into system memory already)
    UINT32 uploadQueueIdx = CommandSyncMask::getGlobalQueueIdx(GQT_UPLOAD, 0); // Note this is the same queue we used in the sync mask

    Vector<SPtr<TextureCore>> texturesToStream = ...;
    UINT32 idx = 0;
    for (auto& entry : texturesToStream)
    {
        // Note that we don't create a command buffer for load & store operations. Instead such operations accept
        // a queue index on which to execute.
        entry->writeData(textureData[idx], 0, 0, true, uploadQueueIdx);
        idx++;
    }
}

Multi-GPU

Vulkan doesn't yet fully support use of multiple GPU's, but we felt it was important to support them from a design standpoint as it would be difficult to implement later.

Up until DX12/Vulkan came along, developers were (mostly) at the mercy of driver developers if their game supported multi-GPU technology like SLI or Crossfire. The best they could do is to create a renderer using certain guidelines that makes enabling multi-GPU possible, but they could never explicitly manage it from within their application - simply because API's like DX11/OpenGL didn't provide that functionality.

DirectX 12 already supports multi-GPU within the application, and Vulkan supports it in a limited fashion (with better support promised). More specifically you are allowed to create resources and execute commands on multiple GPUs, but you cannot directly utilize SLI/Crossfire functionality, they cannot communicate directly and only one GPU can be used for presenting images. This means that right now you can use a non-primary GPU for various helper tasks, and then transfer the results to the primary GPU through the CPU.

Currently in Banshee you can use multiple GPUs in this explicit manner. We provide an interface for sharing resources between GPUs and using such resources transparently over multiple GPUs. You can create command buffers to render or execute compute commands across multiple GPUs. So far the interface is similar to what DX12 is doing, using bitmasks to specify for which devices are resources/operations intended for.

As more support is added we will keep working on this API to make its use as simple and transparent as possible.

//////////////////////////////////////////////////////////////////////////////////////////////
// Example that creates a resource shared among multiple GPU's, and that executes two separate
// command buffers on two separate GPUs.
////////////////////////////////////////////////////////////////////////////////////////////// 
TEXTURE_DESC desc;
desc.width = 1024;
desc.height = 1024;
desc.type = TEX_TYPE_2D;
desc.format = PF_R8G8B8A8;

// Create a texture on the primary and secondary GPU
SPtr<TextureCore> texture = TextureCore::create(desc, (GpuDeviceFlags)(GDF_PRIMARY | GDF_GPU2));

SPtr<PixelData> whitePixelData = PixelData::create(1024, 1024, 1, PF_R8G8B8A8);
for (UINT32 y = 0; y < 1024; y++)
    for (UINT32 x = 0; x < 1024; x++)
        whitePixelData->setColorAt(Color::White, x, y);

// Any data writes happen on both GPUs
texture->writeData(*whitePixelData);

// Create a command buffer on the main GPU
SPtr<CommandBuffer> mainCB = CommandBuffer::create(GQT_GRAPHICS, 0); // Main GPU = index 0
// ... do main rendering on the main command buffer

// If user has a secondary GPU, potentially use it for something
RenderAPICore& rapi = RenderAPICore::instance();
if (rapi.getNumDevices() > 0)
{
    SPtr<CommandBuffer> secondaryCB = CommandBuffer::create(GQT_GRAPHICS, 1);

    // Queue and submit commands as normal. Useful to unload some operations that are not on the critical path.

    // The texture we created above can be used on both devices. Caller must ensure he sets appropriate device
    // flags depending on where is the resource going to be used.

    rapi.submitCommandBuffer(secondaryCB);
}

rapi.submitCommandBuffer(mainCB);

Conclusion

Did you seriously just read though all of that? Jeez! Hopefully it was informative.

Also, Banshee is looking for contributors with C++ experience. There's a lot of stuff to work on. We have huge plans for the future! See if there's anything you like and contact us.

Share

Facebook
Twitter
Google+
Pinterest
Reddit
LinkedIn
Email