Multithreaded Rendering

CS/게임 프로그래밍 2024. 3. 29. 15:11

Motivations for Multithreaded Rendering

Resource Generation in Serial Execution

Applications must generate all resources before entering the operations.
Some resources are available only after additional processing of the original data.
More complex applications may require more than enough resources to store in memory, so they are delayed due to having to read data from the hard disk during program execution.

Pipeline Operation

A set of API calls required to render a frame
- CPU can be a bottleneck to the overall performance of the application.
- Need to increase the size of the batch size
- Batch size : a set of objets that can be rendered in a single drawing call without changing the state.

Parallelize API calls required to render a frame
- The actual processing time is the total API call time divided by the number of processing cores.

Command list : encapsulation of a series of API calls into an object.
The base goal : increasing scalability of the rendering software.
- Distributing API calls across multiple threads.
- Reducing the number of sequential API calls using a command list.

Direct3D 11 Threading Model

Device Interface

Interface segmentation
- Device interface : responsible for resource generation.
- Device context interface : operate the pipeline directly.
Thread-free
- Safe for multiple threads to call the method simultaneously.
- Designed to allow reentrance instead of using heavy synchronization methods.
Can create textures and buffer resources directly with the data from the thread which load the resources from the disk.
Minimize the amount of communication required between threads.

Device Context Interface

Immediate context
- Status setting method is transmitted to the driver immediately after call; Executed 'immediately'
- Implemented 'unsafely in threads', because the interface does not use any basic means of synchronization.
Deferred context
- Used in only one thread.
- 'Delay' all state and pipeline request to a later point in time.
- Each status method, draw, dispatch call is added to the command list, not executed directly.
- Not support direct query execution.
- Unable to read data from resources.

Command List

Immutable : once a command list is created, the operations cannot be changed.
Have no limit on the number of executions.
FinishCommandList() : create the command list and record graphics commands into it.
ExecuteCommandList() : submit the commands in the command list in immediate context or another deferred context.
Release() : release the command list after using it.
Fix the resource allocation command into the API command list, but not the contents of the resource.

Use of the command list object
1. Create a command list every frame and release it every time after use.
  - Able to dynamically and easily update any state used to render an object.
  - To distribute API call costs across multiple threads.
2. Create a command list and use it in multiple frames, but update 'resources' on each frame if necessary.
  - Can reduce the cost of generating a command list in every frame.
  - Only resources should contain dynamic states for rendering objects, so the application may become more complex.

Use of Device and Device Interface

Recommended to use up to one context / thread per core in the CPU.
The application creates a command list with certain rendering operations for each deferred context thread.
When the command list is complete, use immediate context to execute the lists one by one in the appropriate order.
Example of creating command lists from 4 deferred contexts and executing them in immediate context.
- Reduce overall time to submit rendering requests.
- Create the command list in a form that allows the driver to execute quickly and efficiently.

Context Pipeline State Propagation

Context Status

Immediate context : bool parameter of ExecuteCommandList()
- True
  1. The current pipeline status of the immediate context is saved.
  2. The saved pipeline status is restored after execution of the command list.
- False
  1. The current pipeline status is simply deleted.
  2. The immediate context returns to the default status after execution of the command list.
Deffered context : bool parameter of FinishCommandList()
- True : preserve the state prior to function invocation.
- False : initialized to the default pipeline state.

Performance Consideration

The main factor to consider
- Command list size
- How applications use command lists
Initialize pipeline status
- If the command lists are very long.
- If multiple command lists do not share settings.
Preserve pipeline status
- If an application uses multiple small command lists in one frame.
- If there are common pipeline configuration settings in the command lists.
Initializing the pipeline state after execution is faster than saving and restoring the state, so there is no reason to save the status when executing multiple command lists continuously.
Means to determine whether the driver supports multiple threads.

D3D11_FEATURE_DATA_THREADING ThreadingOptions;

m_pDevice->CheckFeatureSupport(
    D3D11_FEATURE_THREADING,
    &ThreadingOptions,
    sizeof(ThreadingOptions)
);

Potential Usage Scenarios

Apply Multiple Threads to Terrain Page Management

Impossible to directly make and use all necessary terrain data as resources.
- The pages of the resource should be replaced appropriately according to the viewer's current location at the time of execution.
Possible design example
1. Create 'worker' threads at and during application startup.
2. Allocate a deferred context and provide the reference to device for each worker thread.
  - A thread must be responsible for multiple terrain pages.
3. At starting point, each worker thread loads only one terrain page from the disk.
4. Allocate a vertex buffer and fill with the loaded terrain page in parallel.
  - Speed is likely to be limited only by available I/O bandwidth.
  - The application can begin rendering simple objects at the same time.
5. At runtime, each worker thread generates a command list.
  - Consist of the API that set various states required to render terrain pages.
  - Execute the command list of all terrain pages visible to the current viewer in immediate context for each frame.
6. The worker thread dynamically loads the new terrain pages from the disk as needed.

Apply Multiple Threads to Shader Generation

CPU is used relatively often in the process of compiling shader programs.
After compilation, multiple shader programs may be simultaneously generated through thread-free methods of the device interface.
The process of compiling shader program and creating shader objects in parallel.
1. Prepare a thread pool consisting of one worker thread per CPU core.
2. Load a shader program list.
3. Dispatch the program source code and information to each worker thread.
4. Each worker thread compile shader program, and creates a shader object with being isolated from other threads.
5. After all the work is done, the created shader objects are collected, stored, and used in the subsequent process.

Apply Multiple Threads to Submit Draw Operations

A scenario applied to submit draw operations
1. One main thread using the immediate context and several worker threads each using one deferred context.
2. Divide the entire task into multiple worker threads in a certain way.
3. Each worker thread generates a command list.
4. The main thread executes the command lists in the appropriate order in immediate context.
5. A final rendering image is created.
Technique for dividing scene rendering operations
1. Command list by views
  - Create a command list for each view of a scene.
  - Create new command lists whenever the render target changes or becomes empty. (in the case of compute pipelines, whenever the UAV changes)
  - The worker threads generate a command list by view and call the methods in the deferred context.
  - All updates for a given frame must be completed before the rendering path, because all data modifications in the rendering path must be planned very carefully so as not to interrupt the work of other threads.
  - Suitable to reduce the CPU burden because similar rendering effects are used for all objects in the view.
  - Comand lists are unlikely to be reused in multiple frames because large command lists are created.
2. Command list by objects
  - Create a separate command lists for each object that composes the scene.
  - Increase the number of command lists to be created and executed.
  - There is a possiblity of performance improvement by propagating the pipeline state of the deferred context.
  - Reduce the amount of batching that can be performed.
  - Once each command list per object is generated, it is not necessary to release and re-create every frame.
  - All per-frame dynamic rendering data are stored in a constant buffer.
  - The command list does not change for each frame if the same constant buffer is updated and used in all frames.
  - Can use the command list in the deferred context.
  - Reducing the number of command lists to be executed in immediate context, by combining multiple small command lists into one command list.
3. Command list by materials
  - Each worker thread creates a command list by material.
  - The main thread executes all command lists in immediate context.
  - Able to process more objects with a relatively small number of state changes.
  - There is an additional cost to classify objects by material before creating a command list.

Practical Consideration and Tips

To Do

Apply multiple levels of threads
- The number of threads used by an application or rendering framework should not be fixed to a specific number.
- The number of CPU cores in the user system is unknown in advance.
  1. The number of threads to be created and used must be determined dynamically at the runtime.
  2. Threads should be added or removed according to rendering workload requirements.
- Program debugging convenience
- Combination of various components to be tested
  1. Multiple thread / deferred rendering : execute command lists generated using a deferred context in multi-threads -> competition of data between multiple threads
  2. Single thread / deferred rendering : execute command lists generated using a deferred context in a single thread -> state configuration mismatch between multiple contexts
  3. Completely single thread : perform rendering using only a immediate context in one thread (not using command list) -> common rendering error
- Flexibility of device context : able to dynamically determine how to use each thread. (immediate / deferred)
Use PIX
- PIX : a tool that shows what an application has done with an API in a given frame.
- Able to identify the time and order of API calls, helping catch bugs caused by problems with API call order.

To Avoid

Stop pipeline : the order of execution of the command lists must be determined so that the pipeline does not stop and continues to execute.
Mode switching : should minimize the number of times the application switches the GPU between rendering mode and compute mode.
Suppose context conditions : remove duplicate state changes in the pipeline state of the rendering system.
- Maintain references to current render targets.
- Whenever an application sets a render target, check whether the current render object already matches the desired render object.

저작자표시 비영리 변경금지

'CS > 게임 프로그래밍' 카테고리의 다른 글

Mesh Rendering (0)	2024.03.31
The Computation Pipeline (0)	2024.03.19
The Tessellation Pipeline (0)	2024.02.12
The Rendering Pipeline - After Tessellation (0)	2024.02.05
The Rendering Pipeline - Before Tessellation (0)	2024.02.04

ABOUT ME

맴매레인저 맴매레인저

Motivations for Multithreaded Rendering

Direct3D 11 Threading Model

Context Pipeline State Propagation

Potential Usage Scenarios

Practical Consideration and Tips

'CS > 게임 프로그래밍' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Motivations for Multithreaded Rendering

Direct3D 11 Threading Model

Context Pipeline State Propagation

Potential Usage Scenarios

Practical Consideration and Tips

'CS > 게임 프로그래밍' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바