The Computation Pipeline

CS/게임 프로그래밍 2024. 3. 19. 21:46

Introduction

DirectCompute

A new processing paradigm that attempts to make the massively parallel computational power of the GPU.
Available for tasks outside of the normal raster-based rendering domain.
Can easily be used directly to supply input to rendering operations.
Use the exact same resources that are used in rendering, so that it make interoperation between the rendering pipeline and DirectCompute very simple.
The performance of a particular algorithm can easily scale with a user's hardware.

Compute Shader

A pipeline stage available for performing flexible computations that can be applied to a wider range of applications.
Follow the same general usage concept as the other programmable shader stages.
1. Compile a shader program.
2. Use a shader program to create a shader object through the device interface.
3. Connect the state object to the compute shader stage by using the device context interface.
The same type of resource, constant buffer, and sampler can be used as in other shader stages.
Connect those resources to the compute shader stage by using UAV.
All data inputs and outputs occur only through resources, not through other stages, so that the compute shader implements complete algorithms within a single program.

DirectCompute Threading Model

Kernel Processing

The shader program provides a kernel that will be used to process one unit of work.
Very easy and intuitive to program an algorithm to run on thousands of threads on a GPU.
Focus on the best way to split a problem into a series of many instances of the same problem for every thread.

Dispatching Work

The method to execute a batch of work with desired number of threads.
- ID3D11DeviceContext::Dispatch(), ID3D11DeviceContext::DispatchIndirect()
- Execute a command list from the thread groups.
- Total number of thread groups $= \ x \ \times \ y \ \times \ z$

// Direct version of Dispatch
void Dispatch(
    // The number of groups dispatched in the x, y, z direction (1 ~ 65535)
    UINT ThreadGroupCountX, 
    UINT ThreadGroupCountY, 
    UINT ThreadGroupCountZ
);

// Indirect version of Dispatch
void DispatchIndirect(
    // a pointer to an ID3D11Buffer, which must be loaded with data that matches the argument list for ID3D11DeviceContext::Dispatch
    ID3D11Buffer *pBufferForArgs,
    // a byte-aligned offset between the start of the buffer and the arguments
    UINT AlignedByteOffsetForArgs
);

The function attribute for designate the number of threads

// total number of threads : a x b x c
[numthreads(a, b, c)]
// shader kernel definition follows...

Thread Addressing System

A number of system value semantics as input parameter for the shader program.
- SV_GroupID : define the 3D identifier (uint3) for which thread group within a dispatch that a thread belongs to.
- SV_GroupThreadID : define the 3D identifier (uint3) for a thread within its own thread group.
- SV_DispatchThreadID : define the 3D identifier (uint3) for a thread within the entire dispatch.
- SV_GroupIndex : define a flattened 1D index (uint) for which thread group that a thread belongs to.
A sample compute shader for doubling the contents of a custom 4D resource.

Buffer<float> InputBuf : register(t0);
RWBuffer<float> OUtputBuf : register(u0);

// group size
#define size_x 10
#define size_y 10
#define size_z 10
#define size_w 10

// declare one thread for each texel of the input texture
[numthreads(size_x, size_y, size_z)]

void CSMAIN(uint3 DispatchThreadID : SV_DispatchThreadID, uint3 GroupID : SV_GroupID)
{
    int index = DispatchThreadID.x +
                DispatchThreadID.y * size_x +
                DispatchThreadID.z * size_x * size_y +
                GroupID.x *size_x * size_y * size_z;

    float Value = InputBuf.Load(index);
    OutputBuf[index] = 2.0f * Value;
}

Thread Execution Patterns

If the processing core of the GPU is less than the declared number of threads, it cannot be executed in parallel.
Instead, the threads are executed in a manner that ensures that they behave as if they were operating at the same time.

DirectCompute Memory Model

Register-Based Memory

The set of registers that the computer shader support.
- v# : input attribute registers
- t# : texture registers
- cb# : constant buffer registers
- u# : unordered registers
- r#, x# : temporary registers
Temporary registers
- Can be used to hold intermediate calculations during execution of a shader program.
- Only accessible to the thread that is currently executing, and typically extremely fast.
- Up to 4096 temporary registers (r# and x# combined) are specified in the common shader core.
- Once data has been loaded into the shader core, it will use temporary registers as much as possible.
The drawback of register-based memory
- Finite in size.
- The desired data must be loaded into the shader core before the registers can be used.
- After a thread has completed its shader program, the contents of these registers are reset for the next shader program.

Device Memory

Device memory resources : resources that are much larger and are maintained between the execution of the shader programs are stored in device memory.
SRV, constant buffer : provide read-only access to device memory resources.
UAV : provide read-write access to device memory resources.
Device memory resources are considerably slower than register-based memory, because of a relatively high latency between the time when a value is requested and when it is returned.
Access to device memory resources is provided to all thread that are executing the current shader program.
- Require manual synchronization of access to the resource.
  1. By using atomic operations or
  2. By defining an access paradigm that can ensure that thread will not overwrite each other's desired data ranges.

Group Shared Memory

Group shared memory : every thread in a complete thread group is allowed to access the same memory data.
Allows for much faster access than the device memory resources, intended to reside on the GPU processor die.
Declared in the global scope of the compute shader with groupshared.
The compute shader program must determine how threads use and interact the memory.
The compute shader program must synchronize memory access.
Several limits of the shared memory
1. The group shared memory is limited to 32KB for each thread group, so other means are needed to share more data.
2. Information sharing does not cross the boundaries of a single thread group, so other means are needed for more threads to access a common memory pool.

Thread Synchronization

Memory Barriers

The combination of two function attributes.
1. The class of memory that thre threads are synchronizing.
2. Whether all of the threads in a given thread group are synchronized.

Each functions will block a thread from continuing until that function's particular conditions have been met.
GroupMemoryBarrier()
- Block a thread's execution until all writes to the group shared memory from all threads in a thread group have been completed.
GroupMemoryBarrierWithGroupSync()
- Block a thread from continuing until all group shared memory writes are completed.
- Block execution until all of the threads in the group have advanced to this function.
DeviceMemoryBarrier(), DeviceMemoryBarrierWithGroupSync()
- Synchronize write operations for resources connected by an unordered access view.
- Operate in the same way as previous functions.
AllMemoryBarrier(), AllMemoryBarrierWithGroupSync()
- Functions that essentially perform both of the previous types of synchronization.

Atomic Functions

Functions that enables more detailed wynchronization.
Call atomic functions in one thread.
The result of the function is propagated to all other thread trying to access the same memory location.
Available atomic functions
- InterlockedAdd()
- InterlockedMin()
- InterlockedMax()
- InterlockedAnd()
- InterlockedOr()
- InterlockedXor()
- InterlockedCompareStore()
- InterlockedCompareExchange()
- InterlockedExchange()
Can be used for device memory as well as group-shared memory.
Can be used in the pixel shader stage.
- Thread access to resources may be synchronized even in the pixel shader stage.

Implicit Synchronization

Avoid synchronization by designing the algorithm to prevent potential competition or conflict between threads.
This techinque keeps the algorithm from deteriorating.
- No additional functions required.
- No expensive thread context switching required.

Algorithm Design

Parallelism

Maximizing parallelism should be an explicit design goal.
- Data should be organized in a form that can be processed with minimal memory access and calculation.
Minimize synchronization
- All synchronization techniques cause some processing burden during compute shader execution.
- Unless synchronization is used to increase efficiency, synchronization is often damaging to performance.
Sharing between threads
- Algorithm performance can be improved by including explicit synchronization in algorithm design.
  1. Sharing loaded memory can reduce device memory bandwidth.
  2. Share long interim calculation results.
- When the cost of accessing group shared memory increases, performance may decreases.

Choose Appropriate Resouce Types

Memory access patterns
- Necessary to check whether algorithms can make the most of the functions provided by hardware and software.
Thread group and dispatch size
1. Select a size in which threads can.
  - Access aggregated memory locations or
  - Share intermediate calculation results with each other.
2. Find the dispatch size that allows the entire resource to be properly processed.
Combination of computation and rendering
- Find the balance between efficiently calculating in the compute shader and efficiently using output data in rendering operation.

저작자표시 비영리 변경금지 (새창열림)

'CS > 게임 프로그래밍' 카테고리의 다른 글

Mesh Rendering (0)	2024.03.31
Multithreaded Rendering (1)	2024.03.29
The Tessellation Pipeline (0)	2024.02.12
The Rendering Pipeline - After Tessellation (0)	2024.02.05
The Rendering Pipeline - Before Tessellation (0)	2024.02.04

ABOUT ME

맴매레인저 맴매레인저

Introduction

DirectCompute Threading Model

DirectCompute Memory Model

Thread Synchronization

Algorithm Design

'CS > 게임 프로그래밍' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Introduction

DirectCompute Threading Model

DirectCompute Memory Model

Thread Synchronization

Algorithm Design

'CS > 게임 프로그래밍' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바