-
The Computation PipelineCS/게임 프로그래밍 2024. 3. 19. 21:46
Introduction
DirectCompute
- A new processing paradigm that attempts to make the massively parallel computational power of the GPU.
- Available for tasks outside of the normal raster-based rendering domain.
- Can easily be used directly to supply input to rendering operations.
- Use the exact same resources that are used in rendering, so that it make interoperation between the rendering pipeline and DirectCompute very simple.
- The performance of a particular algorithm can easily scale with a user's hardware.
Compute Shader
- A pipeline stage available for performing flexible computations that can be applied to a wider range of applications.
- Follow the same general usage concept as the other programmable shader stages.
- Compile a shader program.
- Use a shader program to create a shader object through the device interface.
- Connect the state object to the compute shader stage by using the device context interface.
- The same type of resource, constant buffer, and sampler can be used as in other shader stages.
- Connect those resources to the compute shader stage by using UAV.
- All data inputs and outputs occur only through resources, not through other stages, so that the compute shader implements complete algorithms within a single program.
DirectCompute Threading Model
Kernel Processing
- The shader program provides a kernel that will be used to process one unit of work.
- Very easy and intuitive to program an algorithm to run on thousands of threads on a GPU.
- Focus on the best way to split a problem into a series of many instances of the same problem for every thread.
Dispatching Work
- The method to execute a batch of work with desired number of threads.
- ID3D11DeviceContext::Dispatch(), ID3D11DeviceContext::DispatchIndirect()
- Execute a command list from the thread groups.
- Total number of thread groups $= \ x \ \times \ y \ \times \ z$
// Direct version of Dispatch void Dispatch( // The number of groups dispatched in the x, y, z direction (1 ~ 65535) UINT ThreadGroupCountX, UINT ThreadGroupCountY, UINT ThreadGroupCountZ ); // Indirect version of Dispatch void DispatchIndirect( // a pointer to an ID3D11Buffer, which must be loaded with data that matches the argument list for ID3D11DeviceContext::Dispatch ID3D11Buffer *pBufferForArgs, // a byte-aligned offset between the start of the buffer and the arguments UINT AlignedByteOffsetForArgs );
- The function attribute for designate the number of threads
// total number of threads : a x b x c [numthreads(a, b, c)] // shader kernel definition follows...
Thread Addressing System
- A number of system value semantics as input parameter for the shader program.
- SV_GroupID : define the 3D identifier (uint3) for which thread group within a dispatch that a thread belongs to.
- SV_GroupThreadID : define the 3D identifier (uint3) for a thread within its own thread group.
- SV_DispatchThreadID : define the 3D identifier (uint3) for a thread within the entire dispatch.
- SV_GroupIndex : define a flattened 1D index (uint) for which thread group that a thread belongs to.
- A sample compute shader for doubling the contents of a custom 4D resource.
Buffer<float> InputBuf : register(t0); RWBuffer<float> OUtputBuf : register(u0); // group size #define size_x 10 #define size_y 10 #define size_z 10 #define size_w 10 // declare one thread for each texel of the input texture [numthreads(size_x, size_y, size_z)] void CSMAIN(uint3 DispatchThreadID : SV_DispatchThreadID, uint3 GroupID : SV_GroupID) { int index = DispatchThreadID.x + DispatchThreadID.y * size_x + DispatchThreadID.z * size_x * size_y + GroupID.x *size_x * size_y * size_z; float Value = InputBuf.Load(index); OutputBuf[index] = 2.0f * Value; }
Thread Execution Patterns
- If the processing core of the GPU is less than the declared number of threads, it cannot be executed in parallel.
- Instead, the threads are executed in a manner that ensures that they behave as if they were operating at the same time.
DirectCompute Memory Model
Register-Based Memory
- The set of registers that the computer shader support.
- v# : input attribute registers
- t# : texture registers
- cb# : constant buffer registers
- u# : unordered registers
- r#, x# : temporary registers
- Temporary registers
- Can be used to hold intermediate calculations during execution of a shader program.
- Only accessible to the thread that is currently executing, and typically extremely fast.
- Up to 4096 temporary registers (r# and x# combined) are specified in the common shader core.
- Once data has been loaded into the shader core, it will use temporary registers as much as possible.
- The drawback of register-based memory
- Finite in size.
- The desired data must be loaded into the shader core before the registers can be used.
- After a thread has completed its shader program, the contents of these registers are reset for the next shader program.
Device Memory
- Device memory resources : resources that are much larger and are maintained between the execution of the shader programs are stored in device memory.
- SRV, constant buffer : provide read-only access to device memory resources.
- UAV : provide read-write access to device memory resources.
- Device memory resources are considerably slower than register-based memory, because of a relatively high latency between the time when a value is requested and when it is returned.
- Access to device memory resources is provided to all thread that are executing the current shader program.
- Require manual synchronization of access to the resource.
- By using atomic operations or
- By defining an access paradigm that can ensure that thread will not overwrite each other's desired data ranges.
- Require manual synchronization of access to the resource.
Group Shared Memory
- Group shared memory : every thread in a complete thread group is allowed to access the same memory data.
- Allows for much faster access than the device memory resources, intended to reside on the GPU processor die.
- Declared in the global scope of the compute shader with groupshared.
- The compute shader program must determine how threads use and interact the memory.
- The compute shader program must synchronize memory access.
- Several limits of the shared memory
- The group shared memory is limited to 32KB for each thread group, so other means are needed to share more data.
- Information sharing does not cross the boundaries of a single thread group, so other means are needed for more threads to access a common memory pool.
Thread Synchronization
Memory Barriers
- The combination of two function attributes.
- The class of memory that thre threads are synchronizing.
- Whether all of the threads in a given thread group are synchronized.
- Each functions will block a thread from continuing until that function's particular conditions have been met.
- GroupMemoryBarrier()
- Block a thread's execution until all writes to the group shared memory from all threads in a thread group have been completed.
- GroupMemoryBarrierWithGroupSync()
- Block a thread from continuing until all group shared memory writes are completed.
- Block execution until all of the threads in the group have advanced to this function.
- DeviceMemoryBarrier(), DeviceMemoryBarrierWithGroupSync()
- Synchronize write operations for resources connected by an unordered access view.
- Operate in the same way as previous functions.
- AllMemoryBarrier(), AllMemoryBarrierWithGroupSync()
- Functions that essentially perform both of the previous types of synchronization.
Atomic Functions
- Functions that enables more detailed wynchronization.
- Call atomic functions in one thread.
- The result of the function is propagated to all other thread trying to access the same memory location.
- Available atomic functions
- InterlockedAdd()
- InterlockedMin()
- InterlockedMax()
- InterlockedAnd()
- InterlockedOr()
- InterlockedXor()
- InterlockedCompareStore()
- InterlockedCompareExchange()
- InterlockedExchange()
- Can be used for device memory as well as group-shared memory.
- Can be used in the pixel shader stage.
- Thread access to resources may be synchronized even in the pixel shader stage.
Implicit Synchronization
- Avoid synchronization by designing the algorithm to prevent potential competition or conflict between threads.
- This techinque keeps the algorithm from deteriorating.
- No additional functions required.
- No expensive thread context switching required.
Algorithm Design
Parallelism
- Maximizing parallelism should be an explicit design goal.
- Data should be organized in a form that can be processed with minimal memory access and calculation.
- Minimize synchronization
- All synchronization techniques cause some processing burden during compute shader execution.
- Unless synchronization is used to increase efficiency, synchronization is often damaging to performance.
- Sharing between threads
- Algorithm performance can be improved by including explicit synchronization in algorithm design.
- Sharing loaded memory can reduce device memory bandwidth.
- Share long interim calculation results.
- When the cost of accessing group shared memory increases, performance may decreases.
- Algorithm performance can be improved by including explicit synchronization in algorithm design.
Choose Appropriate Resouce Types
- Memory access patterns
- Necessary to check whether algorithms can make the most of the functions provided by hardware and software.
- Thread group and dispatch size
- Select a size in which threads can.
- Access aggregated memory locations or
- Share intermediate calculation results with each other.
- Find the dispatch size that allows the entire resource to be properly processed.
- Select a size in which threads can.
- Combination of computation and rendering
- Find the balance between efficiently calculating in the compute shader and efficiently using output data in rendering operation.
'CS > 게임 프로그래밍' 카테고리의 다른 글
Mesh Rendering (0) 2024.03.31 Multithreaded Rendering (1) 2024.03.29 The Tessellation Pipeline (0) 2024.02.12 The Rendering Pipeline - After Tessellation (0) 2024.02.05 The Rendering Pipeline - Before Tessellation (0) 2024.02.04