【笔记】CUDA（二） - （异步）SIMT 架构

（异步）SIMT 架构

在 CUDA 编程模型中，线程是进行计算或内存操作的最低抽象级别。从基于 NVIDIA Ampere GPU 架构的设备开始，CUDA 编程模型通过异步编程模型为内存操作提供加速。异步编程模型定义了与 CUDA 线程相关的异步操作的行为。

异步编程模型为 CUDA 线程之间的同步定义了异步屏障的行为。该模型还解释并定义了如何使用 cuda::memcpy_async 在 GPU计算时从全局内存中异步移动数据。

异步操作

异步操作定义为由CUDA线程发起的操作，并且与其他线程一样异步执行。在结构良好的程序中，一个或多个CUDA线程与异步操作同步。发起异步操作的CUDA线程不需要在同步线程中.

这样的异步线程（as-if 线程）总是与发起异步操作的 CUDA 线程相关联。异步操作使用同步对象来同步操作的完成。这样的同步对象可以由用户显式管理（例如，cuda::memcpy_async）或在库中隐式管理（例如，cooperative_groups::memcpy_async）。

同步对象可以是 cuda::barrier 或 cuda::pipeline。这些对象在 Asynchronous Barrier 和 Asynchronous Data Copies using cuda::pipeline 中进行了详细说明。这些同步对象可以在不同的线程范围内使用。作用域定义了一组线程，这些线程可以使用同步对象与异步操作进行同步。

下表定义了CUDA c++中可用的线程作用域，以及可以与每个线程同步的线程。

Thread Scope	Description
cuda::thread_scope::thread_scope_thread	Only the CUDA thread which initiated asynchronous operations synchronizes.
cuda::thread_scope::thread_scope_block	All or any CUDA threads within the same thread block as the initiating thread synchronizes.
cuda::thread_scope::thread_scope_device	All or any CUDA threads in the same GPU device as the initiating thread synchronizes.
cuda::thread_scope::thread_scope_system	All or any CUDA or CPU threads in the same system as the initiating thread synchronizes.

这些线程作用域是在CUDA 标准c++库中作为标准c++的扩展实现的。