问题描述
我正在Pascal上进行内存合并实验,并得到了意外的 nvprof
结果.我有一个内核可以将4 GB的浮点数从一个数组复制到另一个数组. nvprof
报告了 gld_transactions_per_request
和 gst_transactions_per_request
令人困惑的数字.
I am running a memory coalescing experiment on Pascal and getting unexpected nvprof
results. I have one kernel that copies 4 GB of floats from one array to another one. nvprof
reports confusing numbers for gld_transactions_per_request
and gst_transactions_per_request
.
我在TITAN Xp和GeForce GTX 1080 TI上进行了实验.结果相同.
I ran the experiment on a TITAN Xp and a GeForce GTX 1080 TI. Same results.
#include <stdio.h>
#include <cstdint>
#include <assert.h>
#define N 1ULL*1024*1024*1024
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void copy_kernel(
const float* __restrict__ data, float* __restrict__ data2) {
for (unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < N; tid += blockDim.x * gridDim.x) {
data2[tid] = data[tid];
}
}
int main() {
float* d_data;
gpuErrchk(cudaMalloc(&d_data, sizeof(float) * N));
assert(d_data != nullptr);
uintptr_t d = reinterpret_cast<uintptr_t>(d_data);
assert(d%128 == 0); // check alignment, just to be sure
float* d_data2;
gpuErrchk(cudaMalloc(&d_data2, sizeof(float)*N));
assert(d_data2 != nullptr);
copy_kernel<<<1024,1024>>>(d_data, d_data2);
gpuErrchk(cudaDeviceSynchronize());
}
使用CUDA版本10.1编译:
Compiled with CUDA version 10.1:
nvcc coalescing.cu -std=c++11 -Xptxas -dlcm=ca -gencode arch=compute_61,code=sm_61 -O3
配置文件:
nvprof -m all ./a.out
分析结果中有一些令人困惑的部分:
There are a few confusing parts in the profiling results:
-
gld_transactions = 536870914
,这意味着每个全局加载事务平均应为4GB/536870914 = 8个字节
.这与gld_transactions_per_request = 16.000000
一致:每个扭曲读取128个字节(1个请求),如果每个事务为8个字节,则每个请求需要128/8 = 16
个事务.为什么这个值这么低?我希望能够实现完美的合并,因此可以进行4(甚至1)笔交易/请求. -
gst_transactions = 134217728
和gst_transactions_per_request = 4.000000
,那么存储内存效率更高吗? - 请求的和已实现的全局加载/存储吞吐量(
gld_requested_throughput
,gst_requested_throughput
,gld_throughput
,gst_throughput
)> 150.32GB/s .我希望负载的吞吐量比商店的吞吐量低,因为每个请求的事务量更多. -
gld_transactions = 536870914
,但l2_read_transactions = 134218800
.始终通过L1/L2缓存访问全局内存.为什么L2读事务的数量如此之少?不能全部缓存在L1中.(global_hit_rate = 0%
)
gld_transactions = 536870914
, which means that every global load transaction should on average be4GB/536870914 = 8 bytes
. This is consistent withgld_transactions_per_request = 16.000000
: Each warp reads 128 bytes (1 request) and if every transaction is 8 bytes, then we need128 / 8 = 16
transactions per request. Why is this value so low? I would expect perfect coalescing, so something along the lines of 4 (or even 1) transactions/request.gst_transactions = 134217728
andgst_transactions_per_request = 4.000000
, so storing memory is more efficient?- Requested and achieved global load/store throughput (
gld_requested_throughput
,gst_requested_throughput
,gld_throughput
,gst_throughput
) is150.32GB/s
each. I would expect a lower throughput for loads than for stores since we have more transactions per request. gld_transactions = 536870914
butl2_read_transactions = 134218800
. Global memory is always accessed through the L1/L2 caches. Why is the number of L2 read transactions so much lower? It can't all be cached in the L1. (global_hit_rate = 0%
)
我认为我在读取 nvprof
结果错误.任何建议,将不胜感激.
I think I am reading the nvprof
results wrong. Any suggestions would be appreciated.
这是完整的分析结果:
Device "GeForce GTX 1080 Ti (0)"
Kernel: copy_kernel(float const *, float*)
1 inst_per_warp Instructions per warp 1.4346e+04 1.4346e+04 1.4346e+04
1 branch_efficiency Branch Efficiency 100.00% 100.00% 100.00%
1 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
1 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 99.99% 99.99% 99.99%
1 inst_replay_overhead Instruction Replay Overhead 0.000178 0.000178 0.000178
1 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000
1 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000
1 gld_transactions_per_request Global Load Transactions Per Request 16.000000 16.000000 16.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000
1 shared_store_transactions Shared Store Transactions 0 0 0
1 shared_load_transactions Shared Load Transactions 0 0 0
1 local_load_transactions Local Load Transactions 0 0 0
1 local_store_transactions Local Store Transactions 0 0 0
1 gld_transactions Global Load Transactions 536870914 536870914 536870914
1 gst_transactions Global Store Transactions 134217728 134217728 134217728
1 sysmem_read_transactions System Memory Read Transactions 0 0 0
1 sysmem_write_transactions System Memory Write Transactions 5 5 5
1 l2_read_transactions L2 Read Transactions 134218800 134218800 134218800
1 l2_write_transactions L2 Write Transactions 134217741 134217741 134217741
1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00%
1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00%
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 50.00% 50.00% 50.00%
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
1 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 0.00% 0.00% 0.00%
1 tex_cache_throughput Unified Cache Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_read_throughput L2 Throughput (Texture Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_tex_write_throughput L2 Throughput (Texture Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_read_throughput L2 Throughput (Reads) 150.32GB/s 150.32GB/s 150.32GB/s
1 l2_write_throughput L2 Throughput (Writes) 150.32GB/s 150.32GB/s 150.32GB/s
1 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 sysmem_write_throughput System Memory Write Throughput 5.8711KB/s 5.8711KB/s 5.8701KB/s
1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 gld_efficiency Global Memory Load Efficiency 100.00% 100.00% 100.00%
1 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00%
1 tex_cache_transactions Unified Cache Transactions 134217728 134217728 134217728
1 flop_count_dp Floating Point Operations(Double Precision) 0 0 0
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 inst_executed Instructions Executed 470089728 470089728 470089728
1 inst_issued Instructions Issued 470173430 470173430 470173430
1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1)
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 0.79% 0.79% 0.79%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 1.46% 1.46% 1.46%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 96.16% 96.16% 96.16%
1 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00%
1 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00%
1 stall_other Issue Stall Reasons (Other) 1.13% 1.13% 1.13%
1 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 0.00% 0.00% 0.00%
1 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.07% 0.07% 0.07%
1 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00%
1 inst_fp_32 FP Instructions(Single) 0 0 0
1 inst_fp_64 FP Instructions(Double) 0 0 0
1 inst_integer Integer Instructions 1.0742e+10 1.0742e+10 1.0742e+10
1 inst_bit_convert Bit-Convert Instructions 0 0 0
1 inst_control Control-Flow Instructions 1073741824 1073741824 1073741824
1 inst_compute_ld_st Load/Store Instructions 2147483648 2147483648 2147483648
1 inst_misc Misc Instructions 1077936128 1077936128 1077936128
1 inst_inter_thread_communication Inter-Thread Instructions 0 0 0
1 issue_slots Issue Slots 470173430 470173430 470173430
1 cf_issued Issued Control-Flow Instructions 33619968 33619968 33619968
1 cf_executed Executed Control-Flow Instructions 33619968 33619968 33619968
1 ldst_issued Issued Load/Store Instructions 268500992 268500992 268500992
1 ldst_executed Executed Load/Store Instructions 67174400 67174400 67174400
1 atomic_transactions Atomic Transactions 0 0 0
1 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000
1 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s
1 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 134217728 134217728 134217728
1 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.00% 0.00% 0.00%
1 stall_not_selected Issue Stall Reasons (Not Selected) 0.39% 0.39% 0.39%
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 134217728 134217728 134217728
1 flop_count_hp Floating Point Operations(Half Precision) 0 0 0
1 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0
1 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0
1 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0
1 inst_fp_16 HP Instructions(Half) 0 0 0
1 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0)
1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1)
1 pcie_total_data_transmitted PCIe Total Data Transmitted 1024 1024 1024
1 pcie_total_data_received PCIe Total Data Received 0 0 0
1 inst_executed_global_loads Warp level instructions for global loads 33554432 33554432 33554432
1 inst_executed_local_loads Warp level instructions for local loads 0 0 0
1 inst_executed_shared_loads Warp level instructions for shared loads 0 0 0
1 inst_executed_surface_loads Warp level instructions for surface loads 0 0 0
1 inst_executed_global_stores Warp level instructions for global stores 33554432 33554432 33554432
1 inst_executed_local_stores Warp level instructions for local stores 0 0 0
1 inst_executed_shared_stores Warp level instructions for shared stores 0 0 0
1 inst_executed_surface_stores Warp level instructions for surface stores 0 0 0
1 inst_executed_global_atomics Warp level instructions for global atom and atom cas 0 0 0
1 inst_executed_global_reductions Warp level instructions for global reductions 0 0 0
1 inst_executed_surface_atomics Warp level instructions for surface atom and atom cas 0 0 0
1 inst_executed_surface_reductions Warp level instructions for surface reductions 0 0 0
1 inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS 0 0 0
1 inst_executed_tex_ops Warp level instructions for texture 0 0 0
1 l2_global_load_bytes Bytes read from L2 for misses in Unified Cache for global loads 4294967296 4294967296 4294967296
1 l2_local_load_bytes Bytes read from L2 for misses in Unified Cache for local loads 0 0 0
1 l2_surface_load_bytes Bytes read from L2 for misses in Unified Cache for surface loads 0 0 0
1 l2_local_global_store_bytes Bytes written to L2 from Unified Cache for local and global stores. 4294967296 4294967296 4294967296
1 l2_global_reduction_bytes Bytes written to L2 from Unified cache for global reductions 0 0 0
1 l2_global_atomic_store_bytes Bytes written to L2 from Unified cache for global atomics 0 0 0
1 l2_surface_store_bytes Bytes written to L2 from Unified Cache for surface stores. 0 0 0
1 l2_surface_reduction_bytes Bytes written to L2 from Unified Cache for surface reductions 0 0 0
1 l2_surface_atomic_store_bytes Bytes transferred between Unified Cache and L2 for surface atomics 0 0 0
1 global_load_requests Total number of global load requests from Multiprocessor 134217728 134217728 134217728
1 local_load_requests Total number of local load requests from Multiprocessor 0 0 0
1 surface_load_requests Total number of surface load requests from Multiprocessor 0 0 0
1 global_store_requests Total number of global store requests from Multiprocessor 134217728 134217728 134217728
1 local_store_requests Total number of local store requests from Multiprocessor 0 0 0
1 surface_store_requests Total number of surface store requests from Multiprocessor 0 0 0
1 global_atomic_requests Total number of global atomic requests from Multiprocessor 0 0 0
1 global_reduction_requests Total number of global reduction requests from Multiprocessor 0 0 0
1 surface_atomic_requests Total number of surface atomic requests from Multiprocessor 0 0 0
1 surface_reduction_requests Total number of surface reduction requests from Multiprocessor 0 0 0
1 sysmem_read_bytes System Memory Read Bytes 0 0 0
1 sysmem_write_bytes System Memory Write Bytes 160 160 160
1 l2_tex_hit_rate L2 Cache Hit Rate 0.00% 0.00% 0.00%
1 texture_load_requests Total number of texture Load requests from Multiprocessor 0 0 0
1 unique_warps_launched Number of warps launched 32768 32768 32768
1 sm_efficiency Multiprocessor Activity 99.63% 99.63% 99.63%
1 achieved_occupancy Achieved Occupancy 0.986477 0.986477 0.986477
1 ipc Executed IPC 0.344513 0.344513 0.344513
1 issued_ipc Issued IPC 0.344574 0.344574 0.344574
1 issue_slot_utilization Issue Slot Utilization 8.61% 8.61% 8.61%
1 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.592326 0.592326 0.592326
1 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1)
1 l2_utilization L2 Cache Utilization Low (2) Low (2) Low (2)
1 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0)
1 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 tex_fu_utilization Texture Function Unit Utilization Low (1) Low (1) Low (1)
1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1)
1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00%
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00%
1 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00%
1 dram_read_transactions Device Memory Read Transactions 134218560 134218560 134218560
1 dram_write_transactions Device Memory Write Transactions 134176900 134176900 134176900
1 dram_read_throughput Device Memory Read Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 dram_write_throughput Device Memory Write Throughput 150.27GB/s 150.27GB/s 150.27GB/s
1 dram_utilization Device Memory Utilization High (7) High (7) High (7)
1 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0)
1 ecc_transactions ECC Transactions 0 0 0
1 ecc_throughput ECC Throughput 0.00000B/s 0.00000B/s 0.00000B/s
1 dram_read_bytes Total bytes read from DRAM to L2 cache 4294993920 4294993920 4294993920
1 dram_write_bytes Total bytes written from L2 cache to DRAM 4293660800 4293660800 4293660800
推荐答案
对于Fermi和Kepler GPU,在发出全局事务时,该事务始终为128字节,并且L1缓存行大小(如果启用)为128字节.有了Maxwell和Pascal,这些特性发生了变化.特别是,读取L1高速缓存行的一部分并不一定会触发完整的128字节宽度的事务.通过微基准测试,这很容易发现/证明.
With Fermi and Kepler GPUs, when a global transaction was issued, it was always for 128 bytes, and the L1 cacheline size (if enabled) was 128 bytes. With Maxwell and Pascal, these characteristics changed. In particular, a read of a portion of an L1 cacheline does not necessarily trigger a full 128-byte width transaction. This is fairly easily discoverable/provable with microbenchmarking.
有效地,全局负载事务的大小发生了变化,这取决于一定程度的粒度.基于事务大小的这种变化,可能需要多个事务,而以前只需要1个事务.据我所知,这些内容都没有明确发表或详细介绍,我在这里将无法做到.但是,我认为我们可以解决您的许多问题,而无需给出有关如何计算全局负荷交易的精确描述.
Effectively, the size of a global load transaction changed, subject to a certain quantum of granularity. Based on this change of transaction size, it's possible that multiple transactions could be required, where previously only 1 was required. As far as I know, none of this is clearly published or detailed, and I won't be able to do that here. However I think we can address a number of your questions without giving a precise description of how global load transactions are calculated.
在费米/开普勒时间范围内,这种心态(每个请求1个事务以完全合并每个线程32位数量的负载)将是正确的.它不再适用于Maxwell和Pascal GPU.正如您已经计算出的那样,事务大小似乎小于128字节,因此每个请求的事务数大于1.但这并不表示本身就存在效率问题(就像Fermi/开普勒时间表).因此,让我们承认,即使基础流量实际上具有100%的效率,交易规模也可以较小,因此每个请求的交易额可以更高.
This mindset (1 transaction per request for fully coalesced loads of a 32-bit quantity per thread) would have been correct in the Fermi/Kepler timeframe. It is no longer correct for Maxwell and Pascal GPUs. As you've already calculated, the transaction size appears to be smaller than 128 bytes, and therefore the number of transactions per request is higher than 1. But this doesn't indicate an efficiency problem per se (as it would have in Fermi/Kepler timeframe). So let's just acknowledge that the transaction size can be smaller and therefore transactions per request can be higher, even though the underlying traffic is essentially 100% efficient.
不,这不是什么意思.这仅表示细分量对于负载(负载事务)和商店(商店事务)可以不同.这些碰巧是32字节的事务.无论是装载还是存储,在这种情况下交易都是并且应该充分有效的.请求的流量与实际流量一致,其他探查器指标对此进行了确认.如果实际流量远高于请求的流量,则可以很好地表明负载或存储效率低:
No, that's not what this means. It simply means that the subdivision quanta can be different for loads (load transactions) and stores (store transactions). These happen to be 32-byte transactions. In either case, loads or stores, the transactions are and should be fully efficient in this case. The requested traffic is consistent with the actual traffic, and other profiler metrics confirm this. If the actual traffic were much higher than the requested traffic, that would be a good indication of inefficient loads or stores:
1 gld_requested_throughput Requested Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_requested_throughput Requested Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gld_throughput Global Load Throughput 150.32GB/s 150.32GB/s 150.32GB/s
1 gst_throughput Global Store Throughput 150.32GB/s 150.32GB/s 150.32GB/s
Again, you'll have to adjust your way of thinking to account for variable transaction sizes. Throughput is driven by the needs and efficiency associated with fulfilling those needs. Both loads and stores are fully efficient for your code design, so there is no reason to think there is or should be an imbalance in efficiency.
Again, you'll have to adjust your way of thinking to account for variable transaction sizes. Throughput is driven by the needs and efficiency associated with fulfilling those needs. Both loads and stores are fully efficient for your code design, so there is no reason to think there is or should be an imbalance in efficiency.
This is simply due to the different size of the transactions. You've already calculated that the apparent global load transaction size is 8 bytes, and I've already indicated that the L2 transaction size is 32 bytes, so it makes sense that there would be a 4:1 ratio between the total number of transactions, since they reflect the same movement of the same data, viewed through 2 different lenses. Note that there has always been a disparity in the size of global transactions vs. the size of L2 transactions, or transactions to DRAM. Its simply that the ratios of these may vary by GPU architecture, and possibly other factors, such as load patterns.
This is simply due to the different size of the transactions. You've already calculated that the apparent global load transaction size is 8 bytes, and I've already indicated that the L2 transaction size is 32 bytes, so it makes sense that there would be a 4:1 ratio between the total number of transactions, since they reflect the same movement of the same data, viewed through 2 different lenses. Note that there has always been a disparity in the size of global transactions vs. the size of L2 transactions, or transactions to DRAM. Its simply that the ratios of these may vary by GPU architecture, and possibly other factors, such as load patterns.
Some notes:
Some notes:
I won't be able to answer questions such as "why is it this way?", or "why did Pascal change from Fermi/Kepler?" or "given this particular code, what would you predict as the needed global load transactions on this particular GPU?", or "generally, for this particular GPU, how would I calculate or predict transaction size?"
I won't be able to answer questions such as "why is it this way?", or "why did Pascal change from Fermi/Kepler?" or "given this particular code, what would you predict as the needed global load transactions on this particular GPU?", or "generally, for this particular GPU, how would I calculate or predict transaction size?"
As an aside, there are new profiling tools (Nsight Compute and Nsight Systems) being advanced by NVIDIA for GPU work. Many of the efficiency and transactions per request metrics which are available in nvprof
are gone under the new toolchain. So these mindsets will have to be broken anyway, because these methods of ascertaining efficiency won't be available moving forward, based on the current metric set.
As an aside, there are new profiling tools (Nsight Compute and Nsight Systems) being advanced by NVIDIA for GPU work. Many of the efficiency and transactions per request metrics which are available in nvprof
are gone under the new toolchain. So these mindsets will have to be broken anyway, because these methods of ascertaining efficiency won't be available moving forward, based on the current metric set.
Note that the use of compile switches such as -Xptxas -dlcm=ca
may affect (L1) caching behavior. I don't expect caches to have much performance or efficiency impact on this particular copy code, however.
Note that the use of compile switches such as -Xptxas -dlcm=ca
may affect (L1) caching behavior. I don't expect caches to have much performance or efficiency impact on this particular copy code, however.
This possible reduction in transaction size is generally a good thing. It results in no loss of efficiency for traffic patterns such as presented in this code, and for certain other codes it allows (less-than-128byte) requests to be satisfied with less wasted bandwidth.
This possible reduction in transaction size is generally a good thing. It results in no loss of efficiency for traffic patterns such as presented in this code, and for certain other codes it allows (less-than-128byte) requests to be satisfied with less wasted bandwidth.
Although not specifically Pascal, here is a better defined example of the possible variability in these measurements for Maxwell. Pascal will have similar variability. Also, some small hint of this change (especially for Pascal) was given in the Pascal Tuning Guide. It by no means offers a complete description or explains all of your observations, but it does hint at the general idea that the global transactions are no longer fixed to a 128-byte size.
Although not specifically Pascal, here is a better defined example of the possible variability in these measurements for Maxwell. Pascal will have similar variability. Also, some small hint of this change (especially for Pascal) was given in the Pascal Tuning Guide. It by no means offers a complete description or explains all of your observations, but it does hint at the general idea that the global transactions are no longer fixed to a 128-byte size.
这篇关于NVIDIA Pascal上的内存合并和nvprof结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!