如何在cudaMemcpyPeerAsync（）中定义目标设备流？

本文介绍了如何在cudaMemcpyPeerAsync（）中定义目标设备流？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用cudaMemcpyPeerAsync（）从gpu0到gpu1执行异步memcpy。

cudaMemcpyAsync（）为gpu0提供流选项，但不提供gpu1。我可以以某种方式定义接收设备的流吗？

我使用OpenMP线程来管理每个设备（因此，它们在不同的上下文中）。

Visual Profiler显示发送设备的流，但是接收设备的流，这个memcpy只显示在MemCpy（PtoP），而不是在任何流（甚至在默认stream）

PS：我当前的实现工作正常。我只想重叠发送和接收通信。

解决方案

没有一个cuda副本的API调用，指定两端的流。您的问题的简单答案是否定的。

流是一种组织活动的方式。 cudaMemcpyPeerAsync调用将显示在分配给它的流（和设备）中。这是您对API的控制级别。

由于流支配（即控制，调节）行为，能够将cuda任务分配给不同的流在这种情况下，多个设备）是在CUDA中未暴露的控制级别。设备（和流）旨在异步操作，并且要求特定的cuda任务满足两个单独的流（在这种情况下，在两个分离的设备上）的要求将引入不适当的类型的同步，并且可以导致各种活动停顿，甚至可能是死锁。

这里的描述和cudaMemcpyPeerAsync的行为都不应该阻止您在各个方向上重复复制操作。事实上，在我看来，将cuda任务分配给多个流将使得更难以实现灵活的重叠。

如果您无法实现特定的重叠，应该可以描述问题（即，提供一个简单的再现器完整的可编译的SSCCE.org代码），并且显示可视化分析器显示的当前重叠情况，并描述期望的重叠情况。

I am doing a asynchronous memcpy from gpu0 to gpu1 using cudaMemcpyPeerAsync().

cudaMemcpyAsync() provides option for stream to use for gpu0, but not for gpu1. Can I somehow define the stream of the receiving device too?

I am using OpenMP threads to manage each of the devices (so, they are in separate context).

Visual Profiler shows the stream for sending device but for receiving device, this memcpy is just shown in the MemCpy (PtoP) and not in any of the streams (not even in the default stream)

PS: My current implementation works fine. I just want to overlap the sending and receiving communication.

解决方案

There is no API call for a cuda peer copy that allows you to specify streams on both ends. The simple answer to your question is no.

Streams are a way of organizing activity. The cudaMemcpyPeerAsync call will show up in the stream (and device) to which it is assigned. This is the level of control you have with the API.

Since streams dictate (i.e. control, regulate) behavior, being able to assign a cuda task to separate streams (on more than one device, in this case) is a level of control that is not exposed in CUDA. Devices (and streams) are intended to operate asynchronously, and requiring that a particular cuda task satisfy the requirements of two separate streams, (on two separate devices in this case) would introduce a type of synchronization that is not appropriate, and could lead to various kinds of activity stalls, and perhaps even deadlock.

None of the description here, nor the behavior of cudaMemcpyPeerAsync, should prevent you from overlapping copy operations in various directions. In fact, in my opinion, assigning a cuda task to more than one stream would make flexible overlap more difficult to achieve.

If you have difficulty achieving a particular overlap, you should probably describe the problem (i.e., provide a simple reproducer complete compilable SSCCE.org code), and show the current overlap scenario that visual profiler shows, and describe the desired overlap scenario.

这篇关于如何在cudaMemcpyPeerAsync（）中定义目标设备流？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！