

我正在使用MPI(mpi4py)处理 python代码,并且希望在HPC群集的队列中的多个节点(每个节点具有16个处理器)之间实现我的代码。

I am working a python code with MPI (mpi4py) and I want to implement my code across many nodes (each node has 16 processors) in a queue in a HPC cluster.


from mpi4py import MPI

size = comm.Get_size()
rank = comm.Get_rank()

count = 0
for i in range(1, size):
    if rank == i:
        for j in range(5):
            res = some_function(some_argument)
            comm.send(res, dest=0, tag=count)


I am able to run this code perfectly fine on the head node of the cluster using the command

$mpirun -np 48 python codename.py

这里 code是python脚本的名称,在给定的示例中,我选择48个处理器。在头节点上,对于我的特定任务,该工作大约需要1秒钟才能完成(并且成功提供了所需的输出)。

Here "code" is the name of the python script and in the given example, I am choosing 48 processors. On the head node, for my specific task, the job takes about 1 second to finish (and it successfully gives the desired output).


However, when I run try to submit this same exact code as a job on one of the queues of the HPC cluster, it keeps running for a very long time (many hours) (doesn't finish) and I have to manually kill the job after a day or so. Also, it doesn't give the expected output.


Here is the pbs file that I am using,


#PBS -l nodes=3:ppn=16
#PBS -N phy
#PBS -m abe
#PBS -l walltime=23:00:00
#PBS -j eo
#PBS -q queue_name

echo 'This job started on: ' `date`

module load python27-extras
mpirun -np 48 python codename.py

我使用命令 qsub jobname.pbs 提交作业。


I am confused as to why the code should run perfectly fine on the head node, but run into this problem when I submit this job to run the code across many processors in a queue. I am presuming that I may need to change the pbs script. I will be really thankful if someone can suggest what I should do to run such a MPI script as a job on a queue in a HPC cluster.


不需要更改我的代码。这是有效的pbs脚本。 =)

Didn't need to change my code. This is the pbs script that worked. =)


Apparently, I needed to call the appropriate mpirun in the job script, so that when the code runs in the clusters, it uses the same mpirun as that was being used in head node.

这是产生区别的行: /opt/intel/impi/



#PBS -l nodes=3:ppn=16
#PBS -N phy
#PBS -m abe
#PBS -l walltime=23:00:00
#PBS -j eo
#PBS -q queue_name

export I_MPI_PIN=off
echo 'This job started on: ' `date`

/opt/intel/impi/ -np 48 python codename.py


06-29 15:05