本文介绍了在HPC集群上使用python代码(mpi4py)提交作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MPI(mpi4py)处理 python代码,并且希望在HPC群集的队列中的多个节点(每个节点具有16个处理器)之间实现我的代码。

I am working a python code with MPI (mpi4py) and I want to implement my code across many nodes (each node has 16 processors) in a queue in a HPC cluster.

我的代码结构如下:

from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

count = 0
for i in range(1, size):
    if rank == i:
        for j in range(5):
            res = some_function(some_argument)
            comm.send(res, dest=0, tag=count)

能够完美运行此代码使用命令

I am able to run this code perfectly fine on the head node of the cluster using the command

$mpirun -np 48 python codename.py

这里 code是python脚本的名称,在给定的示例中,我选择48个处理器。在头节点上,对于我的特定任务,该工作大约需要1秒钟才能完成(并且成功提供了所需的输出)。

Here "code" is the name of the python script and in the given example, I am choosing 48 processors. On the head node, for my specific task, the job takes about 1 second to finish (and it successfully gives the desired output).

但是,当我尝试在HPC集群的队列之一上提交与作业相同的完全相同的代码,它保持运行很长时间(很多小时)(未完成),我必须手动终止一天左右的工作。另外,它没有给出预期的输出。

However, when I run try to submit this same exact code as a job on one of the queues of the HPC cluster, it keeps running for a very long time (many hours) (doesn't finish) and I have to manually kill the job after a day or so. Also, it doesn't give the expected output.

这是我正在使用的pbs文件,

Here is the pbs file that I am using,

#!/bin/sh

#PBS -l nodes=3:ppn=16
#PBS -N phy
#PBS -m abe
#PBS -l walltime=23:00:00
#PBS -j eo
#PBS -q queue_name

cd $PBS_O_WORKDIR
echo 'This job started on: ' `date`

module load python27-extras
mpirun -np 48 python codename.py

我使用命令 qsub jobname.pbs 提交作业。

我对为什么代码应该在头节点上完全正常运行感到困惑,但是当我提交此作业以跨队列中的多个处理器运行代码时遇到了这个问题。我想我可能需要更改pbs脚本。如果有人可以建议我应该如何将这种MPI脚本作为作业在HPC群集中的队列上运行,我将非常感激。

I am confused as to why the code should run perfectly fine on the head node, but run into this problem when I submit this job to run the code across many processors in a queue. I am presuming that I may need to change the pbs script. I will be really thankful if someone can suggest what I should do to run such a MPI script as a job on a queue in a HPC cluster.

推荐答案

不需要更改我的代码。这是有效的pbs脚本。 =)

Didn't need to change my code. This is the pbs script that worked. =)

显然,我需要在作业脚本中调用适当的mpirun,以便当代码在群集中运行时,它使用与当时相同的mpirun

Apparently, I needed to call the appropriate mpirun in the job script, so that when the code runs in the clusters, it uses the same mpirun as that was being used in head node.

这是产生区别的行: /opt/intel/impi/4.1.1.036/intel64/bin/mpirun

这是有效的作业脚本。

#!/bin/sh

#PBS -l nodes=3:ppn=16
#PBS -N phy
#PBS -m abe
#PBS -l walltime=23:00:00
#PBS -j eo
#PBS -q queue_name

cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16
export I_MPI_PIN=off
echo 'This job started on: ' `date`

/opt/intel/impi/4.1.1.036/intel64/bin/mpirun -np 48 python codename.py

这篇关于在HPC集群上使用python代码(mpi4py)提交作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-29 15:05