如何确定SLURM中在python脚本步骤内存中超出了哪一点

本文介绍了如何确定SLURM中在python脚本步骤内存中超出了哪一点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个python脚本，该脚本正在SLURM群集上运行，用于多个输入文件:

I have a python script that I am running on a SLURM cluster for multiple input files:

#!/bin/bash

#SBATCH -p standard
#SBATCH -A overall 
#SBATCH --time=12:00:00
#SBATCH --output=normalize_%A.out
#SBATCH --error=normalize_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=240000

HDF5_DIR=...
OUTPUT_DIR=...
NORM_SCRIPT=...

norm_func () {
  local file=$1
  echo "$file"
  python $NORM_SCRIPT -data $file -path $OUTPUT_DIR
}

# Doing normalization in parallel
for file in $HDF5_DIR/*; do norm_func "$file" & done
wait

python脚本仅加载数据集(scRNAseq)，对其进行归一化并另存为.csv文件.其中的一些主要代码行是:

The python script is just loading a dataset (scRNAseq), does its normalization and saves as .csv file. Some major lines of code in it are:

        f = h5py.File(path_to_file, 'r')
        rawcounts = np.array(rawcounts)

        unique_code = np.unique(split_code)
        for code in unique_code:
            mask = np.equal(split_code, code)
            curr_counts = rawcounts[:,mask]

            # Actual TMM normalization
            mtx_norm = gmn.tmm_normalization(curr_counts)

            # Writing the results into .csv file
            csv_path = path_to_save + "/" + file_name + "_" + str(code) + ".csv"
            with open(csv_path,'w', encoding='utf8') as csvfile:
                writer = csv.writer(csvfile, delimiter=',')
                writer.writerow(["", cell_ids])
                for idx, row in enumerate(mtx_norm):
                    writer.writerow([gene_symbols[idx], row])

对于10Gb以上的数据集，我一直收到step memory exceeded错误，但我不确定为什么.如何更改.slurm脚本或python代码以减少其内存使用量?我如何才能真正找到导致memory问题的原因，在这种情况下，是否有特定的调试内存的方法?任何建议将不胜感激.

I keep getting step memory exceeded error for datasets that are above 10Gb and I am not sure why. How I can change my .slurm script or python code to reduce its memory usage? How can I actually identify what causes the memory problem, is there a particular way of debugging the memory in this case? Any suggestions would be greatly appreciated.

推荐答案

您可以使用srun启动python脚本来获取详细信息:

You can get refined information by using srun to start the python scripts:

srun python $NORM_SCRIPT -data $file -path $OUTPUT_DIR

然后，Slurm将为每个python脚本实例创建一个步骤"，并在记帐中独立报告每个步骤的信息(错误，返回代码，使用的内存等)，您可以查询sacct命令.

Slurm will then create one 'step' per instance of your python script, and report information (errors, return codes, memory used, etc.) for each step independently in the accounting, which you can interrogate with the sacct command.

如果由管理员配置，请使用--profile选项来获取每个步骤的内存使用情况的时间表.

If configured by the administrators, use the --profile option to get a timeline of the memory usage of each step.

在您的python脚本中，您可以使用 memory_profile 模块来获取有关内存使用情况的反馈您的脚本.

In your python script you can use the memory_profile module to get a feedback on the memory usage of your scripts.

这篇关于如何确定SLURM中在python脚本步骤内存中超出了哪一点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！