本文介绍了 pandas :快速计算具有特定值的列之和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas数据框,我需要计算落入某个窗口内的一列值的总和.因此,例如,如果我有一个500的窗口,而我的初始值为1000,那么我想对所有介于499和999之间以及1001和1501之间的值求和.

I have a pandas dataframe and I need to calculate the sum of a column of values that fall within a certain window. So for instance, if I have a window of 500, and my initial value is 1000, I want to sum all values that are between 499 and 999, and also between 1001 and 1501.

用一些数据更容易解释:

This is easier to explain with some data:

    chrom   pos end AFR EUR pi
0   1   10177   10177   0.4909  0.4056  0.495988
1   1   10352   10352   0.4788  0.4264  0.496369
2   1   10617   10617   0.9894  0.9940  0.017083
3   1   11008   11008   0.1346  0.0885  0.203142
4   1   11012   11012   0.1346  0.0885  0.203142
5   1   13110   13110   0.0053  0.0567  0.053532
6   1   13116   13116   0.0295  0.1869  0.176091
7   1   13118   13118   0.0295  0.1869  0.176091
8   1   13273   13273   0.0204  0.1471  0.139066
9   1   13550   13550   0.0008  0.0080  0.007795
10  1   14464   14464   0.0144  0.1859  0.161422
11  1   14599   14599   0.1210  0.1610  0.238427
12  1   14604   14604   0.1210  0.1610  0.238427
13  1   14930   14930   0.4811  0.5209  0.500209
14  1   14933   14933   0.0015  0.0507  0.044505
15  1   15211   15211   0.5371  0.7316  0.470848
16  1   15585   15585   0.0008  0.0020  0.002635
17  1   15644   15644   0.0008  0.0080  0.007795
18  1   15777   15777   0.0159  0.0149  0.030470
19  1   15820   15820   0.4849  0.2714  0.477153
20  1   15903   15903   0.0431  0.4652  0.349452
21  1   16071   16071   0.0091  0.0010  0.011142
22  1   16142   16142   0.0053  0.0020  0.007721
23  1   16949   16949   0.0227  0.0159  0.038759
24  1   18643   18643   0.0023  0.0080  0.009485
25  1   18849   18849   0.8411  0.9911  0.170532
26  2   30923   30923   0.6687  0.9364  0.338400
27  2   20286   46286   0.0053  0.0010  0.006863
28  2   21698   46698   0.0015  0.0010  0.002566
29  2   42159   47159   0.0083  0.0696  0.067187

所以我需要基于前两列进行子集化.例如,如果我的窗口= 500,我的chrom = 1,我的pos = 15500,我将需要对我的df进行子集化,以仅包含chrom = 1且15000> pos< 16000.

So I need to subset based on the first two columns. For example, if my window = 500, my chrom = 1 and my pos = 15500, I will need to subset my df to include only those rows that have chrom = 1 and 15000 > pos < 16000.

然后我想对这部分数据的AFR列求和.

I would then like to sum the AFR column of this subset of data.

这是我做的功能:

#vdf is my main dataframe, 
#polyChrom is the chromosome to subset by,     
#polyPos is the position to subset by. 
#Distance is how far the window should be from the polyPos. 
#windowSize is the size of the window itself
#E.g. if distance=20000 and windowSize= 500, we are looking at a window
#that is (polyPos-20000)-500 to (polyPos-20000) and a window that is 
#(polyPos+20000) to (polyPos+20000)+500.
def mafWindow(vdf, polyChrom, polyPos, distance, windowSize):

#If start position becomes less than 0, set it to 0
    if(polyPos - distance < 0):
        start1 = 0
        end1 = windowSize
    else:
        start1 = polyPos - distance
        end1 = start1 + windowSize

    end2 = polyPos + distance
    start2 = end2 - windowSize

#subset df        
    df = vdf.loc[(vdf['chrom'] == polyChrom) & ((vdf['pos'] <= end1) & (vdf['pos'] >= start1))|
                 ((vdf['pos'] <= end2) & (vdf['pos'] >= start2))].copy()

    return(df.AFR.sum())

整个方法可用于对数据框进行子集设置,当我的数据框包含约55k行时,该方法非常慢.有更快,更有效的方法吗?

This whole method works on subsetting the dataframe and is very slow when my dataframe contains ~55k rows. Is there a quicker and more efficient way of doing this?

推荐答案

诀窍是将其放到numpy数组中.熊猫索引编制和切片速度很慢.

The trick is to drop down to numpy arrays. Pandas indexing and slicing is slow.

import pandas as pd

df = pd.DataFrame([[1, 10177, 0.5], [1, 10178, 0.2], [1, 20178, 0.1],
                   [2, 10180, 0.3], [1, 10180, 0.4]], columns=['chrom', 'pos', 'AFR'])

chrom = df['chrom'].values
pos = df['pos'].values
afr = df['AFR'].values

def filter_sum(chrom_arr, pos_arr, afr_arr, chrom_val, pos_start, pos_end):
    return sum(k for i, j, k in zip(chrom_arr, pos_arr, afr_arr) \
               if pos_start < j < pos_end and i == chrom_val)

filter_sum(chrom, pos, afr, 1, 10150, 10200)

# 1.1

这篇关于 pandas :快速计算具有特定值的列之和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 03:50