python - 如何从numpy 2d中获取行，其中列值从其他列分组中最大？

这是很常见的 SQL 查询:

选择列 X 中具有最大值的行，按 group_id 分组。

结果是对于每个 group_id ，一个(第一)行，其中列 X 值在组内是最大值。

我有一个包含许多列的 2D NumPy 数组，但让我们将其简化为( ID 、 X 、 Y ):

import numpy as np
rows = np.array([[1 22 1236]
                 [1 11 1563]
                 [2 13 1234]
                 [2 10 1224]
                 [2 23 1111]
                 [2 23 1250]])

我想得到:

[[1 22 1236]
 [2 23 1111]]

我能够通过繁琐的循环来做到这一点，例如:

  row_grouped_with_max = []

  max_row = rows[0]
  last_max = max_row[1]
  last_row_group = max_row[0]
  for row in rows:
    if last_max < row[1]:
        max_row = row
    if row[0] != last_row_group:
      last_row_group = row[0]
      last_max = 0
      row_grouped_with_max.append(max_row)
  row_grouped_with_max.append(max_row)

如何以干净的 NumPy 方式做到这一点？

最佳答案

可能不是很干净，但这里有一个矢量化的方法来解决它 -

# Get sorted "rows"
sorted_rows = rows[np.argsort(rows[:,0])]

# Get count of elements for each ID
_,count = np.unique(sorted_rows[:,0],return_counts=True)

# Form mask to fill elements from X-column
N1 = count.max()
N2 = len(count)
mask = np.arange(N1) < count[:,None]

# Form a 2D matrix of ID's with each row for each unique ID
ID_2Darray = np.empty((N2,N1))
ID_2Darray.fill(-np.Inf)
ID_2Darray[mask] = sorted_rows[:,1]

# Get ID based max indices
grp_max_idx = np.argmax(ID_2Darray,axis=1) + np.append([0],count.cumsum()[:-1])

# Finally, get the "maxed"-X rows
out = sorted_rows[grp_max_idx]

样本输入、输出 -

In [101]: rows
Out[101]:
array([[   2,   13, 1234],
       [   1,   22, 1236],
       [   2,   23, 1250],
       [   6,   12, 1345],
       [   4,   10,  290],
       [   2,   10, 1224],
       [   2,   23, 1111],
       [   4,   45,   99],
       [   1,   11, 1563],
       [   4,   23,   89]])

In [102]: out
Out[102]:
array([[   1,   22, 1236],
       [   2,   23, 1250],
       [   4,   45,   99],
       [   6,   12, 1345]])

关于python - 如何从numpy 2d中获取行，其中列值从其他列分组中最大？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31399193/