python - 计算子组中的丢失实例

；

import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})

+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A     | Blue     |   1 |
| A     | Blue     |   2 |
| A     | Blue     |   4 |
| A     | Red      |   1 |
| A     | Red      |   2 |
| A     | Red      |   3 |
| A     | Red      |   4 |
| B     | Blue     |   1 |
| B     | Blue     |   2 |
| B     | Blue     |   3 |
| B     | Blue     |   6 |
| B     | Red      |   1 |
| B     | Red      |   2 |
| B     | Red      |   3 |
+-------+----------+-----+

The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:

+-------+--------------------+--------+--------+
| Group | Total Observations | Missed |   %    |
+-------+--------------------+--------+--------+
| A     |                  8 |      1 | 12.5%  |
| B     |                  9 |      2 | 22.22% |
+-------+--------------------+--------+--------+

I tried both with for loops and by using groups (for example:

df.groupby(['Group','Subgroup']).sum()
print(groups.head)

)但我似乎无论如何也不能让它起作用。？
从another answer（big shoutout到@Lie Ryan）我找到了一个查找缺失元素的函数，但是我还不太明白如何实现它；

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

def missing_elements(L):
    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
    return list(missing)

有人能告诉我方向对吗？

最佳答案

很简单，您需要groupby这里：
使用groupby+diff，计算每个Group和SubGroup缺少多少个观测值
将df分组，计算上一步计算的列的Group和size
两个更简单的步骤（计算百分比）为您提供预期的输出。

f = [   # declare an aggfunc list in advance, we'll need it later
      ('Total Observations', 'size'),
      ('Missed', 'sum')
]

g = df.groupby(['Group', 'Subgroup'])\
      .Obs.diff()\
      .sub(1)\
      .groupby(df.Group)\
      .agg(f)

g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100

g

       Total Observations  Missed          %
Group
A                     8.0     1.0  12.500000
B                     9.0     2.0  22.222222