python - 在pyspark中将数据框分组之前进行排序是否安全？

给定pyspark数据框df具有列'ProductId'，'Date'和'Price'，按'Date'排序的安全性如何并假设func.first('Price')将始终检索与最小日期相对应的Price？

我的意思是：会
df.orderBy('ProductId', 'Date').groupBy('ProductId').agg(func.first('Price'))
返回每个产品的第一笔及时付款价格，而又不会在分组时弄乱orderBy吗？

最佳答案

我不确定是否可以保证groupBy()的订单。但是，这是执行您想要的工作的另一种方法。

使用pyspark.sql.Window根据需要对DataFrame进行分区和排序。然后使用pyspark.sql.DataFrame.distinct()删除重复的条目。

例如：

创建虚拟数据

data = [
    (123, '2017-07-01', 50),
    (123, '2017-01-01', 100),
    (345, '2018-01-01', 20),
    (123, '2017-03-01', 25),
    (345, '2018-02-01', 33)
]

df = sqlCtx.createDataFrame(data, ['ProductId', 'Date', 'Price'])
df.show()
#+---------+----------+-----+
#|ProductId|      Date|Price|
#+---------+----------+-----+
#|      123|2017-07-01|   50|
#|      123|2017-01-01|  100|
#|      345|2018-01-01|   20|
#|      123|2017-03-01|   25|
#|      345|2018-02-01|   33|
#+---------+----------+-----+

使用窗口

使用Window.partitionBy('ProductId').orderBy('Date')：

import pyspark.sql.functions as f
from pyspark.sql import Window

df.select(
    'ProductId',
    f.first('Price').over(Window.partitionBy('ProductId').orderBy('Date')).alias('Price')
).distinct().show()
#+---------+-----+
#|ProductId|Price|
#+---------+-----+
#|      123|  100|
#|      345|   20|
#+---------+-----+

编辑

我发现this scala post中接受的答案表示保留了顺序，尽管注释中存在与之矛盾的讨论。

关于python - 在pyspark中将数据框分组之前进行排序是否安全？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/48950500/