在"groupByKey"之后，spark是否将单个键的RDD [K，V]的所有元素都保留在单个分区中.即使密钥的数据非常庞大?

本文介绍了在"groupByKey"之后，spark是否将单个键的RDD [K，V]的所有元素都保留在单个分区中.即使密钥的数据非常庞大?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

考虑一下，我有一个PairedRDD，例如10个分区.但是这些键并不是均匀分布的，也就是说，所有9个有数据的分区都属于一个键，例如a，其余的键都说b,c仅存在于最后一个分区中，如下图所示:

Consider I have a PairedRDD of,say 10 partitions. But the keys are not evenly distributed, i.e, all the 9 partitions having data belongs to a single key say a and rest of the keys say b,c are there in last partition only.This is represented by the below figure:

现在，如果我在此rdd上执行groupByKey，则根据我的理解，同一密钥的所有数据最终将进入不同的分区，或者同一密钥的任何数据都不会位于多个分区中.如果我错了，请纠正我.

Now if I do a groupByKey on this rdd, from my understanding all data for same key will eventually go to different partitions or no data for the same key will not be in multiple partitions. Please correct me if I am wrong.

如果是这种情况，那么密钥a的分区的大小可能无法容纳在工作人员的RAM中.在这种情况下，会产生什么火花?我的假设是，它将把数据溢出到工作磁盘上.那是对的吗?或火花如何处理这种情况

If that is the case then there can be a chance that the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do ? My assumption is like it will spill the data to worker's disk.Is that correct?Or how spark handle such situations

推荐答案

是的，确实如此.这是洗牌的重点.

Yes, it does. This is a whole point of the shuffle.

特定分区的大小在这里不是最大的问题.分区使用惰性Iterators表示，可以轻松存储超出可用内存量的数据.主要问题是在分组过程中生成的非惰性本地数据结构.

Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.

特定键的所有值都以CompactBuffer的形式存储在内存中，因此单个较大的组可能会导致OOM.即使每个记录分别适合内存，您仍然可能会遇到严重的GC问题.

All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.

通常:

在分配给分区的数据量超过可用内存量的情况下，对数据进行分区(尽管不是最佳性能)是安全的.
在相同情况下使用PairRDDFunctions.groupByKey是不安全的.

It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
It is not safe to use PairRDDFunctions.groupByKey in the same situation.

注意:但是，您不应将其外推到groupByKey的不同实现中.特别是Spark Dataset和PySpark RDD.groupByKey都使用更复杂的机制.

Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

这篇关于在"groupByKey"之后，spark是否将单个键的RDD [K，V]的所有元素都保留在单个分区中.即使密钥的数据非常庞大?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！