问题描述
我是 Spark 的新手.我有一个很大的元素数据集[RDD],我想把它分成两个大小完全相同的分区,保持元素的顺序.我尝试使用 RangePartitioner
之类的
I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner
like
var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))
这不会给出令人满意的结果,因为它划分了粗略但不完全相等的元素保持顺序.例如,如果有 64 个元素,我们使用Rangepartitioner
,则分为31个元素和33个元素.
This doesn't give a satisfactory result because it divides roughly but not exactly equal sized maintaining order of elements.For example if there are 64 elements, we useRangepartitioner
, then it divides into 31 elements and 33 elements.
我需要一个分区器,以便我在一半中获得前 32 个元素,而另一半包含第二组 32 个元素.您能否通过建议如何使用自定义分区器来帮助我,以便我获得相同大小的两半,保持元素的顺序?
I need a partitioner such that I get exactly first 32 elements in one half and other half contains second set of 32 elements.Could you please help me by suggesting how to use a customized partitioner such that I get equally sized two halves, maintaining the order of elements?
推荐答案
Partitioner
的工作是为分区分配一个键.您需要先了解密钥分布,或查看所有密钥,才能制作这样的分区器.这就是 Spark 不为您提供的原因.
Partitioner
s work by assigning a key to a partition. You would need prior knowledge of the key distribution, or look at all keys, to make such a partitioner. This is why Spark does not provide you with one.
通常您不需要这样的分区器.事实上,我无法想出一个需要相同大小分区的用例.如果元素数是奇数怎么办?
In general you do not need such a partitioner. In fact I cannot come up with a use case where I would need equal-size partitions. What if the number of elements is odd?
无论如何,假设您有一个由顺序 Int
键控的 RDD,并且您知道总共有多少个.然后你可以像这样编写一个自定义的 Partitioner
:
Anyway, let us say you have an RDD keyed by sequential Int
s, and you know how many in total. Then you could write a custom Partitioner
like this:
class ExactPartitioner[V](
partitions: Int,
elements: Int)
extends Partitioner {
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
// `k` is assumed to go continuously from 0 to elements-1.
return k * partitions / elements
}
}
这篇关于如何为大小相等的分区的 Spark RDD 定义自定义分区器,其中每个分区具有相同数量的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!