本文介绍了如何为具有相同大小分区的 Spark RDD 定义自定义分区器,其中每个分区具有相同数量的元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark 的新手.我有一个很大的元素数据集[RDD],我想把它分成两个大小完全相同的分区,保持元素的顺序.我尝试使用 RangePartitioner 之类的

I am new to Spark. I have a large dataset of elements[RDD] and I want to divide it into two exactly equal sized partitions maintaining order of elements. I tried using RangePartitioner like

var data = partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile))

这并没有给出令人满意的结果,因为它划分了粗略但不完全相等的元素保持顺序.例如,如果有 64 个元素,我们使用Rangepartitioner,则分为31个元素和33个元素.

This doesn't give a satisfactory result because it divides roughly but not exactly equal sized maintaining order of elements.For example if there are 64 elements, we useRangepartitioner, then it divides into 31 elements and 33 elements.

我需要一个分区器,以便我在一半中获得前 32 个元素,而另一半包含第二组 32 个元素.您能否通过建议如何使用自定义分区器来帮助我,以便我获得相同大小的两半,保持元素的顺序?

I need a partitioner such that I get exactly first 32 elements in one half and other half contains second set of 32 elements.Could you please help me by suggesting how to use a customized partitioner such that I get equally sized two halves, maintaining the order of elements?

推荐答案

Partitioner 的工作是为分区分配一个键.您需要先了解密钥分布,或者查看所有密钥,才能制作这样的分区器.这就是 Spark 不为您提供的原因.

Partitioners work by assigning a key to a partition. You would need prior knowledge of the key distribution, or look at all keys, to make such a partitioner. This is why Spark does not provide you with one.

通常您不需要这样的分区器.事实上,我无法想出一个需要相同大小分区的用例.如果元素的数量是奇数怎么办?

In general you do not need such a partitioner. In fact I cannot come up with a use case where I would need equal-size partitions. What if the number of elements is odd?

无论如何,假设您有一个由顺序 Int 键控的 RDD,并且您知道总共有多少个.然后你可以像这样编写一个自定义的Partitioner:

Anyway, let us say you have an RDD keyed by sequential Ints, and you know how many in total. Then you could write a custom Partitioner like this:

class ExactPartitioner[V](
    partitions: Int,
    elements: Int)
  extends Partitioner {

  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[Int]
    // `k` is assumed to go continuously from 0 to elements-1.
    return k * partitions / elements
  }
}

这篇关于如何为具有相同大小分区的 Spark RDD 定义自定义分区器,其中每个分区具有相同数量的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 17:10