为什么SparkContext.textFile的partition参数不生效?

本文介绍了为什么SparkContext.textFile的partition参数不生效?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21

scala> p.partitions.size
res33: Int = 729

我原本希望打印8个，并且在Spark UI中看到729个任务

I was expecting 8 to be printed and I see 729 tasks in Spark UI

按照@ zero323的建议调用repartition()后

After calling repartition() as suggested by @zero323

scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count

即使spark-shell打印8，我仍然在Spark UI中看到729个任务.

I still see 729 tasks in the Spark UI even though the spark-shell prints 8.

推荐答案

如果您看一下签名

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

您将看到您使用的参数称为minPartitions，这几乎描述了它的功能.在某些情况下，即使这被忽略，但这是另一回事.幕后使用的输入格式仍然决定着如何计算分割.

you'll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.

在这种特殊情况下，您可以使用mapred.min.split.size增加拆分大小(这将在加载过程中起作用)，或者在加载后仅使用repartition(这将在数据加载后生效)，但通常不需要为了那个原因.

In this particular case you could probably use mapred.min.split.size to increase split size (this will work during load) or simply repartition after loading (this will take effect after data is loaded) but in general there should be no need for that.

这篇关于为什么SparkContext.textFile的partition参数不生效?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！