如何在pyspark中更改hdfs块的大小？

本文介绍了如何在pyspark中更改hdfs块的大小？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用pySpark编写实木复合地板文件。我想改变该文件的hdfs块大小。我设置了块大小，它不起作用：

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

这是否必须在启动pySpark作业之前设置？如果是这样，如何做到这一点。

Does this have to be set before starting the pySpark job? If so, how to do it.

推荐答案

尝试通过 sc._jsc.hadoopConfiguration（） 与 SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

：

in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")

这篇关于如何在pyspark中更改hdfs块的大小？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！