本文介绍了如何在pyspark中更改hdfs块的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我使用pySpark编写实木复合地板文件。我想改变该文件的hdfs块大小。我设置了块大小,它不起作用:
I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
这是否必须在启动pySpark作业之前设置?如果是这样,如何做到这一点。
Does this have to be set before starting the pySpark job? If so, how to do it.
推荐答案
尝试通过 sc._jsc.hadoopConfiguration()
与 SparkContext
from pyspark import SparkConf, SparkContext
conf = (SparkConf().setMaster("yarn"))
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size
:
in Scala:
sc.hadoopConfiguration.set("dfs.block.size", "128m")
这篇关于如何在pyspark中更改hdfs块的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!