本文介绍了如何在pyspark中更改hdfs块的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用pySpark编写实木复合地板文件。我想改变该文件的hdfs块大小。我设置了块大小,它不起作用:

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

这是否必须在启动pySpark作业之前设置?如果是这样,如何做到这一点。

Does this have to be set before starting the pySpark job? If so, how to do it.

推荐答案

尝试通过 sc._jsc.hadoopConfiguration() SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size



in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")

这篇关于如何在pyspark中更改hdfs块的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-19 01:29