参数“mapred.min.split.size”的行为在HDFS中

本文介绍了参数“mapred.min.split.size”的行为在HDFS中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

参数mapred.min.split.size更改之前写入文件的块的大小？
假设我在启动我的JOB时传递值为134217728（128MB）的参数mapred.min.split.size。
什么是正确的说什么发生？

1 - 每个MAP处理相当于2个HDFS块（假设每个块为64MB）;

2 - 我的输入文件（之前包含HDFS）会有一个新的区域占用HDFS 128M中的块;

解决方案
拆分大小按以下公式计算： -
max（mapred.min。 split.size，min（mapred.max.split.size，dfs.block.size））
在你的情况下，它将是： - $ / b
$ b $ pre $ split size = max（128，min（Long.MAX_VALUE（默认值），64））
所以上面的推断： -

每张地图将会处理2个hdfs块（假设每个块为64MB）： True 我的输入文件（之前包含HDFS）会有一个新的部分占用HDFS 128M中的块： False

但制作最小分割大小grea比块的大小增加了分割大小，但是以区域性为代价。

The parameter "mapred.min.split.size" changes the size of the block in which the file was written earlier?Assuming a situation where I, when starting my JOB, pass the parameter "mapred.min.split.size" with a value of 134217728 (128MB).What is correct to say about what happens?

1 - Each MAP process the equivalent of 2 HDFS blocks (assuming each block 64MB);

2 - There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M;

解决方案

The split size is calculated by the formula:-

max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

In your case it will be:-

split size=max(128,min(Long.MAX_VALUE(default),64))

So above inference:-

each map will process 2 hdfs blocks(assuming each block 64MB): True
There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False

but making the minimum split size greater than the block size increases the split size, but at the cost of locality.

这篇关于参数“mapred.min.split.size”的行为在HDFS中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！