为什么HDFS中的块很大？

本文介绍了为什么HDFS中的块很大？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人可以解释这个计算并给出一个清晰的解释吗？

解决方案

块将作为磁盘上的连续信息存储，这意味着完全读取它的总时间是找到它的时间（寻找时间）+读取其内容的时间，而不需要做更多的搜索，即 sizeOfTheBlock / transferRate = transferTime 。

如果我们保持比值 seekTime / transferTime 小（文中接近0.01），这意味着我们正在从磁盘读取数据的速度几乎与磁盘所施加的物理限制一样快，最少花费时间查找信息。

这很重要，因为在map reduce作业中，我们通常遍历（读取）整个数据集（由HDFS文件或文件夹或一组文件夹表示）因为我们必须花费完整的 transferTime 来获取磁盘中的所有数据，所以我们尽量减少大块查找和读取的时间，因此数据块的尺寸很大。

在更传统的磁盘访问软件中，我们通常不会每次都读取整个数据集，所以我们宁愿花更多的时间在较小的块上进行大量的搜索而不是浪费时间传输我们不需要的太多数据。

Can somebody explain this calculation and give a lucid explanation?

解决方案

A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.

If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.

This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.

In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.

这篇关于为什么HDFS中的块很大？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！