本文介绍了带有字符串分区键和整数分区键的Hive/Impala性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否建议将数字列用作分区键?当我们对数字列分区和字符串列分区进行选择查询时,性能会有所不同吗?

Are numeric columns recommended for partition keys? Will there be any performance difference when we do a select query on numeric column partitions vs string column partitions?

推荐答案

否,没有这样的建议.考虑一下:问题在于,Hive中的分区表示形式是一个名称类似于'key = value'的文件夹,也可以只是'value',但无论如何它都是字符串文件夹名称.因此它被存储为字符串,并在读/写期间被强制转换.分区键值未打包在数据文件中,也未压缩.

No, there is no such recommendation. Consider this:The thing is that partition representation in Hive is a folder with a name like 'key=value' or it can be just 'value' but anyway it is string folder name. So it is being stored as string and is being cast during read/write. Partition key value is not packed inside data files and not compressed.

由于map-reduce和Impalla的分布式/并行性质,您将永远不会注意到查询处理性能的差异.同样,所有数据都将被序列化以在处理阶段之间传递,然后再次反序列化并转换为某种类型,同一查询可能多次发生.

Due to the distributed/parallel nature of map-reduce and Impalla, you will never notice the difference in query processing performance. Also all data will be serialized to be passed between processing stages, then again deserialized and cast to some type, this can happen many times for the same query.

通过分布式处理和序列化/反序列化数据会产生大量开销.实际上,仅数据大小很重要.表(文件大小)越小,它的工作速度越快.但是您不会通过限制类型来提高性能.

There are a lot of overhead created by distributed processing and serializing/deserializing data. Practically only the size of data matters. The smaller the table (it's files size) the faster it works. But you will not improve performance by restricting types.

用作分区键的大字符串值可能会影响元数据数据库的性能,并且正在处理的分区数也会影响性能.还是一样:这里只有数据大小很重要,而不是类型.

Big string values used as partition keys can affect metadata DB performance, as well as the number of partitions being processed also can affect performance. Again the same: only the size of data matters here, not types.

1、0 会比'Yes','No'更好.在许多情况下,压缩和并行性可以使这种差异微不足道.

1, 0 can be better than 'Yes', 'No' just because of size. And compression and parallelism can make this difference negligible in many cases.

这篇关于带有字符串分区键和整数分区键的Hive/Impala性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 08:18