本文介绍了Hive SQL编码风格:中间表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 我可以写一些类似(很简单): $ b $我应该在蜂房中创建和删除中间表吗? b drop table if tmp1; create table tmp1 as 从input1中选择a,b,c 其中a> 1和b drop table if tmp2; 创建表tmp2作为从input2中选择x,y,z 其中x drop table如果存在输出; 创建表格输出为选择x,a,count(*)作为计数 from tmp1 join tmp2 on tmp1.c = tmp2.z group by tmp1.b; drop table tmp1; drop table tmp2; 或者我可以将所有内容放入一个语句中: drop table if exists exists输出; 创建表格输出为选择x,a,count(*)作为计数 from(从input1中选择a,b,c 其中a> 1和b join(从输入2中选择x,y,z 其中x 在t1.c = t2.z group by t1.b; 显然,如果我不止一次地重复使用中间表,创建它们非常合适。 然而,当它们只用了一次,我就有了选择。 我尝试了两种,第二种是 6% ,这是由时间测量的,但 4% 速度是由 MapReduce累计CPU时间 code>日志输出。 这种差异可能在随机误差范围内(由其他过程和c引起)。 但是,结合查询有可能导致戏剧性的加速吗?另外一个问题是:中间表,只用一次,一个正常发生在蜂巢代码中,还是应该尽可能避免它们? 解决方案有一个显着的区别。 b运行一个大查询将允许优化器在优化中有更多自由。 在这种情况下最重要的优化之一是在 hive.exec.parallel 。当设置为真实的配置单元将并行执行独立的阶段。 在你的情况下,在第二个查询想象t1,t2做更复杂的工作,如 group by 。在第二个查询t1中,t2将执行simultaniusly,而在第一个脚本中将是串行的。 Should I be creating and dropping intermediate tables in hive?I can write something like (much simplified):drop table if exists tmp1;create table tmp1 asselect a, b, cfrom input1where a > 1 and b < 3;drop table if exists tmp2;create table tmp2 asselect x, y, zfrom input2where x < 6;drop table if exists output;create table output asselect x, a, count(*) as countfrom tmp1 join tmp2 on tmp1.c = tmp2.zgroup by tmp1.b;drop table tmp1;drop table tmp2;or I can roll everything into one statement:drop table if exists output;create table output asselect x, a, count(*) as countfrom (select a, b, c from input1 where a > 1 and b < 3) t1join (select x, y, z from input2 where x < 6) t2on t1.c = t2.zgroup by t1.b;Obviously, if I reuse the intermediate tables more than once, it makes perfect sense to create them.However, when they are used just once, I have a choice.I tried both and the second one is 6% faster as measured by the wall time, but 4% slower as measured by the MapReduce Total cumulative CPU time log output.This difference is probably within the random margin of error (caused by other processes &c).However, is it possible that combining queries could result in a dramatic speedup?Another question is: are intermediate tables, which are used just once, a normal occurrence in hive code, or should they be avoided when possible? 解决方案 There is one significant difference.Running the one big query will allow the optimizer more freedom in optimizations.One of the most important optimizations in such cases are paralellims as set in hive.exec.parallel. when set to true hive will exexcute independant stages in parallel.In your case, in the 2nd query imagine t1,t2 do more complex work likegroup by. in the 2nd query t1,t2 will execute simultaniusly while in the first script the will be serial. 这篇关于Hive SQL编码风格:中间表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-24 18:57