我们正在测试Apache Impala,并注意到将GROUP BY和LIKE一起使用非常慢-单独的查询工作更快。这是两个示例:
# 1.37s 1.08s 1.35s
SELECT * FROM hive.default.pcopy1B where
(lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%')
or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%')
or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%')
or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%')
limit 100;
# 156.64s 155.63s
select "by", type, ranking, count(*) from pcopy where
(lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%')
or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%')
or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%')
or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%')
group by "by", type, ranking
order by 4 desc limit 10;
有人可以解释为什么会发生此问题,以及是否有任何解决方法吗?
最佳答案
2个查询之间有一个基本区别。
第一个查询
要点:
WHERE
子句的100行,它将被标记为已完成,并且将返回100条记录。 第二查询
要点:
GROUP BY
子句生成结果。 WHERE
子句GROUP BY
子句。 ORDER BY
子句。 因此,您提供的查询可能看起来相似,但完全不同,并且可以共同解决不同的目的。
我希望这能帮到您。