performance - 使用Group By和Like进行Impala查询的性能降低

我们正在测试Apache Impala，并注意到将GROUP BY和LIKE一起使用非常慢-单独的查询工作更快。这是两个示例:

# 1.37s 1.08s 1.35s

SELECT * FROM hive.default.pcopy1B where
     (lower("by")  like '%part%' and lower("by")  like '%and%' and lower("by")  like '%the%')
  or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%')
  or (lower(url)   like '%part%' and lower(url)   like '%and%' and lower(url)   like '%the%')
  or (lower(text)  like '%part%' and lower(text)  like '%and%' and lower(text)  like '%the%')
limit 100;

# 156.64s 155.63s

select "by", type, ranking, count(*) from pcopy where
     (lower("by")  like '%part%' and lower("by")  like '%and%' and lower("by")  like '%the%')
  or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%')
  or (lower(url)   like '%part%' and lower(url)   like '%and%' and lower(url)   like '%the%')
  or (lower(text)  like '%part%' and lower(text)  like '%and%' and lower(text)  like '%the%')
group by "by", type, ranking
order by 4 desc limit 10;

有人可以解释为什么会发生此问题，以及是否有任何解决方法吗？

最佳答案

2个查询之间有一个基本区别。

第一个查询

要点:

仅选择100行。

一旦进程获得满足所提供的WHERE子句的100行，它将被标记为已完成，并且将返回100条记录。

只有1个映射器步骤。映射器的数量取决于您的数据大小。

第二查询

要点:

仅选择10行。

即使只选择了10行，该过程仍需要扫描完整数据，以便基于GROUP BY子句生成结果。

应该有3个mapper-reducer步骤。每个步骤中映射程序减少程序的数量将取决于数据大小。

1st MP将读取数据并应用WHERE子句

第二个MR将用于GROUP BY子句。

3rd MR将用于ORDER BY子句。

因此，您提供的查询可能看起来相似，但完全不同，并且可以共同解决不同的目的。

我希望这能帮到您。