星火SQL二次过滤和分组

本文介绍了星火SQL二次过滤和分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题：我有一个数据集A {filed1，场2，FIELD3 ...}，我想先A组由比方说，字段1 ，然后在每个所产生的群体，我愿做一堆的子查询，例如，数着有字段2 ==真的行数，或计数不同的 FIELD3 有字段4 ==SOME_VALUE和字段5 =数量= FALSE 等

Problem: I have a data set A {filed1, field2, field3...}, and I would like to first group A by say, field1, then in each of the resulting groups, I would like to do bunch of subqueries, for example, count the number of rows that have field2 == true, or count the number of distinct field3 that have field4 == "some_value" and field5 == false, etc.

我能想到的一些替代方案：我可以写定义聚合函数的自定义用户的需要，计算过滤条件的功能，但这种方式我要为每一个创建它的一个实例查询条件。我也看了看 countDistinct 功能可以实现一些操作的，但我无法弄清楚如何使用它来实现过滤器独特的计数语义。

Some alternatives I can think of: I can write a customized user defined aggregate function that takes a function that computes the condition for filtering, but this way I have to create an instance of it for every query condition. I've also looked at the countDistinct function can achieve some of the operations, but I can't figure out how to use it to implement the filter-distinct-count semantic.

在猪，我可以这样做：

FOREACH (GROUP A by field1) {
        field_a = FILTER A by field2 == TRUE;
        field_b = FILTER A by field4 == 'some_value' AND field5 == FALSE;
        field_c = DISTINCT field_b.field3;

        GENERATE  FLATTEN(group),
                  COUNT(field_a) as fa,
                  COUNT(field_b) as fb,
                  COUNT(field_c) as fc,

有没有办法在SQL星火做到这一点？

Is there a way to do this in Spark SQL?

推荐答案

剔除重复计数，这是可以通过简单的求和条件解决：

Excluding distinct count this is can solved by simple sum over condition:

import org.apache.spark.sql.functions.sum

val df = sc.parallelize(Seq(
  (1L, true, "x", "foo", true), (1L, true, "y", "bar", false), 
  (1L, true, "z", "foo", true), (2L, false, "y", "bar", false), 
  (2L, true, "x", "foo", false)
)).toDF("field1", "field2", "field3", "field4", "field5")

val left = df.groupBy($"field1").agg(
  sum($"field2".cast("int")).alias("fa"),
  sum(($"field4" === "foo" && ! $"field5").cast("int")).alias("fb")
)
left.show

// +------+---+---+
// |field1| fa| fb|
// +------+---+---+
// |     1|  3|  0|
// |     2|  1|  1|
// +------+---+---+

不幸的是更为棘手。 GROUP BY 在星火SQL 。更何况，寻找不同的元素是相当昂贵的。也许你能做的最好的事情是分别计算不同罪名，只是加入的结果：

Unfortunately is much more tricky. GROUP BY clause in Spark SQL doesn't physically group data. Not to mention that finding distinct elements is quite expensive. Probably the best thing you can do is to compute distinct counts separately and simply join the results:

val right = df.where($"field4" === "foo" && ! $"field5")
  .select($"field1".alias("field1_"), $"field3")
  .distinct
  .groupBy($"field1_")
  .agg(count("*").alias("fc"))

val joined = left
  .join(right, $"field1" === $"field1_", "leftouter")
  .na.fill(0)

使用UDAF每个条件肯定是一种选择，而是有效的实现将是相当棘手的计算不同的值。从内部重新presentation转换是相当昂贵的，并与一家集仓储实现快速UDAF也不便宜。如果你能接受近似解你能在那里使用布隆过滤器。

Using UDAF to count distinct values per condition is definitely an option but efficient implementation will be rather tricky. Converting from internal representation is rather expensive, and implementing fast UDAF with a collection storage is not cheap either. If you can accept approximate solution you can use bloom filter there.

这篇关于星火SQL二次过滤和分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！