After spending some time reviewing the code of Apache Spark, dropDuplicates operator is equivalent to groupBy followed by first function. 第一(columnName:字符串,ignoreNulls:布尔值):列聚合函数:返回组中列的第一个值. first(columnName: String, ignoreNulls: Boolean): Column Aggregate function: returns the first value of a column in a group.import org.apache.spark.sql.functions.firstval firsts = dups.groupBy("value").agg(first("value") as "value")scala> println(firsts.queryExecution.logical.numberedTreeString)00 'Aggregate [value#64L], [value#64L, first('value, false) AS value#139]01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]02 +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint03 +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cast(id#58L as bigint), true), obj#62: java.lang.Long04 +- Range (0, 9, step=1, splits=Some(8))scala> firsts.explain== Physical Plan ==*HashAggregate(keys=[value#64L], functions=[first(value#64L, false)])+- Exchange hashpartitioning(value#64L, 200) +- *HashAggregate(keys=[value#64L], functions=[partial_first(value#64L, false)]) +- *SerializeFromObject [input[0, bigint, false] AS value#64L] +- *MapElements <function1>, obj#63: bigint +- *DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#58L, true), obj#62: java.lang.Long +- *Range (0, 9, step=1, splits=8)我还认为 dropDuplicates运算符可能更高效.I also think that dropDuplicates operator may be more performant. 这篇关于dropDuplicates运算符使用哪一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!
10-14 20:08