本文介绍了如何在加入操作之前转换DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码用于从 products 列中提取排名.等级是每对 [...] 中的第二个数字.例如,在给定的示例 [[222,66],[333,55]] 中,等级为 6655PK 222333,相应地.但是当 df_products 大约为 800 Mb 时,Spark 2.2 中的代码运行非常缓慢:

The following code is used to extract ranks from the column products. The ranks are second numbers in each pair [...]. For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333, accordingly. But the code in Spark 2.2 works very slowly when df_products is around 800 Mb:

df_products.createOrReplaceTempView("df_products")

val result = df.as("df2")
               .join(spark.sql("SELECT * FROM df_products")
               .select($"product_PK", explode($"products").as("products"))
               .withColumnRenamed("product_PK","product_PK_temp").as("df1"),$"df2.product               _PK" === $"df1.product_PK_temp" and $"df2.rec_product_PK" === $"df1.products.product_PK", "left")
               .drop($"df1.product_PK_temp")
               .select($"product_PK", $"rec_product_PK", coalesce($"df1.products.col2", lit(0.0)).as("rank_product"))

这是df_productsdf 的一个小例子:

This is a small sample of df_products and df:

df_products =

df_products =

+----------+--------------------+
|product_PK|            products|
+----------+--------------------+
|       111|[[222,66],[333,55...|
|       222|[[333,24],[444,77...|
...
+----------+--------------------+

df =

+----------+-----------------+                 
|product_PK|   rec_product_PK|
+----------+-----------------+
|       111|              222|
|       222|              888|
+----------+-----------------+

products 的每一行中的数组包含少量元素时,上面给出的代码运行良好.但是当每行[[..],[..],...]的数组中有很多元素时,那么代码就好像卡住了,不前进了.

The above-given code works well when the arrays in each row of products contain a small number of elements. But when there are a lot of elements in the arrays of each row [[..],[..],...], then the code seems to get stuck and it does not advance.

如何优化代码?非常感谢任何帮助.

How can I optimize the code? Any help is really highly appreciated.

是否可以,例如,在加入之前将 df_products 转换为以下 DataFrame?

Is it possible, for example, to transform df_products into the following DataFrame before joining?

df_products =

df_products =

+----------+--------------------+------+
|product_PK|      rec_product_PK|  rank|
+----------+--------------------+------+
|       111|                 222|    66|
|       111|                 333|    55|
|       222|                 333|    24|
|       222|                 444|    77|
...
+----------+--------------------+------+

推荐答案

根据我的回答这里,你可以变换df_products 使用这样的东西:

As per my answer here, you can transform df_products using something like this:

import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")

这里假设 products 是一个结构体数组.如果 products 是数组的数组,则可以使用以下内容:

This assumes products is an array of structs. If products is an array of array, you can use the following instead:

df2 = df1.withColumn("rank", df2("products").getItem(1))

这篇关于如何在加入操作之前转换DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-29 16:10