本文介绍了如何使用selectExpr在Spark数据帧中强制转换结构数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在spark数据帧中强制转换结构数组?

How to cast an array of struct in a spark dataframe ?

让我通过一个例子来说明我要做什么.我们将从创建一个数据框开始,该数据框包含行和嵌套行的数组.我的整数尚未在数据框中强制转换,并且已创建为字符串:

Let me explain what I am trying to do via an example.We'll start by creating a dataframe Which contains an array of rows and nested rows. My Integers are not casted yet in the dataframe, and they're created as strings :

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rows1 = Seq(
  Row("1", Row("a", "b"), "8.00", Seq(Row("1","2"), Row("12","22"))),
  Row("2", Row("c", "d"), "9.00", Seq(Row("3","4"), Row("33","44")))
)

val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

val schema1 = StructType(
  Seq(
    StructField("id", StringType, true),
    StructField("s1", StructType(
      Seq(
        StructField("x", StringType, true),
        StructField("y", StringType, true)
      )
    ), true),
    StructField("d", StringType, true),
    StructField("s2", ArrayType(StructType(
      Seq(
        StructField("u", StringType, true),
        StructField("v", StringType, true)
      )
    )), true)
  )
)

val df1 = spark.createDataFrame(rows1Rdd, schema1)

这是创建的数据框的架构:

Here's the schema of the created dataframe :

       df1.printSchema
       root
       |-- id: string (nullable = true)
       |-- s1: struct (nullable = true)
       |    |-- x: string (nullable = true)
       |    |-- y: string (nullable = true)
       |-- d: string (nullable = true)
       |-- s2: array (nullable = true)
       |    |-- element: struct (containsNull = true)
       |    |    |-- u: string (nullable = true)
       |    |    |-- v: string (nullable = true)

我要做的是将所有可以为整数的字符串都转换为整数.我尝试执行以下操作,但没有成功:

What I want to do is to cast all the strings which can be an integer, to an integer. I tried to do the following but it didn't work:

df1.selectExpr("CAST (id AS INTEGER) as id",
  "STRUCT (s1.x, s1.y) AS s1",
  "CAST (d AS DECIMAL) as d",
  "Array (Struct(CAST (s2.u AS INTEGER), CAST (s2.v AS INTEGER))) as s2").show()

我遇到以下异常:

cannot resolve 'CAST(`s2`.`u` AS INT)' due to data type mismatch: cannot cast array<string> to int; line 1 pos 14;

任何人都有正确的查询来将所有值转换为INTEGER吗?我将不胜感激.

Anyone has the right query to cast all the values to INTEGER ? I'll be grateful.

非常感谢

推荐答案

您应匹配完整结构:

val result = df1.selectExpr(
  "CAST(id AS integer) id",
  "s1",
  "CAST(d AS decimal) d",
  "CAST(s2 AS array<struct<u:integer,v:integer>>) s2"
)

应该为您提供以下架构:

which should give you following schema:

result.printSchema
root
 |-- id: integer (nullable = true)
 |-- s1: struct (nullable = true)
 |    |-- x: string (nullable = true)
 |    |-- y: string (nullable = true)
 |-- d: decimal(10,0) (nullable = true)
 |-- s2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- u: integer (nullable = true)
 |    |    |-- v: integer (nullable = true)

和数据:

result.show
+---+-----+---+----------------+
| id|   s1|  d|              s2|
+---+-----+---+----------------+
|  1|[a,b]|  8|[[1,2], [12,22]]|
|  2|[c,d]|  9|[[3,4], [33,44]]|
+---+-----+---+----------------+

这篇关于如何使用selectExpr在Spark数据帧中强制转换结构数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-22 05:14