问题描述
我想从 Spark v.1.6(使用 scala)数据帧创建一个 JSON.我知道有一个简单的解决方案 df.toJSON
.
I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON
.
但是,我的问题看起来有点不同.例如,考虑具有以下列的数据框:
However, my problem looks a bit different. Consider for instance a dataframe with the following columns:
| A | B | C1 | C2 | C3 |
-------------------------------------------
| 1 | test | ab | 22 | TRUE |
| 2 | mytest | gh | 17 | FALSE |
我希望最后有一个带有
| A | B | C |
----------------------------------------------------------------
| 1 | test | { "c1" : "ab", "c2" : 22, "c3" : TRUE } |
| 2 | mytest | { "c1" : "gh", "c2" : 17, "c3" : FALSE } |
其中 C 是包含 C1
、C2
、C3
的 JSON.不幸的是,我在编译时不知道数据框是什么样子(除了总是固定"的列 A
和 B
).
where C is a JSON containing C1
, C2
, C3
. Unfortunately, I at compile time I do not know what the dataframe looks like (except the columns A
and B
that are always "fixed").
至于我需要这个的原因:我使用 Protobuf 发送结果.不幸的是,我的数据框有时比预期的列多,我仍然会通过 Protobuf 发送这些列,但我不想在定义中指定所有列.
As for the reason why I need this: I am using Protobuf for sending around the results. Unfortunately, my dataframe sometimes has more columns than expected and I would still send those via Protobuf, but I do not want to specify all columns in the definition.
我怎样才能做到这一点?
How can I achieve this?
推荐答案
Spark 2.1 应对此用例提供本机支持(请参阅 #15354).
Spark 2.1 should have native support for this use case (see #15354).
import org.apache.spark.sql.functions.to_json
df.select(to_json(struct($"c1", $"c2", $"c3")))
这篇关于Spark Row 到 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!