本文介绍了Spark + EMRFS/S3-是否可以读取客户端加密的数据并使用服务器端加密将其写回?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在spark中有一个用例,其中我必须从使用客户端加密的S3读取数据,对其进行处理,然后仅使用服务器端加密将其写回.我想知道是否有办法做到这一点?

I have a use-case in spark where I have to read data from a S3 that uses client-side encryption, process it and write it back using only server-side encryption. I'm wondering if there's a way to do this in spark?

当前,我设置了以下选项:

Currently, I have these options set:

spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3.enableServerSideEncryption=true
spark.hadoop.fs.s3.serverSideEncryption.kms.keyId=<kms id here>

但是显然,在写入数据时,它同时使用了CSE和SSE.因此,我想知道是否有可能在读取时仅将spark.hadoop.fs.s3.cse.enabled设置为true,然后将其设置为false或其他选择.

But obviously, it's ending up using both CSE and SSE while writing the data. So, I'm wondering it it's possible to somehow only set spark.hadoop.fs.s3.cse.enabled to true while reading and then set it to false or maybe another alternative.

感谢您的帮助.

推荐答案

使用编程配置来定义多个S3文件系统:

Using programmatic configuration to define multiple S3 filesystems:

spark.hadoop.fs.s3.cse.enabled=true
spark.hadoop.fs.s3sse.impl=foo.bar.S3SseFilesystem

,然后为 s3sse 添加自定义实现:

and then add a custom implementation for s3sse:

package foo.bar

import java.net.URI

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.s3a.S3AFileSystem

class S3SseFilesystem extends S3AFileSystem {
  override def initialize(name: URI, originalConf: Configuration): Unit = {
    val conf = new Configuration()
    // NOTE: no prefix spark.hadoop here
    conf.set("fs.s3.enableServerSideEncryption", "true")
    conf.set("fs.s3.serverSideEncryption.kms.keyId", "<kms id here>")
    super.initialize(name, conf)
  }
}

之后,自定义文件系统可以与Spark read 方法

After this, the custom file system can be used with Spark read method

spark.read.json("s3sse://bucket/prefix")

这篇关于Spark + EMRFS/S3-是否可以读取客户端加密的数据并使用服务器端加密将其写回?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-03 10:19