本文介绍了如何在Spark Scala中重命名S3文件而不是HDFS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3中存储了大约一百万个文本文件.我想根据文件夹名称重命名所有文件.

I have approx 1 millions text files stored in S3 .I want to rename all files based on their folders name.

我该如何在spark-scala中做到这一点?

How can i do that in spark-scala ?

我正在寻找一些示例代码.

I am looking for some sample code .

我正在使用齐柏林飞艇来运行我的Spark脚本.

I am using zeppelin to run my spark script .

下面的代码我已经按照答案的建议进行了尝试

Below code I have tried as suggested from answer

import org.apache.hadoop.fs._

val src = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN")
val dest = new Path("s3://trfsmallfffile/FinancialLineItem/MAIN/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = Path.getFileSystem(conf)
fs.rename(src, dest)

但是低于错误

<console>:110: error: value getFileSystem is not a member of object org.apache.hadoop.fs.Path
       val fs = Path.getFileSystem(conf)

推荐答案

您可以使用普通的HDFS API,例如(输入,未经测试)

you can use the normal HDFS APIs, something like (typed in, not tested)

val src = new Path("s3a://bucket/data/src")
val dest = new Path("s3a://bucket/data/dest")
val conf = sc.hadoopConfiguration   // assuming sc = spark context
val fs = src.getFileSystem(conf)
fs.rename(src, dest)

S3A客户端伪造重命名的方式是每个文件的copy + delete,因此花费的时间与文件数和数据量成正比. S3限制了您的工作:如果尝试并行执行此操作,则可能会使您减速.如果需要一会儿",请不要感到惊讶.

The way the S3A client fakes a rename is a copy + delete of every file, so the time it takes is proportional to the #of files, and the amount of data. And S3 throttles you: if you try to do this in parallel, it will potentially slow you down. Don't be surprised if it takes "a while".

您还会按每次COPY通话收取费用,每1000个通话0.005,因此您要花约$ 5的费用才能尝试.在一个小的目录上进行测试,直到确定一切正常为止

You also get billed per COPY call, at 0.005 per 1,000 calls, so it will cost you ~$5 to try. Test on a small directory until you are sure everything is working

这篇关于如何在Spark Scala中重命名S3文件而不是HDFS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 17:11