使用Scala或Python列出存储在Hadoop HDFS上的Spark群集中可用的所有文件？

本文介绍了使用Scala或Python列出存储在Hadoop HDFS上的Spark群集中可用的所有文件？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

列出Spark中本地可用的所有文件名的最有效方法是什么？我正在使用Scala API，但是，Python也应该没问题。

What is the most efficient way to list all the files names that are locally available in Spark? I'm using Scala API, however, Python should also be fine.

推荐答案

import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import scala.collection.mutable.Stack


 val fs = FileSystem.get( sc.hadoopConfiguration )
 var dirs = Stack[String]()
 val files = scala.collection.mutable.ListBuffer.empty[String]
 val fs = FileSystem.get(sc.hadoopConfiguration)
 dirs.push("/user/username/")

 while(!dirs.isEmpty){
     val status = fs.listStatus(new Path(dirs.pop()))
     status.foreach(x=> if(x.isDirectory) dirs.push(x.getPath.toString) else 
     files+= x.getPath.toString)
 }

files.foreach(println)

这篇关于使用Scala或Python列出存储在Hadoop HDFS上的Spark群集中可用的所有文件？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！