使用Flume将文件从远程节点传输到HDFS

本文介绍了使用Flume将文件从远程节点传输到HDFS的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一堆压缩成* gz格式的二进制文件。它们在远程节点上生成，并且必须传输到位于数据中心服务器之一的HDFS上。

我正在探索使用Flume发送文件的选项;我研究了使用假脱机目录配置来执行此操作的选项，但显然这仅适用于文件目录本地位于同一HDFS节点上的情况。

任何建议如何解决这个问题？ 没有开箱即用的解决方案。但您可以尝试以下解决方法：

您可以为此目的创建自己的源代码实现（通过使用Flume SDK）。例如，似乎能够通过ssh连接到远程目录并使用它可以作为源代码。

您可以创建一个自定义的计划脚本来定期将远程文件复制到本地假脱机目录中，然后将其作为假脱机代理的后台打印源。

您可以尝试创建另一个脚本来读取远程数据，然后将其写入其输出，并在。

您可以在数据所在的机器上找到您的水槽（和代理）（请参阅）。

I have a bunch of binary files compressed into *gz format. These are generated on a remote node and must be transferred to HDFS located one of the datacenter's server.

I'm exploring the option of sending the files with Flume; I explore the option of doing this with a Spooling Directory configuration, but apparently this only works when the file's directory is located locally on the same HDFS node.

Any suggestions how to tackle this problem?

解决方案

There is no out-of-box solution for such case. But you could try these workarounds:

You could create your own source implementation for such purpose (by using Flume SDK). For example, this project seems to be able to connect to remote dir by ssh and use it as source.
You could create a custom scheduled script to copy remote files into local spool directory periodically, then use it as a spooling dir source for flume agent.
You could try to create another script to read your remote data and then to write it into its output and use such script in the Exec Source.
You could locate your flume (and agent) on the machine, where data is located (see Can Spool Dir of flume be in remote machine? ).

这篇关于使用Flume将文件从远程节点传输到HDFS的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！