本文介绍了如何强制STORE(覆盖)到Pig中的HDFS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在开发使用 STORE 命令的Pig脚本时,我必须删除每次运行的输出目录,否则脚本停止并提供:

  2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt  - 错误6000:输出位置验证失败:'hdfs:// [server] / user / [user] / foo / bar更多信息如下:
输出目录hdfs:// [server] / user / [user] / foo / bar已存在

所以我寻找一个in-Pig解决方案来自动删除目录,如果该目录在调用时不存在,那么它就不会窒息。



在Pig Latin Reference中,我找到了shell命令调用者 fs 。不幸的是,只要有任何错误产生,Pig脚本就会中断所以我不能使用

  fs -rmr foo / bar 

(即递归移除),因为如果该目录不存在,它会中断。有一段时间我以为我可以使用

  fs -test -e foo / bar 

这是一个测试,不应该打破或者我认为。但是,Pig再次将 test 在一个不存在的目录上的返回代码解释为失败代码并中断。



Pig项目有一个,用于解决我的问题并建议可选参数 OVERWRITE FORCE_WRITE 用于 STORE 命令。无论如何,我正在使用Pig 0.8.1,并且没有这个参数。

解决方案

最后我发现了一个解决方案在上。因为找到解决方案花了太长时间,我将在这里复制它并添加到它中。



假设您想使用语句存储输出

  STORE Relation INTO'foo / bar'; 

然后,为了删除目录,您可以在脚本开头调用

  rmf foo / bar 



否;或$所需的引号,因为它是一个shell命令。



现在我无法重现它,但是在某个时间点,我收到了一条错误消息(关于丢失文件的信息)只假定 rmf 干扰map / reduce。所以我建议在任何关系声明前打电话。

  

> SET mapred.fairscheduler.pool'inhouse';
注册/usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%默认名称'foobar'
rmf foo / bar
Rel = LOAD'something.tsv';
STORE REL INTO'foo / bar';


When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:

2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists

So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.

In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use

fs -rmr foo/bar

(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use

fs -test -e foo/bar

which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.

There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

解决方案

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

No ";" or quotations required since it is a shell command.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.

Example:

SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';

这篇关于如何强制STORE(覆盖)到Pig中的HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-10 05:16