问题描述
在开发使用 STORE 命令的Pig脚本时,我必须删除每次运行的输出目录,否则脚本停止并提供:
2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - 错误6000:输出位置验证失败:'hdfs:// [server] / user / [user] / foo / bar更多信息如下:
输出目录hdfs:// [server] / user / [user] / foo / bar已存在
所以我寻找一个in-Pig解决方案来自动删除目录,如果该目录在调用时不存在,那么它就不会窒息。
在Pig Latin Reference中,我找到了shell命令调用者 fs 。不幸的是,只要有任何错误产生,Pig脚本就会中断所以我不能使用
fs -rmr foo / bar
(即递归移除),因为如果该目录不存在,它会中断。有一段时间我以为我可以使用
fs -test -e foo / bar
这是一个测试,不应该打破或者我认为。但是,Pig再次将 test
在一个不存在的目录上的返回代码解释为失败代码并中断。
Pig项目有一个,用于解决我的问题并建议可选参数 OVERWRITE 或 FORCE_WRITE 用于 STORE 命令。无论如何,我正在使用Pig 0.8.1,并且没有这个参数。
最后我发现了一个解决方案在上。因为找到解决方案花了太长时间,我将在这里复制它并添加到它中。
假设您想使用语句存储输出
STORE Relation INTO'foo / bar';
然后,为了删除目录,您可以在脚本开头调用
rmf foo / bar
否;或$所需的引号,因为它是一个shell命令。
现在我无法重现它,但是在某个时间点,我收到了一条错误消息(关于丢失文件的信息)只假定 rmf 干扰map / reduce。所以我建议在任何关系声明前打电话。
> SET mapred.fairscheduler.pool'inhouse';
注册/usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%默认名称'foobar'
rmf foo / bar
Rel = LOAD'something.tsv';
STORE REL INTO'foo / bar';
When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:
2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists
So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.
In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use
fs -rmr foo/bar
(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use
fs -test -e foo/bar
which is a test and shouldn't break or so I thought. However, Pig again interpretes test
's return code on a non-existing directory as a failure code and breaks.
There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.
At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.
Suppose you want to store your output using the statement
STORE Relation INTO 'foo/bar';
Then, in order to delete the directory, you can call at the start of the script
rmf foo/bar
No ";" or quotations required since it is a shell command.
I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.
Example:
SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';
这篇关于如何强制STORE(覆盖)到Pig中的HDFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!