在Pig Latin中为每个组写一个文件

本文介绍了在Pig Latin中为每个组写一个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题：
我有许多包含Apache Web服务器日志条目的文件。这些条目不是日期时间顺序，分散在文件中。我试图用Pig阅读一天的文件，按照日期时间对日志条目进行分组和排序，然后将它们写入到文件名称的日期和时间中。

设置：
一旦我导入了我的文件，我使用正则表达式来获取日期字段，然后我将它截断为小时。这会生成一个记录在一个字段中的集合，并且在另一个字段中将日期截断为小时。从这里开始，我在日期 - 小时字段上进行分组。

第一次尝试：
我的第一个想法是使用STORE命令同时使用FOREACH遍历我的团队，并很快发现这对Pig来说并不酷。

第二次尝试：
我的第二次尝试使用储钱罐中的MultiStorage（）方法，这很好，直到我查看文件。问题在于MulitStorage想要将所有字段写入文件，包括我用于分组的字段。我真正想要的只是写入文件的原始记录。

问题：
所以...我用猪是不是有意为之，或者有更好的方法让我用猪来解决这个问题？现在我有了这个问题，我将用一个简单的代码示例来进一步解释我的问题。一旦我有了它，我会在这里发布它。

解决方案

开箱即用，Pig没有很多功能。它的基本功能，但更多的时候，我发现自己不得不编写自定义的UDF或加载/存储funcs，以达到95％的方式达到100％的方式。我通常觉得它是值得的，因为只是写一个小商店函数比整个MapReduce程序少了很多Java。

你的第二次尝试真的接近我的意愿。您应该复制/粘贴 MultiStorage 的源代码或将继承作为起点。然后，修改 putNext 方法去除组值，但仍写入该文件。不幸的是，没有删除或删除方法，将不得不重写整个元组。或者，如果你拥有的只是原始字符串，只需将其取出并输出包含在元组中的元数据。 >关于编写加载/存储函数的一些通用文档，以防需要更多帮助：

The Problem:I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.
解决方案
Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

这篇关于在Pig Latin中为每个组写一个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！