本文介绍了Pig 未将数据加载到 HCatalog 表中 - HortonWorks Sandbox的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 HortonWorks 虚拟机中运行 Pig 脚本,目的是提取 XML 数据集的某些部分,并将这些部分加载到 HCatalog 表的列中.在我的本地机器上,我在 XML 文件上运行我的 Pig 脚本并获得一个包含所有提取部分的输出文件.但是,出于某种原因,当我在 HortonWorks VM 中运行相同的脚本时,该脚本似乎运行成功,但 HCatalog 表仍然为空.

I am running a Pig script in the HortonWorks virtual machine with the goal of extracting certain parts of my XML dataset, and loading those parts into columns in an HCatalog table. On my local machine, I run my Pig script on the XML file and get an output file with all the extracted parts. However, for some reason when I run this same script in the HortonWorks VM the script appears to run successfully but the HCatalog table is still empty.

这是我的本地脚本:

 REGISTER piggybank.jar

items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS  (row:chararray);

data = FOREACH items GENERATE
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS  id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS  creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS  score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS  title:chararray;


STORE data INTO '/tmp/postsETLResults' USING PigStorage();

我在 HortonWorks 中使用的那个:

The one I'm using in HortonWorks:

REGISTER piggybank.jar

items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS  (row:chararray);

data = FOREACH items GENERATE
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS  id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS  creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS  score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS  title:chararray;


STORE data into 'posts_table_1' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.posts_table_1' USING org.apache.hcatalog.pig.HCatLoader();

示例 XML 行(来自 StackOverflow 公共数据集):

Sample XML row (from the StackOverflow public dataset):

<row Id="149115" PostTypeId="2" ParentId="149078" CreationDate="2008-09-29T15:16:23.870" Score="1" Body="&lt;p&gt;I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.&lt;/p&gt;&#xA;" OwnerDisplayName="user16324" LastActivityDate="2008-09-29T15:16:23.870" CommentCount="1" />

我手动创建了 HCatalog 表,所有正确的字段都存在并且类型正确.

I created the HCatalog table manually, and all the correct fields exists and are of the correct type.

奇怪的是,如果我在 Pig 中执行 dump data ,我没有得到任何输出.如果我说明数据,我会在日志中看到我的数据片段,然后是大片空白区域,然后是更多数据,依此类推.

The strange thing is that if I do dump data in Pig, I get no output. If I illustrate data I see pieces of my data in the log, followed by large blank areas, followed by more data, and so on.

我在这里错过了什么?我真的很想使用这个凌乱的 XML 文件并在 HCatalog 中获得一个整洁的表格.同样,当我在我的机器上运行本地脚本时,我得到了我正在寻找的结果,但是当我运行旨在将输出存储到 posts_table_1 HCatalog 表中的第二个版本时,我收到一条成功消息但是一张空桌子.

What am I missing here? I'd really like to take this messy XML file and get a neat table in HCatalog. Again, I get the results I'm looking for when running the local script on my machine, but when I run the second version designed for storing the output into the posts_table_1 HCatalog table, I get a success message but an empty table.

或者,如果我可以在本地机器上以逗号分隔文件的形式获取输出,我可以使用该文件并让 HCatalog 自动加载 Hue 界面中的数据.截至目前,输出是以空格分隔的,这在 Hue 中存在问题,因为帖子的标题包含空格.

Alternatively, if I can just get the output on my local machine as a comma-delimited file, I can use that file and have HCatalog automatically load the data in the Hue interface. As of now, the output is space-delimited which is problematic in Hue because the titles of posts contain spaces.

提前致谢!这让我很难过.

Thanks in advance! This has me stumped.

推荐答案

我发现了这个问题.我手动创建了 HCatalog 表并使用了所有默认选项,包括设置为 ^A (/100) 的分隔符.我的输出有由制表符空格 (\t) 分隔的列,所以当表接收到数据时,它没有发现 ^A 分隔符并存储一个空数据集.我重新创建了表格以查找 \t 并且一切正常.

I found the issue. I created the HCatalog table manually and had used all of the default options, including the delimiter which was set to ^A (/100). My output had columns separated by Tab spaces (\t), so when the table received the data, it found no ^A delimiter and stored an empty dataset. I recreated the table to look for \t and everything worked fine.

这篇关于Pig 未将数据加载到 HCatalog 表中 - HortonWorks Sandbox的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-19 09:39