本文介绍了猪未将数据加载到HCatalog表中-HortonWorks Sandbox的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在HortonWorks虚拟机中运行Pig脚本,目的是提取XML数据集的某些部分,并将这些部分加载到HCatalog表中的列中.在本地计算机上,我在XML文件上运行Pig脚本,并获得包含所有提取部分的输出文件.但是,由于某种原因,当我在HortonWorks VM中运行相同的脚本时,该脚本似乎已成功运行,但是HCatalog表仍然为空.

I am running a Pig script in the HortonWorks virtual machine with the goal of extracting certain parts of my XML dataset, and loading those parts into columns in an HCatalog table. On my local machine, I run my Pig script on the XML file and get an output file with all the extracted parts. However, for some reason when I run this same script in the HortonWorks VM the script appears to run successfully but the HCatalog table is still empty.

这是我的本地脚本:

 REGISTER piggybank.jar

items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS  (row:chararray);

data = FOREACH items GENERATE
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS  id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS  creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS  score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS  title:chararray;


STORE data INTO '/tmp/postsETLResults' USING PigStorage();

我在HortonWorks中使用的那个:

The one I'm using in HortonWorks:

REGISTER piggybank.jar

items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS  (row:chararray);

data = FOREACH items GENERATE
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS  id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS  creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS  score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS  title:chararray;


STORE data into 'posts_table_1' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.posts_table_1' USING org.apache.hcatalog.pig.HCatLoader();

示例XML行(来自StackOverflow公共数据集):

Sample XML row (from the StackOverflow public dataset):

<row Id="149115" PostTypeId="2" ParentId="149078" CreationDate="2008-09-29T15:16:23.870" Score="1" Body="&lt;p&gt;I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.&lt;/p&gt;&#xA;" OwnerDisplayName="user16324" LastActivityDate="2008-09-29T15:16:23.870" CommentCount="1" />

我手动创建了HCatalog表,并且所有正确的字段都存在并且属于正确的类型.

I created the HCatalog table manually, and all the correct fields exists and are of the correct type.

奇怪的是,如果我在Pig中执行dump data,我将无输出.如果我illustrate data,我在日志中看到了我的部分数据,接着是大的空白区域,接着是更多的数据,依此类推.

The strange thing is that if I do dump data in Pig, I get no output. If I illustrate data I see pieces of my data in the log, followed by large blank areas, followed by more data, and so on.

我在这里想念什么?我真的很想要这个凌乱的XML文件,并在HCatalog中获得一个整洁的表.同样,在机器上运行本地脚本时,我得到的结果是我想要的结果,但是当我运行设计用于将输出存储到posts_table_1 HCatalog表中的第二个版本时,我得到了一条成功消息,但有一个空表.

What am I missing here? I'd really like to take this messy XML file and get a neat table in HCatalog. Again, I get the results I'm looking for when running the local script on my machine, but when I run the second version designed for storing the output into the posts_table_1 HCatalog table, I get a success message but an empty table.

或者,如果我可以以逗号分隔的文件形式在本地计算机上获得输出,则可以使用该文件并使HCatalog自动在Hue界面中加载数据.到目前为止,输出是用空格分隔的,这在Hue中是有问题的,因为帖子的标题包含空格.

Alternatively, if I can just get the output on my local machine as a comma-delimited file, I can use that file and have HCatalog automatically load the data in the Hue interface. As of now, the output is space-delimited which is problematic in Hue because the titles of posts contain spaces.

提前谢谢!这让我感到难过.

Thanks in advance! This has me stumped.

推荐答案

我发现了问题.我手动创建了HCatalog表,并使用了所有默认选项,包括设置为^A (/100)的定界符.我的输出具有由制表符空格(\t)分隔的列,因此当表接收数据时,它没有找到^A分隔符并存储了一个空数据集.我重新创建了表以查找\t,并且一切正常.

I found the issue. I created the HCatalog table manually and had used all of the default options, including the delimiter which was set to ^A (/100). My output had columns separated by Tab spaces (\t), so when the table received the data, it found no ^A delimiter and stored an empty dataset. I recreated the table to look for \t and everything worked fine.

这篇关于猪未将数据加载到HCatalog表中-HortonWorks Sandbox的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-19 09:38