本文介绍了在Hive中获取空值创建&用REGEX加载查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Log文件,我需要用REGEX存储数据。我试着下面的查询,但加载所有NULL值。我已经使用检查了REGEX,其数据工作正常。

  CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)
ROW FORMAT SERDE'org.apache。 hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
input.regex=(IMEI \\\\\\\\\\\\\\\ \\'\\\\\\\\\\\\\\\\\\\\\\\\'\\\\\\\\\\\'
保存为TEXTFILE;

LOAD DATA INPATH'hdfs:/ user / user1 / data'OVERWRITE INTO TABLE avl;

请在这里改正。



日志:

  [INFO_ | 01/31 07:19:29] IMEI 356307043180842 
[INFO_ | 01/31 07:19:33] PacketLength = 372
[INFO_ | 01/31 07:19:33] Recv HEXString:

谢谢。

解决方案

使用你当前的表格定义,没有正则表达式可以做你想要的。原因是你的file_format设置为 TEXTFILE ,它按行分割输入文件( \r \ n ,或 \r\\\
),然后数据才会到达SerDe。



然后将每行分别传递给 RegexSerDe ,与您的正则表达式匹配,并且任何不匹配都返回 NULL 。因此,多行正则表达式不能使用 STORED AS TEXTFILE 。这也是为什么你收到了所有 NULL 行:因为没有一行输入匹配你的整个正则表达式。



这里的一个解决方案可能是pre - 处理数据,使每个记录只在输入文件的一行上,但这不是你要求的。



在Hive中执行此操作的方式是使用不同的file_format:

$ p $ 存储为INPUTFORMAT'org.apache.hadoop.mapred.TextInputFormat'

TextInputFormat从当前配置读取名为 textinputformat.record.delimiter 的配置变量。如果您使用的是 TextInputFormat ,则此变量将告诉Hadoop和Hive其中一条记录结束并且下一条记录开始。

因此,设置此值到像 EOR 这样的东西就意味着输入文件被分割为 EOR ,而不是按行。然后,由分割生成的每个块然后作为整体块传递给RegexSerDe,换行符&所有。

你可以在很多地方设置这个变量,但如果这只是这个(以及在会话后面)查询的分隔符,那么你可以做:

  SET textinputformat.record.delimiter = EOR; 

CREATE EXTERNAL TABLE ...
...
ROW格式SERDE'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
input.regex= ...
output.regex= ...

作为INPUTFORMAT存储'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;

在您的特定情况下,我无法确定您可能用于 textinputformat.record.delimiter 而不是 EOF ,因为我们只有一个示例记录,并且我无法确定您尝试了哪个字段根据您的正则表达式捕获第二个。



如果您可以提供这两项(样本数据包含> 1条记录,以及您想要捕获的内容数据包,我可能会提供更多帮助。现在,您的正则表达式与您提供的示例数据不匹配 - 即使在上。

I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.

CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)                        
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (                                             
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",          
"output.format.string" = "%1$s %2$s"                              
)
STORED AS TEXTFILE;

LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;

Please correct me here.

Sample Log:

[INFO_|01/31 07:19:29]  IMEI 356307043180842 
[INFO_|01/31 07:19:33]  PacketLength = 372
[INFO_|01/31 07:19:33]  Recv HEXString : 0000000000000168080700000143E5FC86B6002F20BC400C93C6F000FF000E0600280007020101F001040914B34238DD180028CD6B7801C7000000690000000143E5FC633E002F20B3000C93A3B00105000D06002C0007020101F001040915E64238E618002CCD6B7801C7000000640000000143E5FC43FE002F20AA800C9381700109000F06002D0007020101F001040915BF4238D318002DCD6B7801C70000006C0000000143E5FC20D6002F20A1400C935BF00111000D0600270007020101F001040916394238B6180027CD6B7801C70000006D0000000143E5FBF5DE002F2098400C9336500118000B0600260007020101F0010409174D42384D180026CD6B7801C70000006E0000000143E5FBD2B6002F208F400C931140011C000D06002B0007020101F001040915624238C018002BCD6B7801C70000006F0000000143E5FBAF8E002F2085800C92EB10011E000D06002B0007020101F0010409154C4238A318002BCD6B7801C700000067000700005873

Thanks.

解决方案

With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (\r, \n, or \r\n), before the data ever gets to the SerDe.

Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using STORED AS TEXTFILE. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.

One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.

The way to do this in Hive is to use a different file_format:

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'

TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.

Consequently, setting this value to something like EOR would mean that the input files are split on EOR, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.

You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:

SET textinputformat.record.delimiter=EOR;

CREATE EXTERNAL TABLE ...
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
   "input.regex" = ...
   "output.regex" = ...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
          OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;

In your specific scenario, I can't tell what you might use for textinputformat.record.delimiter instead of EOF, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.

If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.

这篇关于在Hive中获取空值创建&用REGEX加载查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-02 18:39