本文介绍了Parquet支持的Hive表:数组列在Impala中不可查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然Impala比Hive快得多,但我们使用Hive是因为它支持复杂(嵌套)的数据类型,例如数组和地图。

的,现在支持复杂的数据类型。由于也可以在Impala中运行Hive UDF,因此我们可以在Impala中执行所需的任何操作,但速度要快得多。这是个好消息!



在我浏览文档时,我发现Impala预计数据将以Parquet格式存储。我的数据原始形式恰好是一个两列CSV,第一列是ID,第二列是管道分隔的字符串数组,例如:

  123,ASDFG | SDFGH | DFGHJ | FGHJK 
234,QWERT | WERTY | ERTYU

创建了一个Hive表:

  CREATE TABLE`id_member_of`(
`id` INT,
`member_of` ARRAY< STRING>)
行格式限定
字段终止'''
收款项'|'
'\\\
'
作为INPUTFORMAT存储的行终止
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql .io.HiveIgnoreKeyTextOutputFormat';

原始数据已加载到Hive表中:

  LOAD DATA LOCAL INPATH'raw_data.csv'INTO TABLE id_member_of; 

创建表的Parquet版本:

  CREATE TABLE`id_member_of_parquet`(
`id` STRING,
`member_of` ARRAY STRING>)
STORED AS PARQUET;

CSV支持表中的数据已插入到Parquet表中:

  INSERT INTO id_member_of_parquet SELECT id,member_of FROM id_member_of; 

现在可以在Hive中查询Parquet表:

  hive> select * from id_member_of_parquet; 
123 [ASDFG,SDFGH,DFGHJ,FGHJK]
234 [QWERT,WERTY,ERTYU]

奇怪的是,当我在Impala中查询相同的Parquet支持的表时,它不返回数组列:

  [hadoop01:21000]>元数据无效; 
[hadoop01:21000]> select * from id_member_of_parquet;
+ ----- +
| id |
+ ----- +
| 123 |
| 234 |
+ ----- +

问题:数组列发生了什么?你能看到我做错了什么吗?

解决方案

结果非常简单:我们可以通过将数组添加到 FROM 带点,例如

 查询:select * from id_member_of_parquet,id_member_of_parquet.member_of 
+ ----- + ------- +
| id | item |
+ ----- + ------- +
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+ ----- + ------- +


Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.

I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!

As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:

123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU

A Hive table was created:

CREATE TABLE `id_member_of`(
  `id` INT,
  `member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  COLLECTION ITEMS TERMINATED BY '|'
  LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

The raw data was loaded into the Hive table:

LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;

A Parquet version of the table was created:

CREATE TABLE `id_member_of_parquet` (
 `id` STRING,
 `member_of` ARRAY<STRING>)
STORED AS PARQUET;

The data from the CSV-backed table was inserted into the Parquet table:

INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;

And the Parquet table is now queryable in Hive:

hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]

Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:

[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id  |
+-----+
| 123 |
| 234 |
+-----+

Question: What happened to the array column? Can you see what I'm doing wrong?

解决方案

It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.

Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id  | item  |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+

这篇关于Parquet支持的Hive表:数组列在Impala中不可查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 06:54