本文介绍了Impala 时间戳与 Hive 不匹配 - 时区问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 HDFS 中有一些事件日志数据,其原始格式如下所示:

2015-11-05 19:36:25.764 INFO [...等...]

一个外部表指向这个 HDFS 位置:

创建外部表`log_stage`(`event_time` 时间戳,[...])行格式分隔以 '	' 结尾的字段以 '
' 结尾的行存储为输入格式'org.apache.hadoop.mapred.TextInputFormat'输出格式'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

为了性能,我们想在 Impala 中查询它.通过执行 Hive 查询将 log_stage 数据插入到 Hive/Impala Parquet 支持的表中:INSERT INTO TABLE log SELECT * FROM log_stage.这是 Parquet 表的 DDL:

创建表`log`(`event_time` 时间戳,[...])行格式SERDE'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'存储为输入格式'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'输出格式'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

问题:在Impala中查询时,时间戳提前7小时:

蜂巢时间:2015-11-05 19:36:25.764黑斑羚时间:2015-11-06 02:36:25.764>as.POSIXct("2015-11-06 02:36:25") - as.POSIXct("2015-11-05 19:36:25")时差7小时

注意:服务器的时区(来自/etc/sysconfig/clock)都设置为America/Denver",目前比UTC晚7小时.

Impala 似乎正在处理已经在 UTC 时间的事件,错误地假设它们是在美国/丹佛时间,然后又增加了 7 个小时.

您知道如何同步时间以便 Impala 表与 Hive 表匹配吗?

解决方案

Hive 以不同的方式将时间戳写入 Parquet.您可以使用 impalad 标志 -convert_legacy_hive_parquet_utc_timestamps 告诉 Impala 在读取时进行转换.有关详细信息,请参阅TIMESTAMP 文档.>

这篇博文问题的简要描述:

Hive 将时间戳值存储为 Parquet 格式时,会将本地时间转换为 UTC 时间,当读取数据时,它会转换回本地时间.然而,另一方面,Impala 在读取时间戳字段时不进行任何转换,因此返回的是 UTC 时间而不是本地时间.

impalad 标志告诉 Impala 在读取 Hive 生成​​的 Parquet 中的时间戳时进行转换.它确实会产生一些小成本,因此如果这对您来说是个问题,您应该考虑使用 Impala 编写时间戳(尽管它可能很小).

I have some eventlog data in HDFS that, in its raw format, looks like this:

2015-11-05 19:36:25.764 INFO    [...etc...]

An external table points to this HDFS location:

CREATE EXTERNAL TABLE `log_stage`(
  `event_time` timestamp,
  [...])
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '	'
  LINES TERMINATED BY '
'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

For performance, we'd like to query this in Impala. The log_stage data is inserted into a Hive/Impala Parquet-backed table by executing a Hive query: INSERT INTO TABLE log SELECT * FROM log_stage. Here's the DDL for the Parquet table:

CREATE TABLE `log`(
  `event_time` timestamp,
  [...])
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

The problem: when queried in Impala, the timestamps are 7 hours ahead:

Hive time:   2015-11-05 19:36:25.764
Impala time: 2015-11-06 02:36:25.764

> as.POSIXct("2015-11-06 02:36:25") - as.POSIXct("2015-11-05 19:36:25")
Time difference of 7 hours

Note: The timezone of the servers (from /etc/sysconfig/clock) are all set to "America/Denver", which is currently 7 hours behind UTC.

It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours.

Do you know how to sync the times so that the Impala table matches the Hive table?

解决方案

Hive writes timestamps to Parquet differently. You can use the impalad flag -convert_legacy_hive_parquet_utc_timestamps to tell Impala to do the conversion on read. See the TIMESTAMP documentation for more details.

This blog post has a brief description of the issue:

The impalad flag tells Impala to do the conversion when reading timestamps in Parquet produced by Hive. It does incur some small cost, so you should consider writing your timestamps with Impala if that is an issue for you (though it likely is minimal).

这篇关于Impala 时间戳与 Hive 不匹配 - 时区问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

06-11 16:20