本文介绍了Spark SQL到Hive表-日期时间字段小时错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到这个问题:当我通过spark.sql在Hive中输入时间戳字段时数据,时间奇怪地更改为21:00:00!

I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00!

让我解释一下:

我有一个用spark.sql读取的csv文件.我读取了文件,将其转换为数据帧并将其存储在Hive表中.该文件中的一个字段是日期,格式为"3/10/2017".我要在Hive中输入的字段位于时间戳记中格式(我使用此数据类型而不是日期的原因是我想用Impala查询表,而Impala仅具有时间戳数据类型,因此,不能简单地将数据类型更改为Date).

I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala has only Timestamp data type, so It is not a solution to simply change the data type to Date)

从文档中可以看到,Hive Timestamp数据类型具有"YYYY-MM-DD HH:MM:SS"格式,因此在将数据框输入到Hive表之前,我将日期值转换为适当的格式.

As you can see from documentation the Hive Timestamp data type has "YYYY-MM-DD HH:MM:SS" format so before I enter the dataframe to the Hive Table I convert the date values to the appropriate format.

这是我在Python中的代码:

Here is my code in Python:

from datetime import datetime
from pyspark.sql.functions import udf

df = spark.read.csv("hdfs:/user/../MyFile.csv", header=True)

#Use a user defined function to convert date format
def DateConvert(x):
    x_augm = str(x)+" 00:00:00"
    datetime_object = datetime.strptime(x_augm,'%d/%m/%Y %H:%M:%S')
    return datetime_object.strftime('%Y-%m-%d %H:%M:%S')

DateConvert_udf = udf(DateConvert)

df= df.withColumn("Trans_Date", DateConvert_udf("Trans_Date"))

这将正确格式化时间戳.当我运行

This properly formats the timestamp. When I run

df.select("Trans_Date").show(10, False)

我得到:

+-------------------+
|Trans_Date         |
+-------------------+
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
+-------------------+

然后我像这样将数据通过Spark SQL导入到Hive

Then I import the data to Hive with Spark SQL like this

df.createOrReplaceTempView('tempTable')
spark.sql("insert into table db.table select * from tempTable")

我的问题是,当我转到Hive时,时间戳记"字段的值如下:

My problem is that when I go to Hive my Timestamp field has values like:

2017-10-16 21:00:00

2017-10-16 21:00:00

这很奇怪!

预先感谢您的任何建议

推荐答案

这是将数据保存到具有TIMESTAMP数据类型的Hive表中时的常见问题.

This is the common problem while Saving data into Hive tables with TIMESTAMP Data type.

将数据保存到Hive表时,TIMESTAMP值表示写入数据的主机的本地时区.

When you save data into Hive table, TIMESTAMP values represent the local timezone of the host where the data was written.

此处2017-10-16 00:00:00-UTC(默认)已转换为2017-10-16 21:00:00- Hive主机的本地时区.

Here 2017-10-16 00:00:00 - UTC (By Default) got converted to 2017-10-16 21:00:00 - Local TimeZone of Hive host.

为避免因意外的时区问题而产生不希望的结果,在Impala中,时间戳是相对于UTC进行存储和解释的,无论是在写入还是从数据文件读取时.

To avoid undesired results from unexpected time zone issues, in Impala Timestamps are stored and interpreted relative to UTC, both when written to or read from data files.

有关必要的配置设置,请参阅以下文档. https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html#timestamp

You can refer below documentation for necessary configuration settings. https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_timestamp.html#timestamp

这篇关于Spark SQL到Hive表-日期时间字段小时错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-23 01:19