Spark SQL到Hive表-日期时间字段小时错误

本文介绍了Spark SQL到Hive表-日期时间字段小时错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到这个问题:当我通过spark.sql在Hive中输入时间戳字段时数据，时间奇怪地更改为21:00:00！

I face this problem: When I enter in a timestamp field in Hive with spark.sql data, the hours are strangely changed to 21:00:00!

让我解释一下:

我有一个用spark.sql读取的csv文件.我读取了文件，将其转换为数据帧并将其存储在Hive表中.该文件中的一个字段是日期，格式为"3/10/2017".我要在Hive中输入的字段位于时间戳记中格式(我使用此数据类型而不是日期的原因是我想用Impala查询表，而Impala仅具有时间戳数据类型，因此，不能简单地将数据类型更改为Date).

I have a csv file that I read with spark.sql. I read the file, convert it to dataframe and store it, in a Hive table. One of the fields in this file is date in the format "3/10/2017". The field in Hive that I want to enter it, is in Timestamp format (the reason I use this data type instead of Date is that I want to query table with Impala and Impala has only Timestamp data type, so It is not a solution to simply change the data type to Date)

从文档中可以看到，Hive Timestamp数据类型具有"YYYY-MM-DD HH:MM:SS"格式，因此在将数据框输入到Hive表之前，我将日期值转换为适当的格式.

As you can see from documentation the Hive Timestamp data type has "YYYY-MM-DD HH:MM:SS" format so before I enter the dataframe to the Hive Table I convert the date values to the appropriate format.

这是我在Python中的代码:

Here is my code in Python:

from datetime import datetime
from pyspark.sql.functions import udf

df = spark.read.csv("hdfs:/user/../MyFile.csv", header=True)

#Use a user defined function to convert date format
def DateConvert(x):
    x_augm = str(x)+" 00:00:00"
    datetime_object = datetime.strptime(x_augm,'%d/%m/%Y %H:%M:%S')
    return datetime_object.strftime('%Y-%m-%d %H:%M:%S')

DateConvert_udf = udf(DateConvert)

df= df.withColumn("Trans_Date", DateConvert_udf("Trans_Date"))

这将正确格式化时间戳.当我运行

This properly formats the timestamp. When I run

df.select("Trans_Date").show(10, False)

我得到:

+-------------------+
|Trans_Date         |
+-------------------+
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
|2017-10-16 00:00:00|
+-------------------+

然后我像这样将数据通过Spark SQL导入到Hive

Then I import the data to Hive with Spark SQL like this

df.createOrReplaceTempView('tempTable')
spark.sql("insert into table db.table select * from tempTable")

我的问题是，当我转到Hive时，时间戳记"字段的值如下:

My problem is that when I go to Hive my Timestamp field has values like:

2017-10-16 21:00:00

这很奇怪！

预先感谢您的任何建议

Hive

Spark SQL到Hive表-日期时间字段小时错误

问题描述

推荐答案