DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件

本文介绍了DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用以下命令将带有pySpark的DataFrame写入HDFS:

I wrote a DataFrame with pySpark into HDFS with this command:

df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')

查看HDFS时，我可以看到文件正确放置在此处.无论如何，当我尝试使用HIVE或Impala读取表格时，找不到该表格.

When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.

这是怎么回事，我错过了什么吗?

Whats going wrong here, am I missing something?

有趣的是， df.write.format('parquet').saveAsTable("tablename")正常工作.

推荐答案

这是spark的预期行为，

It's an expected behaviour from spark as:

df ... etc.parquet(") 将数据写入 HDFS 位置，并且不会在Hive中创建任何表.

df...etc.parquet("") writes the data to HDFS location and won't create any table in Hive.

但 df..saveAsTable(") 创建表并将数据写入其中.

but df..saveAsTable("") creates the table in hive and writes data to it.

这就是为什么在执行 df ... parquet(") 无法在配置单元中找到表 的原因>

That's the reason why you are not able to find table in hive after performing df...parquet("")

                        这篇关于DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

Hive