本文介绍了DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下命令将带有pySpark的DataFrame写入HDFS:

I wrote a DataFrame with pySpark into HDFS with this command:

df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')

查看HDFS时,我可以看到文件正确放置在此处.无论如何,当我尝试使用HIVE或Impala读取表格时,找不到该表格.

When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.

这是怎么回事,我错过了什么吗?

Whats going wrong here, am I missing something?

有趣的是, df.write.format('parquet').saveAsTable("tablename")正常工作.

推荐答案

这是spark的预期行为,

It's an expected behaviour from spark as:

  • df ... etc.parquet(") 将数据写入 HDFS 位置,并且不会在Hive中创建任何表.

  • df...etc.parquet("") writes the data to HDFS location and won't create any table in Hive.

df..saveAsTable(") 创建并将数据写入其中.

but df..saveAsTable("") creates the table in hive and writes data to it.

这就是为什么在执行 df ... parquet(") 无法在配置单元中找到表 的原因>

That's the reason why you are not able to find table in hive after performing df...parquet("")

这篇关于DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 08:21