问题描述
我使用以下命令将带有pySpark的DataFrame写入HDFS:
I wrote a DataFrame with pySpark into HDFS with this command:
df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')
查看HDFS时,我可以看到文件正确放置在此处.无论如何,当我尝试使用HIVE或Impala读取表格时,找不到该表格.
When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.
这是怎么回事,我错过了什么吗?
Whats going wrong here, am I missing something?
有趣的是, df.write.format('parquet').saveAsTable("tablename")
正常工作.
推荐答案
这是spark的预期行为,
It's an expected behaviour from spark as:
-
df ... etc.parquet(")
将数据写入 HDFS 位置,并且不会在Hive中创建任何表.
df...etc.parquet("")
writes the data to HDFS location and won't create any table in Hive.
但 df..saveAsTable(")
创建表并将数据写入其中.
but df..saveAsTable("")
creates the table in hive and writes data to it.
这就是为什么在执行 df ... parquet(")
无法在配置单元中找到表
的原因>
That's the reason why you are not able to find table in hive
after performing df...parquet("")
这篇关于DataFrame.write.parquet-HIVE或Impala无法读取实木复合地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!