本文介绍了将配置单元表迁移到Google BigQuery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图设计一种数据管道将我的Hive表迁移到BigQuery中。 Hive在Hadoop内部集群上运行。这是我目前的设计,实际上,它非常简单,它只是一个shell脚本:对于每个表source_hive_table {/ b>


$ b

$ b


  • INSERT覆盖表 target_avro_hive_table SELECT * FROM source_hive_table;

  • 使用 distcp

  • 将生成的avro文件移动到Google云端存储中创建第一个BQ表: bq load --source_format = AVRO your_dataset.something something.avro

  • 从BigQuery本身处理任何投射问题,所以从刚刚写入的表中选择并手动处理任何铸件



}



你认为这是否合理?有没有更好的方法,也许使用Spark?
我对处理投射的方式并不满意,我希望避免两次创建BigQuery表。 解决方案

是的,你的迁移逻辑是合理的。

我个人比较喜欢直接在初始的Hive查询中生成Avro(Hive)数据。例如,Hive中的 decimal 类型映射到Avro'type':type:bytes,logicalType:decimal,precision:10, scale:2

BQ只取主类型(这里是字节)而不是逻辑类型。
所以这就是为什么我更容易直接在Hive中投入(这里是double)。
同样的问题发生在日期配置单元类型中。


I am trying to design a sort of data pipeline to migrate my Hive tables into BigQuery. Hive is running on an Hadoop on premise cluster. This is my current design, actually, it is very easy, it is just a shell script:

for each table source_hive_table {

  • INSERT overwrite table target_avro_hive_table SELECT * FROM source_hive_table;
  • Move the resulting avro files into google cloud storage using distcp
  • Create first BQ table: bq load --source_format=AVRO your_dataset.something something.avro
  • Handle any casting issue from BigQuery itself, so selecting from the table just written and handling manually any casting

}

Do you think it makes sense? Is there any better way, perhaps using Spark?I am not happy about the way I am handling the casting, I would like to avoid creating the BigQuery table twice.

解决方案

Yes, your migration logic makes sense.

I personally prefer to do the CAST for specific types directly into the initial "Hive query" that generates your Avro (Hive) data. For instance, "decimal" type in Hive maps to the Avro 'type': "type":"bytes","logicalType":"decimal","precision":10,"scale":2

And BQ will just take the primary type (here "bytes") instead of the logicalType.So that is why I find it easier to cast directly in Hive (here to "double").Same problem happens to the date-hive type.

这篇关于将配置单元表迁移到Google BigQuery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-24 09:48