问题描述
以下是当前流程的步骤:
-
我看一下我构建的过程,它闻起来"很糟:有太多中间步骤会削弱数据流.
大约20个月前,我看到了一个演示,该演示是从Amazon Kinesis管道流式传输数据的,并且Impala可以近乎实时地对其进行查询.我不认为他们所做的事情是如此丑陋/令人费解.有没有更有效的方式将数据从Kafka流到Impala(可能是可以序列化到Parquet的Kafka使用者)?
我想将数据流传输到低延迟SQL"必须是一个相当普遍的用例,所以我很想知道其他人如何解决了这个问题.
解决方案如果您需要将Kafka数据原样转储到HDFS,最好的选择是使用Kafka Connect和Confluent HDFS连接器.
您可以将数据转储到可以在Impala中加载的HDFS上的parket文件中.您需要我认为您需要使用TimeBasedPartitioner分区程序每隔X毫秒制作镶木地板文件(调整partition.duration.ms配置参数).
在Kafka Connect配置中添加类似的内容可能会达到目的:
#不要刷新少于1000条消息到HDFSflush.size = 1000#转储到实木复合地板文件中format.class = io.confluent.connect.hdfs.parquet.ParquetFormatpartitioner.class = TimebasedPartitioner#每小时一个文件.如果您更改此设置,请记住更改文件名格式以反映此更改partition.duration.ms = 3600000#文件名格式path.format ='year'= YYYY/'month'= MM/'day'= dd/'hour'= HH/'minute'= mm
Here are the steps to the current process:
- Flafka writes logs to a 'landing zone' on HDFS.
- A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
- The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
- Records from the staging table are added to a permanent Hive table (e.g.
insert into permanent_table select * from staging_table
). - The data, from the Hive table, is available in Impala by executing
refresh permanent_table
in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.
About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?
I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.
解决方案If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFS flush.size = 1000 # Dump to parquet files format.class=io.confluent.connect.hdfs.parquet.ParquetFormat partitioner.class = TimebasedPartitioner # One file every hour. If you change this, remember to change the filename format to reflect this change partition.duration.ms = 3600000 # Filename format path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
这篇关于如何有效地将数据从Kafka移至Impala表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!