本文介绍了无法将5k / sec记录插入impala?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为Impala探索POC,但是我看不到任何明显的表现。我无法插入5000条记录/秒,最大只能插入200条/秒。考虑到任何数据库性能,这真的很慢。

I am exploring Impala for a POC, however I can't see any significant performance. I can't insert 5000 records/sec, at max I was able to insert mere 200/sec. This is really slow considering any database performance.

我尝试了两种不同的方法,但都很慢:

I tried two different methods but both are slow:


  1. 使用Cloudera

  1. Using Cloudera

首先,我在系统上安装了Cloudera,并添加了最新的CDH 6.2集群。我创建了一个Java客户端,以使用ImpalaJDBC41驱动程序插入数据。我可以插入记录,但是速度太差了。我尝试通过增加Impala守护程序限制和我的系统RAM来调整Impala,但这没有帮助。最终,我认为我的安装有问题,或者切换到另一种方法。

First, I installed Cloudera on my system and added latest CDH 6.2 cluster. I created a java client to insert data using ImpalaJDBC41 driver. I am able to insert record but speed is terrible. I tried tuning impala by increasing Impala Daemon Limit and my system RAM but it didn't help. Finally, I thought there is something wrong with my installation or something so I switched to another method.

使用Cloudera VM

Using Cloudera VM

Cloudera还附带了准备就绪的VM以供测试。我试了一下,看看它是否可以提供更好的性能,但是并没有太大的改进。我仍然无法以5k / sec的速度插入数据。

Cloudera also ships there ready VM for test purpose. I tried my hands on to see if it gives better performance, but there is no big improvement. I still can't insert data 5k/sec speed.

我不知道我需要在哪里进行改进。如果可以做任何改进,我会在下面粘贴我的代码。

I don't know where do I need to improvement. I have pasted my code below if any improvement can be done.

达到(5k-10k /秒)速度的理想Impala配置是什么?

What is the ideal Impala configuration to achieve speed of (5k - 10k / sec)? This speed is still very less of which Impala is capable.

private static Connection connectViaDS() throws Exception {
    Connection connection = null;
    Class.forName("com.cloudera.impala.jdbc41.Driver");
    connection = DriverManager.getConnection(CONNECTION_URL);
    return connection;
}

private static void writeInABatchWithCompiledQuery(int records) {
    int protocol_no = 233,s_port=20,d_port=34,packet=46,volume=58,duration=39,pps=76,
            bps=65,bpp=89,i_vol=465,e_vol=345,i_pkt=5,e_pkt=54,s_i_ix=654,d_i_ix=444,_time=1000,flow=989;

    String s_city = "Mumbai",s_country = "India", s_latt = "12.165.34c", s_long = "39.56.32d",
            s_host="motadata",d_latt="29.25.43c",d_long="49.15.26c",d_city="Damouli",d_country="Nepal";

    long e_date= 1275822966, e_time= 1370517366;

    PreparedStatement preparedStatement;

    int total = 1000*1000;
    int counter =0;

    Connection connection = null;
    try {
        connection = connectViaDS();

        preparedStatement = connection.prepareStatement(sqlCompiledQuery);

        Timestamp ed = new Timestamp(e_date);
        Timestamp et = new Timestamp(e_time);

        while(counter <total) {
            for (int index = 1; index <= 5000; index++) {
                counter++;

                preparedStatement.setString(1, "s_ip" + String.valueOf(index));
                preparedStatement.setString(2, "d_ip" + String.valueOf(index));
                preparedStatement.setInt(3, protocol_no + index);
                preparedStatement.setInt(4, s_port + index);
                preparedStatement.setInt(5, d_port + index);
                preparedStatement.setInt(6, packet + index);
                preparedStatement.setInt(7, volume + index);
                preparedStatement.setInt(8, duration + index);
                preparedStatement.setInt(9, pps + index);
                preparedStatement.setInt(10, bps + index);
                preparedStatement.setInt(11, bpp + index);
                preparedStatement.setString(12, s_latt + String.valueOf(index));
                preparedStatement.setString(13, s_long + String.valueOf(index));
                preparedStatement.setString(14, s_city + String.valueOf(index));
                preparedStatement.setString(15, s_country + String.valueOf(index));
                preparedStatement.setString(16, d_latt + String.valueOf(index));
                preparedStatement.setString(17, d_long + String.valueOf(index));
                preparedStatement.setString(18, d_city + String.valueOf(index));
                preparedStatement.setString(19, d_country + String.valueOf(index));
                preparedStatement.setInt(20, i_vol + index);
                preparedStatement.setInt(21, e_vol + index);
                preparedStatement.setInt(22, i_pkt + index);
                preparedStatement.setInt(23, e_pkt + index);
                preparedStatement.setInt(24, s_i_ix + index);
                preparedStatement.setInt(25, d_i_ix + index);
                preparedStatement.setString(26, s_host + String.valueOf(index));
                preparedStatement.setTimestamp(27, ed);
                preparedStatement.setTimestamp(28, et);
                preparedStatement.setInt(29, _time);
                preparedStatement.setInt(30, flow + index);
                preparedStatement.addBatch();
            }
            preparedStatement.executeBatch();
            preparedStatement.clearBatch();
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            connection.close();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }
}

数据以蜗牛般的速度更新。我尝试增加批量大小,但是却降低了速度。我不知道我的代码是否错误,或者我需要调整Impala以获得更好的性能。请指导。

Data is updating at snails pace. I tried increasing the batch size but it's decreasing the speed. I don't know if my code is wrong or I need to tune Impala for better performance. Please guide.

我正在使用VM进行测试,这是其他详细信息:

I am using VM for testing, here is other details:

System.

Os - Ubuntu 16
RAM - 12 gb
Cloudera - CDH 6.2
Impala daemon limit - 2 gb
Java heap size impala daemon - 500mb
HDFS Java Heap Size of NameNode in Bytes - 500mb.

如果需要更多详细信息,请告诉我。

Please let me know if more details are required.

推荐答案

您不能在12GB的VM上进行基准测试。查看,您会发现会看到您最少需要128GB的内存。

You can't benchmark on a VM with 12GB. Look at the Impala's hardware requirements and you'll see you need 128GB of memory minimum.

建议使用128 GB或更多,最好是256 GB或更多。如果在特定节点上进行查询处理期间的中间结果超过该节点上Impala可用的内存量,则查询会将临时工作数据写入磁盘,这可能会导致较长的查询时间。请注意,由于工作是并行的,并且聚合查询的中间结果通常小于原始数据,因此Impala可以查询和联接比单个节点上可用内存大得多的表。

128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.

此外,还使用VM来熟悉工具集,但它甚至不足以成为开发环境。

Also, the VM is used to familiarize yourself with the toolset but it is not powerful enough to even be a development environment.




  • Impala Requirements:Hardware Requirements
  • Tuning Impala for Performance

这篇关于无法将5k / sec记录插入impala?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 05:25