在具有大数据集的一次性查询上提高MySQL性能

本文介绍了在具有大数据集的一次性查询上提高MySQL性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我之前曾问过一个有关如何分析大型数据集的问题(分析13GB的数据).一种有希望的响应是使用自然键将数据添加到MySQL数据库中，从而利用INNODB的聚集索引.

I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.

我已使用如下所示的架构将数据添加到数据库中:

I've added the data to the database with a schema that looks like this:

TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field    | Type             | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip       | int(10) unsigned | NO   | PRI | NULL    |       |
| infohash | varchar(40)      | NO   | PRI | NULL    |       |
+----------+------------------+------+-----+---------+-------+

这两个字段共同构成主键.

The two fields together form the primary key.

此表表示对等下载洪流的已知实例.我希望能够提供有关在同龄人中可以找到多少种子的信息.我要绘制一个直方图，显示我看到的洪流数量(例如20个同行有2个洪流，40个同行有3个，...).

This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).

我写了以下查询:

SELECT `count`, COUNT(`ip`) 
    FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
              FROM TorrentsPerPeer
              GROUP BY `ip`) AS `counts`
    GROUP BY `count`;

这是子选择的EXPLAIN:

+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table          | type  | possible_keys | key     | key_length | ref    | rows     | Extra       |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1  | SIMPLE      | TorrentPerPeer | index | [Null]        | PRIMARY | 126        | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+

我似乎无法对完整查询执行EXPLAIN，因为它花费的时间太长.此 bug 表示这是因为它首先运行了子查询.

I can't seem to do an EXPLAIN for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.

此查询当前正在运行(并且已经运行了一个小时). top报告说，mysqld仅使用约5％的可用CPU，而RSIZE却在稳定增加.我的假设是服务器正在RAM中建立临时表，以用于完成查询.

This query is currently running (and has been for an hour). top is reporting that mysqld is only using ~5% of the available CPU whilst its RSIZE is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.

我的问题是；如何改善此查询的性能?我应该以某种方式更改查询吗?我一直在更改my.cnf文件中的服务器设置以增加INNODB缓冲池的大小，我是否应该更改其他任何值?

My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf file to increase the INNODB buffer pool size, should I change any other values?

如果很重要，该表的深度为79'262'772行，并占用约8GB的磁盘空间.我不希望这是一个简单的查询，也许耐心"是唯一合理的答案.

If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.

编辑，仅需补充说，查询已完成，花费了105分钟.这不是无法忍受的，我只是希望有所改进.

EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.

of

在具有大数据集的一次性查询上提高MySQL性能

问题描述

推荐答案