MapReduce与MongoDB确实很慢（对于等效数据库，MySQL的时间为30小时，而20分钟）

本文介绍了MapReduce与MongoDB确实很慢（对于等效数据库，MySQL的时间为30小时，而20分钟）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我现在正在做一些数据分析测试，首先，非常简单，我得到了非常奇怪的结果。

这个想法如下：从互联网访问日志（每个访问的文档集合，用于测试90百万个文档）。我想获得域访问的数量（什么是MySQL中的GROUP BY），并获得10个访问最多的域

我在JavaScript中创建的脚本非常简单：

  / *计算每个域url * / 
m = function（）{
 emit （this.domain，1）; 
} 
 
 r =函数（键值）{
 total = 0; 
 for（var i in values）{
 total + = Number（i）; 
} 
 
总回报; 
} 
 
 / *在NonFTP_Access_log_domain_visits集合* / 
 res = db.NonFTP_Access_log.mapReduce（m，r，{out：{replace：NonFTP_Access_log_domain_visits }}）; 
 db.NonFTP_Access_log_domain_visits.ensureIndex（{value：1}）; 
 db.NonFTP_Access_log_domain_visits.find（{}）。sort（{value： -  1}）。limit（10）.forEach（printjson）;

MySQL中的等价物是：

  drop table if exists NonFTP_Access_log_domain_visits; 
 create table NonFTP_Access_log_domain_visits（
`domain` varchar（255）NOT NULL，
`value` int unsigned not null，
 PRIMARY KEY（`domain`），
 KEY`value_index`（`value`）
）ENGINE = MyISAM DEFAULT CHARSET = utf8 
选择域，将count（*）作为来自NonFTP_Access_log组的域值; 
 select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

好吧，MongoDB需要30个小时才能得到结果，MySQL需要20分钟！稍微阅读后，我得出的结论是，对于数据分析，我们将不得不使用Hadoop，因为MongoDB非常慢。这样的问题的答案表明：

MongoDB只使用线程

Javascript也是如此慢

我做错了什么？这个结果是正常的吗？我应该使用Hadoop吗？

我们在以下环境中进行此测试：

操作系统：Suse Linux Enterprise Server 10（Xen上的虚拟服务器）

内存：10 Gb

AMD皓龙处理器6128）

解决方案
我已经回答了这个之前。前面已经概述了MongoDB中的Map Reduce的局限性 - 正如你所提到的那样，它是单线程的，它必须被转换成Java脚本（spidermonkey）并返回等。

这就是为什么有其他选择：

（官方支持）
（需要2.1+）

在撰写本文时，2.2.0稳定版本还没有出来，但是它取决于RC2，所以发布应该是即将推出的。我建议给它一个这样的测试更有意义的比较。

I am doing now some data analyse tests and in the first, really simple I have got very strange results.
The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains
The script I have made in JavaScript is really simple :
/* Counts each domain url */ m = function () { emit(this.domain, 1 ); } r = function (key, values) { total = 0; for (var i in values) { total += Number(i); } return total; } /* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */ res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } ); db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1}); db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);
The equivalent in MySQL is :
drop table if exists NonFTP_Access_log_domain_visits; create table NonFTP_Access_log_domain_visits ( `domain` varchar(255) NOT NULL, `value` int unsigned not null, PRIMARY KEY (`domain`), KEY `value_index` (`value`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 select domain, count(*) as value from NonFTP_Access_log group by domain; select * from NonFTP_Access_log_domain_visits order by value desc limit 10;
Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyse we will have to use Hadoop as MongoDB is really slow. The answers to questions like this say that:
MongoDB uses only thread
Javascript is just too slow
What am I doing wrong? Are this results normal? Should I use Hadoop?
We are making this test on the following environment:
Operating System: Suse Linux Enterprise Server 10 (Virtual Server on Xen)
RAM: 10 Gb
Cores: 32 (AMD Opteron Processor 6128)
解决方案
I've actually answered this very similar question before. The limitations of Map Reduce in MongoDB have been outlined previously - as you mentioned, it is single threaded, it has to be converted to Java Script (spidermonkey) and back etc.
That is why there are other options:
The MongoDB Hadoop Connector (officially supported)
The Aggregation Framework (Requires 2.1+)
As of this writing the 2.2.0 stable release was not yet out, but it was up to RC2, so the release should be imminent. I would recommend giving it a shot as a more meaningful comparison for this type of testing.

这篇关于MapReduce与MongoDB确实很慢（对于等效数据库，MySQL的时间为30小时，而20分钟）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！