本文介绍了MapReduce与MongoDB确实很慢(对于等效数据库,MySQL的时间为30小时,而20分钟)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在做一些数据分析测试,首先,非常简单,我得到了非常奇怪的结果。



这个想法如下:从互联网访问日志(每个访问的文档集合,用于测试90百万个文档)。我想获得域访问的数量(什么是MySQL中的GROUP BY),并获得10个访问最多的域



我在JavaScript中创建的脚本非常简单:

  / *计算每个域url * / 
m = function(){
emit (this.domain,1);
}

r =函数(键值){
total = 0;
for(var i in values){
total + = Number(i);
}

总回报;
}

/ *在NonFTP_Access_log_domain_visits集合* /
res = db.NonFTP_Access_log.mapReduce(m,r,{out:{replace:NonFTP_Access_log_domain_visits }});
db.NonFTP_Access_log_domain_visits.ensureIndex({value:1});
db.NonFTP_Access_log_domain_visits.find({})。sort({value: - 1})。limit(10).forEach(printjson);

MySQL中的等价物是:

  drop table if exists NonFTP_Access_log_domain_visits; 
create table NonFTP_Access_log_domain_visits(
`domain` varchar(255)NOT NULL,
`value` int unsigned not null,
PRIMARY KEY(`domain`),
KEY`value_index`(`value`)
)ENGINE = MyISAM DEFAULT CHARSET = utf8
选择域,将count(*)作为来自NonFTP_Access_log组的域值;
select * from NonFTP_Access_log_domain_visits order by value desc limit 10;

好吧,MongoDB需要30个小时才能得到结果,MySQL需要20分钟!稍微阅读后,我得出的结论是,对于数据分析,我们将不得不使用Hadoop,因为MongoDB非常慢。这样的问题的答案表明:


  • MongoDB只使用线程

  • Javascript也是如此慢



我做错了什么?这个结果是正常的吗?我应该使用Hadoop吗?



我们在以下环境中进行此测试:


  • 操作系统:Suse Linux Enterprise Server 10(Xen上的虚拟服务器)

  • 内存:10 Gb

  • AMD皓龙处理器6128)

    解决方案

    我已经回答了这个之前。前面已经概述了MongoDB中的Map Reduce的局限性 - 正如你所提到的那样,它是单线程的,它必须被转换成Java脚本(spidermonkey)并返回等。



    这就是为什么有其他选择:


    1. (官方支持)
    2. (需要2.1+)

    在撰写本文时,2.2.0稳定版本还没有出来,但是它取决于RC2,所以发布应该是即将推出的。我建议给它一个这样的测试更有意义的比较。


    I am doing now some data analyse tests and in the first, really simple I have got very strange results.

    The idea is the following: from an internet access log (a collection with a document for each access, for the tests 90 millions of documents). I want to get the number of access by domain (what will be a GROUP BY in MySQL), and get the 10 most accessed domains

    The script I have made in JavaScript is really simple :

    /* Counts each domain url */
    m = function () {
        emit(this.domain, 1 );
    }
    
    r = function (key, values)    {
        total = 0;
        for (var i in values)    {
            total += Number(i);
        }
    
        return total;
    }
    
    /* Store of visits per domain statistics on NonFTP_Access_log_domain_visits collection */
    res = db.NonFTP_Access_log.mapReduce(m, r, { out: { replace : "NonFTP_Access_log_domain_visits" } } );
    db.NonFTP_Access_log_domain_visits.ensureIndex({ "value": 1});
    db.NonFTP_Access_log_domain_visits.find({}).sort({ "value":-1 }).limit(10).forEach(printjson);
    

    The equivalent in MySQL is :

    drop table if exists NonFTP_Access_log_domain_visits;
    create table NonFTP_Access_log_domain_visits (
        `domain` varchar(255) NOT NULL,
        `value` int unsigned not null,
        PRIMARY KEY  (`domain`),
        KEY `value_index` (`value`)
        ) ENGINE=MyISAM DEFAULT CHARSET=utf8
        select domain, count(*) as value from NonFTP_Access_log group by domain;
    select * from NonFTP_Access_log_domain_visits order by value desc limit 10;
    

    Well, MongoDB takes 30 hours to get the results and MySQL 20 minutes! After reading a little I have arrived to the conclusion that for data analyse we will have to use Hadoop as MongoDB is really slow. The answers to questions like this say that:

    • MongoDB uses only thread
    • Javascript is just too slow

    What am I doing wrong? Are this results normal? Should I use Hadoop?

    We are making this test on the following environment:

    • Operating System: Suse Linux Enterprise Server 10 (Virtual Server on Xen)
    • RAM: 10 Gb
    • Cores: 32 (AMD Opteron Processor 6128)

    解决方案

    I've actually answered this very similar question before. The limitations of Map Reduce in MongoDB have been outlined previously - as you mentioned, it is single threaded, it has to be converted to Java Script (spidermonkey) and back etc.

    That is why there are other options:

    1. The MongoDB Hadoop Connector (officially supported)
    2. The Aggregation Framework (Requires 2.1+)

    As of this writing the 2.2.0 stable release was not yet out, but it was up to RC2, so the release should be imminent. I would recommend giving it a shot as a more meaningful comparison for this type of testing.

    这篇关于MapReduce与MongoDB确实很慢(对于等效数据库,MySQL的时间为30小时,而20分钟)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 21:33