本文介绍了为什么通过ORM 5-8x加载SQLAlchemy对象的速度比通过原始MySQLdb游标的行慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到SQLAlchemy正在缓慢地获取(和ORMing)某些数据,而使用裸露的SQL来获取数据的速度却相当快.首先,我创建了一个拥有一百万条记录的数据库:

I noticed that SQLAlchemy was slow fetching (and ORMing) some data, which was rather fast to fetch using bare bone SQL. First off, I created a database with a million records:

mysql> use foo
mysql> describe Foo;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | NO   | PRI | NULL    |       |
| A     | int(11) | NO   |     | NULL    |       |
| B     | int(11) | NO   |     | NULL    |       |
| C     | int(11) | NO   |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
mysql> SELECT COUNT(*) FROM Foo;
+----------+
| COUNT(*) |
+----------+
|  1000000 |
+----------+
mysql> 

作为粗略的测试,查询所有Foo大约需要2秒钟:

As a crude test, querying all Foo's takes approximately 2 seconds:

herbert@dev0 ~ $ date; echo 'use foo; select * from Foo;' | mysql -uroot -pxxx > /dev/null; date
zo apr 20 18:48:49 CEST 2014
zo apr 20 18:48:51 CEST 2014

如果我使用MySQLdb在python中执行此操作,则大约需要3秒钟,包括构造Foo对象:

If I do this in python using MySQLdb this takes a approximately 3 seconds, including the construction of Foo objects:

herbert@dev0 ~ $ python BareORM.py 
query execution time:  0:00:02.198986
total time:  0:00:03.403084

哪个是输出:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime

class Foo:
    def __init__(self, a, b, c):
        self.a=a; self.b=b; self.c=c;

try:
    start = datetime.datetime.now()
    con = MySQLdb.connect('localhost', 'root', 'xxx', 'foo')
    cur = con.cursor();

    cur.execute("""SELECT * FROM Foo LIMIT 1000000""")
    print "query execution time: ", datetime.datetime.now()-start
    foos = [];
    for elem in cur:
        foos.append(Foo(elem[1], elem[2], elem[3]))
    con.commit()

except MySQLdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit(1)

finally:
    if con: con.close()
    print "total time: ",  datetime.datetime.now()-start

但是,使用SQLAlchemy减少样板代码,大约需要25秒才能完成相同的工作:

However, using SQLAlchemy to reduce boilerplate code, it needed approximately 25 seconds to do the same job:

herbert@dev0 ~ $ python AlchemyORM.py 
total time:  0:00:24.649279

使用此代码:

import sqlalchemy
import datetime
import MySQLdb

from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref

Base = declarative_base()

class Foo(Base):
    __tablename__ = 'Foo'
    id = Column(Integer, primary_key=True)
    A  = Column(Integer(unsigned=False), nullable=False)
    B  = Column(Integer(unsigned=False), nullable=False)
    C  = Column(Integer(unsigned=False), nullable=False)

engine  = create_engine('mysql+mysqldb://root:xxx@localhost/foo')
Session = sessionmaker(bind=engine)
session = Session()
start = datetime.datetime.now()
foos  = session.query(Foo).limit(1000000).all()
print "total time: ", datetime.datetime.now()-start

假设SQLAlchemy应该做大致相同的事情,为什么SQLAlchemy的运行速度比裸SQL解决方案慢10倍?我可以以某种方式加快速度吗?

这是一个更复杂的查询的最小工作示例,该查询使用急切的加载连接多个表.我当时考虑只对单个表进行简单查询,然后使用字典创建id-> object映射并整理一对N关系.但是在这样做之前,我要确保SQLAlchemy无法执行得更好,因为从软件设计的角度来看,编写自己的ORM是一个坏主意.恕我直言,放慢2倍的速度是可以接受的(也许).

This is a minimal working example of a more complicated query, which joins several tables using eager loading. I was considering just doing simple queries on a single table, and then using dictionaries to create id->object maps and collate one-to-N relations. But before doing so, I want to be sure that SQLAlchemy is unable to perform better, because writing your own ORM is a bad idea from a software design point of view. Imho, a 2x slowdown would be acceptable (maybe).

如果您了解其他(更快)的python-SQL ORM或类似BigTable的解决方案(已经是ORM),请随时将其作为注释提及.

If you know about other (faster) python-SQL ORM's, or maybe BigTable-alike solutions (that already are the ORM), feel free to mention them as a comment.

编辑:还对Peewee进行了尝试,这导致了大约15秒的时间.

Also tried this with Peewee, which resulted in ~15 s.

from peewee import *
import datetime;
database = MySQLDatabase("foo", host="localhost", port=3306, user="root", passwd="xxx")

class Foo(Model):
        id = IntegerField()
        A  = IntegerField()
        B  = IntegerField()
        C  = IntegerField()

        class Meta:
                db_table = 'Foo'
                database = database

start = datetime.datetime.now()
foos = Foo.select()
cnt=0;
for i in foos: cnt=cnt+1
print "total time: ", datetime.datetime.now() - start

作为对Matthias的回应,我试图在Java中使用Hibernate进行相同的操作,结果大约是8到10秒,不是完全快,但是比25秒快很多.代码从一些类开始,以一些配置结束:

As a response to Matthias I tried to do the same thing in Java with Hibernate, the result is approximately 8 to 10 seconds, not exactly fast, but a lot faster than 25 seconds. The code, starting with some classes and ending with some configuration:

package herbert.hibernateorm;

import java.util.List;

import org.hibernate.Session; 
import org.hibernate.Transaction;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.Configuration;

public class App {
   public static void main(String[] args) throws Exception {
      SessionFactory factory = new Configuration().configure().buildSessionFactory();
      Session session = factory.openSession();
      Transaction tx = session.beginTransaction();
      long start = System.currentTimeMillis();
      List foos = session.createQuery("FROM Foo").list(); 
      System.out.println(foos.size());
      System.out.printf("total time: %d\n", System.currentTimeMillis() - start);
      session.close();
   }
}
package herbert.hibernateorm;

public class Foo {
    private int id, a, b, c;
    public Foo() {}
    public Foo(int A, int B, int C) { this.a=A; this.b=B; this.c=C; }

    public int getId() { return id; }
    public void setId(int id) { this.id = id; }
    public int getA() { return a; }
    public void setA(int a) { this.a = a; }
    public int getB() { return b; }
    public void setB(int b) { this.b = b; }
    public int getC() { return c; }
    public void setC(int c) { this.c = c; }
}

配置(分别为hibernate.cfg.xml和hibernate.hbm.xml)

The configuration (hibernate.cfg.xml and hibernate.hbm.xml respectively)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-configuration PUBLIC "-//Hibernate/Hibernate Configuration DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-configuration-3.0.dtd">
<hibernate-configuration>
  <session-factory>
    <property name="hibernate.dialect">org.hibernate.dialect.MySQLDialect</property>
    <property name="hibernate.connection.driver_class">com.mysql.jdbc.Driver</property>
    <property name="hibernate.connection.url">jdbc:mysql://localhost:3306/foo?zeroDateTimeBehavior=convertToNull</property>
    <property name="hibernate.connection.username">root</property>
    <property name="hibernate.connection.password">xxx</property>
    <mapping resource="hibernate.hbm.xml"/>
  </session-factory>
</hibernate-configuration>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="herbert.hibernateorm.Foo" table="Foo" catalog="foo">
        <id name="id" type="int">
            <column name="id" />
            <generator class="assigned" />
        </id>
        <property name="a" type="int">
            <column name="A" not-null="true" />
        </property>
        <property name="b" type="int">
            <column name="B" not-null="true" />
        </property>
        <property name="c" type="int">
            <column name="C" not-null="true" />
        </property>
    </class>
</hibernate-mapping>

最后是pom文件,可在maven中运行所有文件:

And finally the pom file to run it all in maven:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>herbert</groupId>
    <artifactId>hibernateORM</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>hibernateORM</name>
    <url>http://maven.apache.org</url>
    <repositories>
        <repository>
            <id>unknown-jars-temp-repo</id>
            <name>A temporary repository created by NetBeans for libraries and jars it could not identify. Please replace the dependencies in this repository with correct ones and delete this repository.</name>
            <url>file:${project.basedir}/lib</url>
        </repository>
    </repositories>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.21</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-core</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-entitymanager</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate.common</groupId>
            <artifactId>hibernate-commons-annotations</artifactId>
            <version>4.0.1.Final</version>
        </dependency>   
        <dependency>
            <groupId>nz.ac.waikato.cms.weka</groupId>
            <artifactId>weka-dev</artifactId>
            <version>3.7.10</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.9</version>
        </dependency>
        <dependency>
            <groupId>commons-net</groupId>
            <artifactId>commons-net</artifactId>
            <version>3.1</version>
            <classifier>examples</classifier>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.2.2</version>
        </dependency>
        <dependency>
            <groupId>maven</groupId>
            <artifactId>maven-jetty-plugin</artifactId>
            <version>1.1</version>
            <type>plugin</type>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
                <groupId>com.kenai.nbpwr</groupId>
                <artifactId>org-slf4j-jdk14</artifactId>
                <version>1.6.1-201106101300</version>
                <type>nbm</type>
        </dependency>

    </dependencies>
</project>

推荐答案

这是您的MySQL脚本的SQLAlchemy版本,其执行时间为4秒,而MySQLdb则为3秒:

Here is the SQLAlchemy version of your MySQL script that performs in four seconds, compared to three for MySQLdb:

from sqlalchemy import Integer, Column, create_engine, MetaData, Table
import datetime

metadata = MetaData()

foo = Table(
    'foo', metadata,
    Column('id', Integer, primary_key=True),
    Column('a', Integer(), nullable=False),
    Column('b', Integer(), nullable=False),
    Column('c', Integer(), nullable=False),
)


class Foo(object):
    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)
start = datetime.datetime.now()

with engine.connect() as conn:
    foos = [
        Foo(row['a'], row['b'], row['c'])
        for row in
        conn.execute(foo.select().limit(1000000)).fetchall()
    ]


print "total time: ", datetime.datetime.now() - start

运行时:

total time:  0:00:04.706010

这是一个使用ORM完全加载对象行的脚本;通过避免使用yield per一次创建所有1M对象的固定列表,使用SQLAlchemy master可以在 13秒中运行(对于rel 0.9,则为18秒):

Here is a script that uses the ORM to load object rows fully; by avoiding the creation of a fixed list with all 1M objects at once using yield per, this runs in 13 seconds with SQLAlchemy master (18 seconds with rel 0.9):

import time
from sqlalchemy import Integer, Column, create_engine, Table
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Foo(Base):
    __table__ = Table(
        'foo', Base.metadata,
        Column('id', Integer, primary_key=True),
        Column('a', Integer(), nullable=False),
        Column('b', Integer(), nullable=False),
        Column('c', Integer(), nullable=False),
    )


engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)

sess = Session(engine)

now = time.time()

# avoid using all() so that we don't have the overhead of building
# a large list of full objects in memory
for obj in sess.query(Foo).yield_per(100).limit(1000000):
    pass

print("Total time: %d" % (time.time() - now))

然后我们可以拆分这两种方法之间的差异,并使用ORM仅加载单个列:

We can then split the difference between these two approaches, and load just individual columns with the ORM:

for obj in sess.query(Foo.id, Foo.a, Foo.b, Foo.c).yield_per(100).limit(1000000):
    pass

以上内容将在 4秒内再次运行.

The above again runs in 4 seconds.

SQLAlchemy Core的比较是与原始MySQLdb游标的更合适的比较.如果您使用ORM但查询单个列,则在最新版本中大约需要四秒钟.

The comparison of SQLAlchemy Core is the more apt comparison to a raw MySQLdb cursor. If you use the ORM but query for individual columns, it's about four seconds in most recent versions.

在ORM级别,速度问题是因为在Python中创建对象的速度很慢,并且SQLAlchemy ORM在获取这些对象时会对其进行大量记账,这对于实现其使用合同是必不可少的,包括工作单元,身份图,渴望加载,收藏等.

At the ORM level, the speed issues are because creating objects in Python is slow, and the SQLAlchemy ORM applies a large amount of bookkeeping to these objects as it fetches them, which is necessary in order for it to fulfill its usage contract, including unit of work, identity map, eager loading, collections, etc.

要显着加快查询速度,请获取单个列而不是完整对象.请参阅以下技术 http://docs.sqlalchemy.org/en /latest/faq/performance.html#result-fetching-slowness-orm 对此进行了描述.

To speed up the query dramatically, fetch individual columns instead of full objects. See the techniques at http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-orm which describe this.

为了与PeeWee进行比较,PW是一个简单得多的系统,功能较少,其中包括它对身份映射没有任何作用.即使使用PeeWee,ORM尽可能简单,仍然需要 15秒,这证明与原始MySQLdb提取相比, cPython确实非常慢在直线C中.

For your comparison to PeeWee, PW is a much simpler system with a lot less features, including that it doesn't do anything with identity maps. Even with PeeWee, about as simple of an ORM as is feasible, it still takes 15 seconds, which is evidence that cPython is really really slow compared to the raw MySQLdb fetch which is in straight C.

与Java相比,Java VM比cPython更快. Hibernate非常复杂,但是由于JIT的原因,Java VM的运行速度非常快,甚至所有复杂性都以更快的速度运行.如果要将Python与Java进行比较,请使用Pypy.

For comparison to Java, the Java VM is way way way faster than cPython. Hibernate is ridiculously complicated, yet the Java VM is extremely fast due to the JIT and even all that complexity ends up running faster. If you want to compare Python to Java, use Pypy.

这篇关于为什么通过ORM 5-8x加载SQLAlchemy对象的速度比通过原始MySQLdb游标的行慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-31 12:30