1.Scrapy是什么

Scrapy是基于twisted的爬虫框架,用户定制开发几个模块就可以实现爬虫

2.Scrapy的优势

没有Scrapy要自己手写爬虫的时候,我们要用Urlib或Requests库发送请求、封装http头部信息类、多线程、封装代理类、封装去重类、封装数据存储类、封装去重类、封装异常检测机制

3.Scrapy架构

Scrapy学习笔记-LMLPHP

Scrapy Engine:Scrapy的引擎。它负责Scheduler,Pipeline,Spiders,Downloader之间的信号、消息和通讯传递

Scheduler:Scrapy的调度器。简单地说是队列,接受Scrapy Engine发送来的Request,Scheduler对它们进行排队,当Scrapy Engine需要数据时,Scheduler将请求队列中的数据传送给引擎

Downloader:Scrapy的下载器。它负责接受Scrapy Engine的Request,生成Response,并将其交还给Scrapy Engine,引擎再将Response交给Spiders

Spiders:Scrapy的爬虫。它用来写爬虫逻辑,如编写正则,BeautifulSoup,Xpath等;如果Response包含下一次请求,如“下一页”,Spiders会将URL交给Scrapy Engine,再有引擎交给Scheduler进行排队

Pipeline:Scrapy的管道。封装去重类、存储类的地方,负责数据的后期过滤、存储等

Downloader:下载器。它负责发送请求并下载数据

Downloader Middlewares:下载中间件。自定义扩展组件,是我们封装代理、封装HTTP头的地方

Spider Middlewares:爬虫中间件。可以封装从Spiders发送出去的Request和接受到的Response

4.Scrapy例子

4.1 爬取豆瓣电影Top250

Scrapy学习笔记-LMLPHP

搭建Scapy项目的教程网上有很多,可以自行百度

自定义代理中间件,这里用到了本地Ip代理,大量爬虫请求的话需要接入第三方代理工具。可以将爬取源Ip伪装成如下代理

class specified_proxy(object):
    def proccess_request(self,request,spider):
        #随机选取代理Ip
        PROXIES = ['http://183.207.95.27:80', 'http://111.6.100.99:80', 'http://122.72.99.103:80',
                   'http://106.46.132.2:80', 'http://112.16.4.99:81', 'http://123.58.166.113:9000',
                   'http://118.178.124.33:3128', 'http://116.62.11.138:3128', 'http://121.42.176.133:3128',
                   'http://111.13.2.131:80', 'http://111.13.7.117:80', 'http://121.248.112.20:3128',
                   'http://112.5.56.108:3128', 'http://42.51.26.79:3128', 'http://183.232.65.201:3128',
                   'http://118.190.14.150:3128', 'http://123.57.221.41:3128', 'http://183.232.65.203:3128',
                   'http://166.111.77.32:3128', 'http://42.202.130.246:3128', 'http://122.228.25.97:8101',
                   'http://61.136.163.245:3128', 'http://121.40.23.227:3128', 'http://123.96.6.216:808',
                   'http://59.61.72.202:8080', 'http://114.141.166.242:80', 'http://61.136.163.246:3128',
                   'http://60.31.239.166:3128', 'http://114.55.31.115:3128', 'http://202.85.213.220:3128']
        random_proxy = random.sample(PROXIES, 1)
        request.meta['proxy'] = random_proxy

自定义user_agent,让目标服务器知道我们不是机器,而是从操作系统、浏览器等发出的请求

class specified_useragent(object):
    def proccess_request(self, request, spider):
        #随机选取user_agent
        USER_AGENT_LIST = [
            "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
            "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
            "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
            "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
            "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
            "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
            "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
            "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
            "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
            "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
        ]
        agent = random.choice(USER_AGENT_LIST)
        request.headers['USER_AGNET'] = agent

配置完自定义中间件,要在Settings.py中引用它们

#数字越小优先级越高
DOWNLOADER_MIDDLEWARES = {'ScrapyTest.middlewares.specified_proxy': 543,
    'ScrapyTest.middlewares.specified_useragent': 544
}

在items.py里定义数据

import scrapy


class ScrapytestItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #电影序号
    serial_number = scrapy.Field();
    #电影名称
    movie_name = scrapy .Field();
    #电影介绍
    introduce = scrapy.Field();
    #评分
    star = scrapy.Field();
    #电影的评论数
    evaluate = scrapy.Field();
    #电影描述
    describe = scrapy.Field();
    pass

在管道pipelines.py中配置数据的存储,连接Monodb

class ScrapytestPipeline(object):
    def __init__(self):
        host = monodb_host
        port = monodb_port
        dbname = monodb_db_name
        sheetname = monodb_tb_name
        client = pymongo.MongoClient(host=host,port=port)
        mydb = client[dbname]
        self.post = mydb[sheetname]

    def process_item(self, item, spider):
        data = dict(item)
        self.post.insert(data)
        return item

settings.py数据库信息

monodb_host = "127.0.0.1"
monodb_port = 27017
monodb_db_name = "scrapy_test"
monodb_tb_name = "douban_movie"

运行main后的效果

Scrapy学习笔记-LMLPHP

在Mongodb数据库中可以看到插入进来的数据

use scrapy_test;
show collections;
db.douban_movie.find().pretty()

Scrapy学习笔记-LMLPHP

4.2 源码获取

https://github.com/cjy513203427/ScrapyTest

11-27 07:34