通过“创建云起书院爬虫”实例演示Python爬虫综合实战。
01
创建云起书院爬虫
在开始编程之前,首先要根据项目需求对云起书院网站进行分析。目标是提取小说的名称、作者、分类、状态、更新时间、字数、单击量、人气和推荐等数据。首先来到书院的书库 (http://yunqi.qq.com/bk),如图15-4所示。
■ 图 15-4  图书列表
可以在图书列表中找到每一本书的名称、作者、分类、状态、更新时间、字数等信息。同时将页面滑到底部,可以看到翻页的按钮。
接着选其中一部小说单击进去,可以进入小说的详情页,在作品信息中,可以找到单击量、人气和推荐等数据,如图15-5所示。
■ 图 15-5  小说详情页
1
定义Item
创建完工程后,首先要做的是定义Item,确定需要提取的结构化数据。主要定义两个Item:一个负责装载小说的基本信息;一个负责装载小说热度(单击量和人气等)的信息。代码如下:
import
 scrapy


classYunqiBookListItem(scrapy.Item):
#小说id
    novelId = scrapy.Field()

#小说名称
    novelName = scrapy.Field()

#小说链接
    novelLink = scrapy.Field()

#小说作者
    novelAuthor = scrapy.Field()

#小说类型
    novelType = scrapy.Field()

#小说状态
    novelStatus = scrapy.Field()

#小说更新时间
    novelUpdateTime = scrapy.Field()

#小说字数
    novelWords = scrapy.Field()

#小说封面
    novelImageUrl = scrapy.Field()


classYunqiBookDetailItem(scrapy.Item):
#小说id
    novelId = scrapy.Field()

#小说标签
    novelLabel =scrapy.Field()

#小说总点击量
    novelAllClick = scrapy.Field()

#月点击量
    novelMonthClick = scrapy.Field()

#周点击量
    novelWeekClick = scrapy.Field()

#总人气
    novelAllPopular = scrapy.Field()

#月人气
    novelMonthPopular = scrapy.Field()

#周人气
    novelWeekPopular = scrapy.Field()

#评论数
    novelCommentNum = scrapy.Field()

#小说总推荐
    novelAllComm = scrapy.Field()

#小说月推荐
    novelMonthComm = scrapy.Field()

#小说周推荐
    novelWeekComm = scrapy.Field()
2
编写爬虫模块
下面开始进行页面的解析,主要有两个方法。parse_book_list()方法用于解析图15-4所示的图书列表,抽取其中的小说基本信息。parse_book_detail()方法用于解析图15-5所示的小说单击量和人气等数据。对于翻页链接抽取,则是在rules中定义抽取规则,翻页链接基本上符合“/bk/so2/n30p/d+”这种形式,YunqiQqComSpider的完整代码为:

import
 scrapy

from
 scrapy.linkextractors
import
 LinkExtractor

from
 scrapy.spiders
import
 CrawlSpider, Rule


from
 yunqiCrawl.items
import
 YunqiBookListItem, YunqiBookDetailItem

from
 scrapy.http
import
 Request


classYunqiQqComSpider(CrawlSpider):
    name =
'yunqi.qq.com'
    allowed_domains = [
'yunqi.qq.com'
]

    start_urls = [
'http://yunqi.qq.com/bk/so2/n30p1'
]


    rules = (

        Rule(LinkExtractor(allow=
r'/bk/so2/n30p\d+'
), callback=
'parse_book_list'
, follow=
True
),

    )


defparse_book_list(self,response):
        books = response.xpath(
".//div[@class='book']"
)

for
 book
in
 books:

            novelImageUrl = book.xpath(
"./a/img/@src"
).extract_first()

            novelId = book.xpath(
"./div[@class='book_info']/h3/a/@id"
).extract_first()

            novelName =book.xpath(
"./div[@class='book_info']/h3/a/text()"
).extract_first()

            novelLink = book.xpath(
"./div[@class='book_info']/h3/a/@href"
).extract_first()

            novelInfos = book.xpath(
"./div[@class='book_info']/dl/dd[@class='w_auth']"
)

if
 len(novelInfos)>
4
:

                novelAuthor = novelInfos[
0
].xpath(
'./a/text()'
).extract_first()

                novelType = novelInfos[
1
].xpath(
'./a/text()'
).extract_first()

                novelStatus = novelInfos[
2
].xpath(
'./text()'
).extract_first()

                novelUpdateTime = novelInfos[
3
].xpath(
'./text()'
).extract_first()

                novelWords = novelInfos[
4
].xpath(
'./text()'
).extract_first()

else
:

                novelAuthor=
''
                novelType =
''
                novelStatus=
''
                novelUpdateTime=
''
                novelWords=
0
            bookListItem = YunqiBookListItem(novelId=novelId,novelName=novelName,

                              novelLink=novelLink,novelAuthor=novelAuthor,

                              novelType=novelType,novelStatus=novelStatus,

                              novelUpdateTime=novelUpdateTime,novelWords=novelWords,

                              novelImageUrl=novelImageUrl)

yield
 bookListItem

            request = scrapy.Request(url=novelLink,callback=self.parse_book_detail)

            request.meta[
'novelId'
] = novelId

yield
 request


defparse_book_detail(self,response):
# from scrapy.shell import inspect_response
# inspect_response(response, self)
        novelId = response.meta[
'novelId'
]

        novelLabel = response.xpath(
"//div[@class='tags']/text()"
).extract_first()


        novelAllClick = response.xpath(
".//*[@id='novelInfo']/table/tr[2]/td[1]/text()"
).extract_first()

        novelAllPopular = response.xpath(
".//*[@id='novelInfo']/table/tr[2]/td[2]/text()"
).extract_first()

        novelAllComm = response.xpath(
".//*[@id='novelInfo']/table/tr[2]/td[3]/text()"
).extract_first()


        novelMonthClick = response.xpath(
".//*[@id='novelInfo']/table/tr[3]/td[1]/text()"
).extract_first()

        novelMonthPopular = response.xpath(
".//*[@id='novelInfo']/table/tr[3]/td[2]/text()"
).extract_first()

        novelMonthComm = response.xpath(
".//*[@id='novelInfo']/table/tr[3]/td[3]/text()"
).extract_first()


        novelWeekClick = response.xpath(
".//*[@id='novelInfo']/table/tr[4]/td[1]/text()"
).extract_first()

        novelWeekPopular = response.xpath(
".//*[@id='novelInfo']/table/tr[4]/td[2]/text()"
).extract_first()

        novelWeekComm = response.xpath(
".//*[@id='novelInfo']/table/tr[4]/td[3]/text()"
).extract_first()

        novelCommentNum = response.xpath(
".//*[@id='novelInfo_commentCount']/text()"
).extract_first()

        bookDetailItem = YunqiBookDetailItem(novelId=novelId,novelLabel=novelLabel,novelAllClick=novelAllClick,novelAllPopular=novelAllPopular,

                            novelAllComm=novelAllComm,novelMonthClick=novelMonthClick,novelMonthPopular=novelMonthPopular,

                            novelMonthComm=novelMonthComm,novelWeekClick=novelWeekClick,novelWeekPopular=novelWeekPopular,

                            novelWeekComm=novelWeekComm,novelCommentNum=novelCommentNum)

yield
 bookDetailItem
3
Pipeline
上面完成了爬虫模块的编写,下面开始编写Pipeline,主要是完成Item到MongoDB的存储,分成两个集合进行存储,并采用搭建 MongoDB集群的方式。实现代码为:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import
 re

import
 pymongo

from
 yunqiCrawl.items
import
 YunqiBookListItem



classYunqicrawlPipeline(object):

def__init__(self, mongo_uri, mongo_db,replicaset):
        self.mongo_uri = mongo_uri

        self.mongo_db = mongo_db

        self.replicaset = replicaset


    @classmethod
deffrom_crawler(cls, crawler):
return
 cls(

            mongo_uri=crawler.settings.get(
'MONGO_URI'
),

            mongo_db=crawler.settings.get(
'MONGO_DATABASE'
,
'yunqi'
),

            replicaset = crawler.settings.get(
'REPLICASET'
)

        )


defopen_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri,replicaset=self.replicaset)

        self.db = self.client[self.mongo_db]


defclose_spider(self, spider):
        self.client.close()


defprocess_item(self, item, spider):
if
 isinstance(item,YunqiBookListItem):

            self._process_booklist_item(item)

else
:

            self._process_bookeDetail_item(item)

return
 item


def_process_booklist_item(self,item):
'''

        处理小说信息

        :param item:

        :return:

        '''

        self.db.bookInfo.insert(dict(item))


def_process_bookeDetail_item(self,item):
'''

        处理小说热度

        :param item:

        :return:

        '''

#需要对数据进行一下清洗,类似:总字数:10120,提取其中的数字
        pattern = re.compile(
'\d+'
)

#去掉空格和换行
        item[
'novelLabel'
] = item[
'novelLabel'
].strip().replace(
'\n'
,
''
)


        match = pattern.search(item[
'novelAllClick'
])

        item[
'novelAllClick'
] = match.group()
if
 match
else
 item[
'novelAllClick'
]


        match = pattern.search(item[
'novelMonthClick'
])

        item[
'novelMonthClick'
] = match.group()
if
 match
else
 item[
'novelMonthClick'
]


        match = pattern.search(item[
'novelWeekClick'
])

        item[
'novelWeekClick'
] = match.group()
if
 match
else
 item[
'novelWeekClick'
]


        match = pattern.search(item[
'novelAllPopular'
])

        item[
'novelAllPopular'
] = match.group()
if
 match
else
 item[
'novelAllPopular'
]


        match = pattern.search(item[
'novelMonthPopular'
])

        item[
'novelMonthPopular'
] = match.group()
if
 match
else
 item[
'novelMonthPopular'
]


        match = pattern.search(item[
'novelWeekPopular'
])

        item[
'novelWeekPopular'
] = match.group()
if
 match
else
 item[
'novelWeekPopular'
]


        match = pattern.search(item[
'novelAllComm'
])

        item[
'novelAllComm'
] = match.group()
if
 match
else
 item[
'novelAllComm'
]


        match = pattern.search(item[
'novelMonthComm'
])

        item[
'novelMonthComm'
] = match.group()
if
 match
else
 item[
'novelMonthComm'
]


        match = pattern.search(item[
'novelWeekComm'
])

        item[
'novelWeekComm'
] = match.group()
if
 match
else
 item[
'novelWeekComm'
]


        self.db.bookhot.insert(dict(item))
最后在settings中添加如下代码,激活 Pipeline。
ITEM_PIPELINES= {

'yunqiCrawl.pipelines.YunqicrawlPipeline'
300

}
4
应对反爬虫机制
为了不被反爬虫机制检测到,主要采用了伪造随机User-Agent、自动限速、禁用Cookie等措施。
1)伪造随机 User-Agent可以使用中间件来伪造中间件,实现代码为:
#coding:utf-8
import
 random

'''

这个类主要用于产生随机UserAgent

'''


classRandomUserAgent(object):

def__init__(self,agents):
        self.agents = agents


    @classmethod
deffrom_crawler(cls,crawler):
return
 cls(crawler.settings.getlist(
'USER_AGENTS'
))
#返回的是本类的实例cls ==RandomUserAgent

defprocess_request(self,request,spider):
在settings中设置 USER_AGENTS的值:
USER_AGENTS
 = [

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
,

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)"
,

"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
,

"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
,

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)"
,

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"
,

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)"
,

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)"
,

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6"
,

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1"
,

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0"
,

"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
,

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6"
,

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
,

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20"
,

"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"
,

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11"
,

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER"
,

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)"
,

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)"
,

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER"
,

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)"
,

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)"
,

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"
,

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)"
,

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"
,

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)"
,

"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"
,

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"
,

"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5"
,

"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre"
,

"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0"
,

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11"
,

"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
]
并启用该中间件,代码为:
DOWNLOADER_MIDDLEWARES = {

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'
:
None
,

'yunqiCrawl.middlewares.RandomUserAgent.RandomUserAgent'
:
410
,

}
2)自动限速的配置
实现自动限速的配置代码为:
DOWNLOADER_DELAY
=
3
AUTOTHROTTLE_ENABLED
=
True
AUTOTHROTTLE_START_DELAY
=
5
AUTOTHROTTLE_MAX_DELAY
=
60
3)禁用Cookie
实现禁用 Cookie的代码为:
COOKILES_ENABLED=False
采取以上措施之后如果还是会被发现的话,可以写一个HTTP代理中间件来更换IP。
5
去重优化
最后在settings中配置scrapy_Redis,代码为:
#使用scrapy_redis的调度器
SCHEDULER
 =
"yunqiCrawl.scrapy_redis.scheduler.Scheduler"
SCHEDULER_QUEUE_CLASS
 =
'yunqiCrawl.scrapy_redis.queue.SpiderPriorityQueue'
SCHEDULER_PERSIST
 =
True
# 在redis中保持scrapy-redis用到的各个队列,从而允许暂停和暂停后恢复

#使用scrapy_redis的去重方式
# DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# DUPEFILTER_CLASS = "yunqiCrawl.bloomFilterOnRedis.bloomRedisFilter.bloomRedisFilter"
REDIS_HOST
 =
'127.0.0.1'
REDIS_PORT
 =
6379
经过以上步骤,一个分布式爬虫就搭建起来了,如果想在远程服务器上使用,直接对IP和端口进行修改即可。
下面需要讲解一下去重优化的问题,看一下scrapy_Redis中是如何实现RFPDupeFilter的。关键代码为:
defrequest_seen(self,request)

    fp= request_fingerprint(request)

    added=
self
.server.sadd(
self
.key,fp)

returnnot
 added
scrappy_Redis是将生成的fingerprint放到 Redis的set数据结构中进行去重的。接着看一下fingerprint是如何产生的,先进入request_fingerprint方法中。
defrequest_fingerprint(request,include_headers=None)

if
 include_headers:

       include_headers=tuple([h.lower()
for
 h
in
 sorted (include_headers)])

cache=_fingerprint_cache.setdefault(request,{})

if
 include_headers
notin
 cache:

     fp=hashlib.shal()

     fp.update(request.method)

     fp.update(canonicalize_url(request.url))

     fp.updage(request.body
or' '
)

if
 include_headers:

for
 hdr ininclude_headers:

              fp.update(hdr)

for
 v
in
 request.headers.getlist(hdr):

fp.update(v)

     Cache[include_headers]=fp.hexdigest()

return
 cache[include_headers]
从代码中看到依然调用的是scrapy自带的去重方式,只不过将fingerprint的存储换了个位置。
推荐一个开源项目 https://github.com/qiyeboy/Scrapy_Redis_Bloomfilter,它在scrapy-Redis的基础上加入了BloomFilter的功能。使用方法如下:
git clone https://github.com/giyeboy/Scrapy_Redis_Bloomfilter
将源码包clone到本地,并将BloomfilterOnRedis_Demo文件夹下的scrapy_Redis文件夹复制到Scrapy项目中settings.py的同 级文件夹下,以yunqiCrawl项目为例,在settings.py中增加如下几个字段: 
  • FILTER_URL=None
  • FILTER_HOST= 'localhost'
  • FILTER_PORT=6379
  • FILTER_DB=0
  • SCHEDULER_QUEUE_CLASS='yunqiCrawl.scrapy_Redis.queue.SpiderPriorityQueue'
将之前使用的官方SCHEDULER替换为本地文件夹的SCHEDULER:
SCHEDULER= 'yunqicrawl scrapy_redis.scheduler.scheduler’
最后将DUPEFILTER_CLASS="scrappy_Redis.dupefilter.RFPDupeFilter"删除即可。
02
源代码下载
关注微信公众号,后台回复关键词 “云起书院” 即可获得完整源代码。
03
参考书籍 
《Python网络爬虫案例实战》
ISBN:978-7-302-56228-3
作者:李晓东

定价:89元
编辑推荐
案例项目为主线讲述Python爬虫开发中所需的知识和技能;
具有超强的实用性,项目随着图书内容的推进不断趋于工程化;
书中给出了80多个实例让读者理解概念、原理和算法
最大优惠仅剩2天!
当当网自营图书大促
添加优惠券,“实付满300减60”
目前图书本身还在48折左右,所以综合起来折扣更低!
这可是超大的优惠力度,优惠码数量有限,还不赶紧薅羊毛!
怎么BUY ?
优惠码:
TVZTMT 
使用渠道:当当小程序或当当APP
使用时间:6月1日~6月3日
使用方法:
·步骤一,挑选心仪的图书至购物车点击结算
·步骤二,点击优惠券/码处
·步骤三,输入优惠码 TVZTMT(注意要大写)
需要注意的是:优惠码全场自营图书可用(教材、考试类除外)
选书太纠结?推荐几本必买的 Python好书
助你囤的疯狂、读的畅快,绝不后悔!
扫码,微店购买全套Python图书
04
精彩文章回顾

继续阅读
阅读原文