Python爬虫综合实战 │ 创建云起书院爬虫(附代码)
通过“创建云起书院爬虫”实例演示Python爬虫综合实战。
01
创建云起书院爬虫
在开始编程之前,首先要根据项目需求对云起书院网站进行分析。目标是提取小说的名称、作者、分类、状态、更新时间、字数、单击量、人气和推荐等数据。首先来到书院的书库 (http://yunqi.qq.com/bk),如图15-4所示。
■ 图 15-4 图书列表
可以在图书列表中找到每一本书的名称、作者、分类、状态、更新时间、字数等信息。同时将页面滑到底部,可以看到翻页的按钮。
接着选其中一部小说单击进去,可以进入小说的详情页,在作品信息中,可以找到单击量、人气和推荐等数据,如图15-5所示。
■ 图 15-5 小说详情页
1
定义Item
创建完工程后,首先要做的是定义Item,确定需要提取的结构化数据。主要定义两个Item:一个负责装载小说的基本信息;一个负责装载小说热度(单击量和人气等)的信息。代码如下:
import scrapy
classYunqiBookListItem(scrapy.Item):
#小说id
novelId = scrapy.Field()
#小说名称
novelName = scrapy.Field()
#小说链接
novelLink = scrapy.Field()
#小说作者
novelAuthor = scrapy.Field()
#小说类型
novelType = scrapy.Field()
#小说状态
novelStatus = scrapy.Field()
#小说更新时间
novelUpdateTime = scrapy.Field()
#小说字数
novelWords = scrapy.Field()
#小说封面
novelImageUrl = scrapy.Field()
classYunqiBookDetailItem(scrapy.Item):
#小说id
novelId = scrapy.Field()
#小说标签
novelLabel =scrapy.Field()
#小说总点击量
novelAllClick = scrapy.Field()
#月点击量
novelMonthClick = scrapy.Field()
#周点击量
novelWeekClick = scrapy.Field()
#总人气
novelAllPopular = scrapy.Field()
#月人气
novelMonthPopular = scrapy.Field()
#周人气
novelWeekPopular = scrapy.Field()
#评论数
novelCommentNum = scrapy.Field()
#小说总推荐
novelAllComm = scrapy.Field()
#小说月推荐
novelMonthComm = scrapy.Field()
#小说周推荐
novelWeekComm = scrapy.Field()
2
编写爬虫模块
下面开始进行页面的解析,主要有两个方法。parse_book_list()方法用于解析图15-4所示的图书列表,抽取其中的小说基本信息。parse_book_detail()方法用于解析图15-5所示的小说单击量和人气等数据。对于翻页链接抽取,则是在rules中定义抽取规则,翻页链接基本上符合“/bk/so2/n30p/d+”这种形式,YunqiQqComSpider的完整代码为:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from yunqiCrawl.items import YunqiBookListItem, YunqiBookDetailItem
from scrapy.http import Request
classYunqiQqComSpider(CrawlSpider):
name = 'yunqi.qq.com'
allowed_domains = ['yunqi.qq.com']
start_urls = ['http://yunqi.qq.com/bk/so2/n30p1']
rules = (
Rule(LinkExtractor(allow=r'/bk/so2/n30p\d+'), callback='parse_book_list', follow=True),
)
defparse_book_list(self,response):
books = response.xpath(".//div[@class='book']")
for book in books:
novelImageUrl = book.xpath("./a/img/@src").extract_first()
novelId = book.xpath("./div[@class='book_info']/h3/a/@id").extract_first()
novelName =book.xpath("./div[@class='book_info']/h3/a/text()").extract_first()
novelLink = book.xpath("./div[@class='book_info']/h3/a/@href").extract_first()
novelInfos = book.xpath("./div[@class='book_info']/dl/dd[@class='w_auth']")
if len(novelInfos)>4:
novelAuthor = novelInfos[0].xpath('./a/text()').extract_first()
novelType = novelInfos[1].xpath('./a/text()').extract_first()
novelStatus = novelInfos[2].xpath('./text()').extract_first()
novelUpdateTime = novelInfos[3].xpath('./text()').extract_first()
novelWords = novelInfos[4].xpath('./text()').extract_first()
else:
novelAuthor=''
novelType =''
novelStatus=''
novelUpdateTime=''
novelWords=0
bookListItem = YunqiBookListItem(novelId=novelId,novelName=novelName,
novelLink=novelLink,novelAuthor=novelAuthor,
novelType=novelType,novelStatus=novelStatus,
novelUpdateTime=novelUpdateTime,novelWords=novelWords,
novelImageUrl=novelImageUrl)
yield bookListItem
request = scrapy.Request(url=novelLink,callback=self.parse_book_detail)
request.meta['novelId'] = novelId
yield request
defparse_book_detail(self,response):
# from scrapy.shell import inspect_response
# inspect_response(response, self)
novelId = response.meta['novelId']
novelLabel = response.xpath("//div[@class='tags']/text()").extract_first()
novelAllClick = response.xpath(".//*[@id='novelInfo']/table/tr[2]/td[1]/text()").extract_first()
novelAllPopular = response.xpath(".//*[@id='novelInfo']/table/tr[2]/td[2]/text()").extract_first()
novelAllComm = response.xpath(".//*[@id='novelInfo']/table/tr[2]/td[3]/text()").extract_first()
novelMonthClick = response.xpath(".//*[@id='novelInfo']/table/tr[3]/td[1]/text()").extract_first()
novelMonthPopular = response.xpath(".//*[@id='novelInfo']/table/tr[3]/td[2]/text()").extract_first()
novelMonthComm = response.xpath(".//*[@id='novelInfo']/table/tr[3]/td[3]/text()").extract_first()
novelWeekClick = response.xpath(".//*[@id='novelInfo']/table/tr[4]/td[1]/text()").extract_first()
novelWeekPopular = response.xpath(".//*[@id='novelInfo']/table/tr[4]/td[2]/text()").extract_first()
novelWeekComm = response.xpath(".//*[@id='novelInfo']/table/tr[4]/td[3]/text()").extract_first()
novelCommentNum = response.xpath(".//*[@id='novelInfo_commentCount']/text()").extract_first()
bookDetailItem = YunqiBookDetailItem(novelId=novelId,novelLabel=novelLabel,novelAllClick=novelAllClick,novelAllPopular=novelAllPopular,
novelAllComm=novelAllComm,novelMonthClick=novelMonthClick,novelMonthPopular=novelMonthPopular,
novelMonthComm=novelMonthComm,novelWeekClick=novelWeekClick,novelWeekPopular=novelWeekPopular,
novelWeekComm=novelWeekComm,novelCommentNum=novelCommentNum)
yield bookDetailItem
3
Pipeline
上面完成了爬虫模块的编写,下面开始编写Pipeline,主要是完成Item到MongoDB的存储,分成两个集合进行存储,并采用搭建 MongoDB集群的方式。实现代码为:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import re
import pymongo
from yunqiCrawl.items import YunqiBookListItem
classYunqicrawlPipeline(object):
def__init__(self, mongo_uri, mongo_db,replicaset):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
self.replicaset = replicaset
@classmethod
deffrom_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'yunqi'),
replicaset = crawler.settings.get('REPLICASET')
)
defopen_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri,replicaset=self.replicaset)
self.db = self.client[self.mongo_db]
defclose_spider(self, spider):
self.client.close()
defprocess_item(self, item, spider):
if isinstance(item,YunqiBookListItem):
self._process_booklist_item(item)
else:
self._process_bookeDetail_item(item)
return item
def_process_booklist_item(self,item):
'''
处理小说信息
:param item:
:return:
'''
self.db.bookInfo.insert(dict(item))
def_process_bookeDetail_item(self,item):
'''
处理小说热度
:param item:
:return:
'''
#需要对数据进行一下清洗,类似:总字数:10120,提取其中的数字
pattern = re.compile('\d+')
#去掉空格和换行
item['novelLabel'] = item['novelLabel'].strip().replace('\n','')
match = pattern.search(item['novelAllClick'])
item['novelAllClick'] = match.group() if match else item['novelAllClick']
match = pattern.search(item['novelMonthClick'])
item['novelMonthClick'] = match.group() if match else item['novelMonthClick']
match = pattern.search(item['novelWeekClick'])
item['novelWeekClick'] = match.group() if match else item['novelWeekClick']
match = pattern.search(item['novelAllPopular'])
item['novelAllPopular'] = match.group() if match else item['novelAllPopular']
match = pattern.search(item['novelMonthPopular'])
item['novelMonthPopular'] = match.group() if match else item['novelMonthPopular']
match = pattern.search(item['novelWeekPopular'])
item['novelWeekPopular'] = match.group() if match else item['novelWeekPopular']
match = pattern.search(item['novelAllComm'])
item['novelAllComm'] = match.group() if match else item['novelAllComm']
match = pattern.search(item['novelMonthComm'])
item['novelMonthComm'] = match.group() if match else item['novelMonthComm']
match = pattern.search(item['novelWeekComm'])
item['novelWeekComm'] = match.group() if match else item['novelWeekComm']
self.db.bookhot.insert(dict(item))
最后在settings中添加如下代码,激活 Pipeline。
ITEM_PIPELINES= {
'yunqiCrawl.pipelines.YunqicrawlPipeline':300,
}
4
应对反爬虫机制
为了不被反爬虫机制检测到,主要采用了伪造随机User-Agent、自动限速、禁用Cookie等措施。
1)伪造随机 User-Agent可以使用中间件来伪造中间件,实现代码为:
#coding:utf-8
import random
'''
这个类主要用于产生随机UserAgent
'''
classRandomUserAgent(object):
def__init__(self,agents):
self.agents = agents
@classmethod
deffrom_crawler(cls,crawler):
return cls(crawler.settings.getlist('USER_AGENTS'))#返回的是本类的实例cls ==RandomUserAgent
defprocess_request(self,request,spider):
在settings中设置 USER_AGENTS的值:
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
]
并启用该中间件,代码为:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'yunqiCrawl.middlewares.RandomUserAgent.RandomUserAgent': 410,
}
2)自动限速的配置
实现自动限速的配置代码为:
DOWNLOADER_DELAY=3
AUTOTHROTTLE_ENABLED=True
AUTOTHROTTLE_START_DELAY=5
AUTOTHROTTLE_MAX_DELAY=60
3)禁用Cookie
实现禁用 Cookie的代码为:
COOKILES_ENABLED=False
采取以上措施之后如果还是会被发现的话,可以写一个HTTP代理中间件来更换IP。
5
去重优化
最后在settings中配置scrapy_Redis,代码为:
#使用scrapy_redis的调度器
SCHEDULER = "yunqiCrawl.scrapy_redis.scheduler.Scheduler"
SCHEDULER_QUEUE_CLASS = 'yunqiCrawl.scrapy_redis.queue.SpiderPriorityQueue'
SCHEDULER_PERSIST = True
# 在redis中保持scrapy-redis用到的各个队列,从而允许暂停和暂停后恢复
#使用scrapy_redis的去重方式
# DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# DUPEFILTER_CLASS = "yunqiCrawl.bloomFilterOnRedis.bloomRedisFilter.bloomRedisFilter"
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
经过以上步骤,一个分布式爬虫就搭建起来了,如果想在远程服务器上使用,直接对IP和端口进行修改即可。
下面需要讲解一下去重优化的问题,看一下scrapy_Redis中是如何实现RFPDupeFilter的。关键代码为:
defrequest_seen(self,request):
fp= request_fingerprint(request)
added= self.server.sadd(self.key,fp)
returnnot added
scrappy_Redis是将生成的fingerprint放到 Redis的set数据结构中进行去重的。接着看一下fingerprint是如何产生的,先进入request_fingerprint方法中。
defrequest_fingerprint(request,include_headers=None):
if include_headers:
include_headers=tuple([h.lower() for h in sorted (include_headers)])
cache=_fingerprint_cache.setdefault(request,{})
if include_headers notin cache:
fp=hashlib.shal()
fp.update(request.method)
fp.update(canonicalize_url(request.url))
fp.updage(request.body or' ')
if include_headers:
for hdr ininclude_headers:
fp.update(hdr)
for v in request.headers.getlist(hdr):
fp.update(v)
Cache[include_headers]=fp.hexdigest()
return cache[include_headers]
从代码中看到依然调用的是scrapy自带的去重方式,只不过将fingerprint的存储换了个位置。
推荐一个开源项目 https://github.com/qiyeboy/Scrapy_Redis_Bloomfilter,它在scrapy-Redis的基础上加入了BloomFilter的功能。使用方法如下:
git clone https://github.com/giyeboy/Scrapy_Redis_Bloomfilter
将源码包clone到本地,并将BloomfilterOnRedis_Demo文件夹下的scrapy_Redis文件夹复制到Scrapy项目中settings.py的同 级文件夹下,以yunqiCrawl项目为例,在settings.py中增加如下几个字段:
- FILTER_URL=None
- FILTER_HOST= 'localhost'
- FILTER_PORT=6379
- FILTER_DB=0
- SCHEDULER_QUEUE_CLASS='yunqiCrawl.scrapy_Redis.queue.SpiderPriorityQueue'
将之前使用的官方SCHEDULER替换为本地文件夹的SCHEDULER:
SCHEDULER= 'yunqicrawl scrapy_redis.scheduler.scheduler’
最后将DUPEFILTER_CLASS="scrappy_Redis.dupefilter.RFPDupeFilter"删除即可。
02
源代码下载
关注微信公众号,后台回复关键词 “云起书院” 即可获得完整源代码。
03
参考书籍
《Python网络爬虫案例实战》
ISBN:978-7-302-56228-3
作者:李晓东
定价:89元
编辑推荐
以案例项目为主线讲述Python爬虫开发中所需的知识和技能;
具有超强的实用性,项目随着图书内容的推进不断趋于工程化;
书中给出了80多个实例让读者理解概念、原理和算法
最大优惠仅剩2天!
添加优惠券,“实付满300减60”
目前图书本身还在48折左右,所以综合起来折扣更低!
这可是超大的优惠力度,优惠码数量有限,还不赶紧薅羊毛!
优惠码:
TVZTMT
使用渠道:当当小程序或当当APP
使用时间:6月1日~6月3日
使用方法:
·步骤一,挑选心仪的图书至购物车点击结算
·步骤二,点击优惠券/码处
·步骤三,输入优惠码 TVZTMT(注意要大写)
需要注意的是:优惠码全场自营图书可用(教材、考试类除外)
扫码,微店购买全套Python图书
04
精彩文章回顾
Python爬虫实战 │ 爬取mp3资源信息 Python爬虫实战 │ Email提醒(附代码) Python深度学习 │一文掌握卷积神经网络 Python爬虫实战 │ 用selenium爬取百度表情包(附代码) Python爬虫实战│状态521网页的爬取 Python爬虫实战│爬取天气数据的实例详解(附源码) Python实训:用贪婪算法分析业务员路径问题|附源码 用Excel制作工资条实例|附素材+视频 真题解析│2017年蓝桥杯软件类省赛传统“送分题” Java 15新增类Record的工作实例|附代码 Dart应用Bloc设计模式实例|附代码
最新评论
推荐文章
作者最新文章
你可能感兴趣的文章
Copyright Disclaimer: The copyright of contents (including texts, images, videos and audios) posted above belong to the User who shared or the third-party website which the User shared from. If you found your copyright have been infringed, please send a DMCA takedown notice to [email protected]. For more detail of the source, please click on the button "Read Original Post" below. For other communications, please send to [email protected].
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。
版权声明:以上内容为用户推荐收藏至CareerEngine平台,其内容(含文字、图片、视频、音频等)及知识版权均属用户或用户转发自的第三方网站,如涉嫌侵权,请通知[email protected]进行信息删除。如需查看信息来源,请点击“查看原文”。如需洽谈其它事宜,请联系[email protected]。