Scrapy学习: item loader
利用 Item Loader 重构 scrapy,优代码更加的优雅
简介
- 这个是一个让爬虫代码可以更加优雅的方式
- 不是必须的模块
item-loader
取的值,每一个都是一个list
类型- 会结合
scrapy.Field
来使用 Selector
参数的使用: https://blog.csdn.net/zhaohaibo_/article/details/105418792
itemLoader
- add_css: 利用
css selector
取值 - add_xpath: 利用
xpath
取值 - add_value: 直接向
item
中添加值 - load_item: 将
item
转化为dict
快速开始
此案例来源于官方文档。
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_xpath('price', '//p[@id="price"]')
l.add_css('stock', 'p#stock')
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
数据来源
Selector
并不来源于Response
中
def parse_item(self, response):
for r in response.css(".bang_list li"):
loader = DangdangItemLoader(DangdangItem(), selector=r)
loader.add_css("publisher", ".publisher_info a::text")
item = loader.load_item()
yield item
使用
实际开发中的案例
from scrapy.loader import ItemLoader
import scrapy
from scrapy.loader import ItemLoader
from spider_knlib.items import JuziDetailItem
class BookSpider(scrapy.Spider):
name = 'book'
handle_httpstatus_list = [400]
start_urls = [
'https://www.163.com/dy/article/HK9GL8E005198ETO.html',
]
custom_settings = {
'CONCURRENT_REQUESTS': 100,
}
def parse(self, response, **kwargs):
item_l = ItemLoader(item=JuziDetailItem(), response=response)
item_l.add_css('title', '.post_main .post_title::text')
item_l.add_css('published_at', '.post_main .post_info', re=r'(\d{4}-\d{2}.*:\d{2}:\d{2})')
self.logger.info(f'titles: {item_l.load_item()}')
scrapy.Field
- input_processor
- 给 title 添加值 -test1/-test2,以
piplines
方式组合进行
- 给 title 添加值 -test1/-test2,以
- output_processor
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.loader.processors import MapCompose
class SpiderKnlibItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class JuziDetailItem(scrapy.Item):
title = scrapy.Field(
input_processor=MapCompose(
lambda x: x + '-title1',
lambda x: x + '-title2',
)
)
published_at = scrapy.Field()
利用
input_processor
来完成re
提取内容的操作
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import re
import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
def get_published_at(val):
date_re = r'(\d{4}-\d{2}.*:\d{2}:\d{2})'
res = re.findall(date_re, val)
return res[0]
class JuziDetailItem(scrapy.Item):
title = scrapy.Field(output_processor=TakeFirst())
published_at = scrapy.Field(
input_processor=MapCompose(
get_published_at
),
output_processor=TakeFirst()
)
针对所有 scrapy.Field
有时候,我们可能需要针对每个
scrapy.Field
设置,如: 我们需要所有的字段,都以item
形式出现,而不是list
方式
# 低版本的可能会提示这个 Warning:
scrapy.loader.processors.TakeFirst is deprecated, instantiate itemloaders.processors.TakeFirst instead.
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader
class MyItemLoader(ItemLoader):
default_output_processor = TakeFirst()
结合 itemLoader + scrapy.Field
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
class MyItemLoader(ItemLoader):
desc_in = MapCompose(
lambda x: ' '.join(x.split()),
lambda x: x.upper()
)
desc_out = Join()
Input and Output processors
Item Loader 为每个 Item Field 单独提供了一个 Input processor 和一个 Output processor;Input processor 一旦它通过 add_xpath(),add_css(),add_value() 方法收到提取到的数据便会执行,执行以后所得到的数据将仍然保存在 ItemLoader 实例中;当数据收集完成以后,ItemLoader 通过 load_item() 方法来进行填充并返回已填充的 Item 实例;看下面这个例子,
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
声明 Item Loaders
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(unicode.title)
name_out = Join()
price_in = MapCompose(unicode.strip)
# ...
声明 Input and Output processors
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
自定义 output_processor
实现
__call__
方法逻辑即可。
class TakeFirst:
"""
Returns the first non-null/non-empty value from the values received,
so it's typically used as an output processor to single-valued fields.
It doesn't receive any ``__init__`` method arguments, nor does it accept Loader contexts.
Example:
>>> from itemloaders.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'one', 'two', 'three'])
'one'
"""
def __call__(self, values):
for value in values:
if value is not None and value != '':
return value
__call__
用法
本节再介绍 Python 类中一个非常特殊的实例方法,即
__call__()
。该方法的功能类似于在类中重载 () 运算符
,使得类实例对象可以像调用普通函数那样,以“对象名()
”的形式使用。
class CLanguage:
# 定义__call__方法
def __call__(self, name, add):
print("调用__call__()方法", name, add)
clangs = CLanguage()
clangs("C语言中文网", "http://c.biancheng.net")
调试 processor
from itemloaders.processors import Join
proc = Join(',')
proc(['one', 'two', 'three'])
# output 'one,two,three'
cheatsheet
用法 | 代码 | 备注 |
---|---|---|
add_css |
| 常规css取值 |
add_css + re |
| 结合正则,进一步提取 |
参考
- https://www.shangyang.me/2017/07/23/scrapy-learning-7-item-loaders/
- https://docs.scrapy.org/en/latest/topics/loaders.html
- https://stackoverflow.com/questions/37245846/why-are-my-input-output-processors-in-scrapy-not-working
- https://www.youtube.com/watch?v=vYKW0MNODVU&list=PL961LrMWXCe4Ih-83j_MGUwh51E__FQVV&index=34