Scrapy学习: item loader

利用 Item Loader 重构 scrapy,优代码更加的优雅
更新于: 2022-10-23 04:19:43

简介

itemLoader

  • add_css: 利用 css selector 取值
  • add_xpath: 利用 xpath 取值
  • add_value: 直接向 item 中添加值
  • load_item: 将 item 转化为 dict

快速开始

此案例来源于官方文档。

from scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

数据来源 Selector并不来源于 Response

def parse_item(self, response):
    for r in response.css(".bang_list li"):
        loader = DangdangItemLoader(DangdangItem(), selector=r)
        loader.add_css("publisher", ".publisher_info a::text")
        item = loader.load_item()
        yield item

使用

实际开发中的案例

from scrapy.loader import ItemLoader

import scrapy
from scrapy.loader import ItemLoader
from spider_knlib.items import JuziDetailItem


class BookSpider(scrapy.Spider):
    name = 'book'
    handle_httpstatus_list = [400]
    start_urls = [
        'https://www.163.com/dy/article/HK9GL8E005198ETO.html',
    ]

    custom_settings = {
        'CONCURRENT_REQUESTS': 100,
    }

    def parse(self, response, **kwargs):
        item_l = ItemLoader(item=JuziDetailItem(), response=response)
        item_l.add_css('title', '.post_main .post_title::text')
        item_l.add_css('published_at', '.post_main .post_info', re=r'(\d{4}-\d{2}.*:\d{2}:\d{2})')
        self.logger.info(f'titles: {item_l.load_item()}')

scrapy.Field

  • input_processor
    • 给 title 添加值 -test1/-test2,以 piplines 方式组合进行
  • output_processor
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader.processors import MapCompose


class SpiderKnlibItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


class JuziDetailItem(scrapy.Item):
    title = scrapy.Field(
        input_processor=MapCompose(
            lambda x: x + '-title1',
            lambda x: x + '-title2',
        )
    )
    published_at = scrapy.Field()

利用 input_processor 来完成 re 提取内容的操作

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import re
import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
def get_published_at(val):
    date_re = r'(\d{4}-\d{2}.*:\d{2}:\d{2})'
    res = re.findall(date_re, val)
    return res[0]


class JuziDetailItem(scrapy.Item):
    title = scrapy.Field(output_processor=TakeFirst())
    published_at = scrapy.Field(
        input_processor=MapCompose(
            get_published_at
        ),
        output_processor=TakeFirst()
    )

针对所有 scrapy.Field

有时候,我们可能需要针对每个 scrapy.Field 设置,如: 我们需要所有的字段,都以 item 形式出现,而不是 list 方式

# 低版本的可能会提示这个 Warning: 
scrapy.loader.processors.TakeFirst is deprecated, instantiate itemloaders.processors.TakeFirst instead.
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader


class MyItemLoader(ItemLoader):
    default_output_processor = TakeFirst()

结合 itemLoader + scrapy.Field

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()


class MyItemLoader(ItemLoader):
    desc_in = MapCompose(
        lambda x: ' '.join(x.split()),
        lambda x: x.upper()
    )

    desc_out = Join()

Input and Output processors

Item Loader 为每个 Item Field 单独提供了一个 Input processor 和一个 Output processor;Input processor 一旦它通过 add_xpath(),add_css(),add_value() 方法收到提取到的数据便会执行,执行以后所得到的数据将仍然保存在 ItemLoader 实例中;当数据收集完成以后,ItemLoader 通过 load_item() 方法来进行填充并返回已填充的 Item 实例;看下面这个例子,

l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)

声明 Item Loaders

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class ProductLoader(ItemLoader):

    default_output_processor = TakeFirst()

    name_in = MapCompose(unicode.title)
    name_out = Join()

    price_in = MapCompose(unicode.strip)

    # ...

声明 Input and Output processors

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags

def filter_price(value):
    if value.isdigit():
        return value

class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Join(),
    )
    price = scrapy.Field(
        input_processor=MapCompose(remove_tags, filter_price),
        output_processor=TakeFirst(),
    )

自定义 output_processor

实现 __call__ 方法逻辑即可。

class TakeFirst:
    """
    Returns the first non-null/non-empty value from the values received,
    so it's typically used as an output processor to single-valued fields.
    It doesn't receive any ``__init__`` method arguments, nor does it accept Loader contexts.

    Example:

    >>> from itemloaders.processors import TakeFirst
    >>> proc = TakeFirst()
    >>> proc(['', 'one', 'two', 'three'])
    'one'
    """

    def __call__(self, values):
        for value in values:
            if value is not None and value != '':
                return value

__call__ 用法

本节再介绍 Python 类中一个非常特殊的实例方法,即 __call__()。该方法的功能类似于在类中重载 () 运算符,使得类实例对象可以像调用普通函数那样,以“对象名()”的形式使用。

http://c.biancheng.net/view/2380.html

class CLanguage:
    # 定义__call__方法
    def __call__(self, name, add):
        print("调用__call__()方法", name, add)


clangs = CLanguage()
clangs("C语言中文网", "http://c.biancheng.net")

调试 processor

  •  
from itemloaders.processors import Join
proc = Join(',')
proc(['one', 'two', 'three'])

# output 'one,two,three'

 

cheatsheet

用法代码备注
add_css
item_l.add_css('title', '.post_main .post_title::text')
常规css取值
add_css + re
item_l.add_css('published_at', '.post_main .post_info', re=r'(\d{4}-\d{2}.*:\d{2}:\d{2})')
结合正则,进一步提取

参考