Scrapy学习: 开发一个 middleware

如何开发一个middleware
更新于: 2024-03-09 09:44:36

先写配置

首先,你需要创建一个 User-Agent 列表。这个列表可以放在 settings.py 文件中

USER_AGENT_LIST = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    # 可以继续添加更多的 user agent strings...
]

创建中间件

然后,创建一个名为 RandomUserAgentMiddleware 的中间件,这个文件默认会在 middlewares.py 中。

import random

class RandomUserAgentMiddleware:
    def __init__(self, user_agent_list):
        self.user_agent_list = user_agent_list

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            user_agent_list=crawler.settings.get("USER_AGENT_LIST")
        )

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agent_list)

使用

最后,需要在 settings.py 文件的 DOWNLOADER_MIDDLEWARES 配置中添加新创建的中间件:

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomUserAgentMiddleware': 400,
}

优化参数

  • 添加 USER_AGENT_VERBOSE 方便调试
  • kwargs 方式代替原来的参数列表,方便扩展
class RandomUserAgentMiddleware:
    def __init__(self, **kwargs):
        self.settings = kwargs

    @classmethod
    def from_crawler(cls, crawler):
        return cls(**crawler.settings)

    def process_request(self, request, spider):
        user_agent = random.choice(self.settings.get("USER_AGENT_LIST"))
        request.headers['User-Agent'] = user_agent
        if self.settings.getbool("USER_AGENT_VERBOSE"):
            spider.logger.info(f'Request: {request.url} using User-Agent: {user_agent}')

参考