Scrapy学习: 开发一个 middleware
如何开发一个middleware
先写配置
首先,你需要创建一个
User-Agent
列表。这个列表可以放在settings.py
文件中
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
# 可以继续添加更多的 user agent strings...
]
创建中间件
然后,创建一个名为 RandomUserAgentMiddleware 的中间件,这个文件默认会在 middlewares.py 中。
import random
class RandomUserAgentMiddleware:
def __init__(self, user_agent_list):
self.user_agent_list = user_agent_list
@classmethod
def from_crawler(cls, crawler):
return cls(
user_agent_list=crawler.settings.get("USER_AGENT_LIST")
)
def process_request(self, request, spider):
request.headers['User-Agent'] = random.choice(self.user_agent_list)
使用
最后,需要在 settings.py 文件的 DOWNLOADER_MIDDLEWARES 配置中添加新创建的中间件:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
}
优化参数
- 添加
USER_AGENT_VERBOSE
方便调试 - 以
kwargs
方式代替原来的参数列表,方便扩展
class RandomUserAgentMiddleware:
def __init__(self, **kwargs):
self.settings = kwargs
@classmethod
def from_crawler(cls, crawler):
return cls(**crawler.settings)
def process_request(self, request, spider):
user_agent = random.choice(self.settings.get("USER_AGENT_LIST"))
request.headers['User-Agent'] = user_agent
if self.settings.getbool("USER_AGENT_VERBOSE"):
spider.logger.info(f'Request: {request.url} using User-Agent: {user_agent}')
参考
- https://blog.csdn.net/qq_41456723/article/details/107804728
- https://github.com/alo7i/spider-zhishiq
- https://www.zenrows.com/blog/scrapy-user-agent#get-random-ua-at-scale
- https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/useragent.html
- https://www.cnblogs.com/Neeo/articles/11525001.html