scrapy学习:在程序中启动 scrapy 使用 CrawlerProcess、CrawlerRunner
用程序启动 scrapy,而不只是scrapy crawl <spider-name>,常用在 debug 或者启动多个 chunk 的场景
背景
目前為止我們都是用 scrapy crawl <spider-name>
指令來啟動爬蟲,但有時候可能需要在程式中來啟動爬蟲(例如提供一個 API
接口,外部發請求來告知要啟動哪一支爬蟲,由程式來啟動對應的爬蟲),今天會介紹幾種啟動爬蟲的方式。
方法1:利用 subprocess
来启动
- 方便,很好懂
- 问题:非常吃資源
import subprocess
subprocess.run('scrapy crawl ithome')
因為每次執行 scrapy crawl
都會產生一個新的 Scrapy Engine 實體,如果用這個方式啟動多個爬蟲會非常吃資源,所以其實不建議使用 (那幹嘛講)。
方法2:CrawlerProcess
我們可以利用 scrapy.crawler.CrawlerProcess
這個類別來啟動爬蟲,scrapy crawl
指令其實也是使用這個類別。
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
'''
get_project_settings() 方法會取得爬蟲專案中的 settings.py 檔案設定
啟動爬蟲前要提供這些設定給 Scrapy Engine
'''
process = CrawlerProcess(get_project_settings())
process.crawl('ithome')
process.start()
必须等上一个运行完成,才会运行下一个
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from uniprot_spider.spiders.ncbi_protein_down_fasta import NcbiProteinDownFastaSpider
settings = get_project_settings()
process = CrawlerProcess(settings)
count = len(NcbiProteinDownFastaSpider.get_chunks())
for i in range(0, count):
process.crawl('ncbi_protein_down_fasta', index=i)
process.start()
一次可以启动多个,同时运行
- 会消耗较大内存
- 会报这个错误: reactor already installed 这个错误暂时未找到原因
- 消耗内存会很大(在有其它资源占用的时候,会无法跑起来)
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from uniprot_spider.spiders.ncbi_protein_down_fasta import NcbiProteinDownFastaSpider
settings = get_project_settings()
process = CrawlerProcess(settings)
count = len(NcbiProteinDownFastaSpider.get_chunks())
for i in range(0, count):
process.crawl('ncbi_protein_down_fasta', index=i)
process.start()
方法3:CrawlerRunner
如果原本的程式中已經有使用 Twisted 來執行一些非同步的任務,官方建議改用 scrapy.crawler.CrawlerRunner
來啟動爬蟲,如此可以跟原本的程式使用同一個 Twisted reactor。
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
'''
get_project_settings() 方法會取得爬蟲專案中的 settings.py 檔案設定
啟動爬蟲前要提供這些設定給 Scrapy Engine
'''
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('myspider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
看一个下载爬虫的例子
import scrapy
import jsw_nx as nx
from uniprot_spider.models.ncbi_protein import NcbiProtein
from uniprot_spider.items import NcbiProteinDownItem
class NcbiProteinDownFastaSpider(scrapy.Spider):
name = 'ncbi_protein_down_fasta'
allowed_domains = ['www.baidu.com', 'ncbi.nlm.nih.gov']
handle_httpstatus_list = [400]
custom_settings = {
'CONCURRENT_REQUESTS': 100,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'
}
@classmethod
def get_chunks(cls):
chunks = list(NcbiProtein.where({"is_crawled": False}).chunk(50000))
return chunks
@property
def records(self):
chunks = self.get_chunks()
return chunks[int(self.index)]
def start_requests(self):
records = self.records
total = len(records)
for index, entity in enumerate(records):
id = entity.id
self.logger.info(f'Current progress: {index} / {total}')
yield scrapy.Request(url=f'https://www.baidu.com?id={id}', callback=self.parse, meta={'entity': entity})
def parse(self, response):
try:
entity = response.meta['entity']
fasta_url = entity.fasta_url
genpept_url = entity.genpept_url
item = NcbiProteinDownItem()
item['file_urls'] = [fasta_url, genpept_url]
item['protein_id'] = entity.protein_id
item['id'] = entity.id
yield item
except Exception as e:
self.logger.error(e)
raise scrapy.exceptions.CloseSpider('Crawl failed')
任务文件:tasks/ncbi_protein_down_fasta.py
from scrapy.utils.project import get_project_settings
from uniprot_spider.spiders.ncbi_protein_down_fasta import NcbiProteinDownFastaSpider
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
settings = get_project_settings()
process = CrawlerProcess(settings)
count = len(NcbiProteinDownFastaSpider.get_chunks())
runner = CrawlerRunner(get_project_settings())
for i in range(0, count):
d = runner.crawl('ncbi_protein_down_fasta', index=i)
d.addBoth(lambda _: reactor.stop())
reactor.run()
进程守护:用 pm2
运行命令如下
ecosystem.config.js
配置文件在下面
pm2 start ecosystem.config.js --only "ncbi_down"
module.exports = {
apps: [
{
name: 'ncbi_down',
interpreter: '/root/.cache/pypoetry/virtualenvs/uniprot-spider-A_j0jmEh-py3.10/bin/python',
namespace: 'uniprot',
script: './tasks/ncbi_protein_down_fasta.py',
ignore_watch: ['node_modules', 'logs', 'tmp', '*.pyc']
}
]
};
多进程运行
import multiprocessing
import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
{'start_url': 'https://www.nytimes.com/', 'selenium': True}]
def crawl(tasks):
process = CrawlerProcess(get_project_settings())
def run_spider(_, index=0):
if index < len(tasks):
deferred = process.crawl('test_spider', task=tasks[index])
deferred.addCallback(run_spider, index + 1)
return deferred
run_spider(None)
process.start()
def main():
processes = 2
with multiprocessing.Pool(processes) as pool:
pool.map(crawl, np.array_split(tasks, processes))
if __name__ == '__main__':
main()
参考
- https://ithelp.ithome.com.tw/articles/10228616
- https://docs.scrapy.org/en/latest/topics/practices.html
- https://www.anycodings.com/1questions/1073754/running-multiple-spiders-in-scrapy-for-1-website-in-parallel
- https://stackoverflow.com/questions/61194207/how-can-i-make-selenium-run-in-parallel-with-scrapy