记一次cpu过高的故障排除
一次 nginx 的配置解决问题
背景
google
后台反应近1周抓取失败次数比较多- 查看服务器,发现
CPU
使用很多,高的时候,占80%多了
发现问题
将这些有问题的屏蔽掉
大概正常的情况创建 ban-spider.conf
文件
map $http_user_agent $blocked_ua {
default 0;
~*(MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|SeznamBot|Curl|HttpClient|Crawler|crawler|Nimbostratus-Bot|MRA58N|LMY47V|python-requests|ChatGLM-Spider|Amazonbot|Web-Crawler|GPTBot) 1;
}
使用 ban-spider.conf
里的变量
upstream docify-rails {
server 127.0.0.1:3002;
}
# NGINX Server Instance
server {
listen 0.0.0.0:80;
listen 443 ssl;
// ....
if ($blocked_ua) {
return 403;
}
if ($request_uri ~* \.php) {
return 410;
}
}
测试抓取
curl -I -A 'Baiduspider' www.test.com