ncbi: ipg 爬虫相关问题记录

关于数据提取过程中的 ncbi_session_id 过期问题处理

RestClient 配置

@id = 558016069
@format = json

### get nsbi_sid
POST https://www.ncbi.nlm.nih.gov/sviewer/ipg/ipg.cgi
user-agent: Mozilla/5.0 zgrab/0.x

db=ipg&solr=1&id={{id}}

### 557992326 - NOT
POST https://www.ncbi.nlm.nih.gov/sviewer/ipg/ipg.cgi
content-type: application/x-www-form-urlencoded
cookie: ncbi_sid=CE8BBFD72B412CA1_2139SID; NCBI-PHID=CE8BBFD72B3E9E1100000000085B02D1.m_13;
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36

query=omitHeader%3Dfalse%26wt%3D{{ format }}%26indent%3Dtrue%26rows%3D50%26start%3D0%26q%3Drecordset%253A{{ id }}%2520NOT%2520product_i%253A1%26sort%3Dpriority_i%2520asc

解释

请求1会发送id的相关信息过去，产生一个 ncbi_session_id
后面再用这个id，就可以顺利取到相关数据

采集列表

以 “CRISPR-associated“ 关键词，采集这个关键词列表的数据

### GET qky
GET https://www.ncbi.nlm.nih.gov/ipg?term=CRISPR-associated

### simple
# <input name="EntrezSystem2.PEntrez.DbConnector.LastQueryKey" sid="1" type="hidden" value="1">
POST https://www.ncbi.nlm.nih.gov/ipg
cookie: ncbi_sid=CE8A78F52B64F241_1325SID;

term=CRISPR-associated+&EntrezSystem2.PEntrez.Ipg.Ipg_ResultsPanel.Ipg_DisplayBar.PageSize=20&EntrezSystem2.PEntrez.Ipg.Ipg_ResultsPanel.Entrez_Pager.CurrPage=1&EntrezSystem2.PEntrez.DbConnector.LastQueryKey=1&EntrezSystem2.PEntrez.DbConnector.Cmd=PageChanged

原理

先发送”CRISPR-associated“ 的 GET 请求
1. 生成临时的 session(会有一段时间有效)
2. 同时返回2个关相的KEY
  1. ncbi_sid=CE8A78F52B64F241_1325SID
  2. EntrezSystem2.PEntrez.DbConnector.LastQueryKey
根据第1步生成的2个Key，再发送第2步POST请求，分页获取数据

term:CRISPR-associated+
EntrezSystem2.PEntrez.Ipg.Ipg_ResultsPanel.Ipg_DisplayBar.PageSize:200 # 每页取
EntrezSystem2.PEntrez.Ipg.Ipg_ResultsPanel.Entrez_Pager.CurrPage:1 # 当前页
EntrezSystem2.PEntrez.DbConnector.LastQueryKey:1 # 需要动态取， 随 session_id 动态取
EntrezSystem2.PEntrez.DbConnector.Cmd:PageChanged # 除第1页之外的都得带，所以，通用情况下，都带上这一个，否则分页不生效

Python 的实现

import requests
import jsw_nx as nx

from bs4 import BeautifulSoup

# ========== step1: reset ncbi_session_id/last_query_key =======
keyword = 'CRISPR-associated'

url = f'https://www.ncbi.nlm.nih.gov/ipg?term={keyword}'
res1 = requests.get(url)
soup1 = BeautifulSoup(res1.text, 'html.parser')
last_query_key_el = soup1.select_one(
    '[name="EntrezSystem2.PEntrez.DbConnector.LastQueryKey"]')

# string: 1/2/3/4
last_query_key = last_query_key_el.attrs['value']
# string: CE8C42592B6531E1_1815SID
ncbi_session_id = res1.headers.get('NCBI-SID')

# ======== step2: get pagination data =====

post_data = {
    'term': 'CRISPR-associated',
    'EntrezSystem2.PEntrez.Ipg.Ipg_ResultsPanel.Ipg_DisplayBar.PageSize': 20,
    'EntrezSystem2.PEntrez.Ipg.Ipg_ResultsPanel.Entrez_Pager.CurrPage': 1,
    'EntrezSystem2.PEntrez.DbConnector.LastQueryKey': last_query_key,
    'EntrezSystem2.PEntrez.DbConnector.Cmd': 'PageChanged'
}

res2 = requests.post("https://www.ncbi.nlm.nih.gov/ipg", data=post_data, headers={
    'cookie': f'ncbi_sid={ncbi_session_id}',
})
soup2 = BeautifulSoup(res2.text, 'html.parser')

print(
    soup2.select_one('title'), 
    soup2.select_one('[name="ncbi_pageno"]'), 
    soup2.select_one('[name="ncbi_pagesize"]')
)

利用 jsw-bio

安装
采集 ids

pip install jsw-bio -U

按分页采集ids

import jsw_bio as bio

inst1 = bio.NcbiSearchIpg(term="crispr")
inst2 = bio.NcbiSearchProtein(term="crispr")
ids1 = inst1.get(1, size=20)
ids2 = inst2.get(1, size=20)

print(ids1, ids2)

RestClient 配置

解释

采集列表

原理

Python 的实现

利用 jsw-bio

按分页采集ids

相关链接