python学习:biopython

一个生物学很适用的工具包
更新于: 2022-11-10 10:45:01

安装

pip install biopython

一些有用的链接

常用功能

功能代码
获取 数据库列表
from Bio import Entrez
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
handle = Entrez.einfo()
result = handle.read()


print(result)
根据 proterin_id 取信息
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.efetch(db="protein", id="WP_190432046.1",
                       rettype="gb", retmode="text")
print(handle.read())
根据关键词(term)取id
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.esearch(db="protein", retmax=10,
                        term="cas12a", idtype="acc")
record = Entrez.read(handle)

print(record)
根据id取得 详情信息
from Bio import SeqIO
from Bio import Entrez
Entrez.email = 'A.N.Other@example.com'
handle = Entrez.efetch(db="protein", id='WP_190432046.1',
                       rettype="gb", retmode="text")
records = SeqIO.parse(handle, "genbank")

for record in records:
    print(dir(record))
    print(record.id)
    # print(record.seq)
    # print(record.format)
    print(record.name)
    print(record.annotations)
    print(record.features)
根据 pmid 取详细信息
from Bio import Entrez, Medline

handle = Entrez.efetch(db="pubmed", id='10021369', retmode="text", rettype="medline")
records = Medline.parse(handle)

record = next(records)

print(record['PMID'])
print(record['TI'])
print(record['AB'])
print(record['AU'])
print(record['DP'])
print(record['JT'])
print(record['MH'])
print(record['SO'])

常用字段

from Bio import SeqIO
from Bio import Entrez
Entrez.email = 'A.N.Other@example.com'
handle = Entrez.efetch(db="protein", id='7EU9_A',
                       rettype="gb", retmode="text")
records = SeqIO.parse(handle, "genbank")
record = next(records)
# ['annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'lower', 'name', 'reverse_complement', 'seq', 'translate', 'upper']
print(record.id)
print(record.name)
print(record.seq)
print(record.format('fasta'))
功能示例
id7EU9_A
name7EU9_A
seq
XSNKEKNASETRKAYTTKXIPRSHDRXKLLGNFXDYLXDGTPIFFELWNQFGGGIDRDIISGTANKDKISDDLLLAVNWFKVXPINSKPQGVS
PSNLANLFQQYSGSEPDIQAQEYFASNFDTEKHQWKDXRVEYERLLAELQLSRSDXHHDLKLXYKEKCIGLSLSTAHYITSVXFGTGAKNNRQ
TKHQFYSKVIQLLEESTQINSVEQLASIILKAGDCDSYRKLRIRCSRKGATPSILKIVQDYELGTNHDDEVNVPSLIANLKEKLGRFEYECEW
KCXEKIKAFLASKVGPYYLGSYSAXLENALSPIKGXTTKNCKFVLKQIDAKNDIKYENEPFGKIVEGFFDSPYFESDTNVKWVLHPHHIGESN
IKTLWEDLNAIHSKYEEDIASLSEDKKEKRIKVYQGDVCQTINTYCEEVGKEAKTPLVQLLRYLYSRKDDIAVDKIIDGITFLSKKHKVEKQK
INPVIQKYPSFNFGNNSKLLGKIISPKDKLKHNLKCNRNQVDNYIWIEIKVLNTKTXRWEKHHYALSSTRFLEEVYYPATSENPPDALAARFR
TKTNGYEGKPALSAEQIEQIRSAPVGLRKVKKRQXRLEAARQQNLLPRYTWGKDFNINICKRGNNFEVTLATKVKKKKEKNYKVVLGYAANIV
RKNTYAAIEAHANGDGVIDYNDLPVKPIESGFVTVESQVRDKSYDQLSYNGVKLLYCKPHVESRRSFLEKYRNGTXKDNRGNNIQIDFXKDFE
AIADDETSLYYFNXKYCKLLQSSIRNHSSQAKEYREEIFELLRDGKLSVLKLSSLSNLSFVXFKVAKSLIGTYFGHLLKKPKNSKSDVKAPPI
TDEDKQKADPEXFALRLALEEKRLNKVKSKKEVIANKIVAKALELRDKYGPVLIKGENISDTTKKGKKSSTNSFLXDWLARGVANKVKEXVXX
HQGLEFVEVNPNFTSHQDPFVHKNPENTFRARYSRCTPSELTEKNRKEILSFLSDKPSKRPTNAYYNEGAXAFLATYGLKKNDVLGVSLEKFK
QIXANILHQRSEDQLLFPSRGGXFYLATYKLDADATSVNWNGKQFWVCNADLVAAYNVGLVDIQKDFKKKLEHHHHHH
fasta
>7EU9_A Chain A, Cas12i1 D647A mutant
XSNKEKNASETRKAYTTKXIPRSHDRXKLLGNFXDYLXDGTPIFFELWNQFGGGIDRDII
SGTANKDKISDDLLLAVNWFKVXPINSKPQGVSPSNLANLFQQYSGSEPDIQAQEYFASN
FDTEKHQWKDXRVEYERLLAELQLSRSDXHHDLKLXYKEKCIGLSLSTAHYITSVXFGTG
AKNNRQTKHQFYSKVIQLLEESTQINSVEQLASIILKAGDCDSYRKLRIRCSRKGATPSI
LKIVQDYELGTNHDDEVNVPSLIANLKEKLGRFEYECEWKCXEKIKAFLASKVGPYYLGS
YSAXLENALSPIKGXTTKNCKFVLKQIDAKNDIKYENEPFGKIVEGFFDSPYFESDTNVK
WVLHPHHIGESNIKTLWEDLNAIHSKYEEDIASLSEDKKEKRIKVYQGDVCQTINTYCEE
VGKEAKTPLVQLLRYLYSRKDDIAVDKIIDGITFLSKKHKVEKQKINPVIQKYPSFNFGN
NSKLLGKIISPKDKLKHNLKCNRNQVDNYIWIEIKVLNTKTXRWEKHHYALSSTRFLEEV
YYPATSENPPDALAARFRTKTNGYEGKPALSAEQIEQIRSAPVGLRKVKKRQXRLEAARQ
QNLLPRYTWGKDFNINICKRGNNFEVTLATKVKKKKEKNYKVVLGYAANIVRKNTYAAIE
AHANGDGVIDYNDLPVKPIESGFVTVESQVRDKSYDQLSYNGVKLLYCKPHVESRRSFLE
KYRNGTXKDNRGNNIQIDFXKDFEAIADDETSLYYFNXKYCKLLQSSIRNHSSQAKEYRE
EIFELLRDGKLSVLKLSSLSNLSFVXFKVAKSLIGTYFGHLLKKPKNSKSDVKAPPITDE
DKQKADPEXFALRLALEEKRLNKVKSKKEVIANKIVAKALELRDKYGPVLIKGENISDTT
KKGKKSSTNSFLXDWLARGVANKVKEXVXXHQGLEFVEVNPNFTSHQDPFVHKNPENTFR
ARYSRCTPSELTEKNRKEILSFLSDKPSKRPTNAYYNEGAXAFLATYGLKKNDVLGVSLE
KFKQIXANILHQRSEDQLLFPSRGGXFYLATYKLDADATSVNWNGKQFWVCNADLVAAYN
VGLVDIQKDFKKKLEHHHHHH
annotations
annotations = {
    'topology': 'linear', 
    'data_file_division': 'BCT', 
    'date': '01-JUL-2021', 
    'accessions': ['7EU9_A'], 
    'db_source': 'pdb: molecule 7EU9, chain A, release Jun 23, 2021; ...', 
    'keywords': [''], 
    'source': 'Lachnospiraceae bacterium ND2006', 
    'organism': 'Lachnospiraceae bacterium ND2006', 
    'taxonomy': ['Bacteria', 'Firmicutes', 'Clostridia', 'Eubacteriales', 'Lachnospiraceae'], 
    'references': [Reference(title='Mechanistic insights into ...], 
    'comment': 'Crystal structure of the selenometh...', 
    'molecule_type': 'protein'
}
features
type: SecStr
location: [168:176]
qualifiers:
    Key: note, Value: ['helix 7']
    Key: sec_str_type, Value: ['helix']

type: NonStdRes
location: [175:176]
qualifiers:
    Key: non_std_residue, Value: ['MSE']

type: Region
location: [180:275]
qualifiers:
    Key: note, Value: ['NCBI Domains']
    Key: region_name, Value: ['Domain 3']

一种更方便的采集思路

  • 用以下代码可以得到 id_list - esearch (cas12:28/cas15:20条数据)
  • 将id入库,然后可以用链接取得到
  • 拿到 id ,得到 genbank,处理得到你想要的数据 efeatch
from Bio import Entrez
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
handle = Entrez.esearch(db="protein", term="cas12", retmax=1000000)
record = Entrez.read(handle)

print(record["IdList"], len(record["IdList"]))

得到 genbank 的数据

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
handle = Entrez.efetch(db="protein", id="EU490707.1", rettype="gb")
print(handle.read())

根据第一步的 id 取数据

from Bio import Entrez

Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
# ['2222680928', '2222680927', '2222680926', '2222680924']
handle = Entrez.efetch(db="protein", id="2222680928",
                       rettype="gb", retmode="text")
print(handle.read())

得到 fasta 的数据

from Bio import Entrez
Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
handle = Entrez.efetch(db="protein", id="EU490707.1", rettype="fasta")
print(handle.read())

自己实现了一个方便获取 accids 的工具

  • 只适用于较少的ids的情况
  • 较大的会考虑生成文件(已经实现)
import jsw_bio as bio
bio.ncbi_download_accids(term='cas12', filename="./test.list")

# ['VEJ66715.1', 'SUY72866.1', 'SUY81473.1', ...

批量取 accides 原理

  • ncbi_sid=6A060DC2258FD013_4558SID
  • 关键词对应的: [name="EntrezSystem2.PEntrez.Protein.Sequence_ResultsPanel.Sequence_DisplayBar.QueryKey"] 
  • 上面2个值,来源于同一个页面
### get by keywords
GET https://www.ncbi.nlm.nih.gov/protein/?term=cas15

# [name="EntrezSystem2.PEntrez.Protein.Sequence_ResultsPanel.Sequence_DisplayBar.QueryKey"]


### curl cas12 --- 28
curl 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=protein&report=accnlist&query_key=2' \
  -H 'cookie: ncbi_sid=6A060DC2258FD013_4558SID' \
  --compressed


### curl cas15 --- 28
curl 'https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?db=protein&report=accnlist&query_key=1' \
  -H 'cookie: ncbi_sid=CE8B1AB3299F0631_2296SID' \
  --compressed


#### test
GET https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?id=EU490707.1&db=protein&report=genpept&retmode=text

参考