Python语言学习：nltk 英文自然語言處理的經典工具

NLTK 全文是 "Nature Language Tool Kit" (NLTK)，是 Python 中一個經典的、專門用於進行自然語言處理的工具。

安装/更新

pip install nltk -U

首次使用

需要下载数据包。

import nltk
nltk.download("punkt")

# 一句话实现上面的数据包加载
python -c "import nltk; nltk.download('punkt')"

断句 sent_tokenize

将给出的英文句子，分析成数组结构，方便后续的数据处理。

import nltk
# nltk.download("punkt")


str1 = """
This class of CRISPR enzymes recognize a 5' T-rich protospacer adjacent motif (PAM, TTN for this specific enzyme), 
unlike Cas9 enzymes which recognize 3' G-rich PAMs, thus this enzyme increases the possibilites for genome editing (PubMed:26422227). 
The simplicity of the Cas12a-crRNA directed DNA endonuclease activity has been used to target and modify 
DNA sequences in rice and tobacco (PubMed:27905529).
"""

sentences = nltk.sent_tokenize(str1)
# print(sentences)

for index, line in enumerate(sentences):
    print(index, line)

在一些标注场景的应用

import nltk

document = 'Today the Netherlands celebrates King\'s Day. To honor this tradition, the Dutch embassy in San Francisco invited me to'
sentences = nltk.sent_tokenize(document)

data = []
for sent in sentences:
    data = data + nltk.pos_tag(nltk.word_tokenize(sent))

for word in data:
    if 'NNP' == word[1]:
        print(word)

基本原理

利用 `sent_tokenize` 从段落里，拆分句子
利用 `word_tokenize`/`pos_tag` 功能拆分句子为tag，并得到词性。
我们的目标词性为: NNP，专用名词（可以先大量数据中提取，然后人工选出我们的目标NNP词）
我们可以列出 white_list / black_list，优先选取 white_list中的词，并记录出位置
stopwords: 停用词 https://www.cnblogs.com/rumenz/articles/13701922.html
得到位置: https://www.nltk.org/howto/tokenize.html

优化思路

找一些生物学的词库，预训练模型
要计算 2 个单词之间的余弦相似度, new_model.wv.similarity('university','school') > 0.3
找出 NNP 所有词，即使不在我们的 white_list 里，再结合相似度，可以找出更多的词，只要带 NN 词性就可以

词性参考

CC  并列连词          NNS 名词复数        UH 感叹词
CD  基数词            NNP 专有名词        VB 动词原型
DT  限定符            NNP 专有名词复数    VBD 动词过去式
EX  存在词            PDT 前置限定词      VBG 动名词或现在分词
FW  外来词            POS 所有格结尾      VBN 动词过去分词
IN  介词或从属连词     PRP 人称代词        VBP 非第三人称单数的现在时
JJ  形容词            PRP$ 所有格代词     VBZ 第三人称单数的现在时
JJR 比较级的形容词     RB  副词            WDT 以wh开头的限定词
JJS 最高级的形容词     RBR 副词比较级      WP 以wh开头的代词
LS  列表项标记        RBS 副词最高级      WP$ 以wh开头的所有格代词
MD  情态动词          RP  小品词          WRB 以wh开头的副词
NN  名词单数          SYM 符号            TO  to

基本实现代码

import nltk
# nltk.download("punkt")
# nltk.download('averaged_perceptron_tagger')
 
 
str1 = """
This class of CRISPR enzymes recognize a 5' T-rich protospacer adjacent motif (PAM, TTN for this specific enzyme),
unlike Cas9 enzymes which recognize 3' G-rich PAMs, thus this enzyme increases the possibilites for genome editing (PubMed:26422227).
The simplicity of the Cas12a-crRNA directed DNA endonuclease activity has been used to target and modify
DNA sequences in rice and tobacco (PubMed:27905529).
"""
 
sentences = nltk.sent_tokenize(str1)
 
# get the first item:
sentence = sentences[0]
 
# tokenize
tokens = nltk.word_tokenize(sentence)
 
# get pos_tag
tagged = nltk.pos_tag(tokens)
print(tagged)
 
# [('This', 'DT'), ('class', 'NN'), ('of', 'IN'), ('CRISPR', 'NNP'), ('enzymes', 'NNS'), ('recognize', 'VBP'), ('a', 'DT'), ('5', 'CD'), ("'", 'POS'), ('T-rich', 'JJ'), ('protospacer', 'NN'), ('adjacent', 'JJ'), ('motif', 'NN'), ('(', '('), ('PAM', 'NNP'), (',', ','), ('TTN', 'NNP'), ('for', 'IN'), ('this', 'DT'), ('specific', 'JJ'), ('enzyme', 'NN'), (')', ')'), (',', ','), ('unlike','IN'), ('Cas9', 'NNP'), ('enzymes', 'NNS'), ('which', 'WDT'), ('recognize', 'VBP'), ('3', 'CD'), ("'", 'POS'), ('G-rich', 'JJ'), ('PAMs', 'NNP'), (',', ','), ('thus', 'RB'), ('this', 'DT'), ('enzyme', 'NN'), ('increases', 'VBZ'), ('the', 'DT'), ('possibilites', 'NNS'), ('for', 'IN'), ('genome', 'NN'), ('editing', 'NN'), ('(', '('), ('PubMed:26422227', 'NNP'), (')', ')'), ('.', '.')]

参考实现

利用特殊提取，检测语句情绪

准备数据
提取特征
训练模型
使用模型做预测

from nltk.classify import NaiveBayesClassifier


def word_feats(words):
    return dict([(word, True) for word in words])


positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':(']
neutral_vocab = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(train_set)

# Predict
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1

print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))