Python语言学习:nltk 英文自然語言處理的經典工具
NLTK 全文是 "Nature Language Tool Kit" (NLTK),是 Python 中一個經典的、專門用於進行自然語言處理的工具。
安装/更新
pip install nltk -U
首次使用
需要下载数据包。
import nltk
nltk.download("punkt")
# 一句话实现上面的数据包加载
python -c "import nltk; nltk.download('punkt')"
断句 sent_tokenize
将给出的英文句子,分析成数组结构,方便后续的数据处理。
import nltk
# nltk.download("punkt")
str1 = """
This class of CRISPR enzymes recognize a 5' T-rich protospacer adjacent motif (PAM, TTN for this specific enzyme),
unlike Cas9 enzymes which recognize 3' G-rich PAMs, thus this enzyme increases the possibilites for genome editing (PubMed:26422227).
The simplicity of the Cas12a-crRNA directed DNA endonuclease activity has been used to target and modify
DNA sequences in rice and tobacco (PubMed:27905529).
"""
sentences = nltk.sent_tokenize(str1)
# print(sentences)
for index, line in enumerate(sentences):
print(index, line)
在一些标注场景的应用
import nltk
document = 'Today the Netherlands celebrates King\'s Day. To honor this tradition, the Dutch embassy in San Francisco invited me to'
sentences = nltk.sent_tokenize(document)
data = []
for sent in sentences:
data = data + nltk.pos_tag(nltk.word_tokenize(sent))
for word in data:
if 'NNP' == word[1]:
print(word)
基本原理
- 利用 `sent_tokenize` 从段落里,拆分句子
- 利用 `word_tokenize`/`pos_tag` 功能拆分句子为tag,并得到词性。
- 我们的目标词性为: NNP,专用名词(可以先大量数据中提取,然后人工选出我们的目标NNP词)
- 我们可以列出 white_list / black_list,优先选取 white_list中的词,并记录出位置
- stopwords: 停用词 https://www.cnblogs.com/rumenz/articles/13701922.html
- 得到位置: https://www.nltk.org/howto/tokenize.html
优化思路
- 找一些生物学的词库,预训练模型
- 要计算 2 个单词之间的余弦相似度,
new_model.wv.similarity('university','school') > 0.3
- 找出
NNP
所有词,即使不在我们的white_list
里,再结合相似度,可以找出更多的词,只要带NN
词性就可以
词性参考
CC 并列连词 NNS 名词复数 UH 感叹词
CD 基数词 NNP 专有名词 VB 动词原型
DT 限定符 NNP 专有名词复数 VBD 动词过去式
EX 存在词 PDT 前置限定词 VBG 动名词或现在分词
FW 外来词 POS 所有格结尾 VBN 动词过去分词
IN 介词或从属连词 PRP 人称代词 VBP 非第三人称单数的现在时
JJ 形容词 PRP$ 所有格代词 VBZ 第三人称单数的现在时
JJR 比较级的形容词 RB 副词 WDT 以wh开头的限定词
JJS 最高级的形容词 RBR 副词比较级 WP 以wh开头的代词
LS 列表项标记 RBS 副词最高级 WP$ 以wh开头的所有格代词
MD 情态动词 RP 小品词 WRB 以wh开头的副词
NN 名词单数 SYM 符号 TO to
基本实现代码
import nltk
# nltk.download("punkt")
# nltk.download('averaged_perceptron_tagger')
str1 = """
This class of CRISPR enzymes recognize a 5' T-rich protospacer adjacent motif (PAM, TTN for this specific enzyme),
unlike Cas9 enzymes which recognize 3' G-rich PAMs, thus this enzyme increases the possibilites for genome editing (PubMed:26422227).
The simplicity of the Cas12a-crRNA directed DNA endonuclease activity has been used to target and modify
DNA sequences in rice and tobacco (PubMed:27905529).
"""
sentences = nltk.sent_tokenize(str1)
# get the first item:
sentence = sentences[0]
# tokenize
tokens = nltk.word_tokenize(sentence)
# get pos_tag
tagged = nltk.pos_tag(tokens)
print(tagged)
# [('This', 'DT'), ('class', 'NN'), ('of', 'IN'), ('CRISPR', 'NNP'), ('enzymes', 'NNS'), ('recognize', 'VBP'), ('a', 'DT'), ('5', 'CD'), ("'", 'POS'), ('T-rich', 'JJ'), ('protospacer', 'NN'), ('adjacent', 'JJ'), ('motif', 'NN'), ('(', '('), ('PAM', 'NNP'), (',', ','), ('TTN', 'NNP'), ('for', 'IN'), ('this', 'DT'), ('specific', 'JJ'), ('enzyme', 'NN'), (')', ')'), (',', ','), ('unlike','IN'), ('Cas9', 'NNP'), ('enzymes', 'NNS'), ('which', 'WDT'), ('recognize', 'VBP'), ('3', 'CD'), ("'", 'POS'), ('G-rich', 'JJ'), ('PAMs', 'NNP'), (',', ','), ('thus', 'RB'), ('this', 'DT'), ('enzyme', 'NN'), ('increases', 'VBZ'), ('the', 'DT'), ('possibilites', 'NNS'), ('for', 'IN'), ('genome', 'NN'), ('editing', 'NN'), ('(', '('), ('PubMed:26422227', 'NNP'), (')', ')'), ('.', '.')]
参考实现
- https://www.jianshu.com/p/c273e926d734
- https://www.cnblogs.com/rumenz/articles/13701922.html
- https://www.nltk.org/howto/gensim.html
- https://www.nltk.org/howto/tag.html
- https://www.nltk.org/howto/tokenize.html
- https://blog.csdn.net/github_39655029/article/details/82893018
利用特殊提取,检测语句情绪
- 准备数据
- 提取特征
- 训练模型
- 使用模型做预测
from nltk.classify import NaiveBayesClassifier
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':(']
neutral_vocab = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
# Predict
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify(word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))