1. 用python处理NLP

本文是[Sanjaya’s Blog] 中 Natural Language Processing with Python的笔记,但不会翻译所有内容。

jupyter notebook 链接

NLP是让计算机理解人类语言。广泛应用于信息检索(搜索引擎)、文本分类,自然语言生成等等。

典型的NLP应用流程如下:

image-20210422171600732

1. 预处理

Tokenization

预处理一般要进过提取token,

1
2
3
4
text = "This warning shouldn't be taken lightly."
print(text.split(sep=' '))
====================================================
['This', 'warning', "shouldn't", 'be', 'taken', 'lightly.']

用re去掉标点符号punctuation character.

1
2
3
4
5
import regex as re
clean_text = re.sub(r"\p{P}+", "", text)
print(clean_text.split())
=================================
['This', 'warning', 'shouldnt', 'be', 'taken', 'lightly']#去掉了shouldn't中的标点符号

其中:比 str.translate(str.maketrans('', '', string.punctuation))要简洁吧!

\p{P}

  • \p{P} is “Any punctuation character” 标点符号
  • \p{Z} is “Any whitespace character” 空格

也可以用spacy.

1
2
3
#安装语言词典
pip install spacy
python -m spacy download en
1
2
3
4
5
6
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print([token.text for token in doc])
===============================================
['This', 'warning', 'should', "n't", 'be', 'taken', 'lightly', ':)', '#', 'python', '.']
停用词移除

因为这些字符中像a, an, the等出现非常频繁,但又没什么意义,还加大计算量,影响后续结果,所以去掉。

1
print([(token.text, token.is_stop) for token in doc])

更常用的是:

text_rev_stops = [word for word in tokenize if not word in stopwords]

Stemming 词干提取

比如cats 的词干是cat, meeting词干是meet

Lemmatisation 词元提取
1
2
3
4
5
print ([(token.text, token.lemma_) for token in nlp("we are meeting tomorrow")])
print ([(token.text, token.lemma_) for token in nlp("i am going to a meeting")])
================================================================================
[('we', '-PRON-'), ('are', 'be'), ('meeting', 'meet'), ('tomorrow', 'tomorrow')]
[('i', 'i'), ('am', 'be'), ('going', 'go'), ('to', 'to'), ('a', 'a'), ('meeting', 'meeting')]
POS 词性标注
1
2
3
print ([(token.text, token.pos_) for token in doc])
========================================================
[('This', 'DET'), ('warning', 'NOUN'), ('should', 'VERB'), ("n't", 'ADV'), ('be', 'VERB'), ('taken', 'VERB'), ('lightly', 'ADV'), (':)', 'NOUN'), ('#', 'NOUN'), ('python', 'NOUN'), ('.', 'PUNCT')]
第一部分总结

使用包括spacy, NLTK, gensim, textblob来预处理,为后续训练和推理做准备。

2. 特征提取

这部分介绍了sklearn和spacy。

二值化编码 Binary Encoding

不是one-hot而是一个跟词汇表一样长的,文本出现的位置为1的向量。

这部分使用文本为:

1
2
3
4
5
texts = [
"blue car and blue window",
"black crow in the window",
"i see my reflection in the window"
]

先建立词汇表:

1
2
3
4
vocab = sorted(set(word for sentence in texts for word in sentence.split()))
print(len(vocab), vocab)
==================================================================================
12 ['and', 'black', 'blue', 'car', 'crow', 'i', 'in', 'my', 'reflection', 'see', 'the', 'window']

按词汇出现位置来向量化输入文本。

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

def binary_transform(text):
output = np.zeros(len(vocab))
words = set(text.split())
#如果每个词在词汇表中就把该位置置为1
for i, v in enumerate(vocab):
output[i] = v in words
return output

print(binary_transform("i saw crow"))
=====================================================
[0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]

也可以直接用sklearn中的CountVectorizer

1
2
3
4
5
6
7
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(binary=True)
vec.fit(texts)
print([w for w in sorted(vec.vocabulary_.keys())])
===========================================================
['and', 'black', 'blue', 'car', 'crow', 'in', 'my', 'reflection', 'see', 'the', 'window']
1
2
3
import pandas as pd

pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))

输出如下:

image-20210422233214499

Counting

Counting很像Binary,但这不仅会统计会不会出现,还会计算单词出现多少次。

1
2
3
4
5
6
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=False) #默认为False,可以不写
vec.fit(texts)

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))

image-20210422233536372

输出比二值化多了出现次数的信息,本质上我们会用出现次数来作为权重。但是在实际应用中,因为 a, an, have等等词将会有更大的权重。接下来会证明这方法在搜索引擎中的局限性。

TF-IDF

TF-IDF是term frequency-inverse document frequency,中文: 词频—逆文档频率。

如果一个词越常见,IDF越小越接近于0.加1是为了避免分母为0.

更习惯写法是:

其中:

  • $t$ 是词(或统计项)
  • $d$ 是出现该词的文档总词数
  • $D$ 是语料库总文档数

使用:

1
2
3
4
5
6
7
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()
vec.fit(texts)

import pandas as pd
pd.DataFrame(vec.transform(texts).toarray(), columns=sorted(vec.vocabulary_.keys()))

image-20210423000135355

这部分主要介绍了不同方法转换文本为数值特征,然后“喂”给机器学习模型。

3. 文本聚类

聚类算法有Kmeans, DBSCAN, Spectral clustering(谱聚类), 层次聚类等等。

  • KMeans 可以用在没见过的数据集上
  • DBSCAN不可以用在新的没见过的数据上

jange 使用, 这好像作者自己写的一个包。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from jange import ops, stream, vis

ds = stream.from_csv(
"https://raw.githubusercontent.com/jangedoo/jange/master/dataset/bbc.csv",
columns='news',
context_column="type"
)

# Extract clusters
result_collector = {}
clusters_ds = ds.apply(
ops.text.clean.pos_filter("NOUN", keep_matching_tokens=True),
ops.text.encode.tfidf(max_features=5000, name="tfidf"),
ops.cluster.minibatch_kmeans(n_clusters=5),
result_collector=result_collector,
)
# Get features extracted by tfidf and reduce the dimensions
features_ds = result_collector[clusters_ds.applied_ops.find_by_name("tfidf")]
reduced_features = features_ds.apply(ops.dim.pca(n_dim=2)) #用pca降到2维

# Visualization
vis.cluster.visualize(reduced_features, clusters_ds)

这个会报错,ValueError: not enough values to unpack (expected 2, got 0)

放下原文效果:

cluster_jange

用sklearn分析bbc 5类文档的数据集,总数为2225文档,包含商业,娱乐, 政策, 运动和科技。

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.datasets import load_files

random_state = 0
data = load_files(data_dir, encoding='utf-8', decode_error='replace', random_state=random_state)
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])
df.sample(10)

特征提取

1
2
3
4
#特征提取
vec = TfidfVectorizer(stop_words='english')
vec.fit(df.text.values)
features = vec.transform(df.text.values)

训练

1
2
3
4
cls = MiniBatchKMeans(n_clusters=5, random_state=random_state)
cls.fit(features)
cls.predict(features)
cls.labels_

可视化

1
2
3
4
5
6
7
pca = PCA(n_components=2, random_state=random_state)
reduced_features = pca.fit_transform(features.toarray()) #将特征降到2为
reduced_cluster_centers = pca.transform(cls.cluster_centers_) #将聚类中心降到2D
plt.figure(figsize=(20, 16))
plt.scatter(reduced_features[:,0], reduced_features[:, 1], c=cls.predict(features))
plt.scatter(reduced_cluster_centers[:,0], reduced_cluster_centers[:,1], marker='x', s=150, c='r')
plt.show()

vispca

评估

1
2
3
4
5
from sklearn.metrics import homogeneity_score #适合标签在0-1之间

homogeneity_score(df.label, cls.predict(features))
=========================================
0.5433462110559382
1
2
3
4
from sklearn.metrics import silhouette_score #标签在-1,1
silhouette_score(features, labels=cls.predict(features))
=====================================================================
0.009927737289334684

4. 主题建模

0: game match player win
1: government minister election

上面两句话,0句是运动类,1句是政策类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])#语言模型 禁止参数
random_state = 0
data_dir = r"/content/bbc"
data = load_files(data_dir, encoding='utf-8', decode_error='replace', random_state=random_state)
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])

def only_nouns(texts):
output = []
for doc in nlp.pipe(texts):
#因为名词对于模型影响最大就只用NOUN
noun_text = ' '.join(token.lemma_ for token in doc if token.pos_ == 'NOUN')
output.append(noun_text)
return output
df['text'] = only_nouns(df['text'])
df.head()
=======================================================
text label
0 boss bag award executive business magazine tit... 0
1 copy bumper sale fi shooter game copy sale com... 4
2 msp climate warning climate change control dec... 2
3 pavey success view week race track bronze inju... 3
4 tory rethink association candidate election ag... 2

image-20210423142201549

训练

1
2
3
4
5
6
7
8
9
10
11
12
13
n_topics = 5

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vec = TfidfVectorizer(max_features=5000, stop_words='english', max_df=0.85, min_df=2)
features = vec.fit_transform(df.text)

from sklearn.decomposition import NMF
cls = NMF(n_components=n_topics, random_state=random_state)
cls.fit(features)
===========================================================
NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
n_components=5, random_state=0, shuffle=False, solver='cd', tol=0.0001,
verbose=0)

cls.components_将会是一个[ntopics, n_features].这儿回事[5, 5000]. `cls.components.shape`可以看到。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#向量里找到唯一次的列表
features = vec.get_feature_names()

#最影响每个主题的单词数
n_top_words = 15

for i, topic_vec in enumerate(cls.components_):
print(i, end=' ')
#topic_vec.argsort() 词索引按最小分数和最大分数生成的arry
#[-1:-n_top_words:-1]切片到最大15个词
for fid in topic_vec.argsort()[-1:-n_top_words:-1]:
print(features[fid], end=' ')
print()

=============================================================
0 growth sale economy year company market share rate price firm profit oil analyst month
1 film award actor star actress director nomination movie year comedy role festival prize category
2 game player match team injury club time win season coach goal victory title champion
3 election party government tax minister leader people campaign chancellor plan issue voter country taxis
4 phone people music technology service user broadband software computer tv network device video site

预测

1
2
3
4
5
new_articles = [
"Playstation network was down so many people were angry",
"Germany scored 7 goals against Brazil in worldcup semi-finals"
]
cls.transform(vec.transform(new_articles)).argsort(axis=1)[:,-1]

5. 最近邻搜索

数据预处理

1
2
3
4
5
6
from sklearn.datasets import fetch_20newsgroups

bunch = fetch_20newsgroups(remove='headers')
print(type(bunch), bunch.keys())
===========================================================
<class 'sklearn.utils.Bunch'> dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

看看每个bunch数据

1
2
3
bunch.data[0]
====================================================================
'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n'

提取特征:

1
2
3
4
5
6
7
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(max_features=10000)
features = vec.fit_transform(bunch.data)
print(features.shape)
================================================================
(11314, 10000)

训练

1
2
3
4
5
6
7
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(n_neighbors=10, metric='cosine')
knn.fit(features)
============================================================
NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
metric_params=None, n_jobs=None, n_neighbors=10, p=2,
radius=1.0)

结果:

1
2
3
4
5
6
7
8
9
10
11
knn.kneighbors(features[0:1], return_distance=False)
===================================================
array([[ 0, 958, 8013, 8266, 659, 5553, 3819, 2554, 6055, 7993]])



knn.kneighbors(features[0:1], return_distance=True)
===========================================
(array([[0. , 0.35119023, 0.62822688, 0.64738668, 0.66613124,
0.67267273, 0.68149664, 0.68833514, 0.70024449, 0.70169709]]),
array([[ 0, 958, 8013, 8266, 659, 5553, 3819, 2554, 6055, 7993]]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
input_texts = ["any recommendations for good ftp sites?", "i need to clean my car"]
input_features = vec.transform(input_texts)

D, N = knn.kneighbors(input_features, n_neighbors=2, return_distance=True)

for input_text, distances, neighbors in zip(input_texts, D, N):
print("Input text = ", input_text[:200], "\n")
for dist, neighbor_idx in zip(distances, neighbors):
print("Distance = ", dist, "Neighbor idx = ", neighbor_idx)
print(bunch.data[neighbor_idx][:200])
print("-"*200)
print("="*200)
print()



==========================================================================
Input text = any recommendations for good ftp sites?

Distance = 0.5870334253639387 Neighbor idx = 89
I would like to experiment with the INTEL 8051 family. Does anyone out
there know of any good FTP sites that might have compiliers, assemblers,
etc.?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance = 0.6566334116701875 Neighbor idx = 7665
Hi!

I am looking for ftp sites (where there are freewares or sharewares)
for Mac. It will help a lot if there are driver source codes in those
ftp sites. Any information is appreciated.

Thanks in
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================

Input text = i need to clean my car

Distance = 0.6592186982514803 Neighbor idx = 8013
In article <49422@fibercom.COM> rrg@rtp.fibercom.com (Rhonda Gaines) writes:
>
>I'm planning on purchasing a new car and will be trading in my '90
>Mazda MX-6 DX. I've still got 2 more years to pay o
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Distance = 0.692693967282819 Neighbor idx = 7993
I bought a car with a defunct engine, to use for parts
for my old but still running version of the same car.

The car I bought has good tires.

Is there anything in particular that I should do to
stor
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
========================================================================================================================================================================================================