1. lime 解释LSTM模型

本文是 Practical NLP部分的笔记。

1. 数据预处理

下载数据并解压得到数据集 读取对应训练和测试数据路径 获取pos和neg文件夹 读取这两个文件夹中的文本内容和txt文件名代表的评分并合并成一个df。开始解压后的文件目录如下: (tensorflow版本使用为:1.15.2)

aclImdb
├── imdbEr.txt
├── imdb.vocab
├── README
├── test
│ ├── labeledBow.feat
│ ├── neg
│ ├── pos
│ ├── urls_neg.txt
│ └── urls_pos.txt
└── train
├── labeledBow.feat
├── neg
├── pos
├── unsup
├── unsupBow.feat
├── urls_neg.txt
├── urls_pos.txt
└── urls_unsup.txt

要得到train和test两个数据集:

  1. 先读取pos这个级别的文件夹里面的类似 2021_3.txt文本和对应评分级别

    `tf.gfile.GFile(目录, 'r')`读取文件会快很多
    
    • 创建一个data={} 字典
    • 定义字典构成k-v
    • 获取k-v对的值
    • 从字典生成pandas中的df
  2. 利用上面函数生成train和test文件夹里的pos和neg级别的数据

    • 分别读取pos和neg
    • 添加一个正负评价的标签
    • 合并生成对应数据
  3. 按路径下载数据并解压,再利用上面函数生成训练和测试数据

  • 1
    2
    3
    4
    tf.keras.utils.get_file(
    fname="aclImdb.tar.gz",
    origin="url/aclImdb_v1.tar.gz",
    extract=True)
  • train_df, test_df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import tensorflow as tf
import os, re
import pandas as pd
from tensorflow import keras
#pip install tensorflow==1.14.0

def load_dir_data(dir):
"""
读取数据集内容,返回一个df,其中df由tf快速读取的
文本内容列sentence和文件名代表的评分级别构成。
"""
data = {}
data['sentence'] = []
data['sentiment'] = []
for file_path in os.listdir(dir):
with tf.gfile.GFile(os.path.join(dir, file_path), 'r') as f:
data['sentence'].append(f.read())
#文件名格式为2021_3.txt 这里match后group(1)就是3这个情感评分
data['sentiment'].append(re.match('\d+_(\d+)\.txt', file_path).group(1))
return pd.DataFrame.from_dict(data)


def load_dataset(dir):
"""
用load_dir_data()获取对应数据df后,并按照相应的文件夹代表的正反评价
来标注对应标签,最后返回合并后的打乱的重置索引后的df
"""
pos_df = load_dir_data(os.path.join(dir, 'pos'))
neg_df = load_dir_data(os.path.join(dir, 'neg'))
pos_df['polarity'] = 1
neg_df['polarity'] = 0
return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

def download_load_datasets(force_download=True):
"""
按路径下载数据集,并解压得到数据集;
再将其用load_dataset处理得到相应df
返回训练和测试df
"""
dataset = tf.keras.utils.get_file(
fname="aclImdb.tar.gz",
origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
extract=True)

train_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "train"))
test_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb", "test"))
return train_df, test_df

train, test = download_load_datasets()
train.sample(5)
===========================================================================
sentence sentiment polarity
16710 Ironically for a play unavailable on film or v... 7 1
20589 On the back burner for years (so it was report... 1 0
10690 I will stat of with the plot Alice, having sur... 4 0
5756 Envy stars some of the best. Jack Black, Ben S... 1 0
4702 Another rape of History<br /><br />This movie ... 3 0

2. 模型建立和训练

利用sklearn自定义pipeline

  • 实现TextsToSequences()和Padder()两个类
  • 利用 make_pipeline()形成fit_on_texts()texts_to_sequences()pad_sequences()处理流
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import warnings
warnings.filterwarnings('ignore')

import os, sys
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPool1D
from keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from keras.models import Model, Sequential
from keras.initializers import Constant

#序列最大长度
max_seq_len = 1000
max_num_words = 20000
embedding_dim = 100
valid_split = 0.2

vocab_size = 20000
maxlen = 1000

train_texts = train['sentence'].values
train_labels = train['polarity'].values
test_texts = test['sentence'].values

label_index = {'pos':1, 'neg':0}
test_labels = test['polarity'].values


from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.pipeline import TransformerMixin
from sklearn.base import BaseEstimator

class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
"""
我们继承BaseEstimator, TransformerMixin就能自定义fit和transform实现
自定义的pipeline,并且 BaseEstimator还能传入*wargs **kwargs
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)

def fit(self, texts, y=None):
# 使用Tokenizer.fit_on_texts()将文本tokenize,
# 返回self实现fit().transform()串式调用
self.fit_on_texts(texts)
return self

def transform(self, texts, y=None):
#Tokenizer.texts_to_sequences()将文本转换为sequence
# sklearn不能处理DataFrame转换为np
return np.array(self.texts_to_sequences(texts))

seq = TextsToSequences(num_words=vocab_size)

class Padder(BaseEstimator, TransformerMixin):
"""
填充裁剪不等长文本到一样长
只有长于maxlen列表的结尾保留
而短于maxlen的列表则用零填充。
"""
def __init__(self, maxlen=500):
self.maxlen = maxlen
self.max_index = None

def fit(self, x, y=None):
self.max_index = pad_sequences(x, maxlen=self.maxlen).max()
return self

def transform(self, x, y=None):
x = pad_sequences(x, maxlen=self.maxlen)
x[x > self.max_index] = 0
return x

padder = Padder(maxlen)

训练LSTM模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from keras.models import Sequential
from keras.layers import Dense, Embedding, Bidirectional, LSTM
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.pipeline import make_pipeline

batch_size = 64
max_features = vocab_size + 1

def lstm_model(max_features):
lstm = Sequential()
lstm.add(Embedding(max_num_words, 128))
lstm.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
lstm.add(Dense(1, activation='sigmoid'))
lstm.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return lstm

sklearn_lstm = KerasClassifier(build_fn=lstm_model, epochs=2,
batch_size=32,
max_features=max_features,
verbose=1)

pipeline = make_pipeline(seq, padder, sklearn_lstm)
pipeline.fit(train_texts, train_labels)

预测正确率: 看起来结果还不错,测试准确率: 83.74%。但是什么特征对模型其作用呢?模型是不是合理预测呢?这就要用到lime。

1
2
3
4
5
y_preds = pipeline.predict(test_texts)
from sklearn import metrics
print("测试准确率: {:.2f}%".format(100*metrics.accuracy_score(y_preds, test_labels)))
====================================================================================
测试准确率: 83.74%

3. lime 解释模型

lime使用流程是:

  1. 选取实例样本
  2. 创建对应解释器
  3. 用解释器解释实例样本特征跟训练模型近似权重
  4. 绘制权重图形来看对应特征对模型的作用,比如正向或者负向,判断是否合理

训练的LSTM模型,预测样本1:This was an excellent movie...是正向评价,跟标签也是一致的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#选取一个测试实例
idx = 11
text_sample = test_texts[idx]
class_names = ['negative', 'positive']

print('样本{}: 最后1000个词(模型使用的部分)'.format(idx))
print(' '.join(text_sample.split()[-1000:]))
print("概率(正向的) = ", pipeline.predict_proba([text_sample])[0, 1])
print("真正的类别: %s" %class_names[test_labels[idx]])
=======================================================================
样本11: 最后1000个词(模型使用的部分)
This was an excellent movie - fast-paced, well-written and had an intriguing plot. The special effects were innovative, especially in the opening scene. The training segment got a bit silly but overall it was a tense movie.
1/1 [==============================] - 0s 117ms/step
概率(正向的) = 0.96135324
真正的类别: positive

那么在模型上预测起作用的是什么特征呢? 单词,从其权重绘制图来看,最重要的是 excellent tense等这些词,权重在0.1以上,而负向词都小于0.05,因此判断是正向的。这看其来是比较合理的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline
from collections import OrderedDict
from lime.lime_text import LimeTextExplainer

explianer = LimeTextExplainer(class_names=class_names)
explanation = explianer.explain_instance(text_sample,
pipeline.predict_proba, num_features=10)

#将explanation转换为有序字典
weights = OrderedDict(explanation.as_list())
#转换为对应列的df
lime_weights = pd.DataFrame({'words': list(weights.keys()),
'weights': list(weights.values())})
plt.figure(figsize=(10, 8), dpi=120)
sns.barplot(x='words', y='weights', data=lime_weights)
plt.xticks(rotation=45)
plt.title('Sample {} features weights given by LIME'.format(idx))

image-20210517191454032