1. lime 解释LSTM模型 本文是 Practical NLP 部分的笔记。
1. 数据预处理 下载数据并解压得到数据集 读取对应训练和测试数据路径 获取pos和neg文件夹 读取这两个文件夹中的文本内容和txt文件名代表的评分并合并成一个df。开始解压后的文件目录如下: (tensorflow版本使用为:1.15.2)
aclImdb ├── imdbEr.txt ├── imdb.vocab ├── README ├── test │ ├── labeledBow.feat │ ├── neg │ ├── pos │ ├── urls_neg.txt │ └── urls_pos.txt └── train ├── labeledBow.feat ├── neg ├── pos ├── unsup ├── unsupBow.feat ├── urls_neg.txt ├── urls_pos.txt └── urls_unsup.txt
要得到train和test两个数据集:
先读取pos这个级别的文件夹里面的类似 2021_3.txt
文本和对应评分级别
`tf.gfile.GFile(目录, 'r')`读取文件会快很多
创建一个data={}
字典
定义字典构成k-v
获取k-v对的值
从字典生成pandas中的df
利用上面函数生成train和test文件夹里的pos和neg级别的数据
分别读取pos和neg
添加一个正负评价的标签
合并生成对应数据
按路径下载数据并解压,再利用上面函数生成训练和测试数据
1 2 3 4 tf.keras.utils.get_file( fname="aclImdb.tar.gz" , origin="url/aclImdb_v1.tar.gz" , extract=True )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 import tensorflow as tfimport os, reimport pandas as pdfrom tensorflow import kerasdef load_dir_data (dir ): """ 读取数据集内容,返回一个df,其中df由tf快速读取的 文本内容列sentence和文件名代表的评分级别构成。 """ data = {} data['sentence' ] = [] data['sentiment' ] = [] for file_path in os.listdir(dir ): with tf.gfile.GFile(os.path.join(dir , file_path), 'r' ) as f: data['sentence' ].append(f.read()) data['sentiment' ].append(re.match('\d+_(\d+)\.txt' , file_path).group(1 )) return pd.DataFrame.from_dict(data) def load_dataset (dir ): """ 用load_dir_data()获取对应数据df后,并按照相应的文件夹代表的正反评价 来标注对应标签,最后返回合并后的打乱的重置索引后的df """ pos_df = load_dir_data(os.path.join(dir , 'pos' )) neg_df = load_dir_data(os.path.join(dir , 'neg' )) pos_df['polarity' ] = 1 neg_df['polarity' ] = 0 return pd.concat([pos_df, neg_df]).sample(frac=1 ).reset_index(drop=True ) def download_load_datasets (force_download=True ): """ 按路径下载数据集,并解压得到数据集; 再将其用load_dataset处理得到相应df 返回训练和测试df """ dataset = tf.keras.utils.get_file( fname="aclImdb.tar.gz" , origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" , extract=True ) train_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb" , "train" )) test_df = load_dataset(os.path.join(os.path.dirname(dataset), "aclImdb" , "test" )) return train_df, test_df train, test = download_load_datasets() train.sample(5 ) =========================================================================== sentence sentiment polarity 16710 Ironically for a play unavailable on film or v... 7 1 20589 On the back burner for years (so it was report... 1 0 10690 I will stat of with the plot Alice, having sur... 4 0 5756 Envy stars some of the best. Jack Black, Ben S... 1 0 4702 Another rape of History<br /><br />This movie ... 3 0
2. 模型建立和训练 利用sklearn自定义pipeline
实现TextsToSequences()和Padder()两个类
利用 make_pipeline()
形成fit_on_texts()
、texts_to_sequences()
、pad_sequences()
处理流
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 import warningswarnings.filterwarnings('ignore' ) import os, sysimport numpy as npfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom keras.utils import to_categoricalfrom keras.layers import Dense, Input, GlobalMaxPool1Dfrom keras.layers import Conv1D, MaxPooling1D, Embedding, LSTMfrom keras.models import Model, Sequentialfrom keras.initializers import Constantmax_seq_len = 1000 max_num_words = 20000 embedding_dim = 100 valid_split = 0.2 vocab_size = 20000 maxlen = 1000 train_texts = train['sentence' ].values train_labels = train['polarity' ].values test_texts = test['sentence' ].values label_index = {'pos' :1 , 'neg' :0 } test_labels = test['polarity' ].values from keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom sklearn.pipeline import TransformerMixinfrom sklearn.base import BaseEstimatorclass TextsToSequences (Tokenizer, BaseEstimator, TransformerMixin): """ 我们继承BaseEstimator, TransformerMixin就能自定义fit和transform实现 自定义的pipeline,并且 BaseEstimator还能传入*wargs **kwargs """ def __init__ (self, **kwargs ): super ().__init__(**kwargs) def fit (self, texts, y=None ): self.fit_on_texts(texts) return self def transform (self, texts, y=None ): return np.array(self.texts_to_sequences(texts)) seq = TextsToSequences(num_words=vocab_size) class Padder (BaseEstimator, TransformerMixin): """ 填充裁剪不等长文本到一样长 只有长于maxlen列表的结尾保留 而短于maxlen的列表则用零填充。 """ def __init__ (self, maxlen=500 ): self.maxlen = maxlen self.max_index = None def fit (self, x, y=None ): self.max_index = pad_sequences(x, maxlen=self.maxlen).max () return self def transform (self, x, y=None ): x = pad_sequences(x, maxlen=self.maxlen) x[x > self.max_index] = 0 return x padder = Padder(maxlen)
训练LSTM模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 from keras.models import Sequentialfrom keras.layers import Dense, Embedding, Bidirectional, LSTMfrom keras.wrappers.scikit_learn import KerasClassifierfrom sklearn.pipeline import make_pipelinebatch_size = 64 max_features = vocab_size + 1 def lstm_model (max_features ): lstm = Sequential() lstm.add(Embedding(max_num_words, 128 )) lstm.add(LSTM(128 , dropout=0.2 , recurrent_dropout=0.2 )) lstm.add(Dense(1 , activation='sigmoid' )) lstm.compile (loss='binary_crossentropy' , optimizer='adam' , metrics=['accuracy' ]) return lstm sklearn_lstm = KerasClassifier(build_fn=lstm_model, epochs=2 , batch_size=32 , max_features=max_features, verbose=1 ) pipeline = make_pipeline(seq, padder, sklearn_lstm) pipeline.fit(train_texts, train_labels)
预测正确率 : 看起来结果还不错,测试准确率: 83.74%。但是什么特征对模型其作用呢?模型是不是合理预测呢?这就要用到lime。
1 2 3 4 5 y_preds = pipeline.predict(test_texts) from sklearn import metricsprint ("测试准确率: {:.2f}%" .format (100 *metrics.accuracy_score(y_preds, test_labels)))==================================================================================== 测试准确率: 83.74 %
3. lime 解释模型 lime 使用流程是:
选取实例样本
创建对应解释器
用解释器解释实例样本特征跟训练模型近似权重
绘制权重图形来看对应特征对模型的作用,比如正向或者负向,判断是否合理
训练的LSTM模型,预测样本1:This was an excellent movie...
是正向评价,跟标签也是一致的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 idx = 11 text_sample = test_texts[idx] class_names = ['negative' , 'positive' ] print ('样本{}: 最后1000个词(模型使用的部分)' .format (idx))print (' ' .join(text_sample.split()[-1000 :]))print ("概率(正向的) = " , pipeline.predict_proba([text_sample])[0 , 1 ])print ("真正的类别: %s" %class_names[test_labels[idx]])======================================================================= 样本11 : 最后1000 个词(模型使用的部分) This was an excellent movie - fast-paced, well-written and had an intriguing plot. The special effects were innovative, especially in the opening scene. The training segment got a bit silly but overall it was a tense movie. 1 /1 [==============================] - 0s 117ms/step概率(正向的) = 0.96135324 真正的类别: positive
那么在模型上预测起作用的是什么特征呢? 单词,从其权重绘制图来看,最重要的是 excellent
tense
等这些词,权重在0.1以上,而负向词都小于0.05,因此判断是正向的。这看其来是比较合理的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import matplotlib.pyplot as pltimport seaborn as snsfrom collections import OrderedDictfrom lime.lime_text import LimeTextExplainerexplianer = LimeTextExplainer(class_names=class_names) explanation = explianer.explain_instance(text_sample, pipeline.predict_proba, num_features=10 ) weights = OrderedDict(explanation.as_list()) lime_weights = pd.DataFrame({'words' : list (weights.keys()), 'weights' : list (weights.values())}) plt.figure(figsize=(10 , 8 ), dpi=120 ) sns.barplot(x='words' , y='weights' , data=lime_weights) plt.xticks(rotation=45 ) plt.title('Sample {} features weights given by LIME' .format (idx))