from IPython.display import Image
from IPython.display import SVG
Abstract¶
In this article we use skip-thought vectors[1], a state art of sentence encoder model, to evaluate the performance of textual entailment task.
Introduction¶
Textual entailment[2] is a imporant task in natural language processing.
It's a directional relation between two text fragments, one is called as Text(T), the other is called as Hypothesis(H).
There are three defferent relations between T and H:
- T entails H
- T contradicts H
- neutral relation
Related works¶
You can see related textual entailment works at SNLI[3]
Skip-Thought Vectors¶
Skip-Thoughts uses encoder-decoder architecture like the one in [4]. It has one encoder for current sentence and two decoders for previous and next sentence. Encoder and Decoder use single GRU[4] layer.
The following picture consists of detail equations of encoder:
Image('./encoder_equations.png')
And the decoder equations are following, conditioning on the encoder output $h_i$
Image('./decoder_equations.png')
From the pulished code skip-thoughts, it is slightly defferent from the paper[1], you can see the reason at Issue 8
The following picture is a visualization of Skip-Thoughts model according to published training code:
Image('./skip-thoughts_model.png')
After training, the encoder can convert word enbedding of sentences to sentence vectors.
Texual entailment on skip-thoughts¶
Our main works are at evaluating RTE[1, 2, 3] datasets and SNLI dataset on Skip-Thoughts model.
We evaluated four methods and a combined features method to train classifiers to predict that is T entails H.
encodes Text and Hypothesis to sentence vectors¶
We use the pre-trained uni-skip and bi-skip models at skip-thoughts to get two vectors of T and H.
Sentence vector has 4800 length.
And then follow the same way of the experience semantic relatedness in the paper[1], concatenate the two $ |v_t - v_h|$ and $v_t \cdot v_h$ vectors as features(9600 length).
Then using scikit-learn's LogisticRegressionCV to find best accuracy on test data.
uses word2vec cosine similarity of word with word¶
We use the max cosine similarity of each word in Hypothesis with each word in Text, then mean these max cosines, also each word in Text with each word in Hypothesis, finally we get two features: hypothesis_mean_cosine, text_mean_cosine.
The following code get these features for a DataFrame of RTE dataset:
def handle(df):
data_cosines = np.empty((len(df), 2))
for index, row in df.iterrows():
text = row.text
hypothesis = row.hypothesis
text = text.split()
hypothesis = hypothesis.split()
sims = np.zeros((len(text), len(hypothesis)))
for i, w1 in enumerate(text):
for j, w2 in enumerate(hypothesis):
if w1 not in self.word2vec or w2 not in self.word2vec:
sim = 0.0
else:
sim = self.word2vec.similarity(w1, w2)
sims[i, j] = sim
text_max_cosines = np.max(sims, axis=1)
text_mean_cosine = np.mean(text_max_cosines)
hypothesis_max_cosines = np.max(sims, axis=0)
hypothesis_mean_cosine = np.mean(hypothesis_max_cosines)
data_cosines[index, 0] = text_mean_cosine
data_cosines[index, 1] = hypothesis_mean_cosine
return data_cosines
Using the two features trains a Logistic Regression classifier, through scikit-learn's LogisticRegressionCV to find best hyperprarmeter C.
concatenate sentence vectors and cosine similarity features¶
We concatenate preprocessed Text Hypothesis vectors and cosine similarity features(9602 length), and also use scikit-learn's LogisticRegressionCV
uses decoder output by conditioning on encoder output¶
We use pre-trained deocder Hypothesis output conditioning on pre-trained encoder Text output as features, and use pre-trained deocder Text output conditioning on pre-trained encoder Hypothesis output as features, we also use both uni-skip and bi-skip
We use the following network architecture convert Text and Hypothesis to vectors:
Image('./te_decoder_on_encoder.png')
We use the following network architecture to train the classifier that predicts entailment or not:
Image('./rte_decoder_mlp.png')
changes objective of Skip-Thoughts model to tranfer learning a classifier¶
We use the following neural networks to directly train a textual entailment classifier:
Image('./logisticregression_decoder_on_encoder.png')
We use three approach to tranfer learning the above network:
- use pre-train encoder and f_decoder weights
- only use pre-train encoder weights
- don't use pre-train weights
We use dropout(0.5) at encoder and decoder outputs, also test no dropout using approach 1.
The logistic layer is intialized using a uniform distribution in [-0.1, 0.1].
From the results of the three approach experiments, the first way performs better at test accuracy.
Experiments¶
import pandas as pd
We evaluate our approach at RTE[1, 2, 3] and SNLI datasets.
Because of memory limits, we only use 1 / 3 train data of SNLI, and convert contradiction and neutral as noentailment as a 2-class problem, and use a neural network with no hidden layers to do logistic regression.
We use RTE[1, 2, 3] shipped with nltk-data, and use from nltk.corpus import rte
to load RTE datasets.
test accuracy of the first three methods on RTE and SNLI datasets¶
logistic_results = pd.read_excel('./logistic_results.xls')
logistic_results
We also train a mlp classifier on SNLI dataset same as the one uses at using decoder output by conditioning on encoder output.
The best test accuracy is 81.55%
The following figure is the learning curve:
Image('./encoded_snli_mlp.png')
the learning curves of uses decoder output by conditioning on encoder output¶
Because of time reason and memory limit, we just use the method on RTE datasets.
The following three figures are the learning curves of neural network using Lasagne.
Image('./decoded_rte1_mlp.png')
Image('./decoded_rte2_mlp.png')
Image('./decoded_rte3_mlp.png')
the learning curves of changes objective of Skip-Thoughts model to tranfer learning a classifier¶
Same reason, we only evaluate the method on RTE datasets
The following three figures are the no dropout version of this method using both pre-train encoder and decoder¶
Image('./rte1_train.png')
Image('./rte2_train.png')
Image('./rte3_train.png')
And the following nine figures are all using dropout version of this method¶
Pre-train encoder and decoder:¶
Image('./epoch20_rte1.png')
Image('./epoch20_rte2.png')
Image('./epoch20_rte3.png')
No pre-train encoder and decoder:¶
Image('./no_pre-train_epoch20_rte1.png')
Image('./no_pre-train_epoch20_rte2.png')
Image('./no_pre-train_epoch20_rte3.png')
just pre-train encoder:¶
Image('./pre-train_encoder_epoch20_rte1.png')
Image('./pre-train_encoder_epoch20_rte2.png')
Image('./pre-train_encoder_epoch20_rte3.png')
Limitation¶
RTE datasets have too small train samples.
(TODO)
Future works¶
We prepare to evaluate these methods on Chinese texutal entailment
(TODO)
Conclusion¶
Pre-train Skip-Thoughts model can slightly improve textual entailment task, and simple word2vec cosine similarity features also provide good results when train data is too small.
(TODO)
References¶
- Skip-Thought Vectors Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, Sanja Fidler
- https://en.wikipedia.org/wiki/Textual_entailment
- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio
Comments