1 Introduction
Hypernymy refers to the general-specific relationship between two lexical terms. Such is the case of biology taxonomies (e.g vertebrate-mammal, mammal-pangolin), professions (e.g. composer-Lennon)fn and colors (e.g. color-green), among many others. The general term is called the hypernym and the specific one the hyponym. This relationship is crucial in language understanding and generation.
In natural language processing, hypernymy can be useful for several tasks such as question answering [7], textual entailment [4] and image detection [18].
A hand-tailored well known resource is WordNet [20]. It is a large lexical database with lexical relations including hypernymy among them. It was originally created for the English language and later other languages have been included through scheme transfer and translations of the English version.
The original English version consumed a considerable human effort for its creation and maintenance. The derived other languages versions suffer from incompleteness and eventually from inadequacies resulting from the transferred scheme. Furthermore, different applications require the expansion of the hypernymy relationship to particular instances like celebrities, song names, video games, and so on. In this context, it is not surprising that automatic hypernymy detection has been an active NLP research area in the last decades.
Supervised hypernymy detection has been mainly addressed based on the use of two sources of corpus based information. On the one hand, the so-called path based approaches, that use the sequence of terms that connects joint occurrences of related pairs (e.g., vertebrates such as mammals) [11, 22].
These methods can present low recall due to the difficulty to find both terms in the same context. On the other hand, the distributional approaches use the context information of each term independently. Several distributional approaches have been carried out through the use of word embeddings [2].
Word embeddings are a popular NLP technique that has been used on various tasks with successful results. It consists on assigning vector representations to words using their contexts in a large text corpus. These methods condense distributional information of words or characters in a large corpus, based on the idea that a word can be characterized by its use [8]. Many methods to learn word embeddings have been proposed recently [19, 23, 17, 13].
HypeNET [26] combines both approaches: distributional and path-based. It is a neural network model that takes as input the word embeddings of two candidate terms and a path based representation. The latter is built from the LSTM [12] representation of the paths distribution in the corpus. They showed the improvement of this approach in comparison to the best distributional one in a dataset created for such purposes. This dataset contains two partitions depending on the intersection of the vocabularies among train, validation and test: a randomly performed and a lexically disjoint split to take into account the possibility of lexical memorization [17].
In this work we propose to build an order embedding as a method for hypernymy detection. An order embedding is a mapping between two partially ordered sets that is injective but not necessarily surjective. Using the order embedding learning strategy presented by Vendrov et al. [29] we realize that is possible to partially order word embeddings and we apply it to perform hypernymy detection. We use, for the order embedding itself, an artificial neural network that consumes pretrained word embeddings and outputs non negative vectors. The network is trained contrastively using positive and negative instances of the relationship.
Compound terms like "grizzly bear" are common in language. To deal with them we consider two alternatives: a character based embedding of an underscored version of the terms, and a representation using the embedding of each word and convolutional layers as the input layers of the neural network. We try different feed forward networks for the mapping. We perform our experiments on a publicly available dataset [26].
We show that this simple approach overcome the results of the best distributional and path-based approaches in same conditions of data usage.
Our use of the order embeddings in conjunction with a neural network exhibits the shape of a Siamese Network [21] with an asymmetric distance measure. Influenced by the use of Siamese Networks in one-shot learning, we study the behavior of the presented model by training it on different slices of the training data.
2 Related Work
Hypernymy detection in NLP can be focused as a supervised or an unsupervised learning task, depending on the available information. Supervised approaches relies on pairs annotated with the information of whether they belong to the relationship or not. On the contrary, unsupervised approaches do not use annotated instances, they rely solely in the distributional inclusion hypothesis [33] or entropy based measures [25].
Supervised approaches have been addressed mainly using two types of information: paths and contexts distributions (or word embeddings). Path-based approaches use the paths of words that connect pairs holding hypernymy relationship. Hand-crafted paths were used as patterns for hypernymy extraction [11]. For example, the path "is a type of" would match cases like "tuna is a type of fish" allowing to conclude that "tuna" is an hyponym of "fish".
In addition, paths in a syntactic dependency tree of joint occurrences in a corpus result useful for hypernymy [27]. In later works, path patterns are generalized using part-of-speech tags and ontology types [22]. The main disadvantage of path-based approaches is that both candidates must occur simultaneously in the same context.
Distributional based approaches use the contexts of the words in a corpus to represent each term. Many methods propose supervised classification after applying a binary vector operator to the pair of term representations. Operators such as vector concatenation [2] and difference were considered [24, 9, 32]. Vylomova et al. studied vector difference behavior in a wider set of lexical relations and they remarked the importance of negative training data to improve the results [31]. Ustalov et al. performed hypernyms extraction based on projection learning [28]. Instead of classifying the pair of representations, they learned a mapping to project hyponyms embeddings to their respective hypernyms remarking also the importance of negative sampling.
Shwartz et al. combined path-based and distributional information to improve hypernymy detection [26]. They concatenated the embedding of both terms to be classified with a representation of all paths between the terms in a dependency parsed corpus. The representation was built with the average of the LSTM resulting representation of each path. Additionally, they introduced a dataset for lexical entailment where they tested their model.
LEAR (Lexical Entailment Attract-Repel) [30] gives state-of-art performance on hypernymy detection specializing word embeddings based on WordNet constraints. The direction of the asymmetric relation was encoded in the resulting vector norms while cosine distance jointly enforces synonyms semantic similarity. The resulting vectors were specialized simultaneously for lexical relatedness and entailment.
3 Model
In mathematics, an order embedding is a monotone function from one partially ordered set into another. In this work we apply the order embedding presented by Vendrov et al. [29] to hypernymy detection, using for the mapping a neural network that consumes pretrained word embeddings as input. The neural network is trained via back-propagation through supervised examples. We show that the learned transformation can unravel embeddings to detect hypernymy relationship.
The supervised data consist on hypernymy instances (positive examples) and unrelated pairs (negative examples). We consider two alternatives for compound terms: feed forward networks that consume a character based embedding of the whole terms, and convolutional neural networks to obtain the compound term representation through the embedding of each word.
Vendrov et al. already provided an application to hypernymy detection. They showed the representational power of the introduced order embeddings comparing their results to the transitive closure using a randomly split dataset from WordNet constraints. Their application is built using uniquely WordNet constraints. Our work differs in that we pretend to study the capability of pretrained word embedding for hypernymy detection through a supervised trained neural network order embedding.
Particularly, we are interested in the capability of the model to predict the relationship between two terms that have not been seen during training. Note that in their application example the model is not capable to predict an adequate answer for unseen pairs, since it does not have any information of neither of the two terms. In that sense, our work presents more similarities with their application to textual entailment. In the latter they consider a GRU [6] sentence representation using word embeddings as input and performing transfer learning.
We observe that the presented use of order embeddings in conjunction with a neural network exhibits the shape of a Siamese Network [21]. Siamese networks come from the area of computer vision and were introduced in application to signature verification [3]. They mapped image pairs using the same convolutional neural network to take signatures of the same person to equal output vectors. They use vector distance as a measure of equality. Later, Siamese Networks were used in one-shot learning [16] showing strong capabilities to discriminate features of images in scenarios where one image example of each class is observed on training time. In NLP Siamese Networks were considered for sentence similarity using LSTM based representations for input sentences [21].
3.1 Order Embeddings
An order embedding is a function between two partially ordered sets
Note that an order embedding is necessarily an injective function but it may not be surjective, differentiating it from an order isomorphism. In fact, an order embedding provides a way to embed one partially ordered space into another preserving the order structure.
Vendrov et al. present a way to embed an arbitrary a set with an application
dependant hierarchical structure into
where
3.1.1 Loss Function
The partial order relation
where
Additionally,
Note that
where
3.2 Mapping and Compound Terms
As we commented before, we consider neural networks as order embeddings to embed word embedding, partially ordered by hypernym relation, into a non negative vector space ordered by the reversed product order.
We consider equation (4) for the loss function to train the neural network. We expect that if word embeddings contains the needed information to distinguish hypernymy it could be revealed by the learned transformation.
So, lets consider
The used dataset contains compounds terms, that is to say, terms constituted by many words (e.g. "contemporary art"). We considered two variants for compound terms treatment: 1) One-dimension convolutional layers as input layers to represent the compound term from the word embedding of each of its parts. 2) A FastText generated representation replacing spaces by underscores. This representation is then the input of a feed forward network.
We performed most of our experiments considering the second approach since it presents better results and allows to use the complete dataset because absence of out of vocabulary words. Figure 1 shows a diagram of the model.
The activity function of the network defines the output values. Hence, for this model we were limited to consider activity functions with non negative outputs (such as sigmoid function and ReLU). We considered three variants of feed forward networks according to their activity functions: (1) ReLU network, (2) SELU-ReLU a network with a ReLU output layer and SELU [15] functions on its hidden layers, and (3) tanh-sigmoid a network that used sigmoid function on its output layer and tanh on its hidden layers.
4 Experiments
In this section we describe the experiments conducted, the used resources and the model parameters. First we describe the supervised dataset and word embeddings used, followed by the model structure and parameters. We conclude this section with comments about the results and an error analysis.
4.1 Datasets and Word Embeddings
We use the dataset introduced by Shwartz et al. [26] to perform our experiments. Such dataset is constituted by related and unrelated pairs of words. It was created using distant supervision from a variety of knowledge resources such as WordNet [20] and DBPedia [1], among others.
The dataset provides two variants: lexical and random split. Each variant consists of a division into train, validation and test sets. The concept of lexical split refers to a partitioning without lexical intersection between any of the three parts. That is, if a pair occurs in one subset it will not occur in any pair of the other two subsets. The other split was performed randomly. The dataset size information is detailed on Table 1.
Positive | Negative | Total | ||
Random | Train | 9,942 | 39,533 | 49,475 |
Valid Random | 681 | 2,853 | 3,534 | |
Test Rand. | 3,512 | 14,158 | 17,670 | |
Lexical | Train | 4,067 | 16,268 | 20,335 |
Valid | 270 | 10,80 | 1,350 | |
Test | 1,322 | 5,288 | 6,610 |
For the word embeddings we consider the publicly available English GloVe 6Bfn and the FastText [13] vectors trained on English Wikipedia with default parametersfn.
The supervised data presented out - of -vocabulary (oov) terms in GloVe 6B
embeddings and we discarded the affected tuples for training and evaluation. The
amount of discarded examples varies between
In the case of the embeddings built using FastText there is not a problem of out-of-vocabulary terms, as they are character based.
4.2 Mapping Details
We tried several hyperparameters configurations for the neural networks. We
considered ReLU, tanh, sigmoid and SELU layers. For the firsts three with use
dropout between them and for the latter we use alpha dropout. We consider a
We observed improved behavior on the model with SELUs for the firsts layers and
ReLU units in the output layer. For the convolutional approach, we considered
one and two convolutional layers next to the input, with a convolution size of
two words vectors,
We train our models using Adam [14] with a learning rate of
We considered the model output classified as positive if
4.3 Results
In Figure 2 we show the SELU-ReLU model accuracy evolution in the lexical split on the training and validation sets. It can be seen the joint progress of accuracy on train and validation sets, suggesting the capability of the model to distinguish hypernymy relation between terms that have not seen during training.
We evaluate our models using precision, recall and F measures. We present the obtained results on Table 2. We include for comparison the results of the best distributional model reported by Shwartz et al. [26] and HypeNET combined. The reported results are the best, according to the validation set, of three runs of a three layered network of 600, 400 and 200 units on input, hidden and output layer, respectively.
Best Distributional [26] | 0.901 | 0.637 | 0.746 | 0.754 | 0.551 | 0.637 |
HypeNET Integrated [26] | 0.913 | 0.890 | 0.901 | 0.809 | 0.617 | 0.700 |
Siamese ReLU | 0.936 | 0.876 | 0.905 | 0.958 | 0.615 | 0.749 |
Siamese SELU-ReLU | 0.932 | 0.845 | 0.887 | 0.740 | 0.872 | 0.801 |
Siamese tanh-sigmoid | 0.967 | 0.836 | 0.897 | 0.788 | 0.756 | 0.771 |
4.4 Results on Reduced Data
Inspired by the success of Siamese Networks in one shot learning we study the performance of the considered model restricting the available training data. In Figure 3 we show the F measure obtained with the SELU-ReLU model trained on gradually increased sizes of the available training data. Note that the results increase rapidly in the first 20% of the training data in both, random and lexical, splits.
We include the results detailing precision and recall in Table 3. Note that the model first achieves high coverage and then seems to refine its results.
4.5 Results Analysis
In this section we include some analysis of the results obtained with the SELU-ReLU model on the lexical split dataset.
In Table 4 we include the confusion
matrix. Note that
A sampling of pairs where the model fails to predict correctly can be found on Table 5. We note that many of the positive pairs that the model fail to predict correspond to ambiguous terms. In the samples presented, note for example that stubbs refers to the surname of William Stubbs but it can be confused with the artist George Stubbs, among others. The same stands for sting, that can be confused with the singer.
Hyponym | Hypernym |
building | structure |
contentment | happiness |
diver | swimmer |
cosmos | flower |
moment | present |
stubbs | historian |
sting | pain |
Regarding to the pairs that the model wrongly detects as related we detect that
the term novel is involved in
In Table 6 we show a sampling pairs that seems to be incorrectly labelled as negative and the model predict as positive. Note that this examples principally correspond to occupations (like writers or actors) and creations (like novels). However, there are also other pairs that seem to be incorrectly labelled that does not correspond to particular instances such as occupations or creations.
Hyponym | Hypernym |
voltaire | writer |
bill mantlo | writer |
fledgling | novel |
summerland | novel |
sathyaraj | actor |
ferdinand de saussure | linguist |
menecrates | sculptor |
nicomedes | mathematician |
kofax | software |
encarta | encyclopedia |
abode | place |
beretta | weapon |
For example, terms like abode as hyponym of place and beretta as hyponym of weapon are predicted as related while in the data appear as negative examples.
Finally, the Table 7 presents a sample of terms that have been incorrectly predicted as related. The difference between this samples and the presented in Table 6 is that these ones does not seem to be incorrectly labelled. However, note that in the case of stump as hyponym of tree, although it is not a variety of tree, it is a tree that have fallen or that have been cut down. And, in the case of rice and sake there is and composed-of relation, since the sake is made of rice.
5 Conclusion
We present a distributional model for supervised hypernym detection. The model relies on learning an order embedding from word embeddings into a non negative vector space. For the order embedding itself we consider feed forward artificial neural networks and we explore different model configurations.
We show that this approach gives competitive results on a publicly available dataset in comparison to the best distributional and path-based approaches reported on same data. We study the performance of the model restricting the available training data. We found that the model give relative good results using less than 10% of the training data. This suggest that the model tends to learn the order from data even when relatively few examples are provided.