1 Introduction
Knowledge base is a very useful database for knowledge management, such as Wordnet11, Freebase1 , etc. They consists of a great amount of knowledge facts, the form of which is triplet like (left-entity, relationship, right-entity). It means there have the relationship between the left-entity and right-entity. Knowledge base is very important and is very useful for human reasoning, question answering, query expansion, and other AI tasks. But it usually suffer from incompleteness due to a large volume increasing knowledge and lack of additional new entity.
There have much work to complete knowledge base4,2,12,3 . However, most of the models just do knowledge base completion that to predict how likely some additional facts (triples) are held by only using existing knowledge in KBs. They can not add additional new entity. In this paper, we propose a new framework for extending knowledge base that can add additional new entity to knowledge base, by connecting the free text and knowledge base.
Our contributions in this paper are followings:
We propose a new perspective to extend knowledge base by add new additional entity from free text;
We present a framework to extend knowledge base with DNN, word embedding and entity latent representation;
Empirical experimental results demonstrate that our models perform excellently.
In the rest of the paper, we first show some related work in Section 2, and then we introduce a framework for extending knowledge base in Section 3. In Section 4, we show several experiments on real data sets. We finally conclude by sketching some future work directions in Section 5.
2 Related Work
We briefly introduce some of the related work in this section.
2.1 Word Embedding
Word representation (Distributed Representation, Word Embedding) was first proposed by Hinton7 . Word representation is learned a vector from a large number of unlabeled free text corpus by language model. The idea of using neural network to train language model was first presented by Xu15. Word representations learned by neural network language models can be used for many NLP tasks such as POS tagging, chunking, named entity recognition, semantic and syntactic similarity. And subsequently many language models are proposed in5 . (Mikolov et al. (2013) proposed two state-of-the-art models (CBOW and Skip-gram)1 to capture better semantic and syntactic word similarity8,9,10.
2.2 Embedding-based Model
Many recent energy-based embedding models 7,8,9 are proposed, focusing on increasing expressivity, but resulting in higher computational cost. Bordes et al. 10 proposed a simpler model (TransE), however, the drawback of it is that it only can model linear triplets. Wang et al. proposed an extended model of TransE, TransH, which faces the same issue. Several models have been recently proposed for that purpose16.
3 A Framework for Extend Knowledge Base with Multimodal Deep Neural Network, Word Embedding and Entity Latent Representation
In this section, we will introduce a framework that used to extend knowledge base, as in Fig. 1. First of all, we will present the approach of word embedding and another latent model that used to learn the entity latent representation. And then we will present a multimodel deep neural network. Finally we show the implementation to extend knowledge base.
3.1 Word Representation learned by Language Model from Unstructured Data (Free Text) and Encoding Structured Data (Knowledge Base)
Word representation are word vector that can be used as features for other models. In this work, we choose two typical word representations. One is the approach of (Mikolov et al., 2013), known as Word2vec. It used two main language models: n-grams and skip grams for learning. Here resulting word vectors have some interesting character and capture many linguistic regularities. Take the famous two as example here, vector operations vector(’China’) - vector(’Beijing’) + vector(’Tokyo’) results in a vector that is very close to vector(’Japan’), and vector(’king’) - vector(’man’) + vector(’woman’) is close to vector(’queen’). In Word2vec, it has offered more than 1.4M pre-trained entity vectors with naming from Freebase. The other work is proposed in6 , as SINNA. We also directly utilize the resulting work vectors from it, pre-trained word embedding of Wordnet, and how to train them is not our focus in this paper.
In order to connect latent knowledge of Knowledge Base with word representations learned by language model from free text, we propose to encode the entities and relationships of the triplets into latent embedding space. This allow us to build a model, like deep neural network model, which can calculate the plausibility of additional new triplets for extending knowledge base. As introduced in Section 2, several models have been recently proposed for that purpose. In this work, we choose to follow the Pairwise-interaction Differentiated Embedding model (PIDE)16 , which is claimed to performed excellent.
In the next, we briefly review the Pairwise-interaction Differentiated Model. Given a training set H of triplets (subject s, predicate p, object o),
where
3.2 Learning Multimodel DNN with Connecting Word Embedding and Entity Latent Representation
As we have introduced above, we could learning two type of latent representations for a same word (entity)2. The word embedding of the entity is learning from free text by word2vec. We also learn the another latent representation of the word (entity) from structured data (knowledge base). Theoretically both the two latent representation of the same entity should indicate its semantic information. But its two semantic vector expressions would be different because they derive from two different latent semantic space, one is free corpus ant the other is structured knowledge base. So we are trying to find a pipeline that can somehow connect them or transform them from each other. In this paper, we propose to use Deep neural network model for connecting word embedding and entity latent representation. We use a great amount of pre-learned pairs (word embedding, entity latent representation) to train the deep neural network (DNN) model. By this, we could build a pipeline across the two latent semantic space. If there have a additional new entity which exists in free text but does not exist in knowledge base, namely the word embedding of the additional new entity in free text is known and not in knowledge base. Through the DNN model as pipeline, we can obtain latent representation of the additional new entity by put the its word embedding as input.
3.3 Implementation for Extending Knowledge Base
In the next, we present the workflow of the implementation and show how to extend the knowledge base from the beginning. As in the Fig.1, here comes a new entity at testing phase. We find out its word embedding pre-learned from free text. We put the word embedding as input into Multimodal DNN. Through the model, we obtain the entity representation that would be the latent semantic vector in knowledge base space. Finally we can calculate the scoring function of all the possible triplets with the new entity. The candidate triplet holds if its score is larger than the threshold. So we extend the knowledge base in this way.
4 Experiments
Our proposed framwork is evaluated on the data sets extracted from WordNet (Miller 1995) and Freebase (Bollacker et al. 2008) for extend knowledge base. In this section, we will introduce the datasets, evaluation metrics and baseline for the experiment. And then we present the experiments results for extend knowledge base.
4.1 Data Set
WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Examples of triplets are ( _payment_NN_1, _hyponym, _recompense_NN_1 ) or ( _flint_NN_3, _part_of, _wolverine_state_NN_1).3 Here we do not yet distinguish the same entity with different part-of-speech tag because their word representations learn by language model from free text is unique. We create a data set with WordNet and Word Representations (which were trained for about 2 month over Wikipedia) provided by SENNA4. This data set include 47176 triplets with 10556 entities and 18 relationships which were randomly split into three parts (Train, Valid, Test). The training data set includes 35,016 triplets with 8,541 entities and 18 relationships, which would be further split into three parts for PIDE model learning in KB. The valid data set includes 4,283 triplets with 723 extra new entity which would be not emerge in training data set. LEFT-NEW, RIGHT-NEW, BOTH-NEW indicated that the left entity or right entity or both entities of triplets were new entities. The test data set includes 7,877 triplets with 1,292 extra new entity. This data set is denoted WN-WR10K in the rest of this section
Freebase is a large collaborative knowledge base of general facts, currently including around 1.2 billion triplets and more than 80 million entities. We created the data set with FB15K3 extracted from Freebase and Freebase entities vectors5 (word represenation) which trained on 100B words from various news articles. FB15K is subset of Freebase including 592,213 triplets with 14,951 entities and 1,345 relationships. We remove the entities which do not have the word representation from FB15K. This resulted in 471,648 triplets with 13,868 entities and 1271 relationships. Likewise we deal with it further as WordNet. This data set is denoted FB-WR14K in the rest of this section which is shown in Table 1
4.2 Evaluation Metrics
In the experiment, we use the ranking criteria (Bordes et al. 2011) for evaluation. Firstly for each test triple, we remove the subject entity and replace it by each of the entities of the dictionary in turn. The function values g(s, p, o) of the negative triples would be computed by the related models and then sorted by descending order. We can obtain the exact rank of the correct entity in the candidates. Similarly, we repeat the whole procedure while removing the object entity instead of the subject entity of the test triple. We have three kinds of test triplets: LEFT-NEW, RIGHT-NEW AND BOTH NEW. So we use two evaluation metrics for comparison in the three data sets: the mean of those predicted ranks and the proportion of correct entities ranked in the top 10 (Hits@10(%)). Finally they are LR: Left Rank6; LH10: Left Hits@10(%); RR: Right Rank; RH10: Right Hits@10(%); MR: Mean Rank; MH10: Mean Hits@10(%).
4.3 Baseline and Configuration
In this experiment, we choose two baseline models for comparison. The first one is the Un-transform model, in which we directly utilize the word embedding as the entity latent representation without any model as the pipeline to connect the two latent semantic space. The other one is linear transform model namely shallow neural network, using artificial neural network (ANN). We select 4 layers for the ANN model and use BP learning algorithm to train the model.
4.4 Learning the Entity Representation with PIDE Model
For learning the entities and predicate relationships latent representations in TRAIN-TN using PIDE model, we selected the best parameters: {
4.5 Extending Knowledge Bases
Using the two data set: WN-WR10K and FB-WR14K, we test different models with extending the knowledge bases. Firstly, we directly use the word embedding in knowledge base for extending knowledge base (denoted Un-transform). For FB-WR14k, we first use PCA7 to reduce the 1K dimension to 50 dimension for consistent with the dimension of PIDE model. As in ANN model, for WordNet configuration: 4 layers (50 500 200 100), the learning rate:0.0001. We choose the sigmod function as the active function. We train it with the epochs is 500, and we cascade the semantic and syntactic representation that pre-trained by PIDE model as the output of training data (dimension = 100). On FB-WR14K, we also choose 4 lyaers (1000, 500, 200,100), the other parameter is the same. For DNN model, we choose the deep autoencode model, the setting of layer is :( 50 500 1000 200 100) on WN-WR10K and (1000 2000 1000 500 100) on FB-WR14K. We use the typical learning algorithm contrastive divergence for optimization.
We can observe from Table 3 and 4 that the Un-transform method performs worse. I think it exactly indicate that the two latent semantic space: word embedding latent space and entity latent space are not the same. Even the same word or entity, the semantic expression of them are different in the two latent space. We can not directly exchange them from one to another through no additional model. And then we can find that the results is better after we use the ANN as the pipeline for connecting the two latent space. We also could observe from the tables that the DNN performs best. ANN is the typical neural network, and it is hard and may not be effective to optimization. DNN is a deep learning model, it could be trained by contrastive divergence effectively.
5 Conclusion and Future work
Extending knowledge base is a problem of great importance. In this paper, we propose a framework to extend the knowledge base with multimodal deep neural network, word embedding and entity latent representation. Experiments demonstrate its good performance. We will explore how to integrate the models and improve the performance in further.