1 Introduction
Today, thanks to the global pervasiveness of the Web, we have access to large quantities of open and collaborative sources of textual data. Nonetheless, all this information, in the form of language units and related linguistic attributes, requires a proper method for representation, querying and analysis while taking into account the attribute of each unit within its local and global context. With these requirements in mind, researchers have used language networks1 since a long time24 to model linguistic complex data. Indeed, a complex network allows us to have a look at the information from both local and global perspectives. However, it is not until recently, with the growth of computational power, that we are able to exploit linguistic networks at a larger level. In short, in order to extract useful knowledge from a linguistic network we need a representation that can combine diverse kinds of language attributes (and the relations among them), as well as facilitate the application of graph analytic algorithms.
The characteristics of a linguistic network varies according to the necessities of the natural language processing (NLP) task we are trying to solve. Still, we can identify two general aspects: the type of network used to hold the information and the algorithms applied on it to extract new insights. In this context, the work presented in this article has two goals: (1) review recent linguistic network models used to resolve semantic NLP tasks, and (2), propose a novel linguistic network that addresses some of the structural limitations of the works studied.
Accordingly, we first provide a simplified and organized state of the art of linguistic networks in the domain of Word Sense Disambiguation and Induction (WSD and WSI). As we will find out, relatively few approaches go beyond using classic lexical co-occurrence information as a source to discriminate contexts and different senses of a word. It is our intuition that by leveraging different types of linguistic relations we can obtain more pertinent results on a given semantic task. In this sense, we propose a network model that is able to hold diverse kinds of language information and allow for a simple manipulation of the data contained in it. Using this schema, we perform a proof of concept to illustrate the advantages of using such structures for the word sense disambiguation and word sense induction tasks. In that respect, it is not our goal to compete against the best systems for WSD and WSI described in the literature and which require tuning parameters most of the time. Our aim is rather to show that using linguistic networks enables us to encode more fine-grained language information that we could leverage to better address NLP tasks in comparison to basic lexical co-occurrences information.
We organize the paper as follows, in Section 2 we introduce basic concepts and in Section 3 we review network-based approaches to semantic-similarity tasks specifically from a graph-centric view. In Section 4 we propose a linguistic network based on hypergraphs. Next, we show the potential utility of such network in Section 5. Finally, we present our conclusion and future research in Section 6.
2 Background
Below we will delineate the preliminary concepts used throughout the rest of our paper. We introduce the concept of linguistic network (or language network) as well as the semantic tasks we are interested in.
Linguistic Network We define a Linguistic Network (LN) as a modelization of the human language in terms of a graph structure. Usually, textual entities (e.g., letters, words, phrases) are linked together by means of grammatical or semantic relations5 . A network structure allows us to study the characteristics of said relations in order to extract useful knowledge from them.
In this work we focus on two aspects of a linguistic network: the type of LN, with regards to its contents, as well as the graph algorithms used over the network to solve a given NLP task. In this work we concentrate on word-semantics related tasks. Among these tasks, two that are highly popular are word sense disambiguation and word sense induction and.
Word Sense Disambiguation (WSD) Given a target word tw, a context ct, and a set 𝒎𝒏 containing possible meanings for 𝒕𝒘, the goal of WSD is to determine which signification corresponds to 𝒕𝒘 from the set 𝒎𝒏 according to the context 𝒄𝒕. This task is usually solved leveraging a dictionary or thesaurus that establishes semantic links between word senses. This type of resource is also known as Lexical Knowledge Base (LKB)2. A LKB can be defined as an ontology that relates words according to their semantic relation. Two quintessential examples of a LKB are the Wordnet semantic dictionary17 and BabelNet22 .
Word Sense Induction (WSI) The methods employed to solve WSD are generally unsupervised, that is, they do not require an annotated corpus to infer the appropriate sense for a given word. Nonetheless, a certain level of supervision can be distinguished on these approaches. Indeed, LKBs are, most of the time, built using human supervision. In order to circumvent this constraint, researchers have devised fully unsupervised techniques to automatically find the senses 𝒎𝒏 of a word 𝒕𝒘 by leveraging a background corpus. Once the senses have been induced, these approaches perform WSD. This task is named Word Sense Induction (WSI).
3 State of the Art
According to their objectives, we can consider two types of contributions in the linguistic-network literature5: on the one hand, there are those approaches that investigate the nature of language via a graph representation, and on the other hand, we find those that propose a practical solution to a given NLP problem. In that regard we can cite the following survey papers18,19,1,16 .
This article focuses on the latter type of approaches. Moreover, we pay particular attention to two aspects of a given network-based technique: (1) the characteristics of the linguistic data within the network, and (2), the algorithms used to extract knowledge from it.
Once the LN modelization concept and the concerned tasks are introduced, we move on to the content of our literature review. As we defined before, a LN comprises two main characteristics: the type of language network and the nature of the algorithms used in each network.
3.1 Types of Linguistic Networks
In the following paragraphs we introduce the general categories of LNs according to their type of content and relations. We will introduce these categories as well as the approaches that make use of them.
In 16 they define four types of LNs: co-occurrence network, dependencies network, semantic network and similarity network. Meanwhile, from a deeper linguistic point of view, 5 defines broader categories, each having several sub-types. The main difference (in our context) between both definitions lies in the separation of categories. In 5, they conflate syntactic-dependency and co-occurrence networks into the same category: word co-occurrence networks. Similarly, they join semantic and similarity networks together and place them inside a broader category of lexical networks. The third family defined concerns phonological networks which is out of the scope of this paper. In this work we will explore five categories of linguistic networks: semantic, lexical co-occurrence, syntactic co-occurrence and heterogeneous networks. The following sections will elucidate what each kind of network represent, we will mention works that employ this kind of networks and also list the main methodology differences that variate from one approach to another.
Semantic Networks A Semantic Network (SN) relates words, or concepts, according to their meaning. The classical example of a SN is the renowned knowledge base Wordnet. This network, which serves also as an ontology, contains sets of synonyms (called synsets) as vertices and semantic relations as their edges. Typical semantic relationships include synonym-antonym, hypernym-hyponym, holonym-meronym. However, other semantic similarities can be defined. The edges are usually not weighted, although in some cases certain graph similarity measures may be used.
Word sense disambiguation is indeed a task usually solved using semantic networks, specially Wordnet (and to lesser extent, BabelNet)15,25,26,21,3 Given an input text with a set of ambiguous target words to process, these approaches follow a two-step algorithm:
Link target words (usually nouns, without stop-words and functional words) with their corresponding sense (or synset in the case of Wordnet-like dictionaries) and extract their vertices and edges into a new, smaller, SN.
Apply a node ranking technique, usually a random walk based method, and select, for each ambiguous word in the input text, its top ranking synset node as the correct sense.
Lexical Co-occurrence Networks Most co-occurrence based intuitions in NLP have their origin in the distributional hypothesis8 . The idea is resumed by the well know phrase “a word is characterized by the company it keeps”7 . That is to say, words with similar neighbor words (or contexts) tend to be semantically similar.
This intuition has been exploited deeply in NLP. One of the most effective ways of representing word co-occurrences is by means of a graph structure. Indeed, this kind of graphs are the central column of a Lexical Co-occurrence Network (LCN). In these structures, nodes represent words and edges indicate co-occurrence between them, i.e., two words appear together in the same context. A context can vary from a couple of words (before or after a given word) to a full document, although it is usually defined at sentence level. The edges’ weight represent the strength of a link and is generally a frequency based metric that takes into account the number of apparitions of each word independently and together.
To solve a task in a completely unsupervised way, researchers generally use this kind of networks instead of LKBs. It is then natural that word sense disambiguation approaches leverage lexical co-occurrence networks, and in return, the distributional hypothesis, to automatically discover senses for a given target word. That is why WSI methods27,12,20 are tightly related to LCNs. The cited works use a LCN as described before while other works such as21,23 represent the co-occurrence by means of a hypergraph schema. In short, a hypergraph structure is a graph generalization where an edge (called hyperedge) can link multiple vertices per edge and thus it is able to provide a more complete description of the interactions between several nodes6.
WSI systems generally perform four steps. Given an input text with a set of target words and their contexts (target words must have several instances throughout the document to cluster them), the steps are the following:
Build a LCN, assigning tokens as nodes and establishing edges between them if they co-occur in a given context (usually if they both appear in the same sentence).
Determine the weights for each edge according to a frequency metric.
Apply a graph clustering algorithm. Each cluster found will represent a sense of the polysemous word.
Match target word instances with the clusters found by leveraging each target word context. Specifically, assign a cluster (a sense) to each instance by looking at the tokens in the context.
Syntactic Co-occurrence Networks A Syntactic Co-occurrence Network (SCN) is very similar to a LCN in the sense that both exploit the distributional hypothesis. Nonetheless, SCNs go further by leveraging syntactic information extracted from the text. There are two main types of syntactic information both represented as tree structures: constituency-based parse trees and dependency-based parse trees. Briefly, the former structure splits a phrase into several sub-phrases. In this way we can get a glimpse of the role of each word inside a phrase. The latter tells us about the relationships existing between words in the phrase. SCNs employ, most of the time, dependency trees to create a graph that relates words according to their syntactic relations. In the case of10 , a graph is built using syntactic dependencies. It is used to perform WSI using a very similar approach as those systems using LCNs. We note that approaches based on SCNs are scarcely used in WSD or WSI systems, and therefore they are an interesting research avenue to explore.
4 Heterogeneous Linguistic Network: Our Proposal
In the previous section we have mentioned two disadvantages found in the language networks covered in Section 3. Namely, the lack of syntactic information and the homogeneous nature of the networks. In this section we propose a language network that, at this point of our research, addresses both of these concerns. Building upon previous linguistic representations12,13,23 , our model is based on the use of a hypergraph. Hypergraphs have been employed in the literature to model complex systems. Their single most important difference, being able to relate more than two vertices at the same type, allows for a better characterization of interactions within a set of individual elements (in our case, words)9 .
Indeed, our hypergraph modelization integrates four types of relations between tokens: sentence co-occurrence, part-of-speech tags, words’ constituents data and dependency relations in a single linguistic structure. We group words together according to the these features.
Formally, a hypergraph is a generalization of a graph defined as a tuple
In our case, the set of tokens in the corpus are the set of nodes V, and the set of hyperedges E represent the relations between nodes according to different linguistic aspects. Each hyperedge may be one of three types: noun phrase3 constituents (CONST), dependency relations (DEP), or sentence context (SEN). We consider that a token 𝑣 belongs to a hyperedge of type NP or SEN if the token appears in the same noun phrase or in the same sentence. A token v belongs to a hyperedge of type DEP if it is the dependent of a certain dependency relation coupled with its corresponding head (or governor). The hypergraph can be represented as a n x m incidence H matrix with entries
We illustrate our hypergraph incidence matrix with the following example phrase: The report contains copies of the minutes of these meetings. We tokenize the phrase, keeping all the words, and we lemmatize and parse it to obtain both constituency and dependency trees.
The constituency tree of the example phrase is shown in Figure 1. The sentence, as well as each noun phrase (NP) node is identified by a number. We can observe that this phrase is composed by five noun phrases (NP) and one verb phrase. Meanwhile, some of the NPs are formed by other kind of phrases, depending on the grammar production rule used to build each one of them. As is usual in this kind of structures, there is a one to one relation between the number of tokens in the sentence and the number of leaves in the tree.
The dependencies of the example phrase are shown in Table 1. They indicate the syntactic relation between the governor of a phrase and a dependent. In these relations’ examples, the head is the first token to appear followed by the dependent word.
root(root, contains) | det(minutes, the) |
det(report, The) | nmod(copies, minutes) |
nsubj(contains, report) | case(meetings, of) |
dobj(contains, copies) | det(meetings, these) |
case(minutes, of) | nmod(minutes, meetings) |
From both of these types of information we can build a hypergraph representation as stated before. The incidence matrix is illustrated in Table 2. For brevity, we only show nouns as well as only the first three noun phrases and the nominal subject (nsubj) and direct object (dobj) dependency relations. Looking at the table, we can therefore infer that the word copies appears in two hyperedges of type CONST: NP2, which is built from a NP, and two prepositional phrases (PP). Also, we see that it is part of NP3, which indicates a plural noun (NNS). Regarding the syntactic dependency hyperedges, the word copies appear in the dobj_contains column which indicates the copies was indeed the direct object of the verb contains, Finally, we can know that copies appeared in the same sentence S1 as the other four noun words.
5 Proof of Concept: Word Sense Induction and Disambiguation
In this section we carry out a proof of concept experiment to verify the potential of our proposed network modelization. We use the task of word sense induction and disambiguation as an application context for our procedure. As stated before, we do not aim to create a system able to beat the reviewed WSD or WSI techniques. Instead, our goal is to show that using other kinds of language information we can improve the results of those obtained while using classic lexical co-occurrence, and thus emphasize the utility of using diverse linguistic information, in our case through a language hypergraph structure.
5.1 Methodology
The task is the following: we are given a document d with several target words tw and multiple paragraph instances for each tw. We consider each of these paragraphs as the context ct of a target word tw. The goal is to first automatically determine a set of senses for a given tw (WSI), and then assign one meaning to each of its instances (WSD).
As described before, WSI (including WSD) is usually solved following four steps: (1) creation of a linguistic network, (2) determine the level of similarity between nodes within the network, (3) cluster nodes together, thus creating individual senses, and (4) assign a cluster (sense) to each instance of a target word in the input document.
In our process, we follow a similar approach to those used in27,12 . In short, these methods build a network of lexical co-occurrence with a background corpus and then exploit the real-world characteristics of said networks by theorizing that there are certain important nodes (called hubs) that carry a significant role among the words contained in the network and therefore may represent, coupled with their neighbors, a sense for a given target word.
In our approach, we generate a network for each tw and the high-degree nodes found inside this network ideally represent a 𝑡𝑤 sense. As presented in the previous sections, we use a hypergraph structure, similar to the one used in12
Creation of the linguistic network In the previous sections we worked with the
English Wikipedia as background corpus to build and model our proposed
linguistic network. Given the large size of Wikipedia, and to iterate faster our
experiments, we decided to change the corpus to one with a more manageable size.
We use the Open American National Corpus (OANC)11 as background document collection to build a
hypergraph network
At each step, that is, for each 𝑡𝑤 in the input document, we extract a subgraph
Computing similarity between nodes In order to computationally treat
We compute the Jaccard index between each node
Clustering words together Once the incidence matrix
The former (line 9) is the minimum degree a node must have, which is automatically determined by taking into account a node if it is degree is superior to the 85th percentile among all the calculated degrees. This value was chosen experimentally. The latter (from line 11 to 17) sets a minimum limit to the average of the Jaccard similarities between each pair of neighbors of node
where
If node n satisfies both thresholds
The process is repeated until no more nodes satisfy both boundaries. When the process is complete, we obtain a set of senses
Sense assignation The assignation of a sense consists in looking at each tw
instance represented by a context ct and simply determining which sense s in
5.2 Experiments and Results
The objective of this proof of concept is to show the advantages of using syntactic co-occurrence information compared to simple lexical co-occurrence. To this end, we solve the word sense induction and disambiguation tasks using the method described in the previous subsection. We create two independent systems: LEX, which uses lexical co-occurrence hyperedges, and DEP, which employs syntactic dependency hyperedges.
As evaluation dataset, we employ the data provided for Task 02 of Semeval-20072 which evaluated word sense induction systems. The data consists on 100 target words4 (65 verbs and 35 nouns), each target word having a set of paragraph contexts where it appears. From the available performance assessing techniques, supervised and unsupervised, we are interested in the unsupervised evaluation, which is rated using the F-score produced by an evaluation script. We also modify it to obtain also the precision and recall measures to build a precision-recall curve.
Each type of language information has its own characteristics. The sub-network formed by sentence hyperedges tends to have a much smaller number of nodes (words) than those of the dependency type. This make sense as sentences usually contain a few words, meanwhile a dependency hyperedge may incorporate upwards to hundreds of words that are related to a word by the same dependency relation. These characteristics affect the similarity between vertices and thus drove us to set the threshold (th1 and th2) values for LEX and for DEP in function of the percentile of the node’s degree and similarity values distributions, respectively.
This leaves only one threshold left,
The F-score of both systems, and the average number of clusters (senses) produced, is shown in Figure 2. Indeed, in our experiment, the dependency based model DEP preformed better than LEX using classic lexical co-occurrence. We include the result of the UOY system as a similar-method benchmark. In UOY, two background corpora are also used to build a linguistic network of lexical co-occurrences. One of the corpus is the same as the one we used to evaluate. This allows their system to induce the exact senses used in each target word instance. While this is a practical idea, we, by using a large, multiple-domain corpus, are able to induce word senses that may not even be used in the Semeval dataset. Concerning the thresholds, we use percentiles to automatically adapt to the characteristics of the hyperedges, as the lexical and dependency co-occurrences hyperedges behave differently within the linguistic network.
In Figure 3 we appreciate that even while using different threshold values, we achieve, in general, better recall and precision by using syntactic dependencies. It must be noted that this particular Semeval task was dominated by the most frequent sense with an F-score of 80.7, assigning an average of one sense per target word. Our solutions assign an average of 1.257 and 1.200 for LEX and DEP respectively. Verbs analysis and comparisons with other datasets and other systems will be available in the final version.
Based on our proof of concept experiment, we confirm that using syntactic dependencies in order to disambiguate word senses improves can improve the results when compared with regular lexical co-occurrence approaches.
6 Conclusion and Future Work
In this paper we analyzed the state of the art of linguistic network-based approaches to semantic similarity task from a graph-centric point of view. We reviewed the techniques in terms of its graph characteristics, from their structure to the algorithms employed. Among the literature covered, certain non-explored research paths were identified, namely the lack of syntactic data on the networks employed, and therefore, a homogeneous network nature that only allows for relations of a unique type.
We addressed with the proposition of a hypergraph linguistic model that is able to hold heterogeneous language information. We believe that this structure allows the integration multiple kinds of information and has vast potential in terms of which algorithms it can be used with. Our model was tested in a word sense induction proof of concept experiment and found interesting and encouraging results. Again, we note that the approach proposed to solve word sense disambiguation and induction is a proof of concept and as encouraging as the results are, we still need to improve the system in order to compete with the best solutions in the state of the art.
As future work, we are currently extending our algorithm to properly combine the different types of information within our model. We would like to test other kind of graph inductions (instead of transforming the hypergraph into a bipartite graph), or even better, use the incidence matrix of the hypergraph to calculate custom similarity metrics. In this same context, we believe that a deep analysis on the semantic meaning of different types of similarities (and their magnitudes) between words is needed to better determine which metric to use in a specific context. Finally, we also plan to address other NLP domains with our hypergraph model, notably information extraction problems.