1 Introduction
In recent years, vast and increasing numbers of electronic texts have been posted on the Internet, so we need to find ways of automatically extracting useful information from them. Doen et al. [4] proposed a method of extracting relationship information items based on identifying specific keywords in the text and creating a network based on them, then constructed such a network based on the keyword “earthquake.” They found that the network included nodes for unrelated things and proposed a method of automatically deleting them.
Although they constructed a network of concepts, their aim was not to construct a dictionary, such as WordNet (word knowledge), but instead to construct a network that could be used to generate ideas and understand concepts. For example, their constructed concept network about earthquake includes a relation between an earthquake and a nuclear power plant.
The construction of concept networks is also useful for summarization of many documents [2, 10, 11]. The method can make a network that is useful for grasping information related to a concept from a lot of documents related to the concept.
However, their network did not include any node relationship information, making it difficult to understand the relationships. In this study, we therefore propose a method of extracting character strings from newspaper articles and assigning them to links in word networks to express node relationships. By adding character strings to links to indicate the relevant relationships, we can obtain more detailed information from the resulting word network. The aim of this study is to make word networks more useful by making the node relationships easier to understand. This study was conducted in Japanese.
There are many studies related to relation extraction [1, 3, 5, 6, 8, 12, 13, 14]. These studies extract categories of relation between words and words with a certain relation, where the relation includes part-of, entity-place, person-company and so on. In contrast, our study extracts expressions that explain the relationship between word pairs in some detail.
The main contributions of our study are as follows:
— In order to deal with the issue that word networks do not include node relationship information and hence that these relationships are difficult to understand, we attach character strings indicating the node relationships to the links in the word network.
— Attaching character strings to the links makes the relationships between words easier to understand.
— We evaluate the output of the proposed method using the mean reciprocal rank (MRR), Top-1 accuracy rate, and Top-5 accuracy rate. Here, we found that they were about 0.7, 0.6, and 0.9, respectively, based on considering character strings with unnecessary or missing information to be correct.
— We consider two main methods of extracting character strings. The first uses commas and periods (i.e., punctuation marks dividing clauses) as delimiters, while the other uses periods only (punctuation marks dividing sentences). After comparing the experimental results for both methods, we found mixed results: the method using both commas and periods produced better results when only strings that correctly expressed the word relationships were considered to be correct, but the method using periods only performed better if we also allowed strings with additional information. In addition, if we allowed strings with both additional and missing information, both methods performed very similarly.
2 Network Construction
2.1 Overview
Here, we construct word networks using Doen et al.’s method [4], based on a large newspaper article dataset (Figure 1). The procedure is as follows:
Select a keyword1 that expresses the main concept around which the network is to be constructed.
Extract all articles from the newspaper dataset that include the keyword.
Apply morphological analysis to the resulting articles and extract words related to the keyword.
Create text nodes for the five words that are most closely related to the keyword and connect them to the keyword node. Here, we use a related word extraction method based on term frequency–inverse document frequency (TF-IDF) to identify these five words.2
Select each new word added to the network in turn as the new keyword and repeat from Step 2 to expand the network.
In this work, we also remove unrelated words from the network by adding additional procedures to Steps 4 and 5 by referring to the method proposed by Doen et al. The method is explained in detail in the following sections.
2.2 Extracting Candidate Nodes
Denote the initial keyword by
2.3 Selecting Nodes using TF-IDF-based Related Word Extraction
From the candidate nodes, we select the ones to actually add to the network via TF-IDF-based related word extraction. Specifically, we score the words using a TF-IDF based method and identify the five highest-scoring ones as being most closely related to the keyword. Then, we add the highest-scoring candidate words to the network as nodes, and use the TF-IDF scores as edge weights.
The TF-IDF based related word extraction method produces scores indicating the importance of particular words (candidate nodes) in the extracted articles, which are calculated as follows:
Here,
Equation 1 uses
2.4 Expanding the Network
First, we extract five words based on the keyword
2.5 Deleting Unrelated Words
Next, we use the topic-restricted extraction method proposed by Doen et al. to delete unrelated words from the network. This involves one change to the method given in Section 2.1. Specifically, while extracting articles by repeating Step 5, we only extract those that include both the initial and current keywords. Since this means we are focusing on articles that include the initial keyword, we are likely to obtain words related to it, and unlikely to extract unrelated words (Figure 3).
3 Proposed Method: Selecting Character Strings to Attach to Links
In order to make the node relationships in the word network easier to understand, we attach a character string indicating the relationship between the corresponding words to each link. For example, we might take the newspaper article dataset and the input word pair “universe” and “exploration” (Figure 6 in Section 4.1), and output the character string to attach to the link between the words. Figure 4 shows an example of assigning a string to a link in a word network. We select the character string as follows:
From the dataset, extract a character string (character string A) from between the two words in a particular document.
Extract a character string (character string B) that includes string A and is delimited by periods and commas. Table 1 gives two examples of extracting character strings A and B.
Extract the highest-priority character string (character string C) from string B. High-priority strings are defined as those that either occur frequently or are short. This procedure (extracting string C) is repeated for all possible character strings A.
Select the highest-priority character string from among the strings C extracted in Step 3.
Attach the selected character string to the link.
Word pair | Character string A | Character string B | Original character string |
girisha (Greece), kokusai (goverment bonds) | no (’s) | chugoku wa zaisei saiken ni torikumu girisha no kokusai wo kounyu shi (China purchases Greece’s government bonds to assist fiscal rebuilding) | chugoku wa zaisei saiken ni torikumu girisha no kokusai wo kounyu shi, yuuro bouei ni kyouryoku suru shisei wo shimesu nado oushuu eno eikyou ryoku wo kakudai shiteiru (China purchases Greek government bonds to assist fiscal rebuilding, and indicates a desire to cooperate over the euro, including expanding influence in Europe.) |
toyota (Toyota), suiso (hydorogen) | jidousha wa (automobiles) | toyota jidousha wa suiso de ugoku nenryou denchi sha wo 2014 nendo ni kokunai de hatsubai to happyou (Toyota announces the sale of hydrogen-powered fuel cell vehicles in Japan in 2014.) | toyota jidousha wa suiso de ugoku nenryou denchi sha wo 2014 nendo ni kokunai de hatsubai to happyou. shihan wa seika hatsu to naru mitooshi (Toyota automobiles announces the sale of hydrogen-powered fuel cell vehicles in Japan in 2014. This market is expected to be the world’s first) |
The character string priorities are determined by one of the following three equations. Equation 2 prioritizes frequently-occurring strings, while Equation 3 prioritizes short strings and Equation 4 focuses on the ratio between frequency and length:
4 Experiments
4.1 Methods
In these experiments, we constructed word networks for the theme keywords “Toyota,” “universe,” and “Greece.” These networks consisted of 191, 228, and 99 word pairs, respectively. To build the networks for “Toyota” and “universe,” we used 102,547 articles taken from the Mainichi Newspaper (all from 2014). To build the network for “Greece,” we used 92,807 articles taken from the Mainichi Shimbun (all from 2010). Figures 5-7 give the networks for “Toyota,” “universe,” and “Greece,” respectively. These show that all the networks used here consisted of four levels.
4.2 Evaluation based on Human Judgment
Next, we evaluated whether or not the character strings given to the network links were appropriate. For this, we used 20 randomly chosen word pairs from each network (“Toyota,” “universe,” and “Greece”), for a total of 60 word pairs. In addition, ten randomly chosen newspaper articles including each word pair were used as reference data.
The pairs were then evaluated by a human participant against the top five highest-priority strings (as determined in Section 3), as determined by each of the three priority equations. The participant evaluated them on a four-step scale after consulting the reference newspaper articles. Hereafter, we will describe the methods embodied by Equations 2, 3, and 4 as “high-frequency,” “short,” and “ratio,” respectively. Tables 2–5 show the evaluation criteria and examples representing each of the four possible grades.
Criterion | The output appropriately indicates the relationship between the two words. |
Example | wakata koichi uchu hikoushi: ISS sencho (Koichi Wakata astronaut (universe flying pilot): Captain of the ISS) |
Criterion | The output appropriately indicates the relationship between the two words, but includes additional information. |
Example | nihonjin hatsu no sencho wo tsutometa wakata kouichi uchu hikoushi (50) wa 14 nichi gozen 7 ji 58 hun (Wakata Koichi astronaut (universe flying pilot) (50), the first Japanese captain, at 7:58 am on the 14th) |
Criterion | The output appropriately indicates the relationship between the two words, but it is lacking information needed to make the relationship easier to understand. |
Example | nihonjin uchu hikoushi no sencho ga tanjo shiteiru (The Japanese astronaut (universe flying pilot) captain appears) |
Criterion | The output does not appropriately indicate the relationship between the two words. |
Example | kounin no sencho to natta bei koukuu uchu kyoku (NASA) no sutibun swanson hikoushi wa kouichi no riidaashippu wa subarashikatta to tatae (Steven Swanson astronaut (universe flying pilot) of the National Aeronautics and Space Administration (NASA), who was the following captain, said “Koichi’s leadership was wonderful”) |
The example in Table 2 was evaluated as “Good” because the character string essentially says “astronaut Wakata Koichi has become the ISS captain,” which was judged to appropriately indicate the relationship between “flying” and “captain.” Similarly, the example in Table 3 was evaluated as “OK1” because, although the character string appropriately indicates the relationship between the two words, it also includes the additional phrase “at 7:58 am on the 14th.”
Next, the example in Table 4 was evaluated as “OK2” because, while the character string does indicate the relationship between the two words, it is missing information such as the person’s name and thus does not make the relationship as easy to understand as possible. Finally, the example in Table 5 was evaluated as “Bad” because the character string differed significantly from the correct information, as determined from the reference data, so it was judged as not appropriately indicating the relationship.
4.3 Evaluation Using the Mean Reciprocal Rank (MRR)
After conducting the manual evaluation described in Section 4.2, we evaluated the results using the MRR, based on the highest-ranked correct answer among the top five highest-priority results for each target. We calculated the MRR as follows:
where
4.4 Evaluation Using the Top-
Next, we evaluated the results based on the Top-
The Top-
As for the MRR evaluation, we considered three correctness criteria: only “Good” answers are correct; “Good” and “OK1” answers are correct; and “Good,” “OK1,” and “OK2” answers are correct.
4.5 Evaluation of the First String Extraction Method (Periods and Commas)
In this section, we evaluate the string extraction method based on using periods and commas as delimiters. First, we randomly selected 20 word pairs each from the “Toyota,” “universe,” and “Greece” networks, for a total of 60 word pairs. For each word pair, we used the proposed method to extract the five highest-priority strings, then evaluated them in terms of their MRR, Top-1 accuracy rate, and Top-5 accuracy rate.
The experiments were carried out as described in Section 4.1, while the evaluations were carried out as described in Sections 4.2, 4.3, and 4.4. Table 6 shows the MRR evaluation results, while Table 7 shows the Top-1 accuracy rates and Table 8 shows the Top-5 accuracy rates. In addition, Tables 9–11 show some example outputs. “High-frequency,” “Short,” and “Ratio” corresponds to the use of Equations 2 to 4.
4.6 Evaluation of the Second String Extraction Method (Periods Only)
In this section, we evaluate the string extraction method based on using just periods as delimiters. The experimental conditions were as before (Section 4.1), and the evaluations were carried out as described in Sections 4.2, 4.3, and 4.4. Here, we used the same 60 word pairs for evaluation as in Section 4.5. Table 12 shows the MRR evaluation results, while Table 13 shows the Top-1 accuracy rates and Table 14 shows the Top-5 accuracy rates. Tables 15–17 show some example outputs.
Priority | Output | Evaluation |
High-frequency | jidousha buhin ohte takata sei ea baggu no rikouru (kaishu/mushoushuri) mondai de (in a problem leading to the recall (collection / free repair) of airbags made by Takata, a major auto parts company) | Good |
Short | takata sei ea baggu: 474 man dai shuuri wo (Airbags by Takata: 4.74 million units repaired) | OK2 |
Ratio | jidousha buhin ohte takata sei ea baggu no rikouru (kaishu/mushoushuri) mondai de (in a problem leading to the recall (collection / free repair) of airbags made by Takata, a major auto parts company) | Good |
Priority | Output | Evaluation |
High-frequency | shou wakusei tansaki “hayabusa” wo noseta shuryoku roketto H2A26 gouki wo uchiageta. (They launched an H2A26 main rocket with the “Hayabusa 2” asteroid explorer.) | Good |
Short | H2A roketto de uchiagerareru shou wakusei tansaki “hayabusa 2” (Asteroid explorer ”Hayabusa 2” launched using an H2A rocket) | Good |
Ratio | shou wakusei tansaki “hayabusa” wo noseta shuryoku roketto H2A26 gouki wo uchiageta. (They launched an H2A26 main rocket with the “Hayabusa 2” asteroid explorer.) | Good |
Priority | Output | Evaluation |
High-frequency | EU wa girisha shien ni yotte yuuro bouei no ketsui wo shimeshi (EU shows a determined euro defense by supporting Greece) | OK1 |
Short | EU shien kankyou totonou (environment of complete EU support) | Bad |
Ratio | EU: girisha shien (EU: Greece support) | Good |
5 Discussion
5.1 Assigning Strings to Links
Assigning character strings to the links made it possible to identify word relationships that would otherwise have been difficult to understand. However, it was not always possible to extract suitable character strings to attach to the links, because when punctuation marks were present in the character string between two words extracted by the proposed method, the character string was omitted and no character string could be extracted. This issue was particularly prevalent for word pairs that only appeared in a small number of articles. In future work, we plan to improve our approach in this respect.
5.2 Priority Equation
Next, we investigated the equations used to determine the priorities of character strings in the method using both commas and periods. We considered three priority equations: one emphasized the frequency with which the string appeared, another emphasized short character strings, and the third based the priority on the ratio between appearance frequency and length.
First, we found that the equation that focused on short character strings had the lowest performance according to all evaluation methods and criteria. We believe this was because its emphasis on short strings reduced the amount of information available to indicate the word relationships.
Next, we examine the equations emphasizing appearance frequency and the frequency/length ratio. These both produced character strings that appropriately represented the word relationships, and we found that they both yielded nearly equivalent performance when we considered output strings with extra or missing information to be correct. However, the frequency-based equation performed slightly better when only strings that properly indicated the word relationships were considered correct, and also when we allowed answers with additional information.
In addition, although the Top-1 accuracy rates for the two equations were equal, the Top-5 accuracy rate of for the frequency-based equation was higher than that of the ratio-based equation. This indicates that the frequency-based equation is likely to produce more correct answers among the top five outputs. Given that, we believe that emphasizing the occurrence frequency makes it easier to acquire character strings that appropriately indicate the word relationships.
5.3 Use of Commas and Periods
We used two methods of extracting character strings, one that only uses periods to divide the strings and another uses both commas and periods. These produced very similar evaluation results when we considered answers with additional or missing information to be correct. However, when we focused purely on answers that appropriately indicated the word relationships, the method using both commas and periods performed better. Conversely, when we allowed answers including additional information, the period-only method performed better. We believe this was because the character strings produced using periods as delimiters were longer and thus included many strings with extra information.
5.4 Discussion Comparing our Proposed Method and Other Methods
Murata et al. [9] proposed a method of using strings between two words in a sentence as the relationship of the two words in Japanese sentences. The definition sentence for an entry word “snowy moonlit night” is “a moonlit night with the presence of snow.” This shows that “snowy moonlit night” consists of two terms, “snow” and “moonlit night,” and the relationship between them is expressed using the phrase “with the presence of.” They extracted relationships such as “that is in,” “that has,” and “that was made from.”
On the other hand, in our study, a substring containing two words is extracted as representing the relationship between the two words. A substring containing two words has a wider range than a string between two words and can show the relationship between two words more clearly than a string between two words.
In the example sentence of Figure 4, “Koichi Wakata astronaut (universe flying pilot): Captain ISS, communicate with the prime minister,” the string between two words “flying” and “captain” is only “pilot):.” “pilot):” is inadequate as an expression between two words. In contrast, our method gets “Koichi Wakata astronaut (universe flying pilot): Captain ISS.” “Koichi Wakata astronaut (universe flying pilot): Captain ISS” has more information than “pilot):” and shows the relationship between two words in an easy-to-understand manner. Our method is superior to the method using a string between two words.
We conducted experiments using the method based on strings between two words in a sentence. We used frequency corresponding to the use of Equation 2 in the experiments. The results are shown in Table 18. The results were lower than those in our methods. We confirmed that our methods are more effective than the method of using strings between two words.
Priority | Output | Evaluation |
High-frequency | jidousha buhin ohte takata sei ea baggu no rikouru (kaishu/mushoushuri) mondai de, bei kain eenerugii shougyou iinkai wa mikka, jouin ni tsuzuite kouchoukai wo hiraita (The House Energy Commerce Committee held a public hearing following the Senate on the 3rd, due to the recall (collection / free repair) of airbags manufactured by Takata, a major auto parts company) | OK1 |
Short | takata sei ea baggu: 474 man dai shuuri wo (Airbags by Takata: 4.74 million units repaired) | OK2 |
Ratio | bei unyushou no douro koutsuu anzen kyoku wa 18 nichi, kekkan ga mitsukatta jidousha buhin oute takata sei ea baggu no rikouru (kaishuu/mushou shuuri) no taishou chiiki wo zenbei ni kakudai suruyou honda nado jidousha meekaa ni shiji shita to happyou shita (The Road Traffic Safety Authority of the US Department of Transportation announced on the 18th that it had instructed automobile manufacturers such as Honda to expand the target area for the recall (collection / free repair) of Takata airbags, a major automobile part found to be defective, to the whole country) | OK1 |
Priority | Output | Evaluation |
High-frequency | mitsubishi juu kougyou to uchuu koukuu kenkyuu kaihatsu kikou (JAXA) wa mikka gogo, shouwakusei tansaki “hayabusa 2” wo noseta shuryoku roketto H2A26 gouki wo uchiageta (Mitsubishi Heavy Industries and the Japan Aerospace Exploration Agency (JAXA) launched an H2A26 main rocket with the “Hayabusa 2” asteroid explorer on the afternoon of the 3rd) | OK1 |
Short | kongetsu 12 gatsu ni H2A roketto de uchiagerare, shouwakusei 1999JU3 ni touchaku suru no wa 18 nen 6 gatsu (It was launched by an H2A rocket this December, arriving at asteroid 1999JU3 on June 2018) | Bad |
Ratio | mitsubishi juu kougyou to uchuu koukuu kenkyuu kaihatsu kikou (JAXA) wa mikka gogo, shouwakusei tansaki “hayabusa 2” wo noseta shuryoku roketto H2A26 gouki wo uchiageta (Mitsubishi Heavy Industries and the Japan Aerospace Exploration Agency (JAXA) launched an H2A26 main rocket with the “Hayabusa 2” asteroid explorer on the afternoon of the 3rd) | OK1 |
Priority | Output | Evaluation |
High-frequency | EU wa girisha shien ni yotte yuuro bouei no ketsui wo shimeshi, kiki ga hoka no yuuro kameikoku ni tobihi suru jitai no kaihi wo mezasu (EU shows a determined euro defense by supporting Greece, aiming to avoid a situation where the crisis extends to other euro member countries) | OK1 |
Short | EU shien kankyou totonou (environment of complete EU support) | Bad |
Ratio | EU: girisha shien goui (EU: Greece support agreement) | Good |
6 Conclusions
In recent years, vast and increasing numbers of electronic texts have been posted on the Internet, so we need to find ways of automatically extracting useful information from them. Doen et al. [4] proposed a method of extracting relationship information based on identifying specific keywords in the text and creating a network based on them, then constructed such a network based on the keyword “earthquake.” They found that the network included nodes for unrelated concepts and proposed a method of automatically deleting them. However, their network did not include any node relationship information, making it difficult to understand these relationships.
In this study, we therefore proposed a method of extracting character strings from newspaper articles that express node relationships and assigning these strings to links in a word network. Adding character strings to links enables us to indicate the corresponding relationships.
Then, we evaluated the results produced by the proposed method in terms of the MRR, Top-1 accuracy rate, and Top-5 accuracy rate. When we considered answers with additional or missing information to be correct, the MRR was about 0.7, the Top-1 accuracy rate was about 0.6, and the Top-5 accuracy rate was about 0.9.
We also conducted experiments using two different methods of extracting character strings, one that uses commas and periods as delimiters and another that only uses periods. When we compared the performance of these two methods, we found that when we focused on answers that appropriately indicated the word relationships, the method based on both commas and periods performed better. However, when we allowed answers that included additional information, the method based on periods only was better.
Finally, when we allowed answers with additional or missing information, both methods performed very similarly.