1 Introduction
The information has become a necessary resource whose growth is increasing in different languages spoken in the world. Among the most spoken languages according to their number of native speakers are Chinese, Spanish, English, Arabic, Hindi, Portuguese, Bengali, and Russian, among others [1]. To have access to the information that is generated day by day, it is suggested to use the methods of Automatic Text Summarization (ATS). ATS aims to extract the most relevant information of a document [2].
The most of the state-of-the-art methods have been based on Automatic Extractive Text Summarization (AETS) because of its easy implementation and competent results. The methods of AETS extract essential parts of a text (sentences, key phrases, or paragraphs) considered important for the original version; therefore, do not require complex, sophisticated methods.
AETS has 60 years of research; its study started in the ’50s with Luhn’s work in 1958 [3]. Luhn was the first to perform AETS.
Subsequently, the investigation of the AETS has continued with the research of [4-14] and others. Research of AETS up to the year 2000 focused on the English language because the resources (corpus and standard evaluation measures) were available for this language. However, other most spoken languages have an accelerated growth according to [1], e.g., the Spanish language is the second most spoken language in the world and the third most used on the internet.
The problem is that there is not standard corpus with human-generated summaries and evaluation methods, which are highly correlated with human judgments; therefore, there are not state-of-the-art methods for Spanish AETS (SAETS).
Consequently, to be able to update the SAETS, it is necessary to know how the study of this task has progressed for the English language over the 60 years of research.
Up to 2000, all research was focused on the English language and was carried out without having a standard corpus or evaluation measure, so a comparison could not be made. In 2001, the Document Understanding Conferences (DUC) was created with the objective of further progress in summarization in the English language and enable researchers to participate at a large scale.
Several DUC corpus was created over the years 2001 - 2007. DUC01 and DUC02 focus on the automatic text summarization for single and multiple documents; DUC03 to DUC07 for multiple documents with different tasks.
As a continuation of the DUC conferences in 2008, the conference TAC (Text Analysis Conference) is organized by a series of evaluation workshops created to improve systems evaluation. Corpus TAC focused on summaries created over the years: 2008, 2009, 2010, 2011, and 2014, being its main area of study the summaries for multiple documents focused on end-user.
In 2011, the MultiLing task was created to evaluate language-independent summarization algorithms on a for different languages. Several MultiLing corpora were created in 2011, 2013, 2015 and 2017 for the multilanguage automatic text summarization. The MultiLing task already works with different languages; the original texts are collected in English and translated into different languages, so there is no real corpus for each language.
Due to the number of papers published in Google Scholar, it is possible to obtain an approximation of the number of researches that resort to the standard corpus DUC (250 papers), TAC (100 papers) and MultiLing (30 papers). However, despite the efforts made to create a standard corpus of AETS, the most commonly used corpus to test methods and systems has been DUC02, and it is still currently used. [15-19]. DUC02 was built with specific features (news domain, labeling, model summaries, specific length, the measure of baseline:first heuristic) that make it robust and usable.
Another essential factor for AETS is the assessment method. Initially, evaluation methods for AETS were manually processed, that is, were evaluated by humans. However, these manual processes were costly and time-consuming. Subsequently, automatic evaluation methods were developed to reduce the costs presented by manual methods.
The evaluation methods of summaries are classified into two categories: intrinsic and extrinsic [20]. For intrinsic methods, it has a reference text, usually a summary created by a human (gold standard). However, other text or the same original document can also be used [21]. The methods of the extrinsic evaluation determine the effect of the summary on other tasks (e.g., relevance evaluation) [22].
Currently, the most used intrinsic evaluation method is ROUGE. The evaluation method ROUGE compares the summary to be evaluated (candidate summary) with the summary created by the human (model summary or reference summary) [23]. Because ROUGE uses as a reference to the summary created by the human, the evaluation is made concerning the criteria that the human used to generate the summary.
To make a more objective evaluation of the AETS, other intrinsic methods are proposed: ROUGE-C and Jensen Shannon divergence (JS).
These two evaluators, unlike ROUGE, use the original document as the reference text instead of the summary made by the human, which allows them to evaluate the performance of the methods concerning the entire content of the document.
For the English language, ROUGE-C and JS evaluation methods have not been used to assess state-of-the-art methods.
Since the creation of the standard corpus DUC and the creation of automatic evaluation methods, it has been possible to find out the progress for the AETS, also, different heuristics have been calculated, among them are a baseline: random, baseline: first, topline and concordance. The heuristics have served as a reference for the evaluation of the AETS.
Baseline: random baseline consists of randomly choosing the sentences that will constitute the summary [24]. So, when a method or system generates a summary, it is expected to be better quality than just random. Baseline: first consists of taking the first n sentences to make up the summary [25]. For state-of-the-art methods and systems, the goal is to overcome this heuristic. Mainly, for the news, it turns out to be very high, since this type of texts contains the most important information at the beginning of the document.
Topline consists of obtaining the best combination of sentences of every possible combination. This allows ascertaining what the maximum result that can be reached is when evaluating the summaries generated with a standard corpus [26-27].
Concordance consists of obtaining the correspondence or conformity that exists between summaries made by humans [28], for which it is only an informative heuristic and not a reference for the evaluation of method and system.
The heuristics serve as a reference to know the performance of state-of-the-art methods and systems. For the Spanish language, these heuristics have not been calculated due to the lack of resources.
As mentioned, most of the research on AETS is done for the English language. However, the methods performed and tested in the English language are not exclusive to this language.
Many of the state-of-the-art methods mention being language-independent [29-32] and some others, despite not saying they are independent of language, work within structures (extractive) that allow them to work with different languages [17,33-35]. The best methods that have performed are those based on graphs [36] and those based on genetic algorithms [17, 33-35].
In addition to state-of-the-art methods, systems for AETS are also currently available. AETS systems are methods available to the public and their use in some cases requires a payment.
For the Spanish language, few efforts have been made in the research of AETS. In 2001, Acero et al. [37], presents the automatic work generation of personalized summaries using their proper corpus, built with news from newspaper ABC.
Villatoro et al. [38] use the corpus created for the task of extracting information and adapts to apply it to the automatic multi-document text summarization for the Spanish language [39]. There are also other investigations on the SAETS as: [20, 37-38, 40-43].
However, despite the research carried out for SAETS, the current progress is not known because proper or adapted corpora have been used, which does not allow a comparison between the methods and second to the lack of standard corpus.
Currently, it is known which the best state-of-the-art methods and systems are for English Automatic Extractive Text Summarization (EAETS). Then, if they are tested in a standard corpus in Spanish and their performance is measured with different evaluation methods, the research in Spanish can be updated 60 years after the beginning of the task of AETS.
In addition, one can calculate the heuristics that are considered reference for comparison for the methods and systems of AETS.
This paper presents an update of SAETS to motivate research in the Spanish language. The results obtained from the evaluations with ROUGE, ROUGE-C, and JS are presented for state-of-the-art methods and systems of AETS with a standard corpus in Spanish. Also, the results obtained for the heuristics are presented (baseline:random, baseline:first, topline and concordance).
2 State of the Art Summarization Methods
In the task of AETS, state-of-the-art methods are using different techniques such as the use of genetic algorithms, neural networks, use of graphs, among others. In this article, two of the most used techniques for the task of AETS are taken up, to know how they work and test them.
2.1 Use of Graphs
The work of [34] has been one of the most referenced and resumed for new research, so in this article tests its operation for the task of SAETS.
2.1.1 TextRank
This method consists of a graph-based weighting algorithm. According to Rada Mihalcea [36], it constructs a graph to represent the text so that the nodes are words interconnected by arcs with significant relationships. For the task of extracting sentences, the objective is to qualify whole and classic sentences from higher to lesser importance.
Therefore, an arc is added to the graph for each sentence in the text. To establish the connections between sentences, a relation of similarity is defined, where the relationship between two sentences can be seen as a process of "recommendation". A sentence that indicates a certain concept in the text of a reader as a "recommendation" to refer to other sentences in the text that refer to the same concepts and, therefore, a link can be established between two sentences that share common content.
2.2 Use of Genetic Algorithms
Genetic algorithm’s techniques have worked very well for the AETS and the state-of-the-art methods based on this type of technique have obtained acceptable results (surpassing the heuristic baseline:first for the English language). In this article, some of the state-of-the-art methods based on genetic algorithms are tested with a standard corpus in Spanish.
2.2.1 GA-Bag of Words
The method proposed by [35], uses a genetic algorithm based on the bag-of-words text model. The used fitness function takes two main features, which are mentioned below:
– The first sentences are more important. It is considered that the first sentences of a text as candidates to be part of the summary. For a text with 𝑛 sentences, if the sentence 𝑖 was selected for the summary (it is, the chromosome|𝐶𝑖| = 1) then its relevance is defined as: 𝑡( 𝑖 − 𝑥) + 𝑥, where 𝑥 = 1 + (𝑛 − 1)/2 and 𝑡 is the slope for discovering. To normalize the sentence position measure (𝛿), it is calculated the relevance of the first 𝑘 sentences, where 𝑘 is the number of selected sentences:
– It is evaluated that the summary has different ideas, it is not repetitive, but at the same time, it has important words using the measures of precision-recall. For generating a summary (S), the maximum-words threshold (m) of a summary is considered. Consequently, the number of recovery units always is limited by the maximum-word threshold. Therefore, the golden summary must have, for one side, the most relevant words of the original text (T) and, for the other side, must have expressivity, it means, it must not be redundant. The relevance of a word w is represented by the appearing frequency of the word in the original text (frequency(w, T)), and the expressivity is represented if only are considered the different words that the summary can have ({word ∈ S}).
In this sense, the best summary would contain the most frequent words concerning the original text, and each word must be different. To have a normalized measure the sum of the frequencies of the different words in the summary is divided by the sum of the frequencies of the most frequent words concerning the original text:
Therefore, the fitness function was calculated as: 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 = 𝛽 × 𝛿.
2.2.2 GA-Multilanguage
The GA-Multilanguage method proposed by [44] has been applied to the ATS for different languages. The method is based on a genetic algorithm that uses n-grams with 𝑛 = 1, 2,3,4 𝑎𝑛𝑑 5 as a text model.
For the fitness function, two of the most used features are considered on the-state-of-the-art [35], which are: term frequency (see Eq. 2) and sentences position. For feature sentence position Eq. 3 is used, calculated using the work of [18-18] sentence position using symbolic regression is calculated:
where 𝑁 is the number of sentences in a text. Therefore, the fitness function is calculated as:
2.2.3 GA-4feature
In [17], a method to optimize the combination of the four features is presented: similarity with the title (δ), the position of sentences (β) based on [35] (see Eq. 1), length of the sentence (γ) and coverture (α) based on [35] (see Eq. 2), based on a genetic algorithm for each step. For the similarity with the title obtains a weighting of the sentence according to the similarity with the document title, as it contains it relevant words that can be taken as unsupervised keywords.
Some similarity measures have been proposed, to mention some: Cosine, Euclidean, Dice, Jaccard, recently Soft Cosine [45], and other measures. However, these measures usually depend on the term selecting and weighting steps. Specifically, [33] uses the classical cosine similarity as term weighting and 1-grams (words) as term selection, described in the Eq. 5:
where
For the length of the sentence, the Eq. 10 is used [31]. The fitness function used is presented in Eq. 7:
2.2.4 MA-SingleDocSum
The method MA-SingleDocSum proposed by Mendoza [33] is based on a memetic algorithm, focused on the generation of summaries for a single document. In addition to using genetic operators for the generation of summaries, local search is used.
The features that are considered for the fitness function are sentencing position, the relation of sentences with title, sentence length, cohesion, and convergence (known as thematic of the text).
For sentence position, the scheme proposed by [46], was used, where a standard calculation is applied from the position based on Eq.8:
where 𝑞𝑖 indicates the position of the sentence 𝑆𝑖 in the document and 𝑃 is the result of the calculation for all sentences of the summary.
Calculation of the relation of the sentences with the title begins with the representation through the vector space model, and the cosine similarity measure [47] is used, as shown in Eq. 9:
where 𝑠𝑖𝑚𝑐𝑜𝑠(𝑆𝑖, 𝑡) is the cosine similarity of sentence Si with title 𝑡, 𝑂 is the number of sentences in the summary, 𝑅𝑇𝑆 is the average of the similarity of the sentences in the summary 𝑆 with the title, 𝑚𝑎𝑥∀𝑆𝑢𝑚𝑚𝑎𝑟𝑦𝑅𝑇 is the average of the maximum values obtained from the similarities of all sentences in the document with the title (i.e., the average top greater O similarities of all sentences with the title), and 𝑅𝑇𝐹𝑆 is the similarity factor of the sentences of the summary 𝑆 with the title. 𝑅𝑇𝐹 is close to one (1) when sentences, in summary, are closely related to the document title, and 𝑅𝑇𝐹 is close to zero (0) when sentences, in summary, are very different to the document title.
For sentence length, it is considered that a sentence that is not too short will obtain a good grade in this characteristic. Based on this premise, Eq. (10) shows the calculation of length for the sentences of a summary (L):
where 𝑇𝐿(𝑆𝑖) is the length of sentence 𝑆𝑖 (measured in characters), μ(l) is the average length of the sentence of the summary, and std(l) is the standard deviation of the lengths of the sentences of the summary.
For the calculation of cohesion, the cosine similarity measure of one sentence to another is used, see Eq. 11:
where 𝐶𝑜𝐻 corresponds to the cohesion of a summary, 𝐶𝑆 is the average similarity of all sentences in the summary 𝑆, 𝑠𝑖𝑚𝑐𝑜𝑠(𝑆𝑖𝑆𝑗) is the cosine similarity between sentences 𝑆𝑖 and 𝑆𝑗, 𝑁𝑆 is the number of nonzero similarity relationships in the summary, 𝑂 is the number of sentences in the summary, 𝑀 corresponds to the maximum similarity of the sentences in the document and 𝑁 is the number of sentences in the document.
This way, 𝐶𝑜𝐻 tends to 0 when the summary sentences are very different between them, while that 𝐶𝑜𝐻 tends to 1 when these sentences are too similar between them.
Coverage is defined as the similarity between the sentences that produce a summary and the full document. Therefore, for each of the sentences, the document is consequently represented through the vector space model and is weighted by calculating its relative frequency according to Eq. 12:
where 𝐷 is the vector of weights of the terms in the document, and 𝑆𝑖 and 𝑆𝑖 are the vectors of weights of the terms in the sentences 𝑖 and 𝑗, respectively, belonging to the summary.
The weights found for the objective function are: 𝛼 = 0.35, 𝛽 = 0.35, 𝛾 = 0.29, 𝛿 = 0.005, 𝜌 = 0.005; which correspond to the features of Position (P), Relationship to the title (RT), Length (L), Cohesion (CoH) and Coverage (Cov), respectively.
To assess the quality of a summary represented by a representation of a solution 𝑋𝑘, an objective function is required, which will be maximized according to Eq. 13:
The fitness function was calculated as:
3 Summarization System
In this paper, we test to describe the commercial tools are tested to compare to the state-of-the-art methods to have a complete update of the SEATS.
3.1 Open Text Summarization
The Open Text Summarizerfnfn, (OTS) is an open source tool for summarizing texts. The OTS reads a text and decides which sentences are important and which are not. OTS will create a short summary or will highlight the main ideas in the text. OTS is both a library and a command line tool. Word processors such as AbiWordfn and KWordfn can link to the library and summarize documents while the command line tool can summarize text on the console. The program shows the summarized text as plain text or HTML.
3.2 Text Compactor
Text Compactorfn is a free online summarization tool was created to help struggling readers a lot of information. The web app calculates the frequency of each word in the passage. Then, a score is calculated for each sentence based on the frequency count associated with the words it contains. The most important sentence is deemed to be the sentence with the highest frequency count.
3.3 Copernic Summarizer
The system was developed exclusively for ATS. According to [48], Copernic Summarizer uses the following methods:
– A common statistical model (S-Model) can be applied to a multi-language, to a certain degree, to approximate the topic-specific vocabulary. It includes Bayesian estimates and rule systems derived from an analysis of thousands of documents.
– Knowledge intensive processes (K-Process) consider how human beings summarize texts. Considering the following steps: language detection, sentences limit, extraction of concepts, segmentation of documents, and sentence selection.
3.4 Microsoft Office Word (MOW)
This tool has the option of ATS only in the versions Microsoft Office Word 2003 and Microsoft Office Word 2007. The summary created by Microsoft Office Word is the result of a keyword analysis; the selection of each keyword is done by assigning a score to each word. The tool offers several ways to view summaries. The most frequent words in the document will be higher scores that are considered important. The sentences that contain these words will be included in the summary.
3.5 Summarizing
Summarizingfn is an online tool for EATS articles. The stages used are based on detecting the main ideas of the text, obtaining a description of the ideas, which reflects the author's writing style, to rewrite finally the text in summary. The Summarizing tool has the following parameters to generate summaries of 100, 150, 200, and 300 words.
4 Evaluation
In this section, three evaluation methods used in the AETS task are presented. ROUGE is the most evaluation method used in the evaluation of summaries that uses one or several gold standard summaries (summary made by the human) to perform its evaluation. While ROUGE-C and JS divergence are focused on the evaluation of the summaries concerning the original document, however, although they have different evaluation approaches, state-of-the-art methods that evaluate with ROUGE, ROUGE-C, and JS divergence must use the standard corpus to be compared with other methods.
4.1 ROUGE
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was proposed by Lin y Hovy [49-50]. ROUGE compares the summaries generated by a system to the human-generated (gold standard) summaries. For comparison, it uses n-gram statistics.
ROUGE includes the following automatic assessment measures.
– ROUGE-N (n-grams co-occurrence). It expresses the coverage or recall of n-grams between a candidate summary and a set of reference summaries. It is calculated as follows:
where 𝑛 is the length of the n-gram and 𝐶𝑜𝑢𝑛𝑡𝑚𝑎𝑡𝑐ℎ(𝑔𝑟𝑎𝑚𝑛) is the maximal number of n-grams that co-occur in the candidate summary and the set of reference summaries.
– ROUGE-S (noncontiguous bigram co-occurrence): a noncontiguous bigram is any pair of words in the order of the sentence, which allows for an arbitrary number of spaces. The co-occurrence of noncontiguous bigrams statistically measures the coverage of noncontiguous bigrams between the candidate summary and the set of reference summaries. Lin [49] shows that this sort of measure can be applied to assess the quality of automatically generated summaries, as 95% correlation between human judgments is managed.
4.2 ROUGE-C
ROUGE-C is presented as a tool to evaluate summaries without the reference summary made by the human [51]. The ROUGE-C method alternatively by replacing the reference summaries with source document as well as query-focused information (if any), therefore it enables a fully manual-independent way of evaluating multi-document summarization.
In ROUGE-C, for a summary of a document, they were defined as those used by ROUGE. For example, ROUGE-C-N, it is defined as shown in the Eq. (16):
where 𝑛 stands the length of 𝑛 − 𝑔𝑟𝑎𝑚, 𝐶𝑜𝑢𝑛𝑡𝑚𝑎𝑡𝑐ℎ(𝑔𝑟𝑎𝑚𝑛) is the maximum number of n-grams co-occurring in a peer summary and the source document. ROUGE-C-N is the proportion of the overlapping grams in total n-grams of the source document. ROUGE-C-N is a precision-related measure the denominator of the equation is occurring on the Test side.
4.3 Jensen-Shannon Divergence (JS)
Jensen-Shannon divergence [52] is a method that evaluates the content of a summary that does not require models made by humans (gold standard). It assumes that the distribution of the words in the source document and the generated summary should be similar to each other.
The Jensen-Shannon divergence is a measure that compares two probability distributions of words: the text of the original document, (𝑃), and the evaluated summary text, (𝑄). Low divergence from the input document(s) by the produced summary is taken as a signal of a good summary. Given two probability distributions over words: (𝑃 and 𝑄), Jensen-Shannon divergence is defined as:
The measure can be applied to the distribution of units in system summaries 𝑃 and reference summaries 𝑄. The value obtained may be used as a score for the system summary.
JS divergence formula is given in Eq. 17 is implemented here with the following specification (see Eq. 18) for the probability distribution of words 𝑤:
where 𝑃 is the probability distribution of words 𝑤 in text 𝑇 and 𝑄 is the
probability distribution of words 𝑤 in summary 𝑆; 𝑁 is the number of words in
text and summary 𝑁 = 𝑁𝑇 + 𝑁𝑆, 𝐵 = 1.5|𝑉| where V is
the size of the vocabulary of the documents,
It uses the versions smoothed (JS-SMT) and unsmoothed (JS-WSMT) versions of the divergence as features.
5 Experiment and Results
This section shows the experiments carried out on the best methods and systems for AETS, tested in a standard corpus in Spanish. First, the corpus used is described; second, the results of the different heuristics, state-of-the-art methods, and systems are presented using the evaluation methods ROUGE, ROUGE-C, and JS. In addition to showing the results obtained by the concordance between the summaries made by humans. Third, the ranking matrix is calculated for the methods and systems for the SAETS.
5.1 Corpus
The standard corpus used for the experimentation is called “Textos en Español para Resúmenes” (TERfn). TER is a corpus composed by Mexican Spanish language news obtained from the newspaper “Crónicafn”.
The construction of corpus is divided into two stages, the first for the selection, cleaning and tagging of news, and second for the selection of experts, construction, and tagging of summaries.
In the first stage, 20 news items were randomly selected from the following categories: Academy, Wellness, City, Culture, Sports, Entertainment, States, World, National, Business, Opinion, and Society, giving a total of 240 news. The texts were cleaned of tags and images by extracting only the title, the category, the date and the main text of the news. Subsequently, a normalization of the texts was carried out, through the tagging of the texts.
The tagging of the text helps mainly to know where a sentence starts and ends. In this way, its use is facilitated, and it is guaranteed that the methods that use it will use the same separation of sentences. The tags used are shown in Table 1.
Tags | Description |
<DOC></DOC> | Tag indicating the start and end of the document |
<DOCNO> </DOCNO> | Tag indicating the name of the document |
<FILEID></FILEID> | Tag indicating a unique number of the document |
<TITLE></TITLE> | Tag indicating the title of the document |
<CATEGORY></ CATEGORY > | Tag indicating the category to which the document belongs |
<DATE></DATE> | Tag indicating the date of issue of the document |
<TEXT></TEXT> | Tag indicating what is the text of the news |
<s><\s> | Tag indicating the beginning and end of a sentence |
In the second phase for the creation of human-made summaries (gold standard), a group of humans of Mexican nationality and minimal university education was selected.
The human was given the text separated by sentences with the number of words corresponding to each of them so that they only read the text and select the sentences they considered important. Of prayers chosen, he was asked to create a more extensive summary of 100 words. Then for each document, two humans made an extractive summary of more than 100 words. The summaries were also tagged for their best use. Next, the tags used for the summaries are described.
Then there is a corpus of 240 news in the Spanish language of Mexico with two summaries made by humans for each news item. It is worth mentioning that the corpus was built considering the main features of DUC02. Table 3 presents a summary of how the TER corpus is constituted.
Tags | Description |
<SUM></SUM> | Tag indicating the beginning and end of the summary made by the human |
CATEGORY | Tag indicating the category to which the news belongs |
TYPE | Tag indicating the type of summary, in this case it is per document |
SIZE | Tag Indicating the minimum number of words that the summary should have |
DOCREF | Tag that shows the name of the base document for the generation of the extractive summary |
SELECTOR | Tag with the unique key of the human that created the summary |
SUMMARIZER | Tag that indicate which of the two generated abstracts is. A (first) and B (second). |
Newspaper | Category | Documents | Words | Average of words | Sentences | Average sentences |
Crónica | Academy | 20 | 10966 | 548 | 382 | 19 |
Wellness | 20 | 11801 | 590 | 405 | 20 | |
City | 20 | 7568 | 378 | 219 | 11 | |
Culture | 20 | 8631 | 432 | 297 | 15 | |
Sports | 20 | 9519 | 476 | 363 | 18 | |
Entertainment | 20 | 8869 | 443 | 311 | 16 | |
States | 20 | 7471 | 374 | 185 | 9 | |
World | 20 | 7108 | 355 | 247 | 12 | |
National | 20 | 7533 | 377 | 186 | 9 | |
Business | 20 | 7523 | 376 | 229 | 11 | |
Opinion | 20 | 12716 | 636 | 443 | 22 | |
Society | 20 | 6507 | 325 | 228 | 11 | |
Total | 240 | 106212 | 3495 | |||
Average | 442 | 15 |
5.2 Concordance
The results of the concordance heuristic for the corpus TER are shown for three evaluation methods: ROUGE (see Table 4), ROUGE-C (see Table 5), and JS (see Table 6).
For ROUGE, the heuristic concordance shows a level of agreement between the experts of 66%. It shows that there are the two experts chose more than half of the sentences.
For ROUGE-C and JS, the concordance heuristic was applied to evaluate the first summary of human 1 with respect to the source text. Later the summary of human 2 was evaluated with respect to the source text. It is to fulfill the main features of ROUGE-C and JS to evaluate with respect to the original document. Finally, the average between the two summaries of the experts was established.
The results using JS show a higher concordance between the summaries of humans, while in ROUGE-C, the agreement is lower.
5.3 Experimental Results
We present the results of the heuristics, state-of-the-art methods, and systems evaluated with ROUGE (see Table 7), ROUGE-C (see Table 8), and JS (see Table 9).
Method \ System | ROUGE - 1 | ROUGE – 2 | ROUGE - SU4 |
Topline | 0.8344 | 0.7664 | 0.7649 |
GA-Multilanguage | 0.7274 | 0.6289 | 0.6378 |
Baseline:first | 0.7626 | 0.6229 | 0.6326 |
GA-4feature | 0.7131 | 0.6072 | 0.6180 |
GA-Bag of words | 0.6989 | 0.5852 | 0.5972 |
MA-SingleDocSum | 0.6883 | 0.5706 | 0.5842 |
OTS | 0.6761 | 0.5562 | 0.5698 |
Text Compactor | 0.6749 | 0.5537 | 0.5678 |
TextRank | 0.6606 | 0.5390 | 0.5532 |
Copernic | 0.6187 | 0.4711 | 0.4898 |
MOW2007 | 0.6178 | 0.4691 | 0.4854 |
MOW 2003 | 0.6160 | 0.4649 | 0.4819 |
Summarizing | 0.5775 | 0.4098 | 0.4290 |
Baseline:random | 0.4969 | 0.2933 | 0.3201 |
Method \ System | ROUGE - C1 | ROUGE - C2 | ROUGE - CL | ROUGE – CSU4 |
GA-4feature | 0.5041 | 0.4968 | 0.5041 | 0.4864 |
MA-SingleDocSum | 0.5044 | 0.4945 | 0.5044 | 0.4803 |
TextRank | 0.4402 | 0.4290 | 0.4402 | 0.4128 |
GA-Multilanguage | 0.3915 | 0.3867 | 0.3915 | 0.3793 |
MOW2007 | 0.3688 | 0.3567 | 0.3654 | 0.3395 |
MOW 2003 | 0.3559 | 0.3438 | 0.3527 | 0.3266 |
GA-Bag of words | 0.3477 | 0.3411 | 0.3477 | 0.3309 |
OTS | 0.3509 | 0.3413 | 0.3490 | 0.3272 |
Text Compactor | 0.3406 | 0.3315 | 0.3389 | 0.3177 |
Summarizing | 0.2754 | 0.2636 | 0.2726 | 0.2466 |
Baseline:first | 0.2791 | 0.2756 | 0.2764 | 0.2699 |
Copernic | 0.2971 | 0.2852 | 0.2952 | 0.2733 |
Baseline:random | 0.2538 | 0.2322 | 0.2475 | 0.2147 |
Method \ System | JS-SMT | JS-WSMT |
GA-4feature | 0.8524 | 0.8436 |
MA-SingleDocSum | 0.8452 | 0.8362 |
TextRank | 0.8223 | 0.8120 |
GA-Multilanguage | 0.7950 | 0.7812 |
MOW2007 | 0.7920 | 0.7773 |
MOW 2003 | 0.7858 | 0.7702 |
GA-Bag of words | 0.7796 | 0.7634 |
OTS | 0.7745 | 0.7592 |
Text Compactor | 0.7690 | 0.7526 |
Summarizing | 0.7343 | 0.7124 |
Baseline:first | 0.7321 | 0.7107 |
Copernic | 0.7250 | 0.7061 |
Baseline:random | 0.7105 | 0.6884 |
According to the results presented in Table 4 for ROUGE, all methods and systems overcome the baseline:random heuristic. However, as regards baseline: first, only one method overcomes it. The baseline:first heuristic for the TER corpus is very high due to how the news items were written (the most important things are written at the beginning), as well as how the humans selected the sentences to produce the model summary. For the methods and systems that evaluate using the model summaries as a reference, they aim to overcome the heuristic baseline:first. The maximum result that can be reached when evaluating the summaries generated with a standard corpus TER is shown in the first row of results in Table 7.
The results of the evaluations of methods and systems of SAETS with ROUGE-C and JS are very similar with respect to the position of the methods and systems in the ranking. For ROUGE-C and JS, the baseline:first heuristic does not have much relevance because the evaluation reference is the complete document. According to the results presented in Table 8 for ROUGE-C (R-C) and Table 9 for JS, all methods and systems outperform the baseline:random heuristic and only one system does not overcome the baseline:first heuristic.
Despite the differences between the presented evaluation methods, it is observed that the state-of-the-art methods keep their order about their results.
5.3.1 The Ranking Results of the State-of-The-Art Methods and System
The main objective of the paper is to update the methods and systems for the SAETS. However, based on the results obtained by the evaluation methods, is generated a rank matrix to compare the position that the methods and systems have up to now.
Three evaluation methods were used (ROUGE, ROUGE-C, and JS), and each use a different way of calculating their output results (see Table 4-6), it is not possible to determine which of the methods or systems are the best. Therefore, a unification of the methods and systems is proposed considering the position that each method and system occupies according to its evaluation measure. Table 10 shows the position of each method and system with respect to the results obtained by each measure. The resulting ranking matrix was calculated as proposed in [54] as follows (see Eq. 19):
where 𝑛 is the number of methods and systems involved for the comparison, and 𝑅𝑟 refers to the number of times that the method or system affects the r - th position.
Method/system | R(r) | |||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | R | |
GA-4feature | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.5 |
MA-SingleDocSum 2 | 4 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7.8 | |
GA-Multilanguage | 3 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7.3 |
TextRank | 0 | 0 | 6 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6.2 |
GA-Bag of words | 0 | 0 | 3 | 0 | 0 | 1 | 4 | 1 | 0 | 0 | 0 | 5.1 |
MOW 2007 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 3 | 0 | 0 | 4.6 |
OTS | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 4 | 0 | 0 | 0 | 4.2 |
MOW 2003 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 1 | 0 | 3 | 0 | 3.6 |
Text Compactor | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 6 | 0 | 0 | 3.2 |
Copernic | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 2 | 2.0 |
Summarizing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 7 | 1.0 |
According to the results shown in Table 10, the best state-of-the-art method is GA-4feature, and the lowest result has the Summarizing syste6 Conclusions and Future Work. Automatic extractive text summarization has been under research for 60 years.
However, the progress made in the Spanish Automatic Extractive Text Summarization was not known until the present paper. In this paper, we tested a standard corpus in Spanish with the best state-of-the-art methods and systems of AETS.
The evaluation was carried out with ROUGE, ROUGE-C, and JS evaluation measures.
The results obtained with ROUGE show that the state-of-the-art methods and systems have a challenge to overcome because the baseline:first heuristic is very high and only one method has managed to overcome it.
For ROUGE-C and JS all the state-of-the-art methods and three of four proven systems overcome the baseline:first heuristic.
All state-of-the-art methods and systems overcome baseline:random heuristic.
For English, the following methods: MA-SingleDocSum, GA-Multilanguage, GA-Bag of words, and GA-4feature outperform the heuristic baseline:first.
However, for Spanish, only AG-Multilanguage exceeds it for ROUGE measures (ROUGE-1, ROUGE-2, and ROUGE-SU4). There is no evidence of evaluations of state-of-the-art methods in English regarding ROUGE-C and JS. Therefore, the results show that the conclusions obtained for English are not supported for Spanish.
The degree of progress for Spanish was ascertained using the ranking of the state-of-the-art methods and systems for the AETS shown in Table 10.
Based on the results shown in this paper, the opportunity to generate new research is opened using the TER corpus to try state-of-the-art methods as [55-62], among others Also, the methods and systems tested in this paper could be adjusted their parameters to obtain better results in SAETS.