1 Introduction
An ontology of a domain D is a specification of a conceptualization of D. An ontology typically consists of: a list of concepts important for domain D, a list of attributes describing the concepts, a list of taxonomic relationships among these concepts and a list of non-taxonomic semantical relationships among these concepts [5]. Ontologies are the fundamental form of knowledge representation in Semantic Web. The vast majority of currently used ontologies have been built entirely by hand, even those in OBO Foundry. This manual development process represents a major knowledge acquisition bottleneck. One consequence of this has been a series of ongoing efforts largely led by members of the Natural Language Processing (NLP) and Text Mining communities to automate, or semi-automate the ontology construction process.
In summary, Ontology Learning is motivated by the high manual cost of ontology construction, the continuous change in science and knowledge in general, the very large amount of existing text with numbers growing exponentially and the extensive need for a variety of ontology type resources like vocabularies, taxonomies and formal taxonomies [5].
Challenges in Ontology Learning Semantic Web Technologies1 are caused by data heterogeneity and uncertainty on Web. Applications relying on reasoning need consistent ontologies, which must be explicitly supported by ontology learning. Knowledge extraction from growing amounts of web data requires scalable ontology learning.
The data quality is enforced by ontology evaluation which enables formal correctness, completeness and consistency; human involvement increases the quality of learned ontologies. Applications of ontologies mainly are found on Semantic Web, Web pages are annotated with ontologies or user queries for Web pages are analyzed at knowledge level and they are answered by inference on ontological knowledge. Other important applications include knowledge representation and knowledge management systems, intelligent query-answering systems and information retrieval and extraction.
In section 2 fundamental types of Ontology Learning are enlisted along with a brief explanation. The ontology learning tasks are detailed in section 3. Measures and methods of evaluation are addressed in section 4. Section 5 presents eight systems whose general purpose is to learn semantic knowledge from texts for later use it in ontologies development.
2 Fundamental Types of Ontology Learning
Ontologies can be “learned” automatically. Ontology Learning defines a set of methods and techniques for fundamental development of new ontologies, for extension or adaption of already existing ontologies in a (semi), automatic way from various resources. Four fundamental types has been identified and they are briefly described in the following paragraphs2.
2.1 Ontology Learning from Text
This type includes automatic or semi-automatic generation of lightweight ontologies by means of text mining and information extraction.
2.2 Linked Data Mining
In this type meaningful patterns in RDF graphs via statistical schema induction or statistical relational learning are detected.
3 Ontology Learning Tasks
Ontology learning is used to (semi) automatically extract whole ontologies from natural language text [4, 6, 7]. The process is usually split into the following tasks, which are not all necessarily applied in every ontology learning system.
3.1 Ontology Schema Extraction
Squema extraction carries out the extraction of ontology schemata from heterogeneous documents with the help of human experts. A corpus of texts has to be identified, collected and preprocessed in advance. Some works about schema extraction correspond to the main (roots), clases in the process of ontology creation.
OURAL project (Ontologies for the Use of digital learning Resources and semantic Annotations on Line) is proposed in [20]. OURAL integrates areas such as educational sciences, Informatics and Cognitive psychology, in order to create new services for e-learning. As a result, it is possible to obtain classes by using Natural Language Processing (NLP) techniques to specific learning situations which are described in natural language. In [18], the authors also analyze the Educational domain, however, when they applied it to the Chinese language, a preprocessing process was carried out in order to analyze the characteristics of the language such as coupling, relevance and consensus.
In investigations such as [30] methods for extracting classes in a semiautomatic way are presented.A database of verbs, alternations of diathesis and syntactic-semantics schemes of Spanish (ADESSE)[42], is presented. ADESSE stores approximately 160,000 clauses retrieved from a corpus which are used for the extraction of semantic patterns that lead to the determination of the ontology classes. This methodology was applied in the Educational area and replicated in a financial environment [30]. Extraction of classes was complemented with experts’ opinion in the domain. A method for concepts’ extraction using pattern extraction, linguistic calculations and weight calculations with NLP as the morphological labeling is proposed in [30].
3.2 Ontology Creation
Consist in the design of an ontology from the scratch by a team of experts being supported by Machine Learning techniques. Experts make suggestions of well suited relations among concepts. Finally, the reasoner evaluates the consistency of the designed ontology from its hierarchy construction. Most ontologies contain a sub asumption hierarchies of classes, however, it may also be desirable to extract different types of axioms from the text, including disjointness and equivalence. According to the ontological process, the following step is the one corresponding to the creation of the ontology. Table 1, shows the works for both automatic and manual creation of ontologies. In addition, it is attached a column to specify the domain worked in each investigation. The project Artequackt is proposed in [1], a system that generates biographies of using tools such as WordNet4.
Building Method | Authors | Domain |
---|---|---|
Automatic | [1] | Biographies of painters |
[43] | Technical and medical texts | |
[25] | Unstructured document (FIFA) | |
[30] | ADESSE | |
[11] | EOLSS collection | |
Manual | [44] | Basic news |
[50] | English learning material | |
[47] | Online courses handbooks | |
[9] | Science history | |
[10] | Online education | |
[13] [3] |
Software Engineering courses | |
[2] | ETN | |
[38] | Intelligence levels | |
[41] | Level K12 books | |
[23] [22] |
E-learning |
Other investigations like [26], show a mechanism for the ontologies construction based on the episodes extraction in a domain of unstructured documents. Since projects are carried out for the Chinese language, the main focus is to study the characteristics of the Language prior to the ontology construction. Projects are being tested with news of the FIFA being evaluating them with retrieval metrics as precision and recall.
Other investigations like [25],show a mechanism for the ontologies construction based on the episodes extraction in a domain of unstructured documents. Since projects are carried out for the Chinese language, the main focus is to study the characteristics of the Language prior to the ontology construction. Projects are being tested with news of the FIFA being evaluating them with retrieval metrics as precision and recall.
In addition to the previously mentioned research, other works related to the ontologies creation were analyzed. In [43], an environment is developed for the incremental extraction of knowledge from natural text. They proposed a hybrid methodology that uses POSTagging, and WordNet for the extraction of key elements. The final process is semiautomatic and requires previous training of the corpus domain; in addition, taxonomic relationships and semantic concepts are extracted.
A methodology for obtaining information for ontologies automatic construction in Spanish from text is proposed in [30], mainly for the knowledge extraction from Web. The methodology is based on three sequential stages: search for concepts, extraction of relationships and construction of ontologies.
Finally, in [11], a similar investigation is carried out but using a method which does not analyze the language syntactic structure, but rather studies its level of semantic deep (allowing scenarios multilingual); it also use resolution Anaphora techniques, grouping and lexicon-syntactic pattern extraction.
The investigations about ontologies construction by manual means focus on their domain and evaluation considering the approach used for this proposal (mainly pedagogical investigation). An ontology for the events recognition using 20 News articles from a set of articles is presented in [44], which describes the academic life of the basic level. Pattern extraction and evaluation are performed with IR metrics, reporting superior results to 90% in accuracy.
Ontologies for classroom learning are presented in [18] and [23]. The first research proposes an ontology for the interaction between the student and the teacher in the English language; while in the second the ontology involves the use of the Internet for the improvement of learning. An ontology is proposed for each entity participating in the teaching learning process. The evaluation was performed manually by domain experts.
Other investigations like in [47] [10],[13] and more recently in [14], focus on online education creating ontologies manually based on the resources available to the students online. They mainly use XML to perform the tests which are evaluated manually. In [3], it presents a domain ontology over use-case diagrams created for online environments, specifically for Software Engineering course, this work is also evaluated by experts in the domain.
Other authors[41] focus on the autonomous learning online, proposing an ontology based on the Internet of Things; more than on-line learning, they focus on learning inside the classroom with the help of technology taking as reference the types of intelligence of the students. In [2], the process of ontology creation from the course information offered at the higher level, where the student can choose the courses to take according to his academic background. The structure and hierarchy of the classes are manually made.
The research in [9], proposed a Web application to help users to examine a conceptual space and to explore temporal relations between scientific events. Ontology was formulated using a small number of general predicates (semantic and physical) and a detailed analysis of the rules and relationships that compose it. In [38], an ontology is presented on the lifestyle in people with noncommunicable diseases using semantic Web tools. The main focus of the research focuses on the use of techniques for the ontologies integration.
3.3 Extraction of Ontology Instances
This step consist of the extraction of ontology instances from semi-structured / unstructured data to populate alreadyexisting ontology schemata with individuals. Technologies from Information Retrieval and Data Mining are applied. For the ontologies population, investigations were to carry out this automatic and semi-automatic activity. Table 2, shows some of the works and the domain in which they performed their experiments.
A method is proposed to populate ontologies with the use of googled text fragments in [19]. This is based in hand-made patterns for classes and relationships. These patterns are consulted in Google and later the new instances are analyzed previous to be used. In other research [36], two methodologies are proposed which addressed the automatic instantiation of ontologies, from a point of view combining traditional linguistic analysis and technologies for the extraction of textual knowledge. The analysis is based on the contextual distance and the knowledge gain based on semantic roles.
Population independent of domain with an unsupervised automatic model is proposed in [46], by using tourism texts extracted from Wikipedia.
In [15] also tourism texts are used, in addition to a legal corpus proposing a generic process that approaches the automatic population. The authors use extraction of grammatical categories, named entities, morphological tagger and evaluate by querying and WordNet.
Authors in [35], also work in the automatic population in the domain of PhD thesis protocol academic profiles. They use curricular records and summaries of scientific publications in Spanish. The assessment is made against a set of class individuals, relationships between individuals and property values, which were identified and represented by experts in the scientific-academic domain.
Other authors handle semiautomatic population, taking as reference experts in the domain under analysis and NPL techniques. In [44], a system for the semi-automatic ontologies population with instances of unstructured text is described.
It applies supervised learning using tools such as [39], which present a weakly supervised approach to the ontology population using in manual analysis and generating a set of weighted characteristics. The evaluation is performed with IR metrics (precision and recall).
A Web question answering system that combines multiple knowledge bases is presented in [14]. They use a parser of NLP which transforms queries into SPARQL.
4 Evaluation
Ontology evaluation is based on measures and methods to examine a set of criteria. The ontology evaluation approaches basically differ on how many of these criteria are targeted, and their main motivation behind evaluating the taxonomy. Six basic methods for ontology based evaluation has been identified: metric, natural language, clean, lexicon and task based. In the next sections a breve explanation of each one will be provide along the techniques related to them [37, 34].
4.1 Metric Based Evaluation
It presents a set of processes that the user is expected to carry in order to obtain the stability measures of existing ontologies. The following are some features can be used as metrics for evaluation: ontology’s content and language, methodology followed to develop the ontology, software environments and costs
4.2 Clean Based Evaluation
The main focus of this method is to help the users to clean the taxonomies. It provides structural and fundamental insight into the model by considering aspects such as rigidity, unity, identity and dependence. The main application of this evaluation method is to clean the upper level of the WordNet taxonomy [48].
4.3 Lexicon Based Evaluation
This evaluation method is applied on the results of automatic ontology mining techniques that aims to create ontologies, and not to populate ontologies with instances. The evaluation focuses on the the scope of the vocabulary, the wellness of the taxonomy and the adequacy of the non-taxonomic relations. The overall quality of an ontology is not only determined by the quality of the artifact itself, but also by the the quality of its evaluation method. Providing an analysis on the set-up and conditions under which an evaluation of an ontology takes place can only be beneficial to the entire domain of ontology engineering.
4.4 Corpus Based Evaluation
Corpus-based approaches, also known as data-driven approaches, are used to evaluate how far an ontology sufficiently covers a given domain. One basic approach is to perform an automated term extraction on the corpus and simply count the number of concepts that overlap between the ontology and the corpus. Another approach is to use a vector space representation of the concepts in both the corpus and the ontology under evaluation in order to measure the fit between them. [40], evaluate the quality of its constructed taxonomy from a large text corpus by comparing it with six topic specific gold standard taxonomies. These six reference taxonomies are generated from Wikipedia using their proposed GraBTax algorithm
4.5 Task-based Evaluation
Task-based approaches try to measure how far an ontology helps improving the results of a certain task. This type of evaluation considers that a given ontology is intended for a particular task, and is only evaluated according to its performance in this task, regardless of all structural characteristics. Adapting an existing task-based evaluation, the approach in [33] explains how crowdsourcing, involving application users, can efficiently help in the improvement of an application ontology all along the ontology lifecycle. A real case experiment on an application ontology designed for the semantic annotation of Geo-business user data illustrates the proposal. Next section the techniques used by seven prominent ontology learning systems and the evaluation of these techniques is presented.
5 Ontologies Learning Systems
An overview of the system in terms of its developers, the motivation behind the system, and its application domains is provided. Besides. the techniques employed by each system in terms of the corresponding tasks to be achieved.
ASIUM [17, 16], is a semi-automated ontology learning system. The aim of this approach is to learn semantic knowledge from texts and use the knowledge for the portability from one domain to the other. ASIUM uses linguistics and statistics-based techniques to perform its ontology learning tasks: preprocessing texts and discovering subcategorization frames, extracting terms and form concepts, and constructing hierarchy.
Text-to-Onto [8, 26, 27], is a semi-automated system that is part of an ontology management infrastructure called KAON.4 KAON is a comprehensive tool suite for ontology creation and management. Text-to-Onto uses linguistics and statistics-based techniques to perform its ontology learning tasks as preprocessing texts and extracting terms, forming concepts, constructing hierarchy and discovering non-taxonomic relations and labeling non-taxonomic relations.
TextStorm/Clouds [31], is a semi-automated ontology learning system that is part of an idea sharing and generation system called Dr. Divago [32]. The aim of this approach is to build and refine domain ontology for use in Dr. Divago for searching resources in a multidomain environment in order to generate musical pieces or drawings. TextStorm/Clouds uses logic and linguistics-based techniques to perform its ontology learning tasks as preprocessing texts and extracting terms, constructing hierarchy, discovering non-taxonomic relations, and labeling non-taxonomic relations and extracting axioms.
SYNDIKATE [21], is a stand-alone automated ontology learning system. SYNDIKATE uses only linguistics-based techniques to perform its ontology learning tasks as extracting terms, forming concepts, constructing hierarchy, discovering non-taxonomic relations, and labeling nontaxonomic relations.
OntoLearn [28, 29, 45], is part of a project for developing an interoperable infrastructure for small and medium enterprises in the tourism sector under the Federated European Tourism Information System6 (FETISH). OntoLearn uses linguistics and statistics-based techniques to perform its ontology learning tasks as preprocessing texts and extracting terms, forming concepts, and constructing hierarchies.
CRCTOL [24], which stands for concept-relation-concept tuple-based ontology learning, is a system for constructing ontologies from domain-specific documents. CRCTOL uses linguistics and statistics-based techniques to perform its ontology learning tasks as preprocessing texts, extracting terms and forming concepts and constructing hierarchy and discovering non-taxonomic relations.
The more recent OntoGain system [12], from the Technical University of Crete is designed for the unsupervised acquisition of ontologies from unstructured text. OntoGain has been tested against Text2Onto, the successor of Text-To-Onto, in two different domains, namely, the medical and computer science domains. OntoGain uses linguistics and statistics-based techniques to perform its ontology learning tasks as preprocessing texts, extracting terms and forming concepts and constructing hierarchy and discover non-taxonomic relations.
There are several key issues that will likely define the research directions in this area in the near future [37], namely: (1) the issue of noise, authority, and validity in Web data for ontology learning; (2) the integration of social data into the learning process to incorporate consensus into ontology building; (3) the design of new techniques for exploiting the structural richness of collaboratively maintained Web data; (4) the representation of ontological entities as language-independent constructs; (5) the applicability of existing techniques for learning ontologies for different writing systems (e.g., alphabetic, logographic); (6) the efficiency and robustness of existing techniques for Web-scale ontology learning; (7) the increasing role of ontology mapping as more ontologies become available; and (8) the extensibility of existing lightweight ontologies to formal ones.
6 Conclusion and Future Work
Ontology learning is an active field of research in order to facilitate the task of ontology engineering. Another important role for OL is to allow ontologies to be kept up to date more effectively. Ontology based evaluation remains an important open problem and several novelty approaches has been proposed. Diverse systems and tools are under development in OL area. There is no such single method that will be efficient by itself instead of a combination of them according to the application problem is recommended.Open research areas related to ontology learning considers Web-scale, open heterogeneous data repositories, social networks, formal languages and cross-language learning among others.