1 Introduction
On the web today, there are multiple digital libraries [7] sites that support tasks to collect, preserve, manage and retrieve electronic documents in different formats. An open access repository [29] can be seen as a domain digital library that comprises documents of two main types: thematic documents related to a specific knowledge area and institutional documents which are associated to the cultural and scientific production of a university or research facility.
In Mexico, CONACyT1 2 (National Council of Science and Technology), created a public national repository3 to store, manage, preserve and disseminate scientific, cultural and technological knowledge that is derived from research, academic products and technological deployments from Mexican institutions. Following CONACyT's initiative and sponsorship, the Universidad de las Americas Puebla4 (UDLAP) launched the creation of POHUA5, its own digital institutional repository6 to compile the knowledge generated by its academic community through the years (bachelor and postgraduate studies) contained in thesis, dissertations, scientific articles, magazines, among others.
The creation of a domain-based ontology that represents the main aspects of POHUA started in parallel with the aim of constructing a richer semantic representation of an open-access repository. The ontology helps to modelate the main actors, elements and interactions typically present in a university and therefore, helps to create this kind of knowledge base accurately according to the institution necessities. Additionally, it is important to mentioned that one of the main advantages of using this kind of modeling is that this representation can be applied to other universities as an standard for a specific archive. Finally, this paper details the methodology followed to create the classes, relationships, instances and rules/inferences using a well know ontology modeling software (Protégé).
The remainder of this paper is structured as follows: in Section 2 existing digital repositories associated to the modeling of institutional environments are presented. Section 3 provides details on design and implementation of the ontology created for POHUA. In Section 4 a discussion about the relevance of the ontology created is presented. Finally, implications and conclusions derived from this work so far are included in Section 5.
2 Related Work
Digital repositories have a rich background related to information management on distinct topics and domains [31,55,60]. These archives have become an increasingly complex landscape for public and paid articles around the web [2,5,43].
Among the most popular engines for finding information related to digital repositories, the citation/integration7 databases [23,38,42] have emerged as a suitable option due to different factors such as obtaining metadata8 associated to articles, retrieving specific metrics (like citations or number of readers) and getting access to full papers. Examples of this kind of websites can be found in Table 1.
Citation engine | A | B | C | Subject area |
---|---|---|---|---|
ACM [4] | Computer science | |||
AMiner [45] | Networking information | |||
CiteSeerx [46] | Scientific documents | |||
EBSCO [20] | Different subjects | |||
Google Schoolar [3,47] | ||||
IEEE xplore [24] | Engineering | |||
LA referencia [32] | Different subjects | |||
Microsoft Academic [48] | ||||
ResearchGate [34] | ||||
Proquest [52] | ||||
Scopus [58] | Scientific documents | |||
Springer [62] | Science and business |
A: Free or restricted access ()
In the case of digital archives that are widely used without many citation/integration features, the institutional repositories [37] have been gaining terrain in recent years [50,63,81] thank to the efficient storage, management and browsing of different types of documents related to academic or research topics. For the case of Mexican institutional repositories, there has been a parallel growth too, but these efforts have not been enough for creating widely open sites that can handle different types of documents [1,21]. Considering the above, Table 2 shows a comparison of international and Mexican repositories created so far to publish scientific and academic information.
International Repositories | |||||||||
Repository | Documents | Documents type | Documents search type | Dspace | |||||
Language | Thesis | Articles | Books | Others | Basic | Medium | Advanced | software | |
B-Columbia [6] | English | ||||||||
Caltech [9] | |||||||||
Cambridge [10] | |||||||||
Columbia [11,13] | |||||||||
Dialnet [18] | Spanish | ||||||||
Harvard [22] | English | ||||||||
MIT [35,40] | |||||||||
Oxford [8,49] | |||||||||
PolyU [51] | Chinese and English |
||||||||
RABCI [54] | Portuguese | ||||||||
Scielo [59] | Spanish and English |
||||||||
UPM [80] | Portuguese | ||||||||
UNESP [79] | Spanish | ||||||||
Yale [82] | English | ||||||||
Mexican Repositories | |||||||||
Repository | Documents | Documents type | Documents search type | Dspace | |||||
Language | Thesis | Articles | Books | Others | Basic | Medium | Advanced | software | |
COLPOS [12] | Spanish and English |
||||||||
CUDI [17] | |||||||||
INSP [25] | |||||||||
IPN [26] | |||||||||
ITESM [27] | |||||||||
ITESO [28] | |||||||||
UASLP [68] | |||||||||
Redalyc [56] | |||||||||
Remeri [57] | |||||||||
UACJ [64] | |||||||||
UAEH [65] | |||||||||
UAM [66] | |||||||||
UANL [67] | |||||||||
UGuadalajara [77] | |||||||||
UNAM [78] | |||||||||
UDLAP POHUA Proposed ontology |
From tables 1 and 2 it can be observed that there are multiple options for finding knowledge associated to scientific and cultural production of different topics, domains and even languages (most of them on English). The citation/integration engines provide tools for mining valuable metadata related to articles, books and even magazines but in most of the cases these tools do not support full-text access to documents. In contrast, the institutional repositories have been contributing to the management, curation and dissemination of full documents (in most of the cases) linked to academic or research institutions around the world.
Most of the institutional repositories manage a wide range of documents depending of the production rate of their researchers, faculty members or students. Examples of documents included in these repositories are thesis, dissertations, articles, books or essays. One of the major advantages of these repositories is the compilation of all the institutional knowledge created so far which facilitates the access and dissemination of information while the major disadvantage is the difficulty to keep track of all the documents produced through the years.
In Mexico, institutional repositories have increased their visibility as a viable option to storage information until recent years. Most of these repositories have their own policies to produce, save and manage digital documents which make difficult the possible interaction among distinct archives.
Additionally, the use of specific metadata to describe each type of document complicates also the automatic use and analysis of documents. In this sense, CONACyT has made several efforts to integrate these repositories [15] using the same interoperability-standards, procedures and types of documents into a single national repository that links each academic and scientific document produced in each institution around the country by means of open access policies. Finally, it is important to notice that independently of the type of repository analyzed, most of them are based on the Dspace platform [19,33,83] which provides an easy to use platform focused on the long-term storage, access and preservation of digital content through the interaction of users and communities. Additionally, it is important to remark that most of repositories rely on the implemented tools provided by Dspace for searching documents using classical techniques [36], but few of them apply semantic approaches to implement an advanced search that can expand queries based on the content and meaning of terms [44,61].
3 Ontology Modeling and Implementation
In this section, key aspects associated to the creation of the POHUA9 ontology using Protege are provided. In particular, the description of classes found in UDLAP's scientific and cultural production, the relationships and restrictions among these classes and the creation of instances that show the semantic expressiveness and functionalities of the ontology are discussed. It is important to remark that the ontology implemented is based on a previously created one called Onto4AIR [39], which models the basic functionality of a repository in a university context according to the general and technical requirements of CONACyT Call 2016 [14].
3.1 Protege: Ontology Modeling Tool
In order to create a formal and explicit description of an institutional repository (ontology creation), the open source tool called Protege [41] was used. This tool provides an easy to use system for creating domain models and knowledge-based applications [16,30]. Among the different features offer by this tool, some of the most relevant for the creation of an ontology are the following:
— A friendly easy-to-use IDE10 that allows the implementation of different ontological specifications.
— Integration of different standard languages for creating ontologies like RDF11 or OWL12.
— Implementation of different interfaces for adding classes, relationships, restrictions and instances to create domain specific models.
— Usage of different visualization tools like Ontograph13 that allows users to interact easily with an ontological model depending of the application needs.
— Employment of multiple reasoners like Pellet14 or Fact15 for inferring knowledge based on classes, relationships and restrictions (previously created) and for supporting an automatic logical consistency validation.
— Utilization of distinct ontological query languages like Sparql16 for extracting or inferring semantic information on specific classes or instances.
From those functionalities, it can be observed the advantages of Protégé as a tool for creating domain specific models that not only stores information about a topic but also enables users to discover and infer knowledge based on semantic aspects of information.
3.2 Classes or Concepts Definition
In order to create an ontology using Protégé, the first step performed was the extraction and definition of classes that represent concrete concepts associated to the digital repositories domain.
In this sense, after analyzing the scientific and cultural production of the university, in Table 3 the selected sources for obtaining relevant information considering the importance and impact of the publications are presented.
Source | A | B | C | Type |
Thesis [71] | Postgraduate dissertations | |||
Entorno [76] | Magazines | |||
Contexto [75] | ||||
Editorial UDLAP [73] | Books | |||
Articles | Published documents | |||
Datasets [69,70,74] | N/A | Collection of information | ||
Other sources | A | B | C | Type |
CVU ´unico [72] | Professor’s profile | |||
Scopus [58] | Articles metadata | |||
LA referencia [32] |
A: Free or restricted access ()
From Table 3 it can be observed that UD-LAP's contributions are mainly associated to the distribution of information in four major fields: the creation of different graduate and postgraduate thesis, the dissemination of multipurpose magazines/books, the creation of different data sources (datasets) that comprise scientific and cultural information and the publication of scientific papers. Additionally, the use of other sources for enriching the information related to publications and authors have a major role for the correct use of the information.
Considering the distinct elements available in UDLAP's sources of information, the following main features were implemented for the creation of classes that accurately represent entities in the ontology repository:
— Create a class in Protégé for each representative (and indivisible) element in the data sources or frameworks used. For example the classes constructed the elements: collection and community from the Dspace software or the elements: file or institution which model specific aspects of UDLAP's documents.
— Assign different names for each class created in the ontology. This in turn, helps to distinguish one element from others avoiding ambiguity.
— Permit the use of class names (or abbreviations) in other languages besides Spanish, considering that much of the terminology used in Dspace and in the cultural and scientific document production is based on English. As examples, consider terms DOI (Digital Object Identifier) or ISBN (International Standard Book Number).
— Generate a concrete description of the classes using Protégé's Comment attribute which adds a semantic description that helps users to understand and manage ontology elements.
— Organize representative classes in a hierarchy/taxonomy where components that share similar information can be understood, accesed and manipulated efficiently. It is important to remark that Protégé implements a root class called Thing from which other classes inherit main characteristics.
— Add parallel classes that share the same meaning using Protégé's equivalent to attribute which helps to disambiguate classes that do not share the same name but are similar in terms of their role in the ontology. For example the classes Student, SchoolBoy and Disciple that do not have the same name but the same role and actions.
— Append special restrictions for classes that are mutually different using Protégé's disjoint attribute. As example consider the classes Women and Men from which no example can be an instance of both classes.
According to the main features showed below, some representative examples of the hierarchy created using Protege are presented in Figure 1, taking into account that some classes group together others that represent specific entities in the ontology domain.
3.3 Properties Definition
The second step performed for the creation of an ontology was the creation of data properties that characterize the classes previously created for adding information related to the nature of entities. These properties are a valuable asset for representing special qualities that classes exhibits in the context of an institutional repository and makes them relevant for the understanding and use of the proposed ontology.
Like in Section 3.2, after analyzing the different sources found in Table 3, distinct relevant data properties were observed, considering this fact, the following main features were specified for creating properties in the ontology.
— Create several data properties in Protégé for each class previously implemented, having in mind that these properties must have a quantifying nature for obtaining discrete or continuous values. As examples of data properties created consider the creation of Author or Title that characterize information related to the journal publication class.
— Assign different names for each data property created in the ontology to distinguish one from others avoiding ambiguity.
— Generate a description of data properties created using Protégé's Comment attribute which adds a semantic description of the property's purpose.
— Group together representative data properties in a hierarchy/taxonomy where components that share similar information can be understood, accessed and manipulated easily. Like in the case of classes, Protégé implement a root class called TopDataProperty from which other data properties inherit main characteristics.
— Add a scope and a value type to each data property implemented using Protégé's Domain and Range attributes, where the first one indicates the classes attached to an specific data property and the second one indicates the type of values that a property can have (integer, float, string, etc.). For example the property Title which is used for several classes like Thesis or Book and has a string value. On the other hand, the data property EmbargoEndingDate which has a numeric/date nature is assigned to be an exclusive property of published journals.
— Apply a constraint to avoid the use of multiple values types in a data property using Protege's Functional attribute.
Keeping in mind the main features presented, some representative examples of the
hierarchy implemented using Protégé are presented in Figure 2, where it can be observed the value or range (→) of
some properties created as well as the scope or domain of the classes they
belong to
3.4 Relationships Definition
The third step associated to the creation of an ontology is the definition of relationships among instances to group and infer more complex information related to the elements in an institutional repository. One of the major differences between data properties and relationships is that data properties describe the characteristics of classes while the relationships model the actions/associations among them and their instances.
After analyzing the classes and their corresponding data properties, the following guidelines were used for the creation of the relationships of the proposed ontology.
— Create multiple relationships in Protégé for each possible interaction among two instances in the context of an institutional repository. Consider as example, the creation of the relationship ContentBy over instances of the classes File and InstitutionalRepository, where the goal is to capture the notion that an institutional repository stores multiple files.
— Assign different names for each relationship created in the ontology to distinguish the different kinds of relations that instances have among them.
— Generate a description of the relationships created using Protégé's Comment attribute for adding a semantic description of the connection between instances.
— Group together representative relationships in a hierarchy/taxonomy. Like in the case of classes and data properties, Protégé implements a root class called TopObjectProperty from which other relationships inherit main characteristics.
— Add a scope and a value type to each relationship implemented using Protégé's Domain and Range attributes, where the first one specifies the origin classes and the second one the destiny classes (like in a mathematical function). As example, consider the relationship AuthorOf, where the classes Academic or Student are the origin classes and the class InformationResource is the destiny class, this relationship can be read as following: an academic or student are authors of an information resource.
— Implement inverse relationships that exchange the scope and value of previously created ones. Take as example the relationship WrittenBy which inverse form is AuthorOf.
— Add Protégé's functional, symmetric and transitive attributes for ensuring that all relationships have the ability to deduce new information based on the connections formed with others.
Figure 3 shows a subset of the
relationships defined at the Protégés root level, showing the name of the
relationship and the name of the classes it relates
3.5 Instances Creation
The final step related to the creation of an ontology is the implementation of instances or representative examples that demonstrate how an institutional repository works using all the semantic expressiveness of ontology formal languages. In this sense, different actions were followed for creating instances that represent elements in the repository.
— Create multiple instances of elements involved on the daily functionality of an institutional repository. Consider as example, the creation of specific instances of students, professors, articles and books that emulates potential users and documents classically found in a repository.
— Associate each instance to its class category previously created using Protégé's Types attribute. Take as example the instance Article/Paper1 which is associated to a ScientificPublication class or the instance Person1 who is related to the Student class.
— Assign values to the different data properties of the instances created using Protégé's Data property assertion attribute. For example, consider the instance UDLAPRepository which has a number of properties available like RepositoryName, Description, NumberOfFiles, etc.
— Append distinct relationships to the instances created, using Protégé's Object property assertion attribute. Take as example the instance Person1 who is author of Article/Paper1 by means of the relationship AuthorOf.
— Apply one of the reasoners provided by Protégé to infer new knowledge associated to each instance considering the class type, properties used and the relationships implemented. As example consider the instance Person1 from which it is infer that also belongs to the classes Student and Disciple.
3.6 Ontology Main Features
Table 4 shows a summary of the main characteristics implemented in the ontology created for POHUA.
Metric | Number of elements |
---|---|
Axioms created | 1140 |
Logical axioms created | 667 |
Class count | 116 |
Data property count | 98 |
Relationships count | 16 |
Instances count | 12 |
Annotations (description) count | 431 |
Class assertions | 16 |
Data property assertions | 23 |
Relationship assertions | 19 |
From the above table, it can be observed that there are multiple axioms created in the ontology which is an indicative of the variety of definitions, assertions and rules implemented. Additionally, the number of classes, properties and relationships added show the diversity of people, documents and interconnections needed to modelate accurately the functionality of an institutional repository.
Finally, the number of assertions obtained validate the correct construction of the ontology considering the consistency of the elements implemented and how well interact with the other definitions made.
4 Evaluation and Comparative Analysis
In this section, two main aspects associated to the accessibility and importance of the proposed ontology are presented and discussed. The first one is related to the creation of an evaluation tool to measure the ontology impact when it is tested by users that have a certain knowledge about digital repositories. The second one is associated to the analysis and comparison of two ontologies: the proposed ontology for POHUA and other used as inspiration or baseline, Onto4AIR.
4.1 Ontology Evaluation
One of the major challenges related to the analysis of ontologies is the technical evaluation of taxonomic main components (classes, properties, relationships and instances) for measuring the knowledge representation correctness and simplicity of specific domains [53]. For this reason, an evaluation tool was proposed in this paper for proving the acceptance and correct understanding of the semantic features created in the ontology over distinct potential users in the context of an institutional repository.
The users that tested the ontology using the evaluation tool, belong to two major groups: expert users that have experience in the creation of institutional repositories along with some knowledge associated to the creation of ontologies and non-expert users that only have some background related to the ontology terminology.
Considering the ontology main features (see Section 3.6), ten non-expert users and three expert users were selected to check different aspects related to the consistency of the ontology, taking into account six major categories: correct use of language and terminology, creation of relevant classes, use of representative properties, implementation of meaningful relationships, creation of ideal instances and discovery of new knowledge.
In this sense, Table 5 presents the overall results obtained for each group after analyzing the proposed ontology using the evaluation tool using a scale from one (worst evaluated) to five (best evaluated).
From Table 5, it can be observed the following main aspects concerning the evaluation of the ontology:
— The experts and non-experts users surveyed about the importance of the ontology agree that the representation model helps to understand the overall importance of each element used on the construction of an institutional repository.
— For both types of users it can also be observed that the taxonomic structure of the ontology was easy to follow, which is an indicative of the organized nature of the information stored.
— In the case of expert users, they consider that classes, properties, relationships and instances are well organized and facilitate the understanding of the ontology. On the other hand, the non-expert users have little difficulty understanding some of the terminology associated to these key elements in the ontology which suggests that some terminology must have concrete descriptions to improve readability.
— For the case of the inference information obtained from the ontology, both kinds of users consider that more information can be obtained if more rules or restrictions were included in the ontology.
— Finally, for both kinds of users the ontology tool (Protégé) was little difficult to follow but the structure of the ontology facilitates the overall understanding of the information, highlighting the proposed structure as a viable option to model an institutional repository in the context of a university.
4.2 Ontology Comparison
The following aspects highlight the main differences associated to the ontology created for POHUA and the ontology called Onto4AIR [39] which was used as baseline for implementing the main aspects of an institutional repository:
— The first major difference between the semantic models is that the POHUA ontology implements specific classes, data properties, relationships and instances that exemplifies the scientific and cultural production of the UDLAP community.
— The second difference concerns to the implementation of specific restrictions for the POHUA ontology (like the type of elements or their scope) according to the documents and users considered in the UDLAP repository.
— The third difference is associated to the annotation of each element created in the POHUA ontology to add semantic information to each element in the repository.
— The four difference deals with the addition of specific instances (types of documents and users) that show full functionality of the POHUA repository considering the specifics/requirements of UDLAP's academic and cultural production.
— The final major difference is the definition of a specific taxonomy for POHUA that can be adapted according to the university's usage of information compared to Onto4AIR, which establishes a number of flexible elements in the ontology to create a model depending on a university or institute specific requirements.
5 Conclusions and Future Work
In this paper, the steps performed to create an institutional repository ontology have been presented. For each step, the theoretical and practical implications have been discussed, pointing out hints and examples that illustrate the implementation of an ontology on a knowledge representation tool (Protégé). Considering the implications so far, the contributions as well as the proposed future work associated to the deployment of an ontology-domain are the following.
This work has the following contributions:
Review of the current state of the art trends related to the construction of digital repositories in Mexico and the rest of the world, considering the use of citation/integration engines.
Analysis of the university scientific and cultural production for determining the best way of implementing a digital repository ontology.
Extraction of actors and documents associated to the university's data sources to create classes that exemplifies relevant entities present in an institutional repository.
Implementation of data properties that helps to characterized and understand the nature of classes.
Creation of meaningful relationships among classes for emulating the interaction of main entities on an institutional repository.
Generation of suitable instances that illustrate the functionality of an institutional repository at UDLAP.
Usage of a practical tool for evaluating the applicability of an ontology on the deployment of a digital repository.
Analysis of the best practices for creating an open access repository associated to a university that can be applied to other institutions with a similar context.
Creation of a general template of the key elements and interactions related to an institutional repository that in turn can be used to understand better POHUA and therefore the best way for changing or updating specific elements.
We would like to mention as future work:
Creation of new classes, data properties and relationships that helps to improve the ontology created, considering new documents produced in the university.
Generation of new instances that covers all the participants involved in the functionality of an institutional repository.
Implementation of a query-based system on the semantic analysis of the ontology to discover or infer insightful knowledge related to the institutional repository.
Creation of distinct modules to interoperate Dspace software and the ontology proposed for adding semantic information related to the documents stored in the repository.