Introduction
To date there are lots of geospatial data sources available to generate data almost instantaneously. Imagery from aerial or satellite platforms, and the popularization of Unmanned Aerial Vehicle (UAV), or ‘drones’, has allowed to generate geospatial datasets in an unmanageable way, what some authors named ‘big data’ trend (Crampton et al., 2013). ‘Terabytes are quite typical today’ said Traxler and Hesina (2017). Other important data source is the crowdsourced data, generated by volunteers almost daily (Neis and Zielstra, 2014). This overload data scenario brings new challenges for the official spatial data suppliers, or National Mapping and Cadastral Agencies (NMCA). Traditionally, these institutions create and manage authoritative datasets in a standardized way. However, today many data ‘producers’ represent the same phenomena, geospatial features, following their own rules. This new scenario may lead users questioning the quality of available datasets.
In these cases, few or nothing information about the quality of a spatial dataset is available, so we believe that would be interesting a web service with the capability of assess the quality of a test dataset against a reference dataset. A data quality validation service is an appealing topic in the geospatial research agenda which has been developed in current projects (Kruse, 2014).
A recent trend in the geomatics industry is to automatise the most of the productive chain, as we can see in recent projects, e.g. the ‘mapping as a service’ in Ordnance Survey Ireland (Coumans, 2016) and the use of UAV in cadastral mapping (Ramadhani et al., 2016). It is fair to assume that data quality evaluation also experiences this trend.
The state-of-the-art for the automation of quality control for spatial data has shown recent advances. The study of Donaubauer et al. (2008) proposed a web service with the ability to generate quality information of assessed data via web services. The work used well-defined standards when was executed, with the Web Processing Service (WPS) (Schut, 2007) interface to process the quality control, and ISO 19115 (ISO, 2003) for the quality report by means of metadata elements. WPS is an open specification from the Open Geospatial Consortium (OGC). Despite the simplicity of the quality procedure, just an overlay of previously tagged data with some quality elements, this study seemed to be the first attempt of an automatic evaluation service in the literature. Other study also indicated that the quality evaluation can be executed through a WPS (Mobasheri, 2013). More recently, Meek et al. (2016) presented a solution for quality evaluation of crowdsourced data using service chaining and WPS.
In the Universidad de Jaén emerged a successful research focused on the automation of the positional accuracy evaluation, due to Ruiz-Lendínez (2012). The author proposed a solution for automatic positional accuracy assessment of polygonal features using a matching approach. His thesis presented encouraging results and this is our starting point for the current research.
The free availability of some geospatial information for final users has raised questions about the cost of maintenance for NMCAs (Carpenter and Snell, 2013). However, as more data is available for users, more becomes necessary evaluate their quality in order to identify if these data fits the users’ requirements. This may be the opportunity for an authoritative data supplier plays the role of data ‘validator’, providing standardized and useful quality reports about the data users want. Other possibility is the raising of quality certification for geospatial data, as pointed by Ariza-López (2013).
Research question, hypothesis, and goals
Our main research question arises from the need of an on-line evaluation service: how far can we automate the evaluation of geospatial data quality over a web environment?
The current state-of-the-art of the automation of quality control for spatial data shows some recent research:
It is possible to generate quality information about a spatial dataset using the WPS interface (Donaubauer et al., 2008), also confirmed by a later work (Mobasheri, 2013);
Studies inside the ESDIN project described semi-automatic data quality evaluation services (Beare et al., 2010);
Ruiz-Lendínez (2012) proposed and demonstrated a feasible solution for automation of the positional accuracy evaluation using a matching approach;
Ariza-López (2013) argued quality of spatial data has received recent and continuous development of international standards, notably the ISO 19157: 2013 (ISO, 2013);
Fan et al. (2014) demonstrated that is possible evaluating various quality elements - completeness, positional accuracy, thematic accuracy, and shape accuracy - for building footprints using a test dataset against a reference one, where the first step was the matching between datasets; and
Brovelli et al. (2017) presented a new procedure to perform comparisons between crowdsourced and authoritative road datasets with a significant degree of automation.
Taking these facts as working assumptions we can formulate our working hypothesis H1: A fully automatic evaluation procedure is possible, without any human intervention, that assesses a test dataset against a reference dataset along the time. We believe a data quality validation service will bring gains for data producers and data consumers. Data producers may benefit themselves by a standardized dataset to evaluate their own contracted products. Data consumers may obtain a quality report for the spatial data using a standard protocol (WPS).
Considering this hypothesis, our main goal is to develop a web service able to evaluate the quality of geospatial datasets using the standardized interface of WPS in a fully automatic way.
Institutional relevance and publications
This study was developed in the research group GIIC (Grupo de Investigación en Ingeniería Cartográfica) at the Universidad de Jaén, Spain. This research group (TEP-164) has produced relevant studies in the geospatial data quality area, among them we can cite: Ariza López and Atkinson Gordo (2008), Ariza-López et al. (2011), Ariza-López and Mozas-Calvache (2012), Ariza-López and Rodríguez-Avi (2014), Ruiz-Lendínez et al. (2016) and Gil de la Vega et al. (2016).
The research project was supported by the Brazilian Army's Department of Science and Technology (DCT), which sponsored this project on behalf of the Geographic Service (DSG). DSG leads the geospatial information in the Brazilian Army. According to Brazilian law (Brasil, 1967) DSG is responsible to generate and maintain the technical standards for the national land mapping.
This research project generated the following publications at the present date: Xavier et al. (2014), Ariza-López et al. (2015), Xavier et al. (2015a, 2015b, 2015c), Xavier et al. (2016a, 2016b), Ariza-López et al. (2017) and Xavier et al. (2017).
This paper summarizes the whole study that can be found in Xavier (2017). The remainder of this document is structured as follows. The next section presents our proposal and the material used to test the approach. The following section briefly describes the experiments executed to validate our proposal, and discusses the obtained results. The last section brings some conclusions.
Method and material
In order to reach our main goal we propose a framework for automatic geospatial data quality evaluation. This framework is composed by the architecture of a solution towards quality assessment through web services. This solution is presented in the following section. Then, the next section presents the material used in the experiments to validate this approach.
Quality control service
We propose a three-tier architecture for a web services platform focused on quality control of geospatial data (see Figure 1), which we are calling the quality control service. From a bottom-up point of view, the first tier, Data Access, is used by external evaluation methods to manage reference data: retrieve and matching. The second tier, named Evaluation, implements the different quality evaluation procedures available at the service. The last tier, named WPS, handle client requests using the standardized interface of OGC WPS. This architecture was first discussed at Ariza et al. (2015).
Data Access tier manages the relation between test and reference data. Since direct external evaluation procedures depend on reference data for comparison, this tier provides the correspondences between datasets in order to permit compare them. There are two ways to facilitate reference data: (1) remote: who calls the service provides the reference data; or (2) local: the service itself has its own reference dataset. The Data Access tier manages the access to local reference data, and also provides a Matching module that provides data matching between assessed and reference data (local or remote). According to what is been requested by the external method, this matching can be in the feature level, or in the internal level, i.e., by considering vertices of a geometry.
Feature matching is a requirement of direct external evaluation methods in geospatial data quality assessment. In the proposed architecture the Matching module plays the role of finding the correspondences between these two datasets (reference and test). These correspondences can be at the feature level (among objects), or the internal level (among parts of objects, e.g. vertices). There are a plethora of approaches facing matching at feature level, as discussed in Xavier et al. (2016 a). So we decided to investigate which ones would be adequate to our service. In order to achieve this goal we opened three working fronts: (1) development of similarity measures; (2) preparing a matching testbed; and (3) over this testbed we applied some matching methods under the control of design of experiments. Regarding internal matching, there are few matching methods focused in this actuation level, as we can see in Xavier et al. (2016a). In this study, we are proposing a new method for matching geospatial data at internal level based on the shape context descriptor from Belongie et al. (2002).
Evaluation tier contains the implementations of evaluation methods, notably direct internal and direct external. Direct external methods require an external reference that is managed in the Data Access tier. Evaluation tier also contains the Report module that is responsible for generate the quality report in different ways: a human-readable report, or an XML report in ISO format, current (ISO, 2016) or legacy (ISO, 2007). This tier represents the kernel of this architecture towards the quality assessment of geospatial data using web services. In this study we adopt the Brazilian standard for geospatial data quality, named CQDG (DCT, 2016). Taking into account that this standard provides quality evaluation procedures for all geospatial data products in Brazil, this standard plays the role of quality model in this research project. In this part of architecture we develop internal and external quality evaluation procedures described in the CQDG standard for products of type vector geospatial datasets.
In the proposed architecture, the WPS tier is the point of contact with the clients. This tier handles requests and responses using the WPS interface. Quality evaluation procedures often involve complex tasks and people from different organizations or departments. Facing this situation we have two design principles: interoperability and simplicity. The interoperability principle indicates that the WPS tier should follow the WPS specification and schemas in order to permit a standardised way of communication. The simplicity principle leads us to avoid unnecessary issues in the processing itself, so the processing ‘part’ should be as straight as possible. The WPS tier should manage all communication issues, validation procedures, and client-server tasks.
The proposed architecture is intended to be general for automatic quality assessment, and should be applied independently of datasets or software platform.
Material
In this research project we use R as the statistical computing tool. R is a language and also an environment focused on statistic tools and graphics (R Core Team, 2014). Other relevant materials are the geospatial data used to test the quality control service, and the developed software that effectively implements the concepts proposed in this study.
We adopted geospatial datasets built up from mapping data produced by official Spanish mapping agencies for Andalucía, southern Spain. This area was chosen because the Universidade de Jaén is located there, and because there are freely available data covering this area. We used 1:25,000 data from the Base Topográfica Nacional 1:25,000 (BTN25) of national mapping provided by the Instituto Geográfico Nacional of Spain (IGN, 2015). We used 1:10,000 data from the Base Cartográfica de Andalucía 1:10,000 (BCA10) of regional mapping provided by the Instituto de Estadística y Cartografía de Andalucía (ICEA, 2015). We selected different landscapes: coast and mountain, rural and urban. The following mapping sheets 1:25,000 were used to define the study: 0896-3, 0896-4, 1003-4, 0999-1, 0999-2, 0999-3, and 0999-4.
All software developed in this research project is based on the TerraLib library. TerraLib is an open-source GIS library developed by the Brazilian National Institute for Space Research (INPE) (Câmara et al., 2008), available at the TerraLib repository (DPI, 2013). Inside TerraLib there is a subprojects named TerraOGC - a framework for Web-GIS development that contains modules for many OGC specifications, like WMS, WFS, WCS, and GML. For this research project the existing WPS module was improved in order to accommodate the design principles described of the WPS tier. As a part of WPS process was created a data quality processing module (DQEval) which contains most of the code related to this project. It can be found on-line at its repository (DPI, 2017).
Results and discussion
This section presents the experiments executed in order to validate the proposed framework for geospatial data quality evaluation through web services. The essays are designed to assess the proposed framework using both real and synthetic data.
The first experiment deals with the creation of the feature matching testbed. This testbed is composed by four groups of datasets: (1) initial datasets: original mapping data; (2) morphology modified: synthetic datasets created with emphasis in some specific morphology class for lines or areas; (3) systematic disturbance: synthetic datasets created from affine transformations; and (4) random disturbance: synthetic datasets created over the influence of randomly generate displacement vector fields. We believe that this testbed is a valuable tool to be shared with other researches in the GIScience area, so we have submitted it to a public repository of scientific data (Xavier et al., 2017). The dataset generated in this experiment were used in the following experiments.
The second experiment used the concepts of design of experiments (DOE) (Montgomery and Runger, 2003) to compare a set of feature matching methods over the matching testbed developed in the previous experiment. This experiment is divided according to the geometric primitive: point, line, and area, in this order. Each type of geometry has its own essays, or configurations. The designed experiment for feature matching is composed by 20 essays: points (P1-P5), lines (L1-L6), and areas (A1-A9). Based on the results of these many essays, it was possible to select some matching methods with the more suitable results to our quality control service. Figure 2 shows an overview of this DOE with the factors considered for each geometry, and the respective number of treatments.
After test the influence of factors over the matching procedures we were able to draw some recommendations from the results of the DOE for feature matching. Taking into account data in scales closer to 1:25,000 and 1:10,000, where it is fair to suppose that there is no significant positional difference between these datasets, we can recommend the following geometric matching methods:
Point matching: Euclidean distance, closer criterion, 10 m threshold;
Line matching: SMHD measure (Tong et al., 2014), closer criterion, 10 m threshold, combined with partial orientation 0.4 rad; and
Area matching: overlap area measure, closer criterion, and 10 m² threshold.
The third experiment were focused on test the new internal matching method developed in this research project. In this study we are adopting a quality model of the Brazilian standard (DCT, 2016), whose describes a positional quality procedure based in points. Therefore, we developed this internal matching method in order to increase the quantity of points for the positional quality assessment, since it allows to use area and line features at the quality process. The results indicated that the current implementation of the new internal matching method has two highlighted gains when compared with other equivalent methods (Fan et al., 2014, Ruiz-Lendínez et al., 2016): (1) this method is able to deal with many-to-many area pairs and find their corresponding parts; and (2) this method can work with polygon holes, what can increasing the quantity of corresponding parts. Also, the results regarding line parts did not reach a performance acceptable for quality assessment.
In the fourth experiment we tested the validity of the quality evaluation procedures developed in the Evalution tier of the quality control service using the datasets generated in the first experiment. The experiments were divided according to the quality element in consideration: topological consistency, completeness, and positional quality. The results revealed that the topological consistency procedures worked as provided in the standard, in a fully automatic way. Regarding the completeness element, we verified the performance of automatic completeness in all geometric primitives (point, line, area). The results revealed that the automatic implementation worked satisfactorily. However, we identified that the performance of the selected matching method influences the performance of the automatic quality evaluation. In the positional accuracy essay we verified the performance of the automatic implementation of the planimetry procedure in 11 regions: nine point regions and two area regions. The results revealed that the automatic positional accuracy procedure performed similar to the manual procedure, with the quality category preserved in all considered regions.
Finally, in the last experiment we checked whether the WPS tier is capable to play the role of interoperability layer between clients and automatic quality evaluation procedures. In this phase we also checked the possible quality reports generated. The results aroused some aspects of the applicability of WPS while a service interface facing quality evaluation. We can point out: (1) WPS permits multiple inputs and outputs; (2) WPS is ready for service chaining; and (3) process extension is relatively easy. So we can conclude that the WPS interface is platform feasible to implement the quality control service.
Conclusions
The geomatics industry is living a data overload scenario which are raising new challenges to the authoritative data producers, or NMCAs. Today it is possible to find diverse datasets representing the same geographic extent from many producers: volunteers (e.g. OpenStreetMap), commercial mapping companies, and official mapping agencies (at distinct levels). Each one creates its datasets following its own acquisition rules (and sources), which leads us to the question: ‘which one does fit my purposes?’, a fitness for use issue (Servigne et al., 2006).
In this context, this study seeks to provide a solution able to answer the key-question: ‘what is the quality of this dataset?’ Where the geospatial data is created almost automatically (e.g. Coumans, 2016), we also need a quality evaluation tool capable to respond in the speed that the data are created. Therefore the main goal of this study is to develop a standardized web service with the capability to assess the quality of geospatial datasets in fully automatic way. In order to reach this goal we developed a framework for automatic evaluation of geospatial data quality. Then we tested each part of our solution for the quality control service in the experimental phase.
The obtained results confirmed that the main goal was reached: we have a quality control service that automatically assess the quality of geospatial datasets with results comparable to the manual procedure. These results corroborated our main hypothesis: a fully automatic procedure, running over a web environment, is able to assess the quality of geospatial datasets without any human intervention.
The contributions of this study are manifold. We presented full automatic procedures to evaluate topological consistency, commission, omission, and positional accuracy. To the best of authors’ knowledge, this is the first implementation for the last three quality elements over a WPS. Regarding geospatial data matching, in this study we presented a design of experiment to test matching methods at feature level, as embracing as possible; and we developed a new method at internal level for areal features. The matching testbed was released to the research community as an effort to provide a homogeneous framework to test new methods.
The proposed solution has its limitations: the capacity of automatic quality assessment for external methods (those that require an external dataset) is directly related to the performance of the matching methods (at all levels). Then, the performance of the automatic quality control service depends on the performance of used matching methods.
In a classical book of cartography, Robinson et al. (1995) argued that ‘one of the most difficult tasks for cartographers is to indicate to map readers the quality of data used’. In the web era, we hope that the cartographers might delegate this ‘painful’ task to the machines.