INTRODUCTION
The growing need for the use of meteorological data, makes its demand focus on increasing their availability in time and space (WMO, 2011). The above goes in contrast to the fact that there are fewer gauge stations with shorter temporal lengths or lower quality, both in the world (Shelton, 2008) and in Argentina and the Cuyo region (Boninsegna & Villalba, 2014; Santos, 2014; Lui, 2019). While the optimal scenario is to base studies on instrumental time series, problems of extension in space and acceptable time lengths (≥ 30 years, according to the World Meteorological Organization, WMO, 2017), make it increasingly frequent to use alternative data sources.
Integrated Climate Databases, or ICB (NOAA, 2019), are meteorological datasets, established in a grid of certain X and Y dimensions, and with generally long temporal lengths. According to the predetermined pixel dimensions, its information (Z-axis) is stored as a time series. Some, like UDEL (University of Delaware), turn out to be a compendium of data from a large number of stations, based on the Global Historical Climatology Network (GHCN), resulting in a monthly climatology of more than 100 years of temporal length and with eminently terrestrial coverage. ICB interpolations are executed using of digital elevation models, assisted interpolation, traditional interpolations, and climatologically assisted interpolations (Matsuura & Willmott, 2018). One of its main advantages is that the coverage can cover areas with no measurement stations, so it represents, in many cases, the only information source in these areas. Since interpolations have a margin of error concerning the original nature of a given event, choosing the one with the highest adjustment is essential to increase the levels of certainty about it.
Several studies (Takido et al., 2016; Wong et al., 2017; Nashwan et al., 2019; Qiu et al., 2019) have aimed to identify the accuracy of these databases and then apply them as inputs for climate analysis, including regions of central-eastern Argentina (Rusticucci et al., 2014; Casado & Picone, 2018). The measures to assess the adjustment of the ICB are mainly based on the magnitudes of error (between these and the instrumental data) and structure or correlation indicators, such as Pearson’s r coefficient (Rusticucci et al., 2014; Ferrelli et al., 2016). It should be noted that the study area presents large regions without meteorological measurements (Fig. 1), which would lead to having to use interpolations from neighboring stations or the use of ICB, with the usual uncertainty that precedes them. In this sense, the aim was to evaluate the similarity of three gridded databases of monthly temperature, with respect to ten meteorological stations located in central-western Argentina. An unconventional similarity index, frequently used in medical settings, was used to assess the statistic adjustment (Zhao et al., 2019).
METHODOLOGY
Data used
As data input, the ICB of the Climate Research Unit - CRU v. 4.04 (UIACRU et al., 2020). University of Delaware - UDEL v. 5.01 (Matsuura & Willmott, 2018) and the Global Historical Climatology Network - GHCN v.2 (NCARS, 2019), were freely downloaded from the data repositories of the Climatic Research Unit - CRU (University of East Anglia, 2019), Earth System Research Laboratory - ESRL (NOAA, 2019) and the National Center for Atmospheric Research Staff (NCARS, 2019), respectively. These databases have a spatial resolution grid of 0.5º x 0.5º and a temporal resolution (selected from the total) of 22 years (1993 - 2014); the time period was chosen due the available weather data in the region. The domain designed covers from 30º S to 35º S and 71º W and 66º W. The data source for the construction of these ICB is based on meteorological stations data (Fig. 1, Tab. 1).
Stations used by / for: | |||||||
ICB | Applied method |
Validation | |||||
Station | Agencies | Code | CRU | UDEL | GHCN | ||
San Juan Km. 47,3 | SMN | 1208 | X | ||||
San Juan Km. 101 | SMN | 1211 | X | ||||
Uspallata | SMN | 1491 | X | X | |||
Tupungato Pta. de Vacas | SMN | 1420 | X | ||||
Tunuyán Valle de Uco | SMN | 1419 | X | ||||
Mendoza Guido | SMN | 1413 | X | ||||
San Juan Aero | SIPHN | 87311 | X | X | X | X | |
Mendoza Aero | SIPHN | 87418 | X | X | X | X | |
San Martín (Mza) | SIPHN | 87416 | X | X | X | ||
San Luis Aero | SIPHN | 87436 | X | X | X | X | |
San Rafael Aero | SIPHN | 87509 | X | X | X | X | |
Ñacuñán | IADIZA | - | X | ||||
San Juan EEA INTA | INTA | NH0445 | X |
Source: self-made elaboration.
Study area
The study area is located in the central-western part of Argentina, specifically in the Cuyo region and near the border with Chile (Fig. 1). The grid site covers 3.5º of latitude and longitude respectively. With each pixel or cell having dimensions of 0.5º x 0.5º, or 3,098.01 km2, the total area of the gridded zone is 151,802.5 km2, distributed among the provinces of Mendoza, San Juan, San Luis, and La Rioja. The 13 meteorological stations (Fig. 1, Tab. 1) distributed within the area, depending on national agencies, such as the National Meteorological Service (SMN, 2019), the National Institute of Agricultural Technology (INTA, 2019), the National Secretariat of Infrastructure and Water Policy (SIPHN, 2019) and the Argentine Institute for Arid Zone Research (IADIZA, 2019).
Each database is stored in their respective web portals as NetCDF (Network Common Data Form) file format, where each pixel, besides its X and Y dimensions, has temporary information (Z dimension) of monthly scale (264 months). Regarding Table 1, it should be noted that only the stations managed by the SIPHN were used as a base for the construction of the gridded products, except for the Uspallata station (only used by CRU).
Extraction of data from ICB
Overlaying the weather stations over the NetCDF databases in a map (all with a common period 1993 - 2014), it is found that each one has associated a pixel where it is immersed. By extracting the time series of each associated pixel, pairs of time series (Y) were obtained, whose months are in the same chronological position as the instrumental ones (X).
Application of similarity indices
The contrast between pairs of time series (CRU, UDEL, and GHCN) versus instrumental data, was made by applying the Modified Structural Similarity Index [mSSIM] (Wang & Bovik, 2002; Mo et al., 2014). mSSIM utilizes the mean, variance, and structure (Eq. 1), to evaluate how similar or dissimilar are the contrasted time series (concordance). The range of similarity varies between 0 and 1, being 0 for completely dissimilar series, and 1 if it turns out to be the same (Fig. 2).
The interpretation of mSSIM is based on comparing the similarity between two data vectors, X and Y. Since the closeness or distance of these depends on the product of the measures mentioned above, the index will be more robust when their values are closer to 1 (with subsequent verification of statistical significance). Since the similarity is a function of the magnitude of the vectors tested, the use of the index is not recommended for variables at different scales of measurement (i.e. temperature and precipitation) (Wang & Bovik, 2002).
Since several authors (i.e. Rusticucci et al., 2014; Bustos et al., 2016; Rivera et al., 2018) used Pearson’s correlation coefficient (r) as an indicator of adjustment between grids and instrumental series, the mSSIM index is presented as a robust measure to evaluate similarity by structure and closeness between two data vectors X and Y (Mo et al., 2013; Mo et al., 2014). This index, which is frequently used in medical imaging comparison studies (Zhao et al., 2019), also has advantages over other similarity measures, including the Pearson’s r coefficient (Mo et al., 2013). Therefore, both methods (mSSIM and r) were applied for comparison purposes, whose equations (1 and 2) are detailed below, respectively.
Where: x and y represent the contrasted matrices vectors, respectively.
The justification for using mSSIM (Eq. 1), as opposed to the Pearson correlation coefficient r (Eq. 2), lies in the fact that the latter exposes the structure or trend relationships between the contrasted data series, proving to be insensitive to proportional differences in measurements and therefore, considered as an inaccurate measure of similarity (Mo et al., 2013). It should be noted that Pearson’s r is contained in the mSSIM equation (as structure parameter).
Determination of ICB with higher adjustment
Subsequently, and having ten mSSIM values between the time series of each weather station (Tab. 1), versus each gridded series (CRU, UDEL, GHCN) associated with the first one, 30 results were obtained. Ranked in terms of accuracy (i.e., 1st, 2nd, and 3rd place), the objective was to find the database with the best position (highest mSSIM) concerning the others.
Although the previous step determines which ICB has the greatest similarity with respect to the measured data, since it presents gridded values, it is very unlikely that the similarity is equal to 1, so the existing differences can be adjusted to reduce the dissimilar gap between the two series.
Adjustment proposal
Once the grid that had presented the greatest similarity with the instrumental data was chosen, the aforementioned adjustment was carried out, which was executed by:
Where, : C i: ICB calculated in month i; M i : gridded value for month i; I i : instrumental record of the month i; X K : difference average between I i and M i for every month K.
Prediction model for cells without information
To find a prediction model based on the adjusted or modified data (Eq. 5), the equation of the best-fit line between the gridded and modified data series for each station was calculated. The contrasting time series of each grid point were subjected to a cluster analysis, in order to define the different groupings in space. These clusters were considered as the influence area of the considered weather stations.
Where, y: modified grid value; a:b determined intercept and slope parameters, respectively; x: original grid value.
In this sense, the equation of the line found for each weather station (gridded versus modified data) was applied in the pixels that were located within their influence area (grouped by cluster analysis). In the linear regression equation of type y=α+bx, x represents each month of the time series of each one of the pixels, obtaining in this way values of y (corrected by regression equation). Thus, pixels that did not have measurements from weather stations to be contrasted, could be adjusted.
Accuracy and validation
Finally, to the instrumental time series, gridded and calculated by regression (only from those pixels that comprise the weather stations), their respective arithmetic averages were calculated. In this way, as each pixel had only one value, three layers were generated in raster format. Thus, using map algebra, the raster of instrumental values was subtracted from the chosen grid, and then the calculated or modified one. In this way, the level of precision reached by the second one was determined according to the followed methodology.
To evaluate the pixel adjustment level without data from meteorological stations, the following ones were used: Ñacuñán (IADIZA, 2019), Uspallata (SIPHN, 2019), and San Juan km 101 - code 1211 (SMN, 2019), located at 34º02’42’’ S, 67º56’06’’ W; 32º35’42’’ S, 69º20’24’’ W; and 31º15’11’’ S, 69º10’37’’ W, respectively (Fig. 1, Tab. 1). From these stations, only Uspallata was used by CRU as a base for the conformation of its grid.
The chosen periods to evaluate the level of pixel adjustment correspond to the lengths of continuous records available from these stations (Tab. 2); that is, 21.4 years - for Uspallata (08/1993 - 12/2014), five years for San Juan km 101 (01/2010 - 12/2014) and six years for Ñacuñán (01/2009 - 12/2014).
RESULTS
The mSSIM was applied for the ten gauge stations (Tab. 1) against the three ICB, resulting in 30 values, whose averages were ranked according to the results for each station. Ordered data indicated the greatest similarity as follows (Tab. 2): UDEL with 0.9491, CRU with 0.9314, and GHCN with 0.9211 (all with significant values by Student’s t-test, with α=0.05). Although the differences between these values were less than 3 %, UDEL was chosen to continue with the adjustment methodology.
Since ICB consider almost the same weather stations as a base, on the one hand, the similarity values were close to each other; however, although the differences between mSSIMs were less than 3 %, UDEL presented the highest similarity. On the other hand, the differences with Pearson’s r were less than 0.2 %, giving UDEL and GHCN the highest value, but without defining in terms of similarity the ICB with the highest adjustment (Fig. 2).
As shown in the figure above, Pearson’s r does not vary, even though in Figure 2B, there are differences in magnitude up to 10 ºC. Meanwhile, mSSIM is presented as a more sensitive and robust index, incorporating structural, magnitude, and variation measures. Although both data series (instrumental and gridded) are similar in structure, they are not the same, so the Pearson r coefficient does not expose the differences due to similarity. Therefore, if both time series follow the same pattern, but are different in magnitude, Pearson’s r will not detect it, unlike mSSIM that makes it explicit (Fig. 2).
Applied the fitting procedure (Eq. 3 and 4), the differences between UDEL and the contrasted instrumental series decreased, thus obtaining new modified UDEL values (Fig. 3). Subsequently, the linear regression model found between X (UDEL) and Y (modified UDEL), allow to estimate the temperature for those UDEL cells that did not have associated measurements of meteorological stations.
Although it is observed that the regression model overlaps with the modified UDEL curve (Fig. 3), the aim is to apply it to cells where no weather stations exist. Therefore, a cluster analysis (Fig. 4) was applied to the UDEL cells to find the respective clusters and to apply the regression models to the regions found previously.
Once the time series were averaged for each cell, they were subjected to cluster analysis (Ward method with Euclidean distance), resulting in five clusters (optimum defined by Elbow and Silhouette methods). However, since I cluster is divided into two sections (Fig 4A), due to spatial discontinuity, its northern part was formed as a sixth cluster (Fig. 4B).
A way to compare the cluster formation above (based on UDEL data) with the relief and the relationship between them, is observing the natural conditions in the region. In the next figure (Fig. 5), are presented the mean UDEL (without adjustment) temperature contours (ºC), the altitudes (SRTM) and the location of temperature stations.
Source of data used: Farr et al. (2007); Matsuura & Willmott, (2018).
Source of map: self-made elaboration.
Comparing Figures 4 and 5, the spatial distribution of formed clusters is coherent with the temperature and relief spatial configuration. In this way, the lowest temperatures are presented at the east, according with the highest altitudes of the Andes Mountains.
After the clusters were defined, the equations generated by the linear regression for each weather station were applied. Thus, for example, the equation of the Mendoza Guido station was applied to the six cells of the corresponding region IV. Therefore, in the equation of the form , represents the corrected monthly value, and the original UDEL data.
Since there are cases such as clusters I and V (Fig. 4), whose data was originated from more than one weather station, the linear regression model was applied to the surrounded cluster. Then, the adjusted results were averaged as a function of the number of instrumental stations. Since each gridded cell (original UDEL) and adjusted has three-dimensional information (Z as temporal axis), the time series of each cell of these two grids were averaged to form two raster images (Fig. 6).
Although it is observed that both images (Fig. 6) have the same measurement scale, it is evident that the adjusted UDEL grid (Fig. 6B) overestimates the unmodified one (Fig. 6A). However, since only measured data can verify the reliability of the estimates, the information from the weather stations was rasterized (an average per station) with the same cell size (0.5º x 0.5º). These cells (using map algebra) were subtracted from their averaged pairs of unmodified UDEL and modified UDEL, to evaluate where the most significant differences are found (Fig. 7).
In the figure above, it is observed how the largest differences (over 2 ºC and up to 10 ºC) correspond to the original UDEL grid (Fig. 7A), while after modification or adjustment of UDEL (Fig. 7B), the differences for the considered pixels are less than 1 ºC.
Finally, as part of the validation for cells without instrumental information, it was proceeded to contrast them with the Uspallata, San Juan km 101, and Ñacuñán weather stations, which were not included in the previous analyses nor were they used as a basis for the construction of UDEL (Tab. 1).
The time series of the selected stations were contrasted with the original and modified UDEL time series cells (by regression), where they were immersed. That is, having X as the observed value and Y as the gridded series with and without modification, coinciding on the same time period. The contrast and similarity values by mSSIM between both series can be seen in Figure 8 and Table 3, respectively.
Weather station | Time length and period (MM/YYYY) |
I vs. UDEL unmodified | I vs. UDEL modified |
Uspallata | 21.4 years - (08/1993 - 12/2014) | 0.8304 | 0.9678 |
San Juan km 101 | 5 years - (01/2010 - 12/2014) | 0.9440 | 0.9834 |
Ñacuñán | 6 years - (01/2009 - 12/2014) | 0.9899 | 0.9871 |
Source: self-made elaboration.
The visual fit level between the curves of the three meteorological stations considered, and the UDEL curves (modified and unmodified) is shown in Figure 8. Although only the 2014 records were represented, Table 3 shows the mSSIM values according to the different lengths of the time series available at these stations.
As shown in Table 3, by the three considered cases, the modified UDEL was closer to the instrumental series than the unadjusted curve. This situation is more accentuated on the Uspallata station than on the other two, and is reiterated given the lower general precision of the UDEL on mountainous areas. The slight difference for Ñacuñán unmodified and modified values (0.0028), can be due the application of modified proposal methodology in flatted areas (like Ñacuñán surrounding site). The above can also be seen in Figure 7A, where, on the one hand, it is represented that towards the west of the gridded zone, there are cells with temperature differences greater than 7.5 ºC concerning the instrumental values. On the other hand, towards the central and eastern zone of the region, the differences are lower (Ñacuñán station), presumably because it is a site with lower and regular topography.
In contrast, in Table 3, the numerical difference of similarity shows that the precision of the modified UDEL is greater than its original pair, except in the Ñacuñán station, where the original database presents a slight difference over the modified one. According to the above, the greatest difference for Uspallata stands out (for the reasons stated in the previous paragraph). Such evaluation reaffirms the potential of UDEL in terms of its application to cells without measurements and the advantages of using mSSIM as a similarity index.
DISCUSSIONS AND CONCLUSIONS
As part of the better-fitting ICB selection, the mSSIM shows advantages over Pearson’s r coefficient as a similarity index. Such a situation allowed defining that UDEL data had higher similarity than CRU and GHCN data. It should be noted that although the differences were less than 3 % and the statistical significance was 5 %, the choice of UDEL was given by proximity to the records and about the other ICB ones.
Although several investigations (Rusticucci et al., 2014; Bustos et al., 2016; Casado & Picone 2018; Rivera et al., 2018; Qiu et al., 2019) base the ICB validation by applying Pearson’s r, the mSSIM index was positioned as an indicator that presents greater precision in terms of similarity measurement. The mSSIM may be considered as a modified correlation coefficient, with potential applications in meteorology and climatology (Mo et al., 2013).
In this way, mSSIM proved to be more effective in recognizing which ICB had higher similarity with instrumental data, unlike the results found with Pearson’s r, which did not show clarity in establishing differences. Although one of the objectives of ICB was to provide a source of alternative climate data to the instrumental records, the differences with the latter may not be exposed to measures such as Pearson’s r. While such data sources may under or over-estimate the measurements, they generally maintain the same structural pattern due to their relationship of origin with instrumental data.
Once the UDEL database was chosen, the adjustment was applied through a proposed modification to improve the accuracy. By averaging the time series of each cell, and grouping them into six clusters, the regression equation found for their respective weather stations (modified UDEL versus original UDEL) was applied. In this way, cells without measurement stations could be corrected using the stations within the cluster where they were located as a reference.
It should be noted that unlike the proposed modification made, some authors (Rusticucci et al., 2014; Ferrelli et al., 2016) tried to validate and modify reanalysis databases by applying linear regression equations (instrumental data versus unmodified modeling). In this way, they applied the equations found based on closeness by indicators such as Pearson’s r. Thus, the regressions were applied without taking into account the similarity, which ends up underestimating or overestimating the existing instrumental data (depending on the gaps between them and the model used).
Concerned to validations, they indicated differences of less than 1 ºC between modified UDEL versus the instrumental data, compared to more than 2 ºC difference between the original UDEL and the records measured in the pixels located west and east of the region (mountainous area). Instead, for those cells without instrumental stations, the tests indicated an increase in the mSSIM values, bringing the curve of the measured data closer to the modified UDEL, and therefore, increasing its reliability.
It is important to emphasize that originally, UDEL has greater inconsistencies in mountainous areas, where it is advisable to apply the proposed modification to reduce the differences in magnitude between them. However, since in regions with more regular topography, the differences between UDEL and its modified pair are not significantly large, depending on the site, it is possible to use such ICB without applying modifications.
Although the ICB have their origin in the instrumental data, it is advisable to subject them to a contrast of similarity with the instrumental data. The aim, is to be able to apply the proposal to sites far from weather stations, so it is necessary to modify it and thus increase its similarity with the measured data. In this way, cells without instrumental measurements (as in the center-east of the study area), could have temperature data series with greater reliability, thus reducing the uncertainty of spatial measurement gaps (such as the vast semi-desert and uninhabited plains of northeast Mendoza).
Finally, it is remarkable the role of mSSIM as a similarity index to differentiate between the ICB considered, as well as between the observed and modified UDEL values later on. Although it is advisable to use data from measurements, their lack in certain areas (such as the northeast of the studied region) makes it necessary to resort to alternative sources of information. In these cases, these databases represent a valuable source of climate information, as well as an alternative to the weaknesses of the conventional meteorological network.
The replicability of the applied methodology is recommended for plains regions and not quite accurate for mountainous zones and coastal border lines (last fact mentioned due to the limitations of some ICB itself). Future works could use validated ICB as an important data source for the non-measured regions.