Daily streamflow simulation based on the improved machine learning method

Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Ren, Minglei; Lei, Tianjie; Liang, Ke; Zuo, Depeng; Huang, Pengnian; Kan, Guangyuan; He, Xiaoyan; Ding, Liuqian; Li, Jiren; Hong, Yang; Ren, Minglei; Lei, Tianjie; Liang, Ke; Zuo, Depeng; Huang, Pengnian

doi:10.24850/j-tyca-2017-02-05

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Tecnología y ciencias del agua

versión On-line ISSN 2007-2422

Tecnol. cienc. agua vol.8 no.2 Jiutepec mar./abr. 2017

https://doi.org/10.24850/j-tyca-2017-02-05

Artículos técnicos

Daily streamflow simulation based on the improved machine learning method

Simulación de caudales diarios mediante el método de aprendizaje automático mejorado

Guangyuan Kan¹²^*

Xiaoyan He¹

Liuqian Ding¹

Jiren Li¹

Yang Hong²³

Minglei Ren¹^*

Tianjie Lei¹^*

Ke Liang⁴

Depeng Zuo⁵

Pengnian Huang⁶

^¹ China Institute of Water Resources and Hydropower Research, China.

^² Tsinghua University, China.

^³ University of Oklahoma, USA.

^⁴ Hohai University, China.

^⁵ Beijing Normal University, China.

^⁶ Nanjing University of Information Sciences & Technology, China.

Abstract:

Daily streamflow simulation has usually been implemented by conceptual or distributed hydrological models. Nowadays, hydrological data, which can be easily obtained from automatic measuring systems, are more than enough. Therefore, machine learning turns into an effective and popular tool which is highly suited for the streamflow simulation task. In this paper, we propose an improved machine learning method referred to as PKEK model based on the previously proposed NU-PEK model for the purpose of generating daily streamflow simulation results with better accuracy and stability. Comparison results between the PKEK model and the NU-PEK model indicated that the improved model has better accuracy and stability and has a bright application prospect for daily streamflow simulation tasks.

Keywords: Machine learning; daily streamflow simulation; hydrological model; flood forecasting; global optimization

Resumen:

La simulación de caudales diarios se ha implementado por lo general mediante modelos hidrológicos distribuidos o conceptuales. En la actualidad, los datos hidrológicos, que pueden obtenerse con facilidad de sistemas automáticos de medición, son más que suficientes. Por lo tanto, el aprendizaje automático (machine learning) se ha convertido en una herramienta eficaz y popular, muy adecuada para la tarea de simulación de caudales. En este trabajo se propone un método de aprendizaje automático mejorado denominado modelo PKEK, basado en el modelo NU-PEK, previamente propuesto para generar resultados de simulación de flujo diario más precisos y estables. Los resultados de la comparación entre el modelo PKEK y el modelo NU-PEK indican que el modelo mejorado ofrece mayor exactitud y estabilidad, y tiene un excelente potencial de aplicación en la simulación de caudales diarios.

Palabras clave: aprendizaje automático; simulación de caudales diarios; modelo hidrológico; inundación; pronósticos de inundación; optimización global

Introduction

With the development of hydrological automatic measuring technology, the hydrological data become more and more sufficient nowadays. The best way to make full use of these big hydrological data is to adopt the machine learning method. The most popular machine learning methods, which have been widely used in the field of hydrological simulation, are the artificial neural network (ANN) and K-nearest neighbor (KNN) method (^{Li et al., 2014}; ^{Chen et al., 2017}; ^{Dong et al., 2015}; ^{Kan et al., 2016a}, ^2016b, ^2016c, ^2016d, ^2016e; ^{Lei et al., 2016}; ^{Li et al., 2016}; ^{Zuo et al., 2016}). In previous literatures, we proposed an effective and efficient machine learning based streamflow simulation model, NU-PEK model, which is constituted by coupling the ANN and KNN methods. It has been successfully applied in the field of event-based hourly streamflow simulation task. However, when applied to daily streamflow tasks, its performance becomes poor significantly.

In order to overcome the poor performance problem of the NU-PEK model for daily stream-flow simulation task, we proposed an improved machine learning based streamflow simulation model, named PKEK. The PKEK model is composed by partial mutual information (PMI) based input variable selection (IVS) module, the K-means clustering input vector clustering module, the ensemble artificial neural network (ENN) based output estimation module, and the KNN based output error estimation module. The PKEK model and the previously proposed NU-PEK model were applied in Chengcun catchment in China to compare the model performance and stability. Simulation results indicated that the improved model has better accuracy and stability, and has a bright application prospect for daily streamflow simulation task.

Watershed, hydrological and meteorological data utilized in this research

The daily streamflow simulation is carried on in the Chengcun catchment. The Chengcun catch ment lies in the Qiantang River basin, Anhui province, China. It is located in the subtropical monsoon region and is a typical humid catchment. Rainfall mainly falls in the period from April to June. There are ten rainfall gauges located in this area. Observed daily rainfall, evaporation, and average discharges range from 1986 to 1994 were utilized as the calibration data, while data from 1995 to 1999 were utilized as the validation data. The watershed map, hydrological and meteorological characteristics for the Chengcun catchment are shown in figure 1 and table 1.

Figure 1 Watershed map for the Chengcun catchment.

Table 1 The hydrological and meteorological characteristics of the Chengcun catchment.

Methodology

K-means clustering algorithm

The K-means clustering is a famous and widely used partitioning and clustering method (^{MATLAB, 2012}; ^{Grigorios & Aristidis, 2014}; ^{Kapageridis, 2015}). It is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into K clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The K-means method is usually calibrated or trained by the iterative method which minimizes the sum of distances from each object to its cluster centroid over all clusters.

PEK model

The PEK model is a hybrid data-driven model (^{Kan et al., 2015a}, ^2015b) and is composed by ensemble artificial neural network (ENN) and K-nearest neighbor (KNN) algorithm. The PEK approximator functions as a general purpose function approximator. It can be applied for the simulation of the multi-input single-output (MISO) system mapping relationship. The PEK approximator is firstly proposed by ^{Kan et al. (2015a}, ^2015b) and its detailed principle can be found in the corresponding literatures (^{Kan et al., 2015a}, ^2015b).

PKEK model

PKEK model is an improved version of the previously proposed PEK model. It is proposed for the purposed of improving the non-linear simulation capability of the PEK model. It is combined by the PMI-based input variable selection (IVS) module, the K-means clustering algorithm based input variable clustering module, and the ENN and KNN simulation module. The PKEK model can be seen as a hybrid approximator which composed by a K-means clustering module and multiple PEK modules. The simulation and calibration method of the PKEK model is as follows: The input variables are selected by the PMI-based separate IVS scheme to generate the selected input vectors. After that, each selected input vector is fed into the K-means clustering algorithm to determine which category it belongs to. After that, for each selected input vector, we choose a corresponding PEK module to calculate the output. In the PEK module, the output is estimated by the ENN, and the output error is estimated by the KNN regression. The final simulated output is the sum of the estimated output and output error. The calibration of the PKEK approximator is almost the same with the PEK approximator. The difference is that the K-means algorithm needs to be calibrated by an iterative method (^{Grigorios & Aristidis, 2014}; ^{Kapageridis, 2015}).

Non-updating RR simulation of the PKEK model

The non-updating modeling approach of the PKEK model is similar to the previously proposed NU-PEK model. The difference is the addition of the K-means clustering algorithm. The modeling approach of the PKEK model is as follows:

(1)

(2)

(3)

(4)

(5)

(6)

where denotes the candidate simulated antecedent discharge (SAD) input vector; denotes the candidate sliding window cumulative rainfall (SWCR) input vector; denotes the SAD, i=1, 2, …, n_Q; denotes the SWCR with sliding window width i, i=1, 2, …, n_P; n_Q and n_P denote the order of the SAD and the SWCR respectively; IVS_{Q_}_SIM denotes PMI-based IVS for candidate SAD input vector; IVS_SWCR denotes PMI-based IVS for candidate SWCR input vector; denotes the selected input vector; denotes the estimated output discharge associated with ; denotes the estimated output discharge error associated with ; denotes the simulated output discharge and it is the final prediction of the output discharge; F_EBPNN denotes the EBPNN discharge estimation; F_KNN denotes the KNN discharge error estimation; F_K_-means denotes the K-means clustering algorithm. The non-updating simulation procedure of the PKEK model for each flood event is almost the same with the NU-PEK model. The difference is that after the PMI-based separate IVS, the K-means clustering algorithm is executed to classify the input vector and fed it to different PEK module to calculate the forecasted discharge value.

Calibration of the PKEK model

The calibration of the PKEK model is almost the same with the NU-PEK model. The difference is the addition of the calibration of the K-means clustering method. The K-means algorithm is calibrated by the iterative method and this calibration process is repeated for 500 times with different initial parameters to avoid the local minimum problem. The optimal number of classes is determined by the Silhouette value. In this paper, approximately 70% of the flood events are utilized to create the calibration data and the remaining events are utilized to create the validation data.

Results and discussion

Model structure and parameter calibration

Input variable selection

Considered that we are studying the daily simulation, we set the order of SWCR and SAD (i.e. n_P and n_Q) equal to 10 and suppose 10 is large enough. This means that the rainfall will becomes runoff with a lag at most 10 days. After the IVS we have found that the lag time of most selected SADs are less than 10. Therefore, we conform that the order of 10 is large enough to describe the mapping relationship between the rainfall and runoff data of the Chengcun catchment.

K-means clustering

We use the Silhouette value to optimize the best number of clusters. The Silhouette refers to a method of interpretation and validation of consistency with clusters of data. The technique provides a succinct graphical representation of how well each object lies within its cluster (^{Rousseeuw, 1987}). The classification result show that the Silhouette value reaches the maximum value when the input samples are clustered into 3 clusters. Therefore, we divide the input sample into 3 clusters.

The PEK module

The PEK module are calibrated by the NSGA-II, LM, and cross-validation methods. The arithmetic parameters of the NSGA-II algorithm are set as: population size = 90, evolution generation total number =1000, crossover probability = 0.85, and mutation probability = 0.15. The arithmetic parameters of the LM method are set as suggested by the MATLAB software. As for the early stopping strategy, approximately 3/4 of the calibration data are utilized as the training set, and the remaining data are utilized as testing set. The maximum number of testing failures is set to 5. The lower and upper boundaries of K for the KNN algorithm are set to 1 and 300, and the K is optimized by the leave-one-out cross-validation method. Because we divide the input samples into 3 clusters, we construct a PEK module for each cluster for the PKEK model. The objective function used for the NSGA-II algorithm is mean squared error (MSE), mean squared value of the network parameters (MS), and hidden layer neuron number. The calibration results indicate that all the Pareto fronts distribute evenly which means that the optimization results is reasonable.

Model performance comparison

Scatter plots comparison

We use the scatter plots of the observed and simulated discharges and the regression R values to inspect the overall performance of the models. The scatter plots and the regression R values are demonstrated in figure 2.

Figure 2 The scatter plots of the observed and simulated discharges and the regression R values: (A) PKEK model, (B) NU-PEK model.

As demonstrated in figure 2, as for the calibration period, the PKEK model obtains better result (R = 0.9634). The NU-PEK model obtains the worse result (R = 0.9539). It can be noticed that the data scatters of the PKEK model intensively lie close to the 45 degree line. The distribution of the data scatters of the PKEK model is relatively more uniform for small, middle, and large discharge values. These results indicate that the PKEK model generates the best result and obtains the most stable performance in the calibration period. As for the validation period, the PKEK model obtains the best result (R = 0.9296). The NU-PEK model obtains the worse result (R = 0.8912). As demonstrated in the figure 2, the data scatters of the PKEK model distributed normally around the 45 degree line. This result indicates that the simulation results of the PKEK model do not have bias to be larger or smaller and shows a very stable property. However, the simulated discharges of the NU-PEK model is larger than the observed values. This phenomenon becomes more obvious especially for large discharge values.

After analyzing the accuracy in calibration and validation period, the accuracy declination is also compared in this section. The PKEK model obtains better accuracy declination value (0.9296/0.9634 = 0.9649), and the NU-PEK model obtains worse accuracy declination value (0.8912/0.9539 = 0.9343). The comparison of accuracy declination value shows that although the PKEK model outperforms the original NU-PEK model and generates better forecasted results.

The analysis of the three models shows that the PKEK model can generate better overall results than the original NU-PEK model. As demonstrated in figure 2, for peak values, the PKEK model is much better than the NU-PEK model. The comparison results indicate that the PKEK model outperforms the NU-PEK model in Chengcun catchment and prove its satisfactory accuracy and stability.

Simulated hydrographs comparison

Figures 3 and 4 show the observed and simulated hydrographs in the calibration and valida tion period of the PKEK and NU-PEK models, respectively. Due to the space limitation, we only demonstrate two hydrographs for each model. They are year 1986 for calibration and year 1995 for validation, respectively. As shown in these figures, a further ascendency of the PKEK model is that the hydrographs simulated by the PKEK model are much smoother. The hydrographs simulated by the PKEK model are more consistent with the observed hydrographs. The improvement in hydrograph shape is owe to the inclusion of the SADs and is attributed to the autoregressive characteristic of the ANN models. The ANN models place a higher weighting to the latest simulated discharge value utilized as inputs to the model. As a result, the model follows the general trend as prescribed by the simulated discharge.

Figure 3 Observed and simulated hydrographs of the PKEK model: (a) Calibration; (b) Validation.

Figure 4 Observed and simulated hydrographs of the NU-PEK model: (a) Calibration; (b) Validation.

Conclusions

An improved hybrid data-driven model, named PKEK, is proposed in this paper to overcome the disadvantages of the previously proposed NU-PEK model in daily RR simulation. For the purpose of comparing performances of different models, the PKEK and NU-PEK models are applied to daily RR simulation in the Chengcun catchment. It can be remarked that the PKEK model is able to simulate generally much better runoff hydrograph than other models. From the application and analysis of the PKEK model, we can conclude that:

The PKEK model have all the advantages of the previously proposed NU-PEK model, including better calibration and validation non-updating simulation accuracy, automatic calibration without much impact from human experiences, good compromise between the sufficiency and parsimony of the input data, and good compromise between the simulation accuracy and topology complexity, and so on. Compared with the original NU-PEK model, the performance of the PKEK model becomes much better for the daily RR simulation. The improvement in the goodness-of-fit to the observed discharge of the PKEK model is owed to the addition of the K-means clustering method combined with the utilization of multiple PEK modules. With the clustering method and multiple PEK modules, the PKEK model can simulate the characteristics of the RR relationship much precisely and it’s forecasting capability and stability becomes better. Furthermore, the hydrographs simulated by the PKEK model are much smoother and more consistent with the observed hydrographs. The smoother performance of the PKEK model is attributed to the autoregressive nature of the ANN which allocates a higher weighting to the antecedent discharge inputs.

Acknowledgments

This research was funded by the IWHR Research & Development Support Program (JZ0145B052016), China Postdoctoral Science Foundation on Grant (Grant NO. 2016M600096, 2016M591214), Major International (Regional) Joint Research Project - China’s Water and Food Security under Extreme Climate Change Impact: Risk Assessment and Resilience (G0305, 7141101024), International Project (71461010701), Study of distributed flood risk forecast model and technology based on multi-source data integration and hydro meteorological coupling system (2013CB036406), China National Flash Flood Disaster Prevention and Control Project (126301001000150068), Natural Science Foundation of China (41601569), Specific Research of China Institute of Water Resources and Hydropower Research (Grant Nos. Fangji 1240), and the Third Sub-Project: Flood Forecasting, Controlling and Flood Prevention Aided Software Development - Flood Control Early Warning Communication System and Flood Forecasting, Controlling and Flood Prevention Aided Software Development for Poyang Lake Area of Jiangxi Province (0628-136006104242, JZ0205A432013, SLXMB200902). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research. Guangyuan Kan, Minglei Ren, and Tianjie Lei are the corresponding authors. Guangyuan Kan and Ke Liang contributed equally to this work. The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.

References

Chen, S., Kan, G., Liang, K., Li, J., Hong, Y., Zuo, D., Lei, T., Xu, W., Zhang, M., Shi, W., & Chen, X. (2017). Air quality analysis and forecast for environment and public health protection: a case study in Beijing, China. Transylvanian Review, 24(12), 3575-3591. [ Links ]

Dong, J., Zheng, C., Kan, G., Wen, J., Zhao, M., & Yu, J. (2015). Applying the ensemble artificial neural network-based hybrid data-driven model to daily total load forecasting. Neural Computing & Applications, 26(3), 603-611. [ Links ]

Grigorios, T., & Aristidis, L. (2014). The MinMax k-means clustering algorithm. Pattern Recognition, 47(7), 2505-2516. [ Links ]

Kan, G., Yao, C., Li, Q., Li, Z., Yu, Z., Liu, Z., Ding, L., He, X., & Liang, K. (2015a). Improving event-based rainfall-runoff simulation using an ensemble artificial neural network based hybrid data-driven model. Stochastic Environmental Research and Risk Assessment. DOI: 10.1007/s00477-015-1040-6. [ Links ]

Kan, G., Li, J., Zhang, X., Ding, L., He, X., Liang, K., Jiang, X., Ren, M., Li, H., Wang, F., Zhang, Z., & Hu, Y. (2015b). A new hybrid data-driven model for event-based rainfall-runoff simulation. Neural Computing & Applications, 27(2), DOI: 10.1007/s00521-016-2200-4. [ Links ]

Kan, G., He, X., Li, J., Ding, L., Zhang, D., Lei, T., Hong, Y., Liang, K., Zuo, D., Bao, Z., & Zhang, M. (2016a). A novel hybrid data-driven model for multi-input single-output system simulation. Neural Computing & Applications, DOI: 10.1007/s00521-016-2534-y. [ Links ]

Kan, G., Liang, K., Li, J., Ding, L., He, X., Hu, Y., & Mark, A. (2016b). Accelerating the SCE-UA global optimization method based on multi-core CPU and many-core GPU. Advances in Meteorology, 8483728, 10 pages. Recovered from http://dx.doi.org/10.1155/2016/8483728. [ Links ]

Kan, G., Lei, T., Liang, K., Li, J., Ding, L., He, X., Yu, H., Zhang, D., Zuo, D., Bao, Z., Mark, A., Hu, Y., & Zhang, M. (2016c). A multi-core CPU and many-core GPU based fast parallel shuffled complex evolution global optimization approach. IEEE Transactions on Parallel and Distributed Systems, DOI: 10.1109/TPDS.2016.2575822. [ Links ]

Kan, G., Zhang, M., Liang, K., Wang, H., Jiang, Y., Li, J., Ding, L., He, X., Hong, Y., Zuo, D., Bao, Z., & Li, C. (2016d). Improving water quantity simulation & forecasting to solve the energy-water-food nexus issue by using heterogeneous computing accelerated global optimization method. Applied Energy. Recovered from http://dx.doi.org/10.1016/j.apenergy.2016.08.017. [ Links ]

Kan, G., He, X., Ding, L., Li, J., Lei, T., Liang, K., & Hong, Y. (2016e). An improved hybrid data-driven model and its application in daily rainfall-runoff simulation. IOP Conference Series: Earth and Environmental Science, 46(2016), 012029 (6th Digital Earth Summit), DOI: 10.1088/1755-1315/46/1/012029. [ Links ]

Kapageridis, I. K. (2015). Variable lag variography using K-means clustering. Computers & Geosciences, SI, B, 85, 49-63. [ Links ]

Lei, T., Pang, Z., Wang, X., Li, L., Fu, J., Kan, G., Zhang, X., Ding, L., Li, J., Huang, S., & Shao, C. (2016). Drought and carbon cycling of grassland ecosystems under global change: A review. Water, 8, 460, DOI: 10.3390/w8100460. [ Links ]

Li, C., Cheng, X., Li, N., Du, X., Yu, Q., & Kan, G. (2016). A framework for flood risk analysis and benefit assessment of flood control measures in urban areas. International Journal of Environmental Research and Public Health, 13, 787, DOI: 10.3390/ijerph13080787. [ Links ]

Li, Z., Kan, G., Yao, C., Liu, Z., Li, Q., Yu, S. (2014). An improved neural network model and its application in hydrological simulation. Journal of Hydrologic Engineering, 19(10), 04014019-1-04014019-17. [ Links ]

MATLAB (2012). MATLAB documentation - user’s manual, USA. [ Links ]

Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics, 20, 53-65. [ Links ]

Zuo, D., Cai, S., Xu, Z., Li, F., Sun, W., Yang, X., Kan, G., & Liu, P. (2016). Spatiotemporal patterns of drought at various time scales in Shandong Province of Eastern China. Theoretical and Applied Climatology, DOI: 10.1007/s00704-016-1969-5. [ Links ]

Received: May 14, 2016; Accepted: September 22, 2016

^* Author´s institutional address. Ph.D. Guangyuan Kan,^# State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research, Research Center on Flood & Drought Disaster Reduction of the Ministry of Water Resources, Beijing 100038, P. R. China, Tsinghua University, Department of Hydraulic Engineering, State Key Laboratory of Hydroscience and Engineering, Beijing, 100084, P. R. China, No. 3 Yuyuantan South Road, Haidian district, Beijing, P. R. China, Telephone: +86 13718606623, kanguangyuan@126.com. Prof. Xiaoyan He, Prof. Liuqian Ding, Prof. Jiren Li, Ph.D. Minglei Ren, Ph.D. Tianjie Lei, State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research, Research Center on Flood & Drought Disaster Reduction of the Ministry of Water Resources, Beijing 100038, P. R. China, No. 3 Yuyuantan South Road, Haidian district, Beijing, China, 68781991, 68781956, 68781593, 68781798, Telephone: +86 15120098723, hexy@iwhr.com; dinglq@iwhr.com; ljrrsc@163.com; renml@iwhr.com; leitj@iwhr.com. Prof. Yang Hong, Tsinghua University, Department of Hydraulic Engineering State Key, Laboratory of Hydroscience and Engineering, Beijing, 100084, P. R. China, University of Oklahoma Department of Civil Engineering and Environmental Science, Norman, OK, United States, Tsinghua University, Haidian district, Beijing,P. R. China, 68787394, hongyang@tsinghua.edu.cn. M.S. Ke Liang,^# Hohai University, College of Hydrology and Water Resources, Nanjing 210098, P. R. China, No. 1 Xikang Road, Gulou District, Nanjing, Jiangsu Province, P. R. China, Telephone: +86 15295500581, liangkepapers@126.com. Ph.D. Depeng Zuo, Beijing Normal University, College of Water Sciences, Beijing 100875, P. R. China, No. 19 Xinjiekouwai Street, Haidian District, Beijing, P. R. China, 58801136, dpzuo@bnu.edu.cn. Ph.D. Pengnian Huang, Nanjing University of Information Sciences & Technology, College of Hydrometeorology, Nanjing, 210044, P. R. China, No. 219 Liuning Road, Nanjing, Jiangsu Province, China, 15951990472, 002752@nuist.edu.cn. ^# Authors Guangyuan Kan and Ke Liang contributed equally to this work.

This is an open-access article distributed under the terms of the Creative Commons Attribution License