1 Introduction
Flood is one of the most frequent type of natural disaster and occur when an overflow of water submerges land that is usually dry. Floods are often caused by heavy rainfall, rapid snowmelt or a storm surge from a tropical cyclone or tsunami in coastal areas [16].
Floods can cause widespread devastation, resulting in loss of life and damages to personal property and critical public health infrastructure. Between 1998-2017, floods affected more than 2 billion people worldwide [1].
People who live in floodplains or non-resistant buildings, or lack warning systems and awareness of flooding hazard [15] are most vulnerable to floods. Floods are also increasing in frequency and intensity as the climate change is happening day by day. Flood prediction using Machine Learning (ML) algorithms is effective due to its ability to utilize data from various sources and classify and regress it into flood and non-flood classes [14].
ML methods have the potential to improve accuracy as well as reduce calculating time and model development cost [2]. Cloud-based application for natural disaster prediction and management that comprised natural or manmade catastrophes such as earthquakes, cyclones, and floods are also employed random forest regression to provide improved accuracy based on rain fall, temperature, cloud wind speed, and pressure, among other factors [28, 29].
In this work, both classification and clustering algorithms are used to predict the flood. The most contributing features are also selected using the feature selection algorithm. The results obtained with and without feature selection were compared and the changes were observed in terms of time taken to build the model and the prediction accuracy.
1.1 Motivation
Previous studies suggest that personal flood experience is a major motivator for mitigation behaviour [9]. People who had not been harmed greatly underestimated the detrimental effects of a flood [12]. Based on the findings, it is possible to conclude that risk communication should not focus exclusively on technical factors.
According to recent research, social trust in people who manage a hazard is highly related to judgements about the hazard's risk and benefits [10]. When a person lacks knowledge about a hazard, social trust in the authorities in charge of handling the hazard determines perceived risks and benefits.
It is always preferable to anticipate an emergency situation and take steps to minimise loss. In this study, we attempted to anticipate floods by studying the month-by-month rainfall index of a specific area over a period of 118 years [13].
1.2 Contributions
The contributions towards the work are listed below.
– To predict the flood from the monthly rain fall index using supervised ML technique without and with optimized attributes.
– To predict the flood from the monthly rain fall index using unsupervised ML technique without and with optimized attributes.
– To compare the classes to cluster evaluation for identifying the best technique by evaluating the performance in terms of classification accuracy, and time consumption to build the model.
In this work, the related work is described in Section 2, methodology that is adopted is presented in Section 3, the dataset considered for the experiment and the environmental setup is described in Section 4, and result analysis is done in Section 5. Finally, Section 6 concludes the paper with certain future scopes.
1.3 Related Work
Chen et al. [30] used ML models such as Gradient Boosting Decision Trees (GBDT), eXtreme Gradient Boosting (XGBoost), and Convolutional Neural Network (CNN) for flood risk assessment, selecting twelve indices and using 2000 sample points for model training and testing before optimising the models using Hyperparameter.
The GBDT model has the maximum accuracy of 96.83%, although only 12 indices are insufficient for flood risk assessment. Jenifer et al. used Sentinel-1 SAR imagery to develop Otsu's thresholding technique in the Alappuzha region [31]. The raw SAR images was preprocessed using the SNAP software's Sentinel1 toolset.
The Otsu thresholding approach was then used to compute the threshold value in order to demarcate the water pixels in the SAR pictures in order to estimate the flooding in the region. The Area Under Curve (AUC) obtained by the authors was found to be 0.83, indicating that the classifier is excellent.
Ravansalar et al. [32] suggested a hybrid Wavelet Linear Genetic Programming (WLGP) model to estimate monthly streamflow in two gauging stations, which incorporates a discrete wavelet transform (DWT) and a Linear Genetic Programming (LGP).
The authors divided the original time series flow into sub-time series based on wavelet co-efficient. Sub-series were then applied with the LGP to anticipate streamflow one month in advance.
The authors utilised the Nash Coefficient to calculate efficiency, which was 0.877 and 0.817 for the Pataveh and Shahmokhtar stations, respectively. Jigaw et al. [33] combined the statistical method Regional Flood Frequency Analysis (RFFA) with Support Vector Regression (SVR).
Hydrometric data from Environment Canada's Hydrometric Database (HYDAT) were predicted using RFFA-SVR with a combination of different kernel functions (Linear, Polynomial, and Multilayer Perception kernels), but the radial basis kernel function outperformed all the kernel functions with Nash Sutcliffe coefficient, with a coefficient of determination of about 0.7.
Lohani et al. [34] proposed a threshold subtractive clustering based Takagi Sugeno (TSC-T–S) fuzzy inference system which computes two cluster centers based on the hydrologica situation, i.e. one is frequent events and another is rare events.
A new evaluation model Peak Percent Threshold Statistics (PPTS) had also been proposed by the authors to evaluate the ability of forecasting model. The TSC-T–S has been compared with Self Organizing Map (SOM) and subtractive clustering based Takagi Sugeno fuzzy model (SC-T–S fuzzy model) and gave accurate forecast.
Damle et al. [35] presented Time Series Data Mining (TSDM), a method for characterising and predicting occurrences in complicated, nonperiodic, and chaotic time series that blends chaos theory and data mining.
Earthquakes, floods, and rainfall are examples of chaotic nonlinear systems, in which the interactions between variables in a system are dynamic and disproportionate, yet totally deterministic. Mosavi et al. [36] presented the most promising long-term and short-term flood prediction approaches.
The important developments in enhancing the quality of flood prediction models were also addressed. The most effective tactics for improving ML algorithms were identified by the authors and include hybridization, data decomposition, algorithm ensemble, and model optimisation.
As per thw authors, this survey can be used as a guideline for hydrologists and climate scientists in selecting the best ML method for the prediction task.
2 Methodology
In this work, we considered a month-wise rainfall dataset of 118 years of a particular region and then PSO search is applied for attribute selection. Then with the selected attribute different classification and clustering algorithms are applied and the results are compared.
The steps involved in the proposed methodology is described as follows.
a) Pre-Processing
It is one of the most important phases during the building of the ML model. Before passing the data to a model, it needs to be processed so that the performance can be enhanced [17].
This can be in terms of accuracy, processing time, or any other parameter. In this work, the redundant, irrelevant, and minimally contributing data were removed from the dataset to reduce the model building time can be reduced [18].
b) Feature Selection
It is a process by which the approximate to zero contributing features are eliminated from the dataset [6]. This helps to reduce the time consumption in building the model. Here the attribute selection was done with a subset having evaluation parameter with pull size one and the number of thread one.
In this work, PSO Search [5] was used for feature selection. It is an optimization technique based on population and can be implemented in many research areas. Kennedy and Eberhart [8] proposed this technique by getting the inspiration from fish schooling and flocking behaviour of birds.
A bird in the search space called particle is the solution of this problem. A group of particle called swarm tries to find its optimal position by moving in the search space. Each particle xi has a velocity vi and is represented as in Eqn. 1:
where i is the particle number and n is the problem dimension or the number of unknown variables present in the problem.
A group of random particles are present in the PSO problem [11]. In each iteration of the problem solving, two best values are identified for each particle, i.e. pbest () and gbest().
The position and velocity of each particle can be updated by using the Eqn. 2:
where,
In Eqn. 2,
This value is usually initialized as a function of the range of the problem. PSO can be used as a feature selection algorithm.
K-means: Unlike supervised learning, K-means clustering does not need labelled data for clustering [7,19]. K-means divides things into clusters that share commonalities and are distinct to the objects in another cluster[20]. The term 'K' refers to the number of clusters that will be produced. There is a method for determining the best or optimum value of K for a given set of data [21].
Manhattan distance approach: The Manhattan distance is the simple sum of the horizontal and vertical components, or the distance between two sites measured at right angles to each other [22]. The distance is calculated using the equation as presented in Eqn. 3:
Density Based Clustering: A cluster is a set of data objects scattered in the data space throughout a contiguous region with high density of objects in density-based clustering [24, 25].
Clusters based on density are separated from one another by uninterrupted regions of low object density [26]. Data items in low-density areas are often regarded as noise or outliers.
3 Dataset Details and Environmental Setup
The dataset considered in this experiment was downloaded from an open source data repository named Kaggle [3]. Here 118 year’s of rainfall index data are available. These data are collected month-wise from the year 1900 to 2018.
By observing the pattern of rainfall index we tried to predict the flood. It is a labelled dataset that consists of two labels, i.e., “yes” for occurrence of flood and “no” for non-occurrence. The dataset details as well as the details of environment setup are presented in Table 1.
Attribute | Values |
Dataset Considered | Monthly Rainfall Index and Flood Probability [3] (data of 118 years) |
Source | Kaggle open source data repository |
Dataset last accessed | 8th April 2023 |
Experimentation environment | WEKA version 3.9 [4] |
Attribute Selection Algorithm | PSO Search |
Clustering Algorithm | K-means clustering (manhattan distance approach), Density Based Clustering |
4 Result and Analysis
In this experiment, the classification techniques such as j48 and RF was applied to the considered dataset for both without and with feature selection algorithms. The performance was evaluated in terms of accuracy and model building time.
The clustering such as K-means and density based clustering was also implemented with the same set of data after removing the labelled field. Then the same experiment was also conducted for both with and without feature selection algorithms.
a) Classification without Feature Selection algorithm
The classification algorithms were implemented with 10-fold cross-validation. In this work, we applied j48 and RF classification algorithms on the considered dataset.
When J48 algorithm was applied, it was observed that while building the pruned tree the number of leaves generated was 11 and the size of the tree was 21.
The total time taken to build the model was recorded to be 0.07 second. Here, it was found that 70.34% of data were correctly classified. The confusion matrix of J48 algorithm without feature selection algorithm is given in the Table 2.
During the classification, all the evaluation parameters were observed and are listed in Table 3.
TP Rate | FP Rate | Precision | Recall | F-Measure | MCC | ROC Area | PRC Area | Class | |
0.750 | 0.345 | 0.692 | 0.750 | 0.720 | 0.407 | 0.702 | 0.659 | Flood | |
0.655 | 0.250 | 0.717 | 0.655 | 0.685 | 0.407 | 0.702 | 0.657 | No Flood | |
Weighted Avg. | 0.703 | 0.298 | 0.704 | 0.703 | 0.703 | 0.407 | 0.702 | 0.658 |
After the classification it was found that the kappa statistic value was 0.4058, mean absolute error value was 0.3104 and root mean squared error value was 0.5312. The confusion matrix for the same is given in Table 3.
RF classifier algorithm [27] was implemented on the same dataset with 100 numbers iteration and found that 78.81% of data were correctly classified.
The confusion matrix for RF algorithm without feature selection algorithm is given in Table 4.
Total time taken to build the model was 0.32 seconds and the accuracy of classification is given in Table 5.
TP Rate | FP Rate | Precision | Recall | F-Measure | MCC | ROC Area | PRC Area | Class | |
0.800 | 0.224 | 0.787 | 0.800 | 0.793 | 0.576 | 0.902 | 0.904 | Flood | |
0.776 | 0.200 | 0.789 | 0.776 | 0.783 | 0.576 | 0.902 | 0.901 | No Flood | |
Weighted Avg. | 0.788 | 0.212 | 0.788 | 0.788 | 0.788 | 0.576 | 0.902 | 0.902 |
In this classification, the kappa statistic value, mean absolute error, and root mean squared error was found as 0.576, 0.3375, and 0.3782 respectively.
b) Classification with Feature Selection Algorithm
In this case, first the PSO feature selection algorithm was applied to the dataset toselect the most contributory features. From the feature selection, it was found that JUN, JULY, SEPTEMBER are the months whose rainfall index is more responsible to predict the flood.
After the feature selection, J48 classifier was implemented on the selected features. The total time taken to build the model was found to be 0.07 seconds.
The size of the tree created in J48 model creation is 11 where as the number of leaves is 11. Out of total 118 number of instances, 78 instances are correctly classified which is 66.10% of accuracy.
It was observed that, The values of Kappa statistic, Mean absolute error, Root mean squared error, and Relative absolute error obtained were 0.32, 0.35, 0.49, and 70.20 respectively.
The confusion matrix obtained after the implementation of J48 algorithm is given in Table 6. The values of TP rate, FP rate, Precision, Recall, F-Measure, MCC, ROC Area, and the class are listed in Table 7. We had also implemented the RF classifier with 10-fold cross validation. The total time taken to build the model was found to be 0.16 seconds. 76.27% of data were classified properly where as 23.73% of data were incorrectly classified.
TP rate | FP rate | Precision | Recall | F-Measure | MCC | ROC Area | PRC Area | Class |
0. 683 | 0. 362 | 0. 661 | 0. 683 | 0. 672 | 0. 322 | 0. 726 | 0. 663 | YES |
0. 638 | 0. 317 | 0. 661 | 0. 638 | 0. 649 | 0. 322 | 0. 726 | 0. 725 | NO |
Kappa statistic, Mean absolute error, Root mean squared error, Relative absolute error were calculated and the value obtained were 0.53, 0.30, 0.43, and 63.82 respectively. The confusion matrix given in Table 8 and the values of other evaluating parameters are presented in Table 9.
TP rate | FP rate | Precision | Recall | F-Measure | MCC | ROC Area | PRC Area | Class |
0. 767 | 0. 241 | 0. 767 | 0. 767 | 0. 767 | 0. 525 | 0. 816 | 0. 820 | YES |
0. 759 | 0. 233 | 0. 759 | 0. 759 | 0. 759 | 0. 525 | 0. 816 | 0. 814 | NO |
c) Clustering without Feature Selection Algorithm
K-means clustering algorithm was applied on the considered dataset without feature selection. Two clusters were made, i.e. 1 for yes and 0 for no. Missing values were globally replaced with mean/mode. ManhattanDistance was taken into consideration during the cluster creation.
The total time taken to build the model was found to be 0.09 seconds and it has gone through 11 number of iteration. Out of 118 instances, 73 are clustered as no flood and 45 were clustered for predicting flood.
A total of 27.97% of the data are clustered incorrectly and rest are clustered correctly.The cluster instances are presented in Table 10 and the classes to cluster matrix is given in Table 11.
The Density Based clustering was also applied on the same dataset without any feature selection. It was observed that for cluster 0, the prior probability was 0.6 where as for cluster 1, the prior probability was 0.4.
The total time taken to build the model was 0.5 seconds. In this model, 27.11% of instances are incorrectly classified. The cluster instances are given in Table 12 and the classes to cluster matrix value is presented in Table 13.
d) Clustering with Feature Selection Algorithm
For better performance we applied feature selection using PSO where out of 12 attributes, 4 were selected. During the model creation, it has gone through iterations. Out of 118 instances, 43 instances were clustered as cluster 0 and 75 instances were as cluster 1.
The total time taken to build the model obtained was 0.02 seconds and it was observed that 29.66% of instances were clustered incorrectly.
The cluster instances are presented in Table 14 and the classes to cluster matrix is given in Table 15.
Density based clustering: PSO algorithm was applied for feature selection and then the Density based clustering was implemented on the considered dataset.
In this model, out of 12 attribures 4 were selected as the most contributing attributes. It had gone through 5 iterations while building the model. Missing values were globally replaced with mean/mode.
According to cluster centroid from 118 instances 43 belongs to cluster 0 and 75 belongs to cluster1. Cluster 0 has the prior probability 0.4 cluster: 1 has prior probability: 0.6. The total time taken to build model obtained was 0.02 seconds and incorrectly clustered instances were 31.36 %.
The clustered instances of density based clustering with feature selection was given in Table 16 and the classes to clusters of density based clustering with feature selection was presented in Table 17.
Classes to cluster evaluation: The clustering accuracy performance was evaluated against the real classification. The accuracy percentage in both with and without feature selection is shown in Figure 2.
Model building time comparision: The model building time comparision of both classification and clustering techniques with and without feature selection algorithm is shown in Figure 3.
From the graph, it is observed that the model building time reduces significantly with feature selection as compared to without feature selection.
5 Conclusion and Future Directions
In this work, we considered a dataset containing the rain fall details of a particular region. It contains the data of 118 years month wise. The classification techniques j48 and RF and the clustering techniques such as K-means and density based are impleted on the considered dataset. After this, the months minimally contributing for flood prediction were discarded using PSO feature selection method.
Then the same classification and clustering techniques were again applied and the results in both the cases were compared. It was observed that the accuracy percentage obtained was found to be higher in both clustering and classification when no features are discarded. But the model building time was reduced when the classifications and clustering techniques were applied on the selected features.
In future, if this type of model can be built and embedded in the cloud as a service which can be called as per requirement then alert can be made and proper management of the situation can be done.