1 Introduction
A wildfire is an exceptional or extraordinary free-burning vegetation fire that may have been started maliciously, accidentally, or through natural means that could significantly affect the global carbon cycle by releasing large amounts of CO2 into the atmosphere.
It has profound economic effects on people, communities, and countries, produces smoke that is harmful to health, devastates wildlife, and negatively impacts bodies of water [26].
The three main categories of remote sensing for wildfire monitoring and detection systems are ground-based systems, manned aerial vehicle-based systems, and satellite-based systems. However, they present the following technological and practical problems: ground-based have limited surveillance ranges.
Satellite-based have problems when planning routes, their spatial resolution may be low, and the information transmission may be delayed. Manned aerial vehicle-based systems are expensive and potentially dangerous due to hazardous environments and human error.
Unmanned aerial vehicles (UAVs) provide a mobile and low-cost solution using computer vision-based remote sensing systems that can perform long-time, monotonous, and repetitive tasks [31]. Drones, in particular, represent an excellent opportunity due to their easy deployment.
However, the ability to implement these fire detection systems, based on deep learning (DL), is limited by the maximum payload of the drone and the high power consumption. In this paper, we posit that a convolutional neural network (CNN) can be implemented on a hardware accelerator that can be embedded as part of a smart camera and installed on a drone for the detection of wildfires.
A review of the literature on hardware implementation for various artificial intelligence (AI) algorithms was published by Talib et al. [22] reviewing 169 different research reports published between 2009 and 2019, which focus on the implementation of hardware accelerators by using application-specific integrated circuits (ASICs), FPGAs, or GPUs. They found that most implementations were based on FPGAs, focusing mainly on the acceleration of CNNs for object detection, letting the GPU-based implementations in second place.
Due to the diversity of applications, AI models such as CNNs need to meet various performance requirements for drones and autonomous vehicles, with the essential demands of low latency, low weight overhead, long-term battery autonomy, and low power consumption being the most pressing requirements. The complexity of the tasks that CNNs must perform continues to increase as models evolve.
As a result, deeper networks are designed in exchange for higher computational and memory demands. In this context, the reconfiguration capabilities of FPGAs enable the creation of CNN hardware implementations that are high-performance, low-power, and configurable to fit system demands [27].
A smart camera is an embedded system for computer vision applications that has attracted great interest in various application domains, as it offers image capture and image processing capabilities in a compact system [20].
This paper describes the methodology, implementation, design cycle, and experimental protocol of porting a modified U-Net model into a Xilinx Ultra96-V2 FPGA for the wildfire semantic segmentation task for the smart camera system. The rest of the paper is organized as follows: Section 2 discusses recent works applying computer vision models for wildfire segmentation, highlighting their strengths and limitations; the second part of the section discusses related works regarding smart camera implementations in order to better contextualize our work.
Section 3 details our contribution, discussing in detail the proposed model, the dataset used for evaluating our models, and the design flow followed for optimizing the model and testing it in the target embedded FPGA board.
Section 4 discusses the results of our optimization process and provides a quantitative and qualitative comparison between full precision and the BNN model. Finally, Section 5 concludes the paper and discusses future areas of research.
2 State of the Art
2.1 Segmentation Models for Wildfire Detection and Characterization
Detecting a wildfire by categorizing each pixel in an infrared image is a semantic segmentation problem; therefore, for this task, AI models have been used, such as fully convolutional networks as well as the U-Net model proposed by Ronnenberger et al. in 2015 [19], which allow precise segmentation with few training images.
For the specific task of fire segmentation, artificial intelligence models have already been implemented to solve this problem with visible images of fire [2], the fusion of visible and infrared images of fire [8], and visible images of fire and smoke [18]. Akhloufi et al. [2] proposed Deep-Fire, a semantic segmentation model based on the U-Net architecture.
The authors trained and evaluated their model using the Corsican Fire Database [25]. With an F1 score ranging from 64.2% to 99% on the test set, Akhloufi et al. claimed successful results using the Dice similarity coefficient as the loss function for the model.
Ciprián-Sánchez et al. [8] evaluated thirty-six different DL combinations of the U-Net-based Akhloufi architecture [2], the FusionNet-based Choi architecture [6], and the VGG16-based Frizzi architecture [9], the Dice [16], Focal Tversky [1], Unified Focal [30] losses, and the visible and near-infrared (NIR) images of the Corsican Fire Database [25] and fused visible-NIR images produced by the methods by Li et al. [15] and Ciprián-Sánchez et al. [7].
After evaluating these models, the combination with the best results was Akhloufi + Dice + visible with a 0.9323 F1 score, also known as the Dice coefficient.
Although these works have highlighted the potential of using AI in this domain, many of these algorithms are incapable of operating in real-time, as they inherently suffer from very high inference times and are prohibitive as they require many computing resources, which impedes their usability on drone missions and thus we posit that new paradigms are needed for their successful deployment, particularly in terms of inference time (FPS) and power consumption.
2.2 Smart Camera Implementations for Computer Vision
Smart cameras are devices that process, analyze, and extract data from the images they capture. Different video processing algorithms are used for the extraction.
Smart cameras have been employed in a variety of applications, including human gesture recognition [29], surveillance [4], smart traffic signal optimization systems [23], and a fire detection system based on conventional image processing methods [10].
We propose a DL implementation capable of performing a precise segmentation that can be used as a first step in wildfire characterization and risk assessment systems.
FPGAs are excellent choices for creating smart cameras because they offer significant processing capabilities while maintaining a low power consumption, which makes them good candidates for particular edge tasks creating efficient hardware accelerators capable of high throughput [27], and maintaining a high degree of flexibility and reconfigurability.
The disadvantage of FPGAs is that developers need to be skilled in hardware design to accomplish these goals. The design process frequently takes longer with FPGAs than with CPU and GPU systems.
To address such issues, FPGA vendors and other academic and industrial tool developers have introduced several computer-aided design (CAD) tools for training and optimizing DL models and mapping such models into the reconfigurable fabric.
Convolutional neural networks provide high-accuracy results for computer vision tasks, and their applications could be benefited from being implemented in edge devices such as FPGAs. Still, for applications such as smart cameras, limited use of hardware resources and power requirements are of the utmost importance.
Therefore, to implement models that generally require a large number of computational resources, large storage capabilities for the model parameters, and the use of high energy consuming hardware [28, 27], such as GPUs, it is necessary to use model optimization techniques such as pruning and quantization [3] for the compression of the model, to implement it in devices such as an FPGA while achieving high inference speed.
3 Proposed Method
The implementation of a BNN for the segmentation of wildfire images was done using the Xilinx tool Vitis AI because each operation of the model is mapped into a hardware-accelerated microinstruction, in which a series of sequential micro-instructions can represent the whole DL model, while a scheduler is in charge of managing the hardware calls and data flow. This enables the customization of the HW accelerator while considering the resources of the FPGA.
In the particular context of our application, Vitis AI is indeed the best choice as we target a small FPGA device (Xilinx Ultra96-V2) for deep embedded image processing.
3.1 General Overview of the Optimization Approach
Fig. 1 depicts the Vitis AI Pytorch flow followed in this paper. The design process begins by training a segmentation with NIR images from the Corsican Fire Database [25] and their corresponding ground truths for fire region segmentation.
Subsequently, a pruning process to reduce the number of filters in the convolution layers using the Pytorch framework is performed. Then, both the original and the pruned models are saved in pt files.
The next module is in charge of changing the numerical representation of the DL model by performing an 8-bit quantization using the Vitis AI quantizer module, producing an xmodel file.
Finally, the quantized model is compiled, producing an xmodel file containing all the instructions needed by the DPU to execute the model. After the model has been compiled, it can be loaded on the target FPGA board and tested.
In our work, this model is a U-Net model modified to accommodate the needs of our application. The rest of this section will detail the implementation of such an optimized segmentation model.
3.2 Dataset: Corsican Fire Database
In this paper, we employ the NIR images from the Corsican Fire Database, first introduced by Toulouse et al. [25].
For fire region segmentation tasks, this dataset includes 640 pairs of visible and NIR fire images along with the matching ground truths created manually by experts in the field.
A representative NIR image from the Corsican Fire Database is shown in the top left corner of Fig. 2, along with its corresponding ground truth.
3.3 Segmentation Model Training
The proposed architecture for this paper is a modified version of a U-Net model [19] with the number of filters from the deepest layers reduced to reduce training and inference times.
Furthermore, we add batch normalization layers [12] after every convolutional layer. The final architecture is shown in Fig. 3; the numbers in black are the number of filters before pruning, and the numbers in blue are the number of filters after pruning.
As depicted in Fig. 2, every image in the training set was resized to a width of 320 and a height of 240 pixels for training. For the training of the proposed model, the dataset was divided into 80% for training and 20% for testing.
The model was trained with a learning rate of 0.0001 for 350 epochs with a batch size of 5 using cross-entropy loss and Adam optimizer.
3.4 Optimization
Pruning. The pruning method (contained in the binarization block of Fig. 2) employed in the present paper is based on the work of [14] in which, as shown in Fig. 4, when a filter is pruned, the corresponding feature map is removed and the kernels of the input feature maps for the next convolution that correspond to the output feature maps of the pruned filters are also removed.
Fig. 5 briefly explains the pruning process used for this paper, with which it was possible to reduce the number of filters in each convolutional layer by approximately 90%.
In Fig. 3, we can see the final architecture of the model, the numbers in blue being the number of filters after the pruning process.
Quantization. The model was quantized using the Vitis AI quantizer module; resulting in a CNN model with all its values represented with only 8 bits. That is, the floating-point checkpoint model in coverted into a fixed-point integer checkpoint.
After confirming there was no significant degradation in the model’s performance, the quantized model was compiled with the Vitis compiler, which creates a xmodel file with all the instructions required by the DPU to execute the model.
3.5 Proposed FPGA-Based Smart Camera System
Fig. 6 shows the system implementation for the smart camera solution of wildfire detection. The processing system (PS) controls every step of the application’s life cycle, including retrieving images from the camera, feeding them to the programmable logic (PL) section of the SoC (hardware accelerator implementing the proposed model), and processing the segmented image. An IR camera is attached to the Ultra96 board using a USB port in the SoC.
The PS block (an ARM processor) processes the input picture before feeding it to the PL section, which runs the binarized U-Net model mapped into the reconfigurable fabric. The image is processed and then passed back into the PS block for feature extraction.
If a complete IIoT solution is implemented, these features may be used for viewing on a TFT screen or communicated via a communication protocol (i.e., LORA) to a cloud.
These capabilities are not yet implemented here and are left for future work. In order to make the picture more straightforward, the AXI connection, which is not illustrated here, is used for all communication between the PS, PL, and peripherals.
In our experiments, the overall performance of the model implemented using single-thread execution was not satisfactory, as we obtained only a throughput of 15.77 FPS, even after the pruning and quantization of the model. Therefore, we explored the use of a multi-thread approach supported by the Ultra89-v2 board.
The use of this functionality enabled us to attain a higher performance. The main limitation of the single-threaded approach is the bottleneck introduced by the DPU when performing inference in the FPFA, as it introduces a significant latency.
This problem arises from the use of queues for exchanging information among the different threads. In Fig. 7, we provide a flow chart comparing both software implementations.
4 Results and Discussion
In the subsequent section, we will discuss the results obtained from implementing the U-Net model for segmenting images of the Corsican Fire Database, comparing both the original full-precision model and the optimized model running on the FPGA platform.
We will also compare our results with previous works in the state-of-the-art based on a number of metrics used in the literature, which will be described in the next subsection.
After this, quantitative and qualitative results will be provided, based on these metrics, followed by a discussion of the obtained results.
4.1 Comparison Metrics
4.1.1 Matthews Correlation Coefficient
First proposed by Matthews [17], it measures the correlation of the true classes with their predicted labels [5]. The MCC represents the geometric mean of the regression coefficient and its dual, and is defined as follows [24]:
where:
where
4.1.2 F1 score
Also known as the Dice coefficient or overlap index [21], the F1 score is the harmonic mean of the precision
4.1.3 Hafiane Quality Index
Proposed by Hafiane et al. [11] for fire segmentation evaluation, it measures the overlap between the ground truth and the segmentation results, penalizing as well the over- and under-segmentation [11]. First, the authors define a matching index
where
Finally, the Hafiane quality index is defined as follows:
where
4.2 Quantitative Results
Table 1 shows the results obtained by the final implementation of the optimized model in the FPGA using MCC, HAF, and F1 score. It can be observed the pruned model presented a slight drop in performance (3% in MCC) whereas the FPGA model presented a slightly higher drop (of about 5% both in MCC and F1 score) of performance for all metrics. This slight degradation is expected given the heavy optimization undergone by the model when passing from 64-bit to 8-bit data representation.
Model | MCC | F1 score | Hafiane |
Proposed Model Original (Validation) | 0.964 | 0.964 | 0.946 |
Proposed Model Original (Test) | 0.933 | 0.934 | 0.902 |
Proposed Model Pruned (Validation) | 0.964 | 0.965 | 0.941 |
Proposed Model Pruned (Test) | 0.924 | 0.926 | 0.877 |
Proposed Model FPGA (Validation) | 0.932 | 0.933 | 0.899 |
Proposed Model FPGA (Test) | 0.912 | 0.915 | 0.870 |
However, the gain in throughput (and thus inference time) is significant: the full precision model runs at 8 FPS in a GPU, consuming a large amount of power, whereas our model can attain up to 33.64 FPS in the selected FPGA when running in multi-threaded mode (15.77 FPS for the single-threaded mode), for a fraction of the power consumption. Table 2 provides a comparison with other models in the literature.
Model | MCC | F1 score | Hafiane |
Akhloufi + Dice + NIR | 0.910 | 0.915 | 0.890 |
Akhloufi + Focal Tversky + NIR | 0.914 | 0.916 | 0.889 |
Akhloufi + Mixed focal + NIR | 0.828 | 0.843 | 0.802 |
Proposed Model Original (Test) | 0.933 | 0.934 | 0.902 |
Proposed Model FPGA (Test) | 0.912 | 0.915 | 0.870 |
A recent and thorough comparison of the state-of-the-art carried out by Ciprián-Sánchez et al. [8] compared different architectures, image types, and loss functions on the Corsican Fire Database. Here, we compared the bests model from this study (by Akhloufi et al. [2] with various losses) using the base metrics (i.e., MCC, HAF, and F1 score).
From the table, it can be observed that the original model outperforms this previous work by about 2% (0.933 MCC), whereas the FPGA implemented model attains a similar performance to the best configuration obtained by Akhloufi (0.912 vs 0.910 MCC), using a much smaller footprint.
4.3 Qualitative Results
Table 3 provides a qualitative comparison of the different models compared in Table 1.
It shows the original images of the Corsican Fire Database and the segmentation results using the original model before the optimization process, after the pruning method, and finally, the final model used in the FPGA.
It can be observed that for the 3 examples provided, both the pruned model and the FPGA implementation yielded practically the same results as the full-precision model, albeit at a much higher frame rate (33 FPS vs the 8 the U-Net running on a V100 GPU).
Such results can be used in the smart camera for higher image processing tasks in real-time, such as fire spread prediction by using the processing section (ARM processor) of the Ultra96-v2 platform.
5 Conclusions
In the present paper, we implement and analyze the performance of a smart camera system based on an FPGA accelerator.
A modified version of the U-Net architecture was used, to which optimization methods such as quantization and pruning were applied, effectively reducing the inference time and, at the same time, obtaining good results in the wildfire segmentation task. The frame rate obtained in the segmentation task was 33.63 FPS.
It is believed that there is still some potential to improve the speed of inference by using other strategies, such as the conversion of CNN models to spiking neural networks (SNN), whose conversion has been shown to reduce inference times by reducing the number of operations performed [13].
Finally, given the results obtained, heavy computational tasks are believed to benefit from the accelerators implemented in FPGAs for their use in real-time applications such as wildfire surveillance using drones.