1 Introduction
SARS-CoV2, better known as COVID-19 or Coronavirus, is an acute fatal disease identified in December 2019 in Wuhan province, China. This virus spread worldwide with great speed [1], declaring itself a pandemic on March 11, 2020 [2]. As of October 31, 2020, 45,428,731 cases have been confirmed in the world, causing 1,185,721 deaths [3].
COVID-19 is spread through droplets of secretion released from the mouth and nose of an infected individual [4] and is transmitted by direct or indirect contact (through contaminated objects and surfaces) to mucosal areas of the skin such as the mouth, nose, or tear ducts. Symptoms may include dry cough, fever, headache, fatigue, shortness of breath, loss of taste or smell, and shortness of breath. Symptoms usually appear 2 to 14 days after infection [5]. An early diagnosis is important because it is one of the most effective methods to stop the disease progression [6].
There are studies that have shown that COVID19 virus mainly attacks human lungs, after that there is a possibility of an infection and a lung disease [7]. Therefore, the diagnosis using a patient’s chest computed tomography (CT) is so relevant.
The main aspect in a CT image of COVID is the presence of ground glass opacity (GGO) [8,9]. Some experts have identified three main types of anomalies in CT lung images related to COVID-19: ground glass opacification, consolidation and pleural effusion [10,11].
The manual observation is the main technique to decide whether the patients are infected or not. However, the job is exhausted and there aren’t enough medical doctor’s staff to do the job. So, an automatic segmentation system is necessary in order to identify and delimit the boundary of the region of interest in the lung [12].
Deep Learning (DL), a subfield of Machine Learning, is a tool commonly used in re-search areas for speech recognition, computer vision, natural language processing, and image processing [13]. The main advantage of DL methods is that they do not require experts to perform feature extraction; it is done automatically and implicitly by multiple flexible linear and non-linear processing units in a deep architecture.
In recent years, Deep Learning has been a useful tool for classifying medical images [14], among its techniques the convolutional neural network (CNN) model [15] stands out; a neural network inspired by the connectivity of the animal visual cortex. CNN is a multi-layer neural network that uses minimal processing of convolution operations on the pixels of the images. This technique extracts the relevant features from image sets to detect features regardless of their position.
Nowadays, the computer’s power has made it possible to apply deep learning in a wide range of applications in the medical field, such as deciding whether a tumor is in a radiograph [16] or detect a cardiovascular risk. For the task about semantic segmentation, there is a constant improvement in the accuracy of segmentation with models such as Fully Convolutional Network (FCN) [17], U-net [18], Fast RCNN [19] and Mask RCNN [18] among others.
There are a lot of models that detect Covid19 cases from chest x ray images [–22], yielding a prediction value of 90% [23]. However, this kind of model cannot provide a quantitative analysis of infection severity because they just classify between Covid19 and regular pneumonia.
2 Related Work
2.1 Mask R-CNN
Mask R-CNN Is a framework focused on instance segmentation. This task combines elements of object detection (classify individual objects and localize every instance with a bounding box) and semantic segmentation (classify every pixel in a set of categories).
The Figure 1 shows a representation of the Mask R-CNN framework that contains two main phases; the first one consists of a Faster R-CNN architecture [19]. It has three elements: the backbone, the region proposal network (RPN) and the object detection [18]. The backbone takes advantage of a CNN architecture for image feature extraction and generating feature maps.
The RPN uses these maps and creates proposed bounding boxes (anchors) to do the object detection task, dispersed over each feature map. These bounding boxes or anchors are classified in two classes: positive anchors or foreground, which refers to the anchors located in regions that represent features on the objects to be detected, and the negative ones or background which are located outside these objects.
The positive anchors are used to perform a task called region of interest (ROI) alignment; they are centered to the located object and mark the ROIs for the next part. The object detection is the last part and classifies every class inside each ROI. The second phase consists of a new branch in order to do the instance segmentation task over every detected object inside the image. This new branch is made by a fully convolutional mask [18].
2.2 Unet
Unet is one of most popular models for the task of image segmentation in the medical field. It was developed to understand in a visual way different types of images. And it is based on an encoder decoder neural network architecture. There are two main parts: con-tractive and expansive. The contracting one is built with several layers of convolution, filters of size 3 x 3 and strides in both directions, with ReLU layers at the end.
This part is important because it extracts the essential features of the input and the result is a feature vector of a particular dimension. The second part recover information from the contractive part by coping and cropping. However, the feature vector is built by convolutions and generate an output segmentation map. In this architecture the main component is the linking operation between the first and second part.
In this way the network gets correct information from the first part, so it could generate an accurate segmentation mask [18].
2.3 SegNet
SegNet is a deep fully convolutional neural network architecture for semantic seg-mentation [24]. It was originally designed for road and interior scene segmentation tasks. This requires the network to converge using an unbalanced dataset because the pixels of the road, sky, and buildings dominate. The main elements consist of an encoder network, a corresponding decoder followed by a pixel classification layer.
The encoder network is almost the same as the 13 convolutional layers of the VGG16 network [25]. The task of the decoder network is to map low resolution encoder feature maps to full input resolution feature maps for pixel classification. The main feature of SegNet is the way the decoder upsamples its lower resolution input feature maps; in this part, the decoder network uses clustering indices computed in the maximum clustering step of the corresponding encoder to perform non-linear upsampling.
2.4 Dense V-Net
Dense V-Net is a fully connected convolutional neural network that has performed well on the organ segmentation task. You can establish a voxel-voxel connection between the input and output images [26].
It consists of three layers of dense feature stacks whose outputs are concatenated after a convolution on the missing connection and bilinear oversampling [27]. There are 723 feature maps that are computed using a convolution step.
It then continues with a cascade of convolutions and dense feature stacks to generate activation maps with resolutions of three outputs. A convolution unit is applied on each output resolution to reduce the number of features. At the end it generates the segmentation logit.
Dense V-Net differs from V-Net [28] in several respects: the downsampling subnet-work is a sequence of three dense feature stacks connected by downsampling strided convolutions; each skip connection is a unique convolution to the output of the corresponding dense feature stack. The upsampling network comprises bilinear upsampling to the final segmentation resolution.
2.5 MaskLab
MaskLab is an instance segmentation model [29], refines object detection with ad-dress and semantic features based on Faster R-CNN [19]. This model produces three out-puts: box detection, semantic segmentation logits for pixel classification, and direction prediction logits to predict the direction of each pixel around its instance center.
Therefore, MaskLab is based on the Faster R-CNN object detector, the predicted frames provide precise location of object instances. Within each region of interest, MaskLab performs fore-ground and background segmentation by combining semantic and direction prediction.
2.6 MiniCovid-Unet
The ground glass opacities are important features of COVID-19 infection regions in CT images scans. However, these image characteristics cannot be extracted efficiently by conventional CNNs, where the original images are taken as input and the learning processes begin from pixel level features. Hence, to reflect more regional features of infections we use different filters to highlight the region of interest.
As shown in Figure 6, the proposed COVID-19 segmentation model applies the Unet like structure as backbone. There are two basic sections: contractive and expansive. We have used the activation function Leaky Rely in all blocks of layers because it is faster and it reduces the complexity of the network. Every convolution layer has 32 filters for images of 512 X 512 pixels.
There are less layers of convolution because the improvement was slightly better, however it increased the time of training and the computer resources needed. The model we proposed has good performance for computers with limited resources and is small enough to use in a mobile device.
3 Materials and Methods
Images of the dataset are Computed Tomography (CT) scans that belong to the Italian Society of Medical and Interventional Radiology [30]. The dataset contains one hundred one-slice CT scans in png format, whose dimensions are 512 x 512. There are also masks that show the region labeled by experts of the medical field [31].
In the original dataset there are three kinds of injuries related with Covid19: ground-glass opacities, consolidation and pleural effusion (Figure 7). However, we just try to identify whether an image has an injury in the lung and where it is located. The images are of people who had been infected with COVID-19.
The training of Mask R-CNN used a total of 72 of lung CT images and lung segmentation masks labels, the original image’s size remained and no data enhancement was used for training. The validation set used 18 images and its masks. The training set iteration was 16 with 500 steps per iteration. The learning rate was 0.001. We set aside 10 im-ages to visualize the performance of the trained and validated model with the training and validation data sets.
For this experiment the backbone CNN architecture used was ResNet50 because of the small graphic card [32]. The experiment used pre-trained COCO weights [18,33]. The total number of parameters for Mask R-CNN is 44,662,942.
There is a problem with imbalance classes, because the task is to segment only the COVID-19 infected region. But with this configuration we have two classes: COVID-19 region and non-COVID-19. In this case, we have more pixels from healthy regions (2.4482e + 7) than from infected regions (2.119975e + 6). So, the unbalance ratio is 11 and the data set is unbalanced, that’s the reason we have chosen metrics for the segmentation task.
3.1 Implementation Details
The Jupyter notebook interactive development environment was used to build and visualize the model and results. Python 3.6 was used as a programming language and the hardware configured to execute the experiments was a personal computer with a processor Intel(R) Core (TM) i7-6700 CPU @ 3.40GHz with 8 cores. NVIDIA GeForce GTX 1050 Ti (GPU 0), CUDA Toolkit 10.0 and CUDNN 7.4.1 were used to drop the time training.
Be-cause of the small GPU the training configuration was set to use one image in every step and it was needed to use a small backbone (resnet50). On average the full execution of this model took 57 minutes.
3.2 Evaluation Metrics
In order to evaluate the performance of the models, we used the following classification and segmentation measures: precision, recall, Dice coefficient and mean Intersection over Union (mIoU). These metrics are also used in the medical field, and are defined be-low.
Precision is the radio of pixels correctly predicted as COVID-19 divided by the total predicted as COVI-19:
where TP is the true positive (i.e., the number of pixels labeled as COVID-19 correctly) and FP refers to the false positive (i.e., the number of pixels labeled as COVID-19 wrong).
Recall is the radio of pixels correctly predicted as COVID-19 divided by total number of actual COVID-19:
where FN refers to the false negatives (i.e., the pixels that are labeled wrong as non-COVID-19).
However, these two measures are not frequently used as evaluation metrics because of their sensibility to segment size, in other words, they penalize errors in small segments more than in large ones [28, 34, 35].
Dice coefficient or Dice score (DSC) is a metric for image segmentation:
where A and B refers to the predicted and ground truth masks.
4 Results and Discussion
All models that we have used in this work predict a probability for every pixel and we have to set a threshold in order to identify if a pixel is in the segment of COVID-19 or is in the healthy part. So, we have decided that the threshold value of 0.9 is the best to do the Task.
We used the validation method five-fold cross validation to evaluate the segmentation performance of the models on the COVID-19 dataset. First of all, we set aside 10 im-ages to test the model after we have trained it. With the remaining 90 images, the new data set is used to apply five-cross validation.
We divided the new dataset into 5 parts, one of which was selected as the validation set and the other four parts were used for the training set in order to train the model. When the training had finished the loss, metrics were calculated and we repeated all the experiments until all the parts were used as a validation set, then the average of metrics was calculated to get the performance evaluation value of the model.
Figures 8 and 9 show the loss during the training and validation phase. At the beginning of the training phase, the difference between all the models is noticeable, but over time, all the ones converge. The models only detect where the lesion is, so we don’t ask about the class of lesion.
Table 1 shows the metrics to evaluate the performance of the model. The Dice metric can be used to compare predicted segmentation pixels and their corresponding ground truth. Dense V-Net is the model that has the best performance in terms of metric accuracy. On the other hand, the proposed model achieves a better performance with respect to the Dice and Recall metric.
Method | Dice | Precision | Recall |
Mask R-CNN Unet | 0.7801 | 0.7857 | 0.7333 |
SegNet | 0.8202 | 0.619 | 0.8667 |
Dense V-Net | 0.8001 | 0.7667 | 0.8333 |
MaskLab | 0.7905 | 0.8667 | 0.8467 |
0.7885 | 0.8001 | 0.8402 | |
Proposed | 0.8301 | 0.8254 | 0.8684 |
All models were able to detect the foreground from the back-ground; however, they were unable to detect the lesion class of the background. The best scores were obtained during the training phase compared to the validation phase, as can be seen in Figures 10 and 11.
4.1 Inference
Figure 12 illustrates the segmentation results of lung infections from an example of lung CT slices taken from the test set using different segmentation networks.
The original image is on the left side (a), the expertly labeled mask is on the right side (b). All models have located the correct position on the image of the COVID-19 related injury, but do not retrieve the exact shape of the injury.
Unet misses true infected areas with small sizes. Mask RCNN works better than Unet to determine the infected region, however, some tissues close to infections are segmented incorrectly. Segnet and Dense V-Net provide good performance in segmenting medium-sized infection regions, but several overestimates of normal tissues as infections.
MaskLab cannot provide full segmentation of some regions. On the contrary, the proposed MiniCovid-Unet provides superior performance to previous methods, regarding the recognition and segmentation of small and medium infections.
The shape of the infected area was complex and could be located anywhere within the image, the contrast between the infected and healthy parts was low.
In addition, the original Mask R-CNN model has been trained with millions of images of people and different types of objects, which could explain the low score against MiniCovid-Unet.
Furthermore, the other models were unable to retrieve the exact shape of the COVID-19 lesion, as can be seen in Table 1.
5 Conclusion and Future Work
In this paper, we propose the MiniCovid-Unet network with novel structure for COVID-19 infection region segmentation in lung CT slices. We also presented other models applied to detect and segment injuries related to COVID-19.
The models were selected because it is simple to implement for a custom dataset of images. However, a GPU is necessary in order to train the model in a reasonable time.
All models were able to identify the regions where lesions were found, but had difficulties in correctly segmenting the shape of the lesion. Figure 12 shows that a healthy lung could be differentiated from a diseased one, and even completely healthy lungs could be detected. However, the results for the segmentation task were poor.
Although the models can identify the injury, it does not indicate the type of injury. We used a small dataset available for the segmentation task, however the MiniCovid-Unet’s results obtained so far in this work represent an alternative to use deep learning to help in the objective diagnosis of COVID-19 using CT images of the Lung.
As future work, we want to get more images to train the framework. We also hope to be able to perform the segmentation taking into account the three existing classes in the dataset.
It is also proposed to make a comparison against other models such as U-Net++ [36], which are frameworks focused on COVID-19 medical images.