Introduction
Left ventricle (LV) segmentation in ultrasound images is a crucial process for evaluating the ejection fraction and assessing heart condition. Ultrasound offers real-time imaging with moderate costs and no ionizing radiation. However, image quality is affected by speckle noise and acoustic shadows of adjacent structures. For these reasons, several automatic and non-automatic methods have been developed to perform the segmentation of the LV in ultrasound images.
Some of the main approaches to LV segmentation have been active contours, such as geodesic models [1] and level sets [2], deformable templates, such as shape models [3] registration-based methods [4], and supervised learning models such as data base guided segmentation [5], and hybrid active appearance models [6]. International challenges have contributed to the publication of large image databases that enable the training and testing of supervised learning models [7]. Machine learning models have shown high performance in the segmentation of the LV in echocardiography in 2D and 3D, taking advantage of the public image databases. An example of a machine learning method for LV segmentation is reported in [8]. A structured random forest was developed for automatic segmentation of the myocardium and the LV on an echocardiographic data set of 250 patients, including healthy athletes and cardiac patients. The random forest showed improved segmentation results when compared against an active appearance model.
Statistical shape models (SSMs) and active shape models (ASMs) [9] have been extensively used, with continuous improvements. An enhanced ASM, as reported in [10], incorporates an adaptive strategy to construct appearance models of each landmark point, optimizing the number of principal components (PCs) using mean squared eigenvalue error (MSEE). Preprocessing with a Nakagami filter further improved segmentation results on good quality images from the CAMUS dataset. However, details on the ASM's initialization method were not provided in the study.
Recently deep learning methods have been applied with good results for the segmentation of the LV. In [11] is presented a review of the application of convolutional neural networks (CNNs) to the segmentation of the LV in ultrasound and MRI. Fully convolutional [12] and the U-Net architecture [13], [14], [15] have been successfully applied to the classification of the pixels corresponding to the LV; Different SSMs such as morphological models, snake models, and ASMs have been combined with CNNs to create hybrid architectures in different studies. In [16] is reported a hybrid method with a fast region-based CNN and an ASM for LV segmentation in ultrasound images where adaptive anisotropic diffusion filtering is applied to all images, the fast CNN detects the bounding box around the LV for ASM initialization and the ASM finds the final boundary of the LV. On the other hand, new CNN architectures have been developed for the adjustment of a shape contour. In [17] is reported a new approach for left ventricle segmentation based on a CNN that first, detects 3 landmarks: apex, starting and, end points of the endocardium; Afterwards, a triangle (start-apex-end point) is used as initialization for a deep-snake [18] which is adjusted to the endocardium using circular convolution, good results were obtained in the HMC-QU echocardiography data set. Also, convolutional neural networks based on autoencoders such as U-Net [19], have shown good results when performing segmentation with reduced training sets; however, this type of network has a great disadvantage as they are based on semantic segmentation, they produce areas (blobs) with an inaccurate classification that are sometimes far from the organ to be segmented (see Figure 1).
In this work, is reported a modified U-Net based architecture, which incorporates expert shape knowledge of the LV in a point distribution model (PDM) [9], this approach diverges from the conventional pixel by pixel classification method discussed in [19]. Our model focuses on generating statistically valid shapes for the left ventricle (LV). This alternative segmentation strategy avoids the production of blobs due to pixel classification errors. Our work contributes to improve the reliability of LV segmentation. The training error of the proposed CNN was calculated as the RSME between the expert pose and shape values and the CNN output at each epoch. During the test stage our U-Net model optimizes the shape and pose parameters corresponding to a non-training ultrasound image of the LV. In the following sections are presented all the details of the PDM of the LV, the U-Net architecture developed and its training parameters. In section III are reported the tests and results on the CAMUS [20] and EchoNet-Dynamic [21] databases. In section IV are presented the discussion and conclusions.
Materials and methods
This section presents the proposed methodology. As can be observed in Figure 2, the methodology starts with the training of a U-Net convolutional network, which is fed with a set of echocardiography images (Figure 2 a-2b). The target is to estimate the parameters of translation, rotation, scale, as well as a deformation vector (Figure 2c), which are useful to adjust the mean shape obtained from the PDM, and thus perform the segmentation of the left ventricle in the systole and diastole phases (Figure 2d). Finally, the proposed methodology is validated using the Dice coefficient and the Haussdorff distance. The details of each stage are explained in the following subsections.
Point distribution models
Point distribution models (PDMs) have been widely used for modeling complex structures, such as the organs of the human body [9], [22] and specifically in this paper the left ventricle. The model is based on a set of points known as “landmarks”, which must be a fixed number and should correspond to the same position along the contour of each example in the training set, as shown in Figure 3.
A shape vector is constructed by concatenation of all the [x, y] coordinates of the landmarks of each training shape. Subsequently, with this set of landmarks, the mean shape (
Where:
P=principal eigenvector matrix.
b = deformation parameter vector.
Shape and pose parameters.
In this work we constructed a PDM of the left ventricle with the purpose of modeling the deformation of each shape in the training set. The deformation of the PDM for each left ventricle shape can be calculated by solving for b in Equation 1, which yields Equation 2, where b contains the shape parameters of each training shape. The pose parameters (rotation, translation in X, translation in Y and scale) were obtained as follows: for each of the contours in the training set the rotation and scale were computed as suggested in [9] while translation in X and Y were calculated as the average value in each axis, on each training example.
U-Net with statistical shape restrictions
We adapted a convolutional neural network architecture that uses the features extracted by the convolutional filters to predict the values of shape and pose of the PDM of left ventricle in this way, all segmentations are valid shapes restricted by the training of the network, thus avoiding the blobs shown in Figure 1. The ultimate goal of this CNN is to predict the values of rotation, translation, scale and the deformation vector b corresponding to a given ultrasound image of the left ventricle.
Network architecture
To provide a better understanding of the proposed U-Net architecture (see Figure 4), it will be divided into 4 blocks:
Input block: Is an image input layer of size 256x256x1 of the form: (width, height, channels).
Encoder - Decoder block: The encoder block is formed by convolutional filters of size 3x3 and its goal is to extract relevant features of the image, also the encoder path reduces the spatial resolution of the extracted feature maps applying maxpooling operations. On the other hand, the decoder path upsamples the features maps and preserves the spatial resolution of the input while also performs convolutional operations. By employing the skip connections from the encoder, the decoder layers enhance their ability to detect and fine-tune the features within the image as explained in [19].
Fully connected block: It consists of a flatten layer followed by a set of fully connected layers. The purpose of this block is to link the features extracted by the encoder-decoder block with the shape and pose parameters related to the input image.
Output block: Finally at the end of the fully connected block is a regression layer that allows to predict the desired parameter: rotation, translation, scale, or b-vector.
Network Training
Using the pose and shape parameters described above, an input vector was constructed for each training image
Where:
Ii = the i-th image of the training set.
θi = the i-th rotation value.
xi = the i-th x translation value.
yi = the i-th y translation value.
si = the i-th scale value.
bi = the i-th deformation vector.
The vector Vi then becomes the input of the proposed CNN, the intention of training a U-Net architecture is to take full advantage of its ability to extract features throughout the encoder stage, the idea is then to use these features to predict the values of pose and shape of the left ventricle, assuming that each LV image corresponds to a single contour and these are never the same between patients. Also the network loss during the training stage is calculated as the RMSE between the values of shape and pose predicted by the network (as output from the regression layer) and the ground truth contained in each training vector Vi through an iterative process called: stochastic gradient descent (SGD) the network adjusts its weights according to the values of the loss trying to decrease it in each iteration, the lower the loss the better the prediction of the pose and shape values.
Predicted shape reconstruction
Once the network has been trained, it is possible to predict the values of θ,Tx,Ty,s and b following the diagram in Figure 2. These values allow reconstructing the LV contour corresponding to the input image, first by applying b and
Where:
Θ = the rotation predicted by theCNN.
S = the scale predicted by theCNN.
[Tx,Ty] = two-column translation values.
Validation
To validate the proposed method, we used two measures:
Dice Coefficient: This method allows us to determine the area of overlap between the segmentation mask of the expert and the segmentation mask of our method. A higher overlap area corresponds to a higher Dice coefficient value. Therefore, the larger the coefficient value, the better the segmentation. This coefficient is within the range of 0 to 1.
Haussdorff Distance: It measures the distance between two sets of points. In this case, it measures the distance between the contour marked by the expert and the contour reconstructed with the network values. A smaller Haussdorff distance indicates a greater similarity between the two contours.
Results and discussion
This section presents how the training and test dataset were created, the CNN training parameters and the segmentation results obtained by the CNN reported in this paper. It is also important to mention that the data extraction and image preprocessing were performed in Python 3.7, while the development, training and testing of the proposed network were done using MATLAB 2022b.
Training dataset
A set of 800 images (400 systole and 400 diastole) were randomly selected from the CAMUS database [20], this set includes images of good, regular, and poor quality, then, data augmentation modifying rotation, translation, scale, brightness, and contrast was applied to achieve a total of 4800 training images an example of these images is shown in Figure 5. With the use of data augmentation, we managed to attain greater diversity in the training set and, consequently, enhance the generalization achieved by the network.
Test dataset
The test set used to evaluate our method was divided into 2 parts: the first consists of 98 images from the CAMUS database, comprising 49 systole images and 49 diastole images (hold-out set). The second part consists of non-training images from the EchoNet-dynamic database [21], these images were extracted from .AVI videos at the end of the systolic and diastolic cycles. The task was done by Cervantes - Guzmán in [24]. The above-mentioned images were subsequently annotated by an expert and the segmentation masks were obtained. The total number of systole images is 207, while for diastole, it is 211. It is worth noting that the resolution of the images from the EchoNet-dynamic database is 112x112 pixels, so they were scaled to 256x256, the resolution at which the proposed network is trained.
CNN training parameters
The training parameters for the CNN were defined as follows:
Training Images = 4560 images.
Validation images = 240 images.
Number of epochs = 20.
Learning rate = 0.0001.
Batch size = 32.
Encoder depth = 3.
Start filters = 64.
A batch size of 32 images is generally manageable on most modern GPUs, ensuring efficient use of hardware resources without causing memory issues. Additionally, this batch size introduces a moderate amount of noise in the gradient updates, providing some regularization effects without being too small. Empirical tests were conducted, and the training time was found to be acceptable, with the model converging effectively without overfitting. Additionally, the mean squared error (RMSE) was used as the loss function since the final layer of the network is a linear regression layer. Therefore, the RMSE tends to penalize larger errors more significantly, leading to better weight adjustments in the network compared to accuracy. The average RSME in the training stage was: 9.86.
This architecture was implemented on a PC with 32 GB of RAM, GPU NVIDIA Tesla K40c and a NVIDIA Tesla T4 working in parallel into an Ubuntu server environment.
PDM training parameters
These parameters were selected as follows, the number of landmarks (64) are a good point density number and can represent the LV contour with good quality as depicted in Figure 6; normally the explained variance as proved in [9] is about 90 %, thus characterizing the most significant shape information provided by the PCA. Finally, the training set was the same as the one selected for the CNN training.
Segmentation results
In this section, we present the results obtained by our method on the test dataset. In Figures 7 and 8, are shown examples of image segmentation results for the CAMUS database during systole and diastole. Figures 9 and 10 illustrate the segmentation for both phases using images from the EchoNet Dynamic database. Additionally, Table 1 displays the average Dice coefficients for both test sets. Subsequently, Table 2 compares our method against others described in [20] and [22]. The average Dice value in this table represents the average obtained from the CNN on the CAMUS dataset only. Furthermore, we decided to categorize the obtained segmentations based on their Dice coefficient into three categories: good, regular, and bad. The threshold used for this categorization is shown in Table 3. Meanwhile, in Figure 11a and 11b, the median for each category is observed for systole images from the CAMUS and EchoNet-Dynamic databases. Figures 11c and 11d display the median for each category for diastole images in the mentioned databases.
CAMUS database | EchoNet-dynamic database | |||
---|---|---|---|---|
Metric | Systole | Diastole | Systole | Diastole |
Haussdorff distance (px) | 25.92 | 37.55 | 28.95 | 31.20 |
Dice coefficient | 0.71 | 0.66 | 0.64 | 0.65 |
Results analysis
In this section, we discuss the results presented in the previous section. The first point to address is the performance of our U-Net, both for the CAMUS database it was trained on and the EchoNet-dynamic database, which contains completely new images for the network to segment. As observed in Figures 11a - 11d, the performance is very similar, which suggests that there is no significant overfitting due to data augmentation. We can also understand that the performance of the "good" category is competent in both cases (CAMUS and EchoNet-dynamic) with respect to the results presented in [20]. On the other hand, in Figures 7 to 10, it is evident that all segmentations produced by our method show shape characteristics very similar to those of the left ventricle (LV). They exhibit smoothed shapes located on the correct side of the image, with scale and rotation close to the expert annotation. Even in the case of poor segmentations (see Figure 12), the obtained contours preserve the shape qualities of the left ventricle. This is the effect achieved by the statistical shape constraint arising from training the network with parameters derived from a PDM of the LV. This represents an advantage over convolutional networks based on semantic segmentation, which, when they fail, produce classification errors like the one shown in Figure 1. Such misclassifications are challenging to correct since contour extraction yields noisy shapes and sometimes contours are located far from the LV, a situation in which our method proves to be more robust. Another point to consider, which impacts our method, is the quality of the images. If ultrasound already poses a significant challenge due to speckle noise, the images acquired in both databases are not of the highest quality. This affects the segmentations obtained by our method and makes it difficult to find the optimal features corresponding to the pose and shape values during the network training hence the low Dice average in the tests performed. However, unlike semantic segmentation, these segmentations can be corrected by some other method that performs a LV contour fine-tuning.
Conclusions
In this paper we presented an alternative way to the use of CNNs for LV shape and pose parameter prediction, taking advantage of convolution layers to find features extracted from the ultrasound image, resulting in segmentations with statistically valid shapes as the network is trained with the shape parameters of a PDM. Although, currently, the overall segmentation accuracy is not high, when high quality images are selected the accuracy of our method compares favorably to other previously published work. Our method avoids the appearance of blobs, and the fact that the segmentation is always a statistically valid shape allows it to be used as an initialization mechanism for another method to perform an adjustment and improve the segmentation starting from a smoothed shape and not from a noisy shape and in some cases far from the ventricle, as it would be the case with the extraction of the contour of a semantic segmentation CNN. In conclusion, this method explores the possibility of generating segmentations with statistical shape restrictions using the power of CNNs and can also be used as an automatic initialization method to later fine tune the predicted LV segmentation.
Author contributions
E.G.G. conceptualized the project, participated in the data curation, carried out formal analyses, contributed to investigation and the development of methodology and the use of specialized software, validated results and wrote the manuscript in their different stages of development. F.T.R. carried out formal analyses, participated in the development of methodology and in all the stages of the writing of the manuscript. J.L.P.G. carried out formal analyses, investigation, participated in the development of methodology and the writing, reviewing and editing of the manuscript. B.E.R. conceptualized and oversaw the project, carried out formal analyses, obtained resources for the project, and participated in the writing, reviewing and editing of the manuscript. F.A.C. conceptualized and oversaw the project, carried out formal analyses, obtained resources for the project, and participated in the writing, reviewing and editing of the manuscript. All Both authors reviewed and approved the final version of the manuscript.