1 Introduction
The COVID-19 mortality rate reported by China at the outset was 2% on January 23, 2020, but when compared to July 9, 2021, it has risen to 4.7%. While this value remains relatively low when compared to the mortality rates for other diseases, the challenges in detecting the disease due to a 5-day symptom delay and the presence of asymptomatic cases have hindered efforts to isolate infected individuals and contain the virus. This situation led to a significant increase in cases, ultimately collapsing healthcare systems in many countries [1].
In the Americas, Mexico in particular, has experienced a high COVID-19 mortality rate. This is attributed to the fact that only individuals suspected of having the disease are tested, leaving moderate and asymptomatic cases unaccounted for in the statistics.
However, this elevated mortality rate underscores the critical importance of a swift response within the healthcare system, as it can mean the difference between life and death.
Different countries have implemented varying public health measures, leading to distinct outcomes, including differences in mortality rates and the emergence of new virus variants, which have affected younger populations.
For example, in the USA, initially, only 2% of children were affected, but this number increased to 24% before widespread vaccination.
In Mexico, measures such as early isolation and quarantine, the implementation of an epidemiological risk traffic light system adopted by different states, social distancing, temperature checks in crowded establishments, and daily reporting to raise awareness among the population have been put in place on a voluntary basis.
Nevertheless, the country has experienced two waves and is on the brink of a third, with infection rates 60% higher than the initial wave, driven by variants of the α, β, γ, and δ viruses.
The goal of this study is to design a predictive model for the risk of COVID-19-related patient mortality using Machine Learning algorithms.
These algorithms will be applied to data from over seven million COVID-19 patients, sourced from the website of the General Directorate of Epidemiology of the Secretary of Health of the Government of the Mexican Republic.
Machine Learning is a subset of Artificial Intelligence that encompasses supervised learning, used for prediction and classification. It seeks to uncover dependencies or structures between input and output variables. As the volume of available data for learning increases, algorithms adaptively enhance their performance [2, 3].
2 Related Works
In [4], the authors achieved a predictive accuracy of 94.99% for Covid-19 infection outcomes using logistic regression, decision trees, support vector machines, Naive Bayes, and artificial neural networks. They conducted their analysis with data from 263,007 patients in Mexico.
The authors in [3] estimate the main conditions in patients associated with Covid-19 in Mexico that increase the risk of death, by applying logistic regression to 1,048,575 patients, achieving an accuracy of 87%.
In [5], with a database containing more than 2,670,000 confirmed COVID-19 cases from 146 countries, the authors applied various Machine Learning Algorithms to predict the risk of death. The results demonstrated an accuracy of 89.98%.
They utilized 57 features categorized into symptoms, pre-existing conditions, and demographic information.
In [7], machine learning algorithms were applied to electronic health records from a US hospital, involving 966 patients, to predict the number of days patients would remain hospitalized.
3 Methodology
The proposed model is depicted in Figure 1, and each of its components will be elucidated in the subsequent paragraphs.
3.1 Data Extraction
Data from the website of the General Directorate of Epidemiology of the Secretary of Health of the Government of the Mexican Republic [8] was extracted. This dataset, covering the period from March 2020 to May 31, 2021, includes records for 7,042,816 patients associated with COVID-19.
It encompasses 40 characteristics described in a file available on the same page, referred to as a data dictionary. This data dictionary contains the keys necessary for understanding the database, and we have organized it in Table 1.
Hospital | Geography | Dates | Patient | Diseases | Results | Services |
USMER | Nationality | Update date | Gender | Pneumonia | Laboratory sample | Hospitalized |
Sector | Birth entity | Date of admission | Age | Diabetes | Laboratory result | Intubated |
Entity of the medical unit | Residence entity | Date of symptoms | Pregnancy | COPD | Sample antigen | ICU |
Municipality Residence | Date death | Register ID | Asthma | Result antigen | ||
Migrant | Indigenous speaking language | Immunosuppression | Final classification | |||
Nationality country | Indigenous | Hypertension | ||||
Country of origin | Another complication | |||||
Cardiovascular | ||||||
Obesity | ||||||
Chronic kidney | ||||||
Smoking | ||||||
Another case |
3.2 Data Exploration and Visualization
From March 2020 to May 31, 2021, 7,042,816 individuals aged from zero to 120 years, were attended for COVID-19.
The average age of patients associated with COVID-19 and the most frequent is 40 and 28 years respectively, contrasting that of those who have died, the average age is 63 and the most frequent is 65 years, most of the patients have been women, although those who have died the most were men.
Figure 2 shows this distribution. From the records, it is extracted that 293,913 individuals associated with COVID-19 have died, representing 4.1%.
We emphasize that cardiovascular problems, hypertension, diabetes, obesity have prevailed among the deceased, not very different from what most patients associated with COVID-19 suffer, which is hypertension, obesity, diabetes, and smoking, according to the records, as showed in Figure 3, 69% of the deceased suffered from pneumonia.
Severe COVID-associated patients required hospitalization, intensive care, and intubation. However, from graph B in the same figure, it can be observed that most of them died without having been hospitalized, intubated, or admitted to the intensive care unit.
Of all the patients associated with COVID-19 who were intubated, 76% of them died, and of those who were in the Intensive Care Unit, 48% died.
3.3 Data Cleaning, Feature Selection, and Feature Transformation
Although there were no missing data in the database, due to the state of emergency experienced and the need to transfer patients directly to the ventilation and intensive care areas, records labeled as 97, 98 and 99 were found, indicating ‘not applicable’ 'is ignored' or 'not specified,' respectively.
Therefore, these patients were not considered in the study.
Twelve characteristics were chosen:
– Age,
– Gender,
– Pneumonia,
– Diabetes,
– COPD,
– Asthma,
– Immunosuppression,
– Hypertension,
– Cardiovascular,
– Obesity,
– Chronic kidney disease,
– Smoking.
All variables are binary, except 'age', which was discretized using a threshold based on data characteristics. In this case, discretization was done according to the distribution of deceased patients, as shown in Figure 2 a).
Mean and standard deviation were used as criteria for this discretization. Out of the 7,042,816 COVID-19-associated patients, after cleaning the database, 6,711,412 individuals remained, of whom 290,285 died.
Therefore, an equal number of surviving patients, specifically 290,285, were randomly selected to balance the dataset, resulting in a total of 580,570 data points. The X matrix is of length 580,570x12, consequently the output vector Y is 580,570x1.
The data set was divided into two groups, one called the training set, with which the algorithm learns the properties of the data and the other called the test set, with which we validate the method.
To obtain the training a way that the training vector conserves 75% of its size and the remaining 25% constitutes the test array, it is important to reserve a percentage of the data for verify the operation of the model.
4 Results and Discussion
After selecting the training and test datasets, machine learning algorithms like Logistic Regression, Naive Bayes, Decision Trees, and Random Forests were employed.
Python was the programming language utilized for this task, and the Scikit-learn library, a machine learning library for Python, was employed to carry out data mining and analysis tasks. The results of the execution of the machine learning algorithms presented in Table 2 demonstrate an accuracy of 87%, except for the Naive Bayes method.
Machine Learning Method | Score |
Logistic Regression | 0.87 |
Naïve Bayes | 0.84 |
Decision Tree | 0.87 |
Random Forest | 0.87 |
The confusion matrices are presented in figure 4. To evaluate the efficiency of the proposed methods, precision, recall, and F1-score metrics were calculated based on the confusion matrices, and the results are presented in Table 3.
Algorithm | Precision | Recall | F 1-score |
Logistic Regression | 0.901 | 0.851 | 0.875 |
Naïve Bayes | 0.910 | 0.800 | 0.850 |
Decision Tree | 0.876 | 0.872 | 0.874 |
Random Forest | 0.876 | 0.874 | 0.875 |
The proposed model has been tested again, but now with all the patients associated with COVID-19, from the month of June 2021, a total of 402,116, obtaining an accuracy of 91%, as we can see in Table 4.
Machine Learning Method | Score |
Logistic Regression | 0.91 |
Naïve Bayes | 0.89 |
Decision Tree | 0.91 |
Random Forest | 0.91 |
The proposed model shows good classification performance, as evidenced by the ROC curve in Figure 5 and the obtained AUC-ROC value of 0.92, which demonstrates its discriminative ability.
With the aim of identifying unnecessary variables, an ablation study was conducted by removing one of the 12 features at a time and running the algorithm as reported in Table 5, where accuracy does not undergo significant changes.
5 Conclusions
The model achieves an 87% accuracy in the first test and a 91% accuracy in the second one, using June data, which improves upon the proposal from [5]. This can aid in designing strategies and public policies to combat its spread and reduce mortality. This machine learning proposal is valuable for organizing and planning hospital triage strategies. It also significantly contributes to public health policies aimed at reducing the high risk of contagion in individuals who have one or more concurrent conditions such as diabetes, hypertension, COPD, obesity, asthma, smoking, cardiovascular issues, and immunosuppression, all of which have become critical factors in the mortality of patients associated with COVID-19.