1. Introduction
Design problems are internal structures of source code that challenge design principles or rules (Oizumi et al., 2016; Sharma & Spinellis 2018; Suryanarayana et al., 2014). Typically, they arise when the source code of applications undergoes constant changes, for example, to accommodate new features or even evolve existing ones. These changes can lead to improper changes characterized by the scattering and tangling of software concerns, violating design principles such as the single-responsibility principle. The violation of rules and design principles gives rise to symptoms of poor design (Palomba et al., 2017) due to the presence of code anomalies, also popularly known as bad smells (Oizumi et al., 2016).
Previous empirical studies (Oizumi et al., 2016; Palomba et al., 2018b; Palomba et al., 2017) point out that bad smells indicate design problems. Couplers and bloaters are examples of bad smells (Suryanarayana et al., 2014), which represent excessive coupling, and large proportions of source code elements, respectively. As a result, maintenance tasks become error prone. Documenting architectural decisions through UML models (Júnior et al., 2021; Petre, 2014) could help identifying such design problems. In practice, UML models end up not being created or updated, making it difficult to identify and resolve code anomalies. As a result, developers tend to rely on tacit knowledge to prevent further violations of design principles.
In this context, predicting design problems can play an essential role, especially identifying defective architectural modules in advance. The greater the forecast of the appearance of design problems, the greater the anticipation capacity to address such problems. An alternative would be to generate predictions of design problems, for example, based on size, complexity, coupling, and cohesion as the source code is modified on-demand. Today, literature lacks studies that provide an overview of the state of the art on the subject of predicting source code design problems. Having a literature mapping would help software developers to identify potential techniques as well as researchers to explore gaps in the state of the art. The absence of this mapping leads developers, for example, to adopt techniques considering the opinion of experts.
This article, therefore, proposes a systematic mapping of the literature to fill this gap. We classify the current literature and pinpoint trends and challenges worth investigating in this research field. Our mapping study follows well-known practical guidelines (Petersen et al., 2015; Petersen et al., 2008). Moreover, it is based on the authors’ experience in carrying out previous mapping studies (Bischoff et al., 2021; Carbonera et al., 2020; Menzen et al., 2021; Vieira & Farias, 2020). In total, 35 primary studies were selected, analyzed, and categorized after applying a careful filtering process to a sample of 894 candidate studies to answer six research questions.
The main results show that (1) 54.29% (19/35) of the primary studies use machine learning techniques, and (2) most of the primary studies explored Bloaters (62.86%, 22/35), architectural problems (54.29%, 19/35), and couplers (42.86%, 15/35). Our findings also indicate that predicting design problems is still in an incipient field of research, showing plenty of room for future works. Lacking techniques that can proactively anticipate the appearance of design problems, allowing for an early refactoring action to preserve the internal quality indicators of source code. Lastly, this article reports worth investigating challenges regarding the prediction of design models that can proactively influence the source code’s quality. Rather than exhausting this topic, our article seeks to serve as a starting point for future investigations.
We outline the main concepts discussed throughout the article, and compare our article with similar ones, highlighting their differences and commonalities (Section 2). In addition, we present the study protocol (Section 3) and introduce the procedures adopted to filter potentially relevant studies (Section 4). After that, we present the collected results (Section 5) and outline additional discussions and future directions (Section 6). Lastly, we discuss some limitations and threats to validity (Section 7) and draw some conclusions (Section 8).
2. Background and related work
2.1. Design problems
Design problems are indicators of situations that improperly affect software quality attributes such as understandability, testability, extensibility, reusability, and maintainability in general. Improving maintainability is one of the cornerstones of making software evolution easier (Alkharabsheh et al., 2019). A design problem does not produce compile-time or run-time errors, but it does negatively affect the system quality attributes, such as understandability, testability, extensibility, reusability, or maintainability in general (Garcia, 2011; Yamashita & Counsell, 2013). Design problems are indicators of refactoring opportunity (Bavota et al., 2015) or even signs that can lead to software design failure (Budgen et al., 2008).
According to Fowler and Kent (Fowler, 2018) , code smells can be defined as a potential indication of problems in the source code of information systems. Thus, it is important to define new approaches to spot them whenever possible. Some systematic literature reviews (SLRs) have been conducted in the area of bad smells (Al-Shaaby et al., 2020; de Paulo Sobrinho et al., 2018; Fernandes et al., 2016; Lacerda et al., 2020; Rasool & Arshad, 2015; Sabir et al., 2019; Santos et al., 2018). The code smell detection process has motivated many researchers to propose different methods to deal with the occurrence of code smells in systems. Nowadays, machine learning techniques are utilized to address code smell issues with promising results (Al-Shaaby et al., 2020; Alkharabsheh et al., 2018; Arcelli Fontana et al., 2016; Di Nucci et al., 2018). A machine learning classifier needs first to be trained using a set of code smell examples to generate a model. The generated models are then used to identify or detect code smells in unseen or new instances (Al-Shaaby et al., 2020).
However, the volume of research in the domain has multiplied over more than 17 years of activity in the field. This has led to the need for critical integration and evaluation of the available research in design smell detection. In our opinion, understanding this area is important because, nowadays, a considerable number of software projects have large dimensions, so manual design smell detection is not a realistic task. Problems are latent in code; detection usually occurs very late, and then, the solutions are very complex. As a consequence, the software quality is negatively affected and technical debt increases, so redoing the software becomes the most realistic option. We believe, from a comparative perspective, that while refactoring has been extensively adopted by the software industry. Design smell detection is far from that reality (Alkharabsheh et al., 2019).
2.2. Prediction of design problems
The prediction of design problems is an attempt to anticipate design problems based on metrics using specific techniques. Some techniques analyze software project histories, trying to guess the behavior of the software in the future with respect to its coding (Kaur & Mittal, 2017; Kim et al., 2008; Palomba et al., 2018a; Palomba et al., 2018; Rani & Chhabra, 2017). Predicting the location can increase the chances of identifying possible anomalies in the source code, in addition to assisting in testing activities, which aim to identify possible problems with the quality of the software. There are several benefits of prediction, such as support and planning for software testing, identification of code snippets where improvements are needed, reduction of code defects, and mainly planning of future efforts within the software project, software developers in prioritizing their work on the most severe smells (Alkharabsheh et al., 2018).
2.3. Related work
Current studies pay close attention to empirical studies’ elaboration to produce evidence-based knowledge and use purely statistical methods to predict the quality of the source code. However, little has been done to create a systematic map of studies published in recent decades. There are studies in the literature that also point to problems that impact the final quality of the software, (Alenezi et al., 2016; Boehm et al., 2019; Chen et al., 2018; Ibarra & Muñoz, 2018; Martínez-Fernández et al., 2019; Wong et al., 2018), these studies mainly deal with the final quality of the software. The team involved in the software design must have version control of the artifacts and the software itself, a prerequisite for its modeling and construction to identify possible failures that may occur.
Moreover, works are developed in the area of Software Analytics (Abbes et al., 2011; Yamashita et al., 2007), tooling supports, and good development practices for improved source code quality. Barbosa et al. (2020) report that software design quality degradation can be avoided, reduced, or accelerated depending on the developers’ communication dynamics and the specific roles performed within the software project. Understanding the role of communication dynamics and the content involved in the discussion is important to avoid software design anomalies. Social metrics can also be indicators of design decay when analyzing the two aspects together.
It is appointed in (Uchôa et al., 2021; Uchôa et al., 2020) that existing studies tend to analyze the degradation of the software design and consider unique events like introducing a single design problem or simply analyzing the rate of degradation. However, understanding how design degradation evolves between reviews is still challenging. The impact of modern code review on the evolution of project degradation is analyzed. Since the code review also aims to improve the quality of design and evaluations can be expected to gradually reduce over time various symptoms of software degradation. To this end, they have investigated retrospectively 14.971 code reviews of seven software systems belonging to two large open source communities. Design degradation characteristics were analyzed in revisions and within reviews. How design discussions tend to impact design degradation.
The paper (Alkharabsheh et al., 2019) presents a systematic mapping, focusing on all types of design smell (bad smell, anti-patterns, disharmonies, using design smell as a unifying term) and, on the other hand, on the detection activity and some other related activities, such as specification, correcting (refactoring), and prioritization.
Providing an overview and discussing the use of machine learning approaches in the field of bad smell work (Azeem et al., 2019) presents a systematic literature review (SLR) on machine learning techniques for smell detection code. It is considered articles published between 2000 and 2017. Starting from an initial set of 2.456 articles, only 15 of them took machine learning approaches. These studies address four different perspectives: (i) considered code smells, (ii) configuring machine learning approaches, (iii) design of evaluation strategies, and (iv) a meta-analysis of the performance achieved by the models hitherto proposed.
The analyzes performed show that Class of God and long method and functional decomposition and spaghetti code have been widely considered in the literature. Decision trees and support vector machines are more machine learning algorithms commonly used to code smell detection. Models based on large set of variables had a great performance. JRip and random forest are the classifiers more effective in terms of performance. The analysis also reveals the existence of several open questions and challenges that the research community should focus on in the future. Providing an overview and discussing the use of machine learning approaches in the field of design problems.
This paper presents a systematic mapping that differs from the previously mentioned studies by focusing on the one hand, on all types of code smell and design smell (bad smell, anti-patterns, disharmonies, among others, using design smell and code smell as a unifying term) and, on the other hand, on the detection activity. Consequently, the main goal of our systematic mapping study is to collect and organize the knowledge on design smell detection in general (approaches, tools, techniques, datasets, and quality factors) and not just tools, as is the case of some related work.
3. Planning
This section introduces an outline of the research protocol used to carry out our mapping study. Section 3.1 presents the central objective and research questions of our study. Section 3.2 defines the search strategy for selecting works already published. Section 3.3 explains how the search string was defined and the electronic databases were chosen. Section 4.4 describes the exclusion and inclusion criteria used to filter works.
Figure 1 introduces the planning followed to run our study. This protocol addresses the steps and guidelines for conducting systematic mapping studies in software engineering (Kitchenham & Charters, 2007; Kitchenham et al., 2011; Petersen et al., 2008; Petersen et al., 2015). Additionally, our protocol was also inspired on our previous mapping studies (Carbonera et al., 2020; Bischoff et al., 2021; Menzen et al., 2021; Vieira & Farias, 2020)
3.1. Objective and research questions
The main objective of this mapping study is to introduce a broad and careful view of previous works considering the theme of predicting design problems. In particular, this work aims to provide an overview of the current literature by filtering (Section 4) and classify articles (Section 5), pointing out gaps, open challenges and research opportunities that are worth exploring by the scientific community (Section 6).
We formulated six research questions, described in Table 1, to address this objective. These questions seek to explore some pivotal issues, including the types of design problems most commonly investigated, the prediction aspects considered to predict design problems, the prediction techniques used to anticipate the future occurrence of design problems, the main contribution reported in each selected study, the research methods applied to run the research and the research venue chosen to publish the articles.
Research Question | Motivation | Variable |
RQ1: What are the design problems explored by prediction techniques? | Reveal the most common explored types of design problems. | Types of design problems |
RQ2: What aspects are considered for predicting design problems? | Understand the different aspects considered for predicting software design problems. | Prediction aspect |
RQ3: Which techniques have been used to predict design problems? | Pinpoint is the most used prediction technique. | Prediction technique |
RQ4: What is the main contribution of the primary studies? | Identify the explored contributions. | Main contribution |
RQ5: What are the empirical methods used to evaluate the prediction of design problems? | Identify the research methods used to evaluate the prediction. | Research method |
RQ6: Where have the studies been published? | Reveal the target venues used to report the results. | Research venue |
3.2. Search strategy
Based on the research questions, we determined a set of key terms to support searching for potentially relevant articles. Key terms were defined based on well-known empirical guidelines (Al-Qudah et al., 2015; Petersen et al., 2008; Wohlin et al., 2012). Table 2 shows the main terms and alternative terms. The selection of these terms considered their relevance to the research field, as well as strategies used and validated in previous works (Bischoff et al., 2021; Luz & Farias, 2020; Menzen et al., 2021).
3.3. Elaboration of the search string
Steps to define the search strings. The main steps followed to define the search string were: (1) Identify candidate keywords, reading studies (e.g., (Fowler, 2018; Oizumi et al., 2016; Palomba et al., 2017; Sharma & Spinellis, 2018; Suryanarayana et al., 2014) chosen by relevance and convenience to pinpoint the main terms; (2) Identify closely related words and alternative terms or synonyms related to the candidate keywords; (3) Check through an initial search if the terms are in articles widely known in the field - an interactive and incremental process run by the authors; and (4) Join the alternative terms using logical operator “OR”, and the main terms using logical operators “AND”. Previous works (Petersen et al., 2015; Wohlin et al., 2012) were also considered to formulate the search string. The combinations that produced the most significant results are shown as follows:
(design problem OR bad smell OR code smell OR code anomaly or design smell) AND (prediction OR forecast OR foresee OR anticipate) AND (maintenance OR evolution OR development OR review OR refactoring)
Electronic databases. After determining our search string, the next step was to identify the electronic databases and retrieve potentially relevant studies. Table 3 details the electronic databases used to search for studies for preparing the systematic mapping. These electronic databases were selected for three reasons. First, these databases have a large and representative number of articles published related to the research topic explored in our mapping. Second, they have been used extensively in systematic mapping studies, pointing out their usefulness and effectiveness. Third, previous studies (El Koutbi et al., 2016; Kitchenham, 2012; Kitchenham et al., 2011; Kuutila et al., 2020; Petersen et al., 2015), have demonstrated the effectiveness of such electronic databases used to perform literature reviews.
Source | Electronic Address |
---|---|
ACM DL | https://dl.acm.org |
IEEE | https://ieeexplore.ieee.org |
Science Direct | https://www.sciencedirect.com |
Scopus | https://www.scopus.com |
Elsevier | https://www.elsevier.com |
Google Scholar | https://scholar.google.com |
3.4. Exclusion and inclusion criteria
We apply a set of inclusion and exclusion criteria to the returned articles after applying the search string to the digital libraries. These criteria establish rules to make the filtering as objective and auditable as possible, avoiding biases usually found in manual tasks performed by humans. The inclusion criteria (CI) considered were:
CI1: Published articles, journals, in an event or periodical that deals with the evaluation of prediction techniques, or source code design problems, whether general-purpose or not.
CI2: Studies published between 2005 to 2020;
CI3: Studies published in Portuguese or English. We agree that the most relevant research in the computing area is published in English-language vehicles, although there are exceptions; and
CI4: Studies that contain key terms: Prediction, Bad Smell, Software, or Design.
The exclusion criteria (CE) considered were:
CE1: The title, summary or even its content without relation to the search string.
CE2: Short studies (up to 4 pages) written in another language, other than Portuguese or English;
CE3: Duplicate studies.
CE4: Abstract did not address any aspect of the research questions;
CE5: Older versions of published studies prior to 2005.
CE6: Studies that are narrowly related to Software Engineering, Software Development and/or contrary to research questions; and
CE7: The full text did not address issues considering the prediction of design models;
3.5. Data extraction
The data extraction procedures consist of a careful reading of each selected work and storing the extracted data in an online Google Spreadsheet. This spreadsheet served as a basis for the collection and synchronization of data extraction actions by the authors. Each primary study was carefully read, and its data extracted to answer the research questions formulated. The extraction process was iterative and incremental, aiming that the authors could collect and audit the data. This made it possible to align the way data was collected and to detect any incorrect collection procedures.
In particular, the articles were classified according to the type of study performed (Petersen et al., 2015): (1) Evaluation study: a specific problem is defined, proposing a solution and conducting empirical analysis, to point out the advantages and disadvantages; (2) Philosophical studies: a taxonomy or conceptual framework is proposed as a way to outline a research area; (3) Experience article: An experience report on the theme of prediction of design problems. Typically, these studies explain what and how something was done in practice; (4) Opinion article: Someone’s personal opinion about predicting design problems. The report does not have a clear methodology, nor related work, focusing on the opinion itself; (5) Solution proposal: A proposed solution for a given problem is presented. The evaluation sticks to the execution of examples or the elaboration of prototypes, rarely to the execution of robust empirical studies; and (6) Validation search: Studies that typically perform experimental studies to evaluate solutions, approaches, techniques or processes that have not yet been used in real-world settings.
4. Study filtering
With the search string and the exclusion and inclusion criteria defined, the next step is to define the article search strategy. The filtering process was made up of five steps performed sequentially. The focus was on selecting a sample of representative studies from a sample of potentially relevant ones. Figure 2 illustrates the results collected from the execution of each step. Each step is described as follows:
Step 1: Initial search. It gathers the initial results obtained after applying the search string in the electronic databases (Table 3). In total, 894 candidate studies were recovered.
Step 2: Exclusion criteria. Three exclusion criteria (EC1, EC2, and EC3) were applied to remove impurities. Some studies were withdrawn due to the absence of any semantic relationship to its title, abstract, or even content, considering the theme investigated in this research (that is, out of scope). In addition, studies that were not written in English or Portuguese were also discarded. In total, 180 studies (20.13%) continued in the next stage, while 714 works were discarded. Calls for conference articles, special issues of journals, patent specifications, research reports, and no peer-reviewed material were examples of discarded materials.
Step 3: Filter by similarity. This step also discarded the studies that were selected by the search string, however, their content was not closely related to the research questions, or they had no close relationship with the study area, e.g., software development and prediction of design problems. The CE5 and CE6 were applied. For that, 38.33% (69 out of 180) of the studies were filtered.
Step 4: Filter by abstract. Exclusion criteria (CE4 and CE7) were applied to remove the studies considering their abstract, and after their full text. In total, 39 studies were removed, leaving 30 studies (43.47%) for the next step.
Step 5: Addition by snowballing. Some studies may not have been located, although the search engines used are widely qualified. To mitigate this threat, studies have been added using the snowballing method (both backward and forward) (Jalali & Wohlin et al., 2012; Wohlin, 2014). After selecting the studies in step 04, a manual analysis of the references and citations of the hitherto filtered studies was performed. Five studies were incorporated. The snowball method in this paper is run three times to add new selected articles.
The search was performed in the first two months of 2021. Finally, 35 studies were filtered as the most representative, hereinafter called primary studies (Table 4). The number of citations shown in Table 4 was found in Google Scholar and retrieved in January 2021.
ID | Title | Year | #Cit | #Ref |
---|---|---|---|---|
A1 | Identifying Architectural Problems through Prioritization of Code Smells (Vidal et al., 2016a) | 2016 | 17 | 25 |
A2 | Are SonarQube Rules Inducing Bugs? (Lenarduzzi et al., 2020) | 2020 | 02 | 32 |
A3 | Do Code Smells Impact the Effort of Different Maintenance Programming Activities? (Soh et al., 2016) | 2016 | 28 | 40 |
A4 | Code smells detection 2.0: Crowdsmelling and visualization (dos Reis et al., 2017) | 2017 | 3 | 50 |
A5 | Code-Smell Detection as a Bilevel Problem (Sahin et al., 2014) | 2014 | 56 | 73 |
A6 | LDFR: Learning deep feature representation for software defect prediction (Xu et al., 2019) | 2019 | 0 | 118 |
A7 | Static Code Analysis of IEC 61131-3 Programs: Comprehensive Tool Support and Experiences from Large-Scale Industrial Application (Prähofer et al., 2016) | 2017 | 20 | 25 |
A8 | BDTEX: A GQM-based Bayesian approach for the detection of antipatterns (Khomh et al., 2011) | 2011 | 106 | 33 |
A9 | Schedule of Bad Smell Detection and Resolution: A New Way to Save Effort (Liu et al., 2011) | 2012 | 106 | 54 |
A10 | Improving Design Smell Detection for Adoption in Industry (Alkharabsheh et al., 2018) | 2018 | 02 | 28 |
A11 | Detecting Code Smells using Deep Learning (Das et al., 2019) | 2019 | 0 | 27 |
A12 | Deviance from perfection is a better criterion than closeness to evil when identifying risky code (Kessentini et al., 2010) | 2010 | 73 | 28 |
A13 | Evolution of legacy system comprehensibility through automated refactoring (Griffith et al., 2011) | 2011 | 13 | 30 |
A14 | Automatically classifying source code using tree-based approaches (Phan et al., 2018) | 2016 | 8 | 42 |
A15 | Detecting Android Smells Using Multi-Objective Genetic Programming (Kessentini & Ouni, 2017) | 2017 | 13 | 36 |
A16 | A hierarchical method for detecting codeclone (Devi & Punithavalli, 2011) | 2011 | 1 | 20 |
A17 | Are automatically-detected code anomalies relevant to architectural modularity?: an exploratory analysis of evolving systems (Macia et al., 2012) | 2012 | 93 | 49 |
A18 | On the Relation between External Software Quality and Static Code Analysis (Plosch et al., 2008) | 2008 | 14 | 19 |
A19 | Do code smells reflect important maintainability aspects? (Yamashita & Moonen, 2012) | 2012 | 165 | 40 |
A20 | Adaptive Detection of Design Flaws (Kreimer, 2005) | 2005 | 50 | 47 |
A21 | Detecting code smells using machine learning techniques: Are we there yet? (Di Nucci et al., 2018) | 2018 | 43 | 90 |
A22 | An empirical study to improve software security through the application of code refactoring (Mumtaz et al., 2018) | 2018 | 12 | 86 |
A23 | Automatically identifying code features for software defect prediction: Using AST N-grams (Shippey et al., 2019) | 2019 | 8 | 81 |
A24 | Code smell severity classification using machine learning techniques () | 2017 | 34 | 42 |
A25 | Change Prediction through Coding Rules Violations (Tollin et al., 2017) | 2017 | 6 | 12 |
A26 | Visual Indicator Component Software to Show Component Design Quality and Characteristic (Irwanto, 2010) | 2010 | 2 | 11 |
A27 | Iterative software fault prediction with a hybrid approach (Erturk & Sezer, 2016) | 2016 | 28 | 64 |
A28 | Using (Bio)Metrics to Predict Code Quality Online (Müller & Fritz, 2016) | 2016 | 32 | 77 |
A29 | Less is more: Minimizing code reorganization using XTREE (Krishna et al., 2017) | 2017 | 15 | 68 |
A30 | Bad-smell prediction from software design model using machine learning techniques (Maneerat & Muenchaisri, 2011) | 2011 | 40 | 14 |
A31 | On the criteria for prioritizing code anomalies to identify architectural problems (Vidal et al., 2016b) | 2016 | 8 | 9 |
A32 | Software Defect Prediction via Convolutional Neural Network (Li et al., 2017) | 2017 | 84 | 50 |
A33 | A Hybrid Approach To Detect Code Smells using Deep Learning (Hadj-Kacem & Bouassida, 2018) | 2018 | 5 | 37 |
A34 | Predicting Design Impactful Changes in Modern Code Review: A Large-Scale Empirical Study (Uchôa et al., 2021) | 2021 | 0 | 76 |
A35 | JSpIRIT: A Flexible Tool for the Analysis of Code (Vidal et al., 2015) | 2015 | 47 | 20 |
Legend:
#Cit: Number of citations
#Ref: Nuber of references
5. Results
This section presents the results obtained after classifying the primary studies (Table 4) to answer the formulated research questions (Table 1).
5.1. Objective and research questions
Table 5 presents the design problems investigated by the primary studies. The main feature is that the majority of the primary studies explored Bloaters (62.86%, 22/35), Architectural Problems (54.29%, 19/35), and Couplers (42.86%, 15/35). The classification in Table 5 is based on the work of (Mantyla et al., 2003) categorized all of Fowler’s code smells except for Incomplete Library Class and Comments smells into five categories: Bloaters, Object Orientation Abusers, Change Preventers, Dispensables, Encapsulators, and Couplers. The study outlines the existence of several correlations among smells belonging to the same category. Moha and Guéhéneuc (2007) propose a taxonomy of smells and describe some correlations among design smells, such as Blob and (many) Data Class, or Blob and (Large Class and Low Cohesion). The categories of code smells we considered are based on the classification proposed in (Mäntylä & Lassenius, 2006), where the smells are classified according to some of the common concepts shared by the smells within one category.
Classification | Amount | Percentage | List of primary studies |
---|---|---|---|
Bloaters | 22/35 | 62.86% | [A4], [A5], [A8], [A9], [A11-15], [A17-18], [A20-22], [A24-25], [A28-30], [A32-33], [A35] |
Architectural | 19/35 | 54.29% | [A1], [A3-7], [A9], [A11-20], [A31], [A33], |
Problems | [A20-22], [A24-25], [A28-30], [A32-33], [A35] | ||
Couplers | 15/35 | 42.86% | [A3], [A5-6], [A9], [A15], [A17-18], [A20-22], [A29-30], [A32-33], [A34] |
Dispensables | 12/35 | 34.29% | [A5], [A16], [A19], [A21-23], [A29-30], [A32-33], [A34-35] |
Object-Orientation Abusers | 6/35 | 17.14% | [A1], [A23], [A26-27], [A30], [A35] |
Change Preventers | 4/35 | 11.43% | [A5], [A19], [A22], [A29] |
Technical Debt | 1/35 | 2.86% | [A2] |
There are two interesting findings when comparing this result with studies already published. First, there may be a relationship between the most frequently explored design problems with the diffuseness of design smells. Previous empirical studies (Palomba et al., 2018b; Sjøberg et al., 2012) revealed a relationship between the diffusion of bad smells and the size and complexity of the source code. Palomba et al. (2018b) point out that the smelly diffuseness is associated with the size and complexity of the source code.
This smelly diffuseness typically addresses bloater smells, including Long Method, Large Class, Primitive Obsession, Long Parameter List, among others. Typically, these smells appear gradually throughout the source code, as the source code undergoes frequent evolution or maintenance tasks, remaining in the absence of a refactoring effort to eradicate them. Sjøberg et al. (2012) reveal that the size of classes often impacts maintainability more than the presence of bad smells. The result highlights a higher frequency of studies concerned with predicting the appearance of Bloaters. This concern makes sense when empirical findings have already revealed their harmful effect on maintainability.
The second finding would be that unlike exploring specific design problems, the primary studies explored more than one. On average, the primary studies investigated at least two design problems. Previous empirical findings already point out that the presence of multiple code smells in classes tends to increase the change- and fault-proneness (Khomh et al., 2012; Palomba et al., 2018b), and design problems can arise from this clustering of code smells (Oizumi et al., 2016). In this sense, exploring more than one design problem makes sense and would be supported by the findings already reported in the literature.
5.2. RQ2: What aspects are considered for predicting design problems?
Table 6 presents the commonly used aspects for predicting design problems in the primary studies. Understanding the considered aspects of the source code is essential to pinpoint which features are relevant, for example, to anticipate design problems.
Classification | Amount | Percentage | List of primary studies |
---|---|---|---|
Code complexity | 27/35 | 77.14% | [A2], [A5], [A7-9], [A11-15], [A17], [A19-30], [A32-33], [A34-35] |
Size | 27/35 | 77.14% | [A2], [A5], [A7-9], [A12-25], [A27-30], A32-33], [A34-35] |
Inheritance | 21/35 | 60% | [A5], [A7-9], [A11-13], [A15], [A17-18], [A21-24], [A26-27], [A29-30], [A32-33], [A35] |
Coupling | 18/35 | 51.43% | [A1], [A3], [A5], [A15-18], [A20-24], [A26-27], [A29-30], [A32], [A35] |
Cohesion | 14/35 | 40% | [A11-13], [A15], [A18-22], [A24], [A27], [A29], [A32], [A35] |
Agglomerations | 3/35 | 8.57% | [A1], [A31] |
Others | 11/35 | 31.43% | [A4-8], [A10], [A12-14], [A28], [A34] |
The collected results indicate a tendency to use structural properties that can be calculated by metrics, including Code complexity (77.14%, 27/35), Size (77.14%, 27/35), Inheritance (60%, 21/35), Coupling (51.43%, 18/35), Cohesion (40%, 14/35), and Agglomerations (8.57%, 3/35) in a smaller amount. These results corroborate with previous studies, revealing that such structural properties can be predictors of design problems and bugs (Moha et al., 2009; Nagappan et al., 2006; Palomba et al., 2017; Yamashita et al., 2007). Palomba et al. (2017) use structural properties of source code to propose a smell-aware bug prediction model, including code complexity, coupling, cohesion, lines of code, coupling dispersion, among others. Zimmermann et al. (2007) indicate a positive correlation between code complexity and bugs. Nagappan et al. (2006) also examined the use of metrics to predict buggy components across 5 Microsoft projects.
Moreover, the primary studies explored source code design problems computed from pure object-oriented code, e.g., pure Java code. However, real-world applications rarely have pure object-oriented code. Typically, software systems are built from the composition of pure object-oriented code together with numerous annotations of frameworks and architectural styles, such as @RestController and @PostMapping from Spring Platform - i.e., annotations for REST web controller and mapping HTTP requests onto specific handler methods, respectively. Thus, computing and predicting design problems from semantically enriched code would require understanding the meaning of annotations. For example, predicting design problems in source code with Spring Boot platform annotations would require dealing with semantic and structural aspects - a challenging and ever-present problem, since semantic information related to annotations is rarely formally specified.
5.3. RQ3: Which techniques have been used to predict design problems?
Table 7 introduces the collected data related to the techniques used to predict bad smell problems investigated by the selected studies. The main feature is that most primary studies used Machine Learning Techniques (54.29%, 19/35). The overall ranking accuracy of Machine Learning (ML) models is used to measure the performance of different methods and approaches. On the other hand, ML algorithms can be divided into 3 categories: supervised learning, unsupervised learning, and reinforcement learning (Chinnamgari, 2019; Dharmadhikari et al., 2011).
Classification | Amount | Percentage | List of primary studies |
---|---|---|---|
Machine Learning | 19/35 | 54.29% | [A2-4], [A6], [A8], [A11-12], [A14], [A20-21], [A23-25], [A27-28], [A30], [A32-33], [A34] |
Decision tree | 7/35 | 20% | [A2], [A6], [A14], [A23-25], [A34] |
Random forest | 7/35 | 20% | [A2], [A6], [A21], [A24-25], [A30], [A35] |
Rules/Heuristics | 6/35 | 17.14% | [A5], [A8], [A11-12], [A14-15] |
Prioritization Criteria | 6/35 | 17.14% | [A1-2], [A6], [A12], [A24], [A31] |
Linear regression | 4/35 | 11.43% | [A2], [A24], [A11] |
Logistic regression | 3/35 | 8.57% | [A2-3], [A24], [A30] |
Bagging | 2/35 | 5.71% | [A2], [A8] |
Others | 7/35 | 20% | [A1], [A5-7], [A10], [A34] |
Among the selected works, we highlight the use of algorithms with a supervised learning classification and regression method. Decision Tree (20%, 7/35) and Random Forest (20%, 7/35), followed by Rules/Heuristics (17.14%, 6/35) and Prioritization Criteria (17.14%, 6/35) and Linear Regression (11.43%, 4/35) and Logistic Regression (8.57%, 3/35) and Bagging (5.71%, 2/35) and Others (20%, 7/35). Note that primary studies generally explored Machine Learning and used algorithms for training problem prediction models. It is essential to highlight and compare this result with the published study that notes that further studies are needed to consider the use of cluster learning, multi-classing and resource selection techniques for code smells detection (Al-Shaaby et al., 2020).
In an attempt to anticipate the location of defects in an application through the use of specific techniques, the primary studies explored more than one bad smell according to the classification in Table 5, evaluating the results present in Table 7, we can say that apprenticeship is proposed to improve the performance of software problem classifiers, combining different classifiers and methods in defect prediction.
The results indicate that the use of Learning Techniques is part of an in-depth analysis of the performance index of software bug prediction models. Therefore, future efforts will be dedicated to analyzing the contribution of information related to the detection of bad smell in the context of models. Prediction of local learning bug. Finally, the future research agenda includes the definition of new factors that influence the performance of forecasting models (Palomba et al., 2017).
The vast majority of selected primary studies use the practice prioritizing bad smells. Several automated approaches are proposed to generate rules that can detect bad smells in static software codes. A rule is a combination of quality metrics and their threshold values to detect a specific type of Bad Smells. The use of static code analysis tools coupled with machine learning is used to compare the power of prediction of failure propensity for software quality violations, applying several models for comparing the predictive power of bad smells or possible violations software quality related to the pre-defined metrics in each selected primary study.
5.4. RQ4: What is the main contribution of primary studies?
The collected results indicate a tendency to use techniques machine learning in Table 7, used the practice of prioritizing bad smells as described in Table 5, the use of analysis combined with machine learning is used to compare the power of prediction of failures of software quality violations according to the results present in Table 6 where the main contributions of the selected primary studies are classified in Process (34.28% 12/35), Method (31.42% 11/35), Model (28.57% 10/35), Metric (2.85% 1/35) and Tool (2.85% 1/35). We can affirm that the results are directly linked to the results present in the previous Section 5.3, in Table 8, since a classification of contribution adopted in (Petersen et al., 2008; Petersen et al., 2015), its great majority by the selected primary studies are related to the links that propose a solution to a given problem, whether it is a new solution or a significant reference from previous studies (Petersen et al., 2015), highlight small examples are typically used to demonstrate the potential benefits and the applicability of the proposed solution.
Classification | Amount | Percentage | List of primary studies |
---|---|---|---|
Process | 12/35 | 34.28% | [A9], [A17-18], [A23], |
[A25], [A28], [A29], [A31-35] | |||
Method | 11/35 | 31.42% | [A4], [A10], [A12], [A15-16], |
[A19-20], [A22], [A24], [A27], [A30] | |||
Model | 10/35 | 28.57% | [A2-3], [A5-6], [A8], [A11], |
[A13], [A14], [A21], [A26] | |||
Metric | 1/35 | 2.85% | [A1] |
Tool | 1/35 | 2.85% | [A7] |
5.5. RQ5: What research methods were used?
Table 9 shows the relation between the primary studies selected and six empirical methods described in (Petersen et al., 2008; Wieringa et al., 2006). Most studies (48.57%, 17/35) focused on proposing new solutions. This result indicates that the primary studies were chiefly concerned with bridging research gaps by proposing techniques to deal with design models. The primary studies predominantly sought to propose a new solution, instead of significantly extending an existing technique. The potential benefits and applicability of these solutions have been demonstrated through small examples or initial empirical studies supported by discussions and implications. Robust and practical studies that brought evidence about the effectiveness of the solutions have not been identified. Case studies in the industry considering context variables have not been reported. This may be indicative of an area still maturing and expanding
Classification | Amount | Percentage | List of primary studies |
---|---|---|---|
Solution Proposal | 17/35 | 48.57% | [A4-9], [A12], [A14-16], [A20], [A23], [A26-27], [A29], [A32], [A35] |
Evaluation | 9/35 | 25.71% | [A1-3], [A18-19], [A21-22], [A25], [A34] |
Validation | 9/35 | 25.71% | [A10-11], [A13], [A17], [A24], [A30-31], [A33] |
Some studies (25%, 9/35) were classified as validation research, which proposed some new techniques, but have not yet been implemented in practice, being evaluated through empirical studies in laboratories. Müller and Fritz [A28] show through an empirical study that biometrics can be used to predict quality concerns of parts of the code while a developer is working on it.
The results indicate that little has been done to discuss the problems identified with prediction techniques. Most studies make only notes for identifying anomalies, security, and vulnerability issues as examples. Finally, the lack of a massive amount of empirical studies may indicate that the evaluation of prediction techniques may be based mainly on experts’ reflection, not on empirical evidence.
5.6. RQ6: Where have the studies been published?
This section investigates when and where primary studies were published to accurately pinpoint trends in publication. Figure 3 presents the primary studies chronologically, organizes them by type of publication and shows the number of studies published per year.
Number and venue of publications. The blue dashed line in Figure 3 counts the number of articles published per year. The results indicate that 62.86% (22/35) of the primary studies were published in conferences, while 34.29% (12/35) in journal, showing a predominance of publications in venues that encourage synchronous discussion by researchers. Based on the premise that articles published in journals are more robust, this may indicate a new or maturing area of research. The publications were more concentrated from 2016 to 2019. Such research on the prediction of design problems may have gained momentum for two reasons: (1) the maturation of the research area itself. that brought well-established concepts about catalogs of code anomalies and refactorings, as well as empirical knowledge about how certain code or social characteristics impact the incidence of design problems; and (2) machine learning techniques are being widely explored to solve practical software development problems.
Trends. Although there is not yet a consistent upward trend, the number of published studies has been growing. After the first publication in 2005, four and seven articles were published in 2011 and 2017, respectively, representing the tops reached over the years. This growth is accompanied by strong fluctuations, alternating with periods with a maximum of two published articles (2005 to 2009 and 2013 to 2015) to nine or more published articles (2010 to 2012 and 2016 to 2019). In addition, 2017 stood out with a greater number of articles produced than other years. Articles published in premier conferences and journals, such as SANER, ICSE, ASE, MSR, ICSM, JSS, IST, TOSEM, IEEE TSE, show that robust research has already been carried out. Although many studies have been published, there are still challenges worth exploring, which are discussed in the following section.
6. Discussion and Future Directions
This section discusses the collected data to explore the main points of RQ5 and RQ6. In particular, we seek to reveal where the selected studies are being published over the years (Figure 3). The classification of the selected studies is based on the year of publication, publication type (workshops, conferences, newspapers, and magazines), and the number of studies published per year. Figure 4 presents the obtained data based on the 37 selected studies, showing quantitatively the results presented in RQ6.
6.1. Distriution of primary studies
Figure 4 introduces a bubble chart that organizes the primary studies in three dimensions (d1, d2, d3), where d1 represents the main contributions (RQ4), d2 is the adopted research method (RQ5), and d3 is the explored design problems (RQ1). Each bubble has values assigned to d1, d2, and d3. This bubble chart helps grasp relations among the main contributions (RQ4), the research methods (RQ5), and the design problems (RQ1). It shows how primary studies have made a triangulation between RQ1, RQ4, and RQ5.
It is observed that prediction techniques for source code design, even though it is a recent research area, have many studies published since 2014 and continue to grow. This result shows that this area of research has been very active in recent years. After identifying the type of publication of these studies, it is revealed that the researchers who contributed most to the subject made their publications in recent years at conferences, represented by a total of 51.35% of the selected studies.
The results did not present statistical qualifiers and were not compared with other results studied since research on this topic has not been developed by other researchers previously. Some recommendations for future research would be: increase the breadth of analysis of selected studies, refine research on fundamental software quality issues; conduct systematic review literature to examine best practices related to source code analysis approaches, technologies or tools, and comparative analysis information. Also, this work may be the first step towards an ambitious agenda on how to advance the current literature on techniques for predicting source code design problems.
6.2. Future challenges
(1) Good quality management in software projects. It is important to identify the main software quality guides most used in the current market. Amid a tougher economic climate, organizations turned to ML for automation and efficiency gains. This allows many of them to massively expand their operations while reallocating their human capital. The search for quality to meet customer needs is no longer a differential competitive, but an obligation for any business to survive in the market. The increase in quality in a company generates positive effects on the company’s processes, management, customer service, and strategic planning. Therefore, it is imperative to know which quality tools or techniques of ML will provide an effective and clear improvement in software projects.
(2) How to quantify software metrics and their quality. Learning analysis appears as a possibility to address this challenge, recognize the difficulties in generating quantitative security, vulnerability, and design metrics. It will be possible to quantify and predict the impacts on the final quality of the software. It is not easy to quantify the maintainability of software. This measure’s primary metric is the time spent on maintenance, considering the time of recognition of the problem, analysis of the problem, specification of changes, modification, tests, and the total time.
(3) How to extract critical features for knowledge discovery. Machine learning is essential for predicting source-code problems. Training machine learning models demand well-designed data sets. The construction of a data set is challenging due to the various sources and lack of structured data. Moreover, source code only may not be sufficient to obtain good results, in artificial intelligence projects, especially Machine Learning, a large amount of data is needed, which will participate in the training of the algorithm.
So, a lot of the work on an ML project is finding the perfect data set for your needs. However, it is not always possible to find an option according to its ambition. Therefore, another challenge is how to consider the developers’ experience in the training process of the machine learning models. Current literature fails to deal with these challenges, leading to great research opportunities.
7. Threats to Validity
The validity of the results achieved in the systematic mapping depends on some factors present in its structure. The main threats to the validation of this study and the factors used to mitigate them are presented and analyzed:
Selection and quality of primary studies: To guarantee an impartial and comprehensive systematic mapping process and the quality of studies considered relevant, research questions, inclusion criteria, and exclusion criteria were defined by a group of researchers.
The researchers and responsibilities: To review the process of carrying out systematic mapping, conducted by the master student, and to clarify his doubts, while he performed the data extraction process. In this way, studies with a broad overview were obtained.
The number of studies selected: To obtain a wide range of results and necessary data, the search was carried out in six repositories of widely known scientific studies (IEEE Explorer, ACM Digital Library, Scopus, Science Direct, Scopus, and Google Scholar).
Possibility of a relevant study to be ignored: Although it is plausible that possible relevant studies were ignored in the survey of primary studies, we opted only to read the abstract, title, and keywords in the application of the criteria inclusion and exclusion. However, in step 05, manual search procedures were performed using snowballing techniques to find possible relevant studies in the references of the studies selected in the previous step. An interesting method to expand the possibilities of returning articles relevant to the review research topic is the snowball method. This one method consists of searching the references of articles included in the work to identify works that are of interest to the research. This method can be used, for example, at the end of the automatic search where a set of articles is already included in the review. Thus, from this set, this technique can be used to find more relevant studies.
8. Conclusions and Future Work
This article sought to grasp and classify the current literature and pinpoint trends and challenges worth investigating in the field of design problems. A systematic mapping study was designed and run based on well-established practical guidelines. In total, six research questions were formulated and answered after carefully analyzing 35 primary studies. This study fills a current gap in the literature considering the variables explored in the research questions, as well as serving as a basis for researchers and students to develop future work on the subject. Our results indicated that a majority of the primary studies explored Bloater bad smells, used code complexity and size as predictors, applied machine learning techniques to generate predictions, and presented a prediction proposal without an extensive empirical assessment.
Finally, we hope that the collected data and insights presented throughout this study can encourage researchers and practitioners to explore upcoming challenges regarding the prediction of design problems. Moreover, this work can be seen as the first step towards a more robust agenda on how to advance the current literature on the prediction of design problems.