1 Introduction
Nowadays, the Internet world is growing considerably in terms of the number of users, websites and pages. Statistics in [1] show a dramatic evolution of the Internet from 1995 until the present day (from 16 million to 5,168 million users in 2021). This increase is due mainly to the democracy of the web and the low cost of publishing and consulting web content. Consequently, the web is becoming the first solution of social, E-commerce and marketing content comprising opinions and advertisements in different domains. In last recent years, there is an increasing interest of businesspersons, companies and users in analyzing and exploiting social media content. Indeed, the capabilities of such content are various; for example, in marketing, knowing users’ opinions and attitudes is very useful for both customers and companies. Another relevant case concerns the prediction of economic and commercial indicators such as incomes and prices from online textual data. Owning such knowledge could help companies to still performing and competitive in the market.
However, the rapid evolution of the huge amount of web data poses a challenge regarding its effective exploitation in an optimal time and effort. Fortunately, in text mining, automatic predictive techniques are proposed to deal with such challenging tasks as they propose solutions to analyze data and predict required outcomes by computer programs. Due to the fact that the most typical form of online information is written words, text mining has a very high commercial potential. Indeed, a study showed that 80 % of a company's data was in textual form like emails and reports [2].
Despite the utility of employing predictive text mining techniques in social media content for business purposes, there has been limited work targeting this area. The situation in Arabic is even worse because, to the best of our knowledge, no research effort has been devoted to the application of text mining methods for the task of prediction in business and commercial domains. Actually, the state-of-the-art of Arabic text mining confirms that most research efforts have been focused mainly on thematic text categorization [3-4-5-6-7-8], sentiment analysis [9-10-11-12-13], author attribution [14-15-16-17] and mining the holy Quran [18-19-20], while other proposed papers were interested in web pages clustering and annotation [21] and information extraction [22].
Recently, and in a business perspective, many companies specialized in marketing built their websites to provide online advertisements concerning several commercial activities. The domain of used cars and equipment is among the most interesting business sectors in Algeria because of their continuous growth and expansion in the last years. In fact, the decision of the Algerian government to ban the import of new cars has revolutionized the used cars and equipment markets. For example, the famous Algerian website Oued Knissfn, specialized in posting advertisements concerning real estates, used equipment and cars, is the first Algerian website visited in Algeria, and ranked fourth among the first international websites visited in this country in 2021 [23]. This website was worth 40 billion Algerian dinars in 2014 with more than 800,000 advertisements [24].
In contrast of new cars advertisements in which attributes are categorical representing their components and options, the description of used cars and equipment by unstructured textual data is relevant for accurate price estimation. Indeed, some textual features including car description are very valuable, and thus should be taken in consideration in price prediction. For example, the state of a given car as nearly new or good as well as its components like engine, if new, repaired or revised, affects significantly the price of used car. Table 1 and 2 illustrate two examples of car and equipment advertisements (unit U equals 10,000 Algerian dinars). In these tables, we can see some relevant textual features (in bold) that could affect the price of used cars and equipment.
The goal of this research work is to explore the capabilities of using text mining techniques in Arabic, coupled with common predictive machine learning algorithms to estimate the prices of used goods like cars and equipment. In particular, we study the influence of the text preprocessing task on price valuation. We also investigate the impact of some data mining techniques such as feature selection on prediction results. In addition, we examine and compare the performance of some predictive algorithms like K–nearest neighbor, regression-based algorithms and neural networks in price forecasting of used cars and equipment. Finally, to evaluate the contribution of using text mining and integrating textual data in the prediction model, we compare the proposed methods with the same algorithms that employ only structured variables for price prediction.
We think that proposing such solutions for predicting accurate prices will be very helpful for both buyers and sellers. Indeed, providing a precise price estimation tool allows people to make the right decision of selling or buying used properties by avoiding overestimating or underestimating the real price [25].
The rest of the paper is organized as follows. In the next section, we give an overview on some studies related to the task of prediction in the commercial context by using text mining methods. Section 3 explains the process of gathering and preprocessing textual data and describes the solution for price valuation. After this, we present the results obtained with their interpretations in section 4. Finally, section 5 provides conclusions and possible improvements of the presented work.
2 Related Work
The aim of text mining is the analysis of large amounts of textual data and the detection of linguistic usage patterns to find useful information [26]. This research area uses natural language processing and data mining techniques to extract useful knowledge from texts.
In text mining, supervised methods are among the most popular techniques for mining such valuable information. In these methods, predictive models are built and learned, and then evaluated based on annotated examples with the aim of predicting the required results.
These forecasting techniques fall under two categories: classification and regression [27]. This distinction depends on the required type of prediction; in the classification task, an example is categorized in one of possible predefined classes, while regression models estimate the output of a given instance to a continuous numerical valuation [28].
In recent years, relevant research papers that investigated the task of predictive text mining models for marketing and business purposes have been proposed due to their importance for both customers and companies [25, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]. For example, in sentiment analysis context, suggested studies attempted to analyze the impact of consumers’ sentiments on company’s economic outputs [29], and to estimate revenue from opinions in Social Media [30]. The effect of news headlines [31] and sentiments on stock price estimation was also investigated in the work [32]. Moreover, some scientific articles studied the relation between purchase intention and product price [33, 34].
Other interesting studies exploited text mining approaches for the task of price prediction in markets. For example, the authors [35] presented a prediction method for crude oil prices. They indicated that their approach outperforms other predictive methods. In a paper presented by [36], the researchers described a forecasting system based on text mining, called NewsCATS (News Categorization and Trading System) to predict trends in stock prices. Another interesting advertising activity in which price estimation is crucial is the real estate area. In this perspective, relevant scientific papers aimed to estimate the price of real estate classifieds [25, 37] and predict end-prices of online auctions [38] by the use of text mining methods.
Regarding Arabic text mining, research studies presented in the marketing and economic domain, that employ text mining techniques for price prediction are very rare. The present work aims to investigate this avenue of research.
3 Proposed Approach
3.1 Data Collection and Preparation
In this section, we explain the process of creating our datasets. Regarding the used cars domain, we collected Arabic advertisements from the first website of online advertising in Algeria (ouedkniss.com). The posts were gathered in the period between 10th August 2018 and 17th October 2018. We selected only texts in which web users employed standard or comprehensible Arabic in their online advertisements. Duplicate documents including identical descriptions of the second-hand cars were removed. Additionally, advertisements using Romanization of Arabic or different languages such as French were discarded from the corpus. Unrealistic announces requiring exaggerated prices were also rejected as they could affect negatively prediction model performance.
The obtained dataset contains 400 documents about the 20 most used models in Algeria. Each document in the data collection comprises two parts: the first part includes structured variables describing the used car like year of manufacture and mileage in km, while the second part of the document comprises textual data that describes the car state. The price in this dataset varies from 800,000 AD to 4.000,000 AD. AD means Algerian dinar. Table 3 shows a list of the structured features in the used car dataset.
Feature name | Type | Meaning |
Model | Text | Model of the car, example Chevrolet OPTRA |
Mileage in km | Integer | Distance travelled by the car |
Year | Integer | Year of manufacture |
Price | Integer | The required price of the car |
To compile the second dataset related to used equipment of construction, we gathered examples from the same website (Ouedkniss). In addition, the same methodology of selecting documents in the first dataset of used cars is applied to create the second collection related to used equipment. In the second dataset, each document describes heavy equipment lots such as trucks and buses.
The obtained dataset comprises 482 texts, and the interval of price in this collection is between 1,000,000 AD and 28,000,000 AD.
Concerning data preparation, we performed usual text preprocessing techniques including tokenization, removing non Arabic letters and normalization of Alif – Taa and Yaa (). In addition, stop words such as ( in English on-in), and words having length less than 2 letters were removed from the corpus.
In the preprocessing method, stemming is an optional task that reduces word forms to a unique representation (stem, base or root). In Arabic, widely known stemming methods are root stemming [39] and light-stemming [40]. In our work, we applied light stemming.
Regarding feature types, we used n-grams of words [41]. We recall that an n-gram of word is a contiguous sequence of n words from a given sample of text. An n-grams of size 1 is called unigrams; two successive words are called bigrams (or digrams) and size 3 is named trigrams.
The next step in the preprocessing task consists on assigning for each feature (word) a weight representing its relevance in the text. There are several weighting schemes such as Boolean weighting, Term Frequency TF and Term Frequency Inverse Document Frequency (TF.IDF) which is a combination of TF and Inverse Document Frequency (IDF). TF is the number of times a word occurs in a text, while IDF is the number of total documents over the number of documents containing word i. TF.IDF is a popular weighting scheme used in text mining applications such as information retrieval and text classification. This scheme reflects the relevance of a feature (word) in a given corpus.
Finally, we applied feature selection on the textual features to optimize the number of words in the dataset. We retained the most relevant features based on correlation between description words and the output, i.e., the price. The optimal number of the most important features is determined empirically by experiments.
3.2 Used Algorithms
Regarding the price prediction model, and for comparison purpose, we applied four forecasting algorithms namely, linear regression (LinReg), pace regression (PaceReg), K-nearest neighbors (KNN) and finally Neural networks (NeuNet) for estimating the price of used cars and equipment. To determine the importance of textual information integration that includes the description of used cars in price valuation, we performed two prediction models. The first base model includes only categorical and structured data such as year of manufacture and mileage, while the second prediction model combines the first model (categorical features) with textual data that contains the description of the used car state. Figure 1 shows the different steps followed in the solution for integrating textual features to accurately predict the price of used cars and equipment.
3.3 Evaluation
To assess the performance of our predictive techniques, we calculated Root Mean Squared Error (RMSE), which is a well-known measure for evaluating prediction models. In addition, 10 fold-cross validation was employed to average the performance results in the experiments.
Cross-validation is a popular technique used to evaluate and compare machine learning algorithms. For each iteration, we split data into two parts: one used to learn and build the prediction model, while the other part is utilized for the test step. The performance results of the 10 folds are then combined and averaged [42].
4 Experimental Results
The role of the experimental study is to evaluate and compare the performance of the proposed models for price forecasting. To perform experiments, we used the Rapid Miner softwarefn, which includes different tasks required to perform text preprocessing, prediction and evaluation.
Regarding the first text collection (car), we considered two scenarios. The first one is without considering textual data and using only structured features, while in the second setting, we integrated unstructured textual data representing the used car description in the prediction process.
Table 4 illustrates performance prediction results measured by (RMSE) for the four prediction algorithms without including textual information, that is the description of the car in the model. Through this table, the best result is obtained when the algorithm neural networks (NeuNet) is applied (30.539±10.770).
Algorithms | RMSE |
LineReg | 32.791 ± 11.870 |
KNN (N=1) | 73.631 ± 12.666 |
KNN (N=5) | 59.165±12.021 |
KNN (N=10) | 56.012±10.760 |
PaceReg | 31.937±11.091 |
NeuNet | 30.539±10.770 |
In addition, KNN is the worst prediction algorithm, and it provides the best results when the number of neighbors equals 10.
Therefore, we continue to use this value for KNN in the next experiments.
In the second step, we integrated unstructured features including the description of the car in the prediction model. For this, we tested different text representation schemes in the experiments: binary, TF and TF*IDF. We found that TF*IDF is the best for all the prediction algorithms. Hence, we provide results concerning only this weighting scheme TF*DF for unigrams and bigrams of words. In addition, we performed pruning, which eliminates words that are correlated between them in order to optimise the list of relevant features in the final list of attributes. Table 5 depicts performance results of the four algorithms when employing unigrams and bigrams. The best results for each algorithm are highlighted in bold.
Algorithm | Feature size | RMSE | |
Unigrams | Bigrams | ||
LineReg | 10 | 27.839 ± 9.457 | 27.585 ± 8.874 |
50 | 30.770 ± 8.798 | 31.414 ± 12.502 | |
100 | 30.476 ± 9.782 | 32.062 ± 13.055 | |
KNN (N= 10) | 10 | 56.106 ± 10.664 | 55.897 ± 10.887 |
50 | 55.911 ± 10.881 | 55.786 ± 10.728 | |
100 | 55.907 ± 10.877 | 55.831 ± 10.791 | |
PaceReg | 10 | 28.565 ± 8.567 | 28.731 ± 8.370 |
50 | 29.587 ± 8.996 | 31.228 ± 11.970 | |
100 | 32.230 ± 10.123 | 31.680 ± 9.079 | |
NeuNet | 10 | 29.136 ± 9.358 | 28.526 ± 9.516 |
50 | 31.477 ± 9.223 | 29.081 ± 8.949 | |
100 | 32.397 ± 6.622 | 31.000 ± 8.243 |
From the results provided in Table 5, we see that integrating textual information that contains the description of the used cars ameliorates price prediction for all the tested algorithms compared with the results of Table 4. We can also observe that the best algorithm for predicting car price is linear regression (LineReg).
Moreover, using bigrams enhances slightly the performance in three algorithms, and the optimal number of features equals 10 yielded the lowest values of RMSE. The first algorithm gained from the integration of unstructured textual data is LineReg as RMSE was reduced from 32.791 ± 11.870 to 27.585±8.874. This result agrees with the findings of the paper [25], which confirmed that linear regression outperformed neural networks.
In order to go further in the analysis regarding the contribution of textual data integration for price estimation, we show in Table 6 a list of some relevant words that affect positively or negatively the prices of used cars.
We observe from Table 6 that some opinion words such as (new and nice, in English) have a positive impact (weight) for increasing the price of the car, while some words related to the car description such as (accident and scratches in English ), impact negatively its price. We see also that some words such as (in English clean) share the same meaning. Therefore, we think that integrating a semantic approach that maps synonyms to their common concept could improve price prediction of used cars.
We continue our experiments with the second text collection of used equipment. The same preprocessing tasks were performed as in the first data collection of used cars. This data collection comprises only textual information (no categorical features).
The experimental outcomes are presented in Table 7 where the best results for each algorithm are highlighted in bold. The first remark is that the obtained results are modest when compared with the first dataset of used cars. These results go along with the conclusions of the study in [38] as the authors showed that their regression models did not provide good results for the prediction of end-prices of auction items. Our dataset is similar to the auctions dataset in [38] as it contains several heterogeneous items for sale.
Algorithm | Feature size | RMSE | |
Unigrams | Bigrams | ||
LineReg | 10 | 457.488 ± 238.081 | 518.347 ± 231.582 |
50 | 355.422 ± 262.844 | 428.134 ± 260.230 | |
100 | 318.524 ± 274.326 | 345.747 ± 278.843 | |
KNN (N= 10) | 10 | 466.195 ± 241.201 | 489.468 ± 243.697 |
50 | 398.128 ± 263.475 | 452.539 ± 256.848 | |
100 | 415.493 ± 263.154 | 395.590 ± 269.428 | |
PaceReg | 10 | 459.948 ± 244.777 | 511.549 ± 233.568 |
50 | 362.932 ± 263.316 | 429.844 ± 255.480 | |
100 | 340.889 ± 268.593 | 366.443 ± 271.005 | |
NeuNet | 10 | 524.814 ± 231.400 | 543.280 ± 235.249 |
50 | 368.699 ± 257.011 | 430.134 ± 251.625 | |
100 | 443.707 ± 377.139 | 438.771 ± 321.049 |
According to Table 7, we see that, again, linear regression (LineReg) proves its superiority on the other algorithms in terms of RMSE, while KNN is the worst one. In addition, the optimal number of features is 100 in three algorithms. Regarding feature types, applying bigrams of words does not improve price prediction.
Finally, after carrying out our experiments on the two data collections, we have drawn some conclusions. Firstly, regarding the employed algorithms, the best ones using text mining and textual data for predicting the price are regression based models: linear and Pace regression as they provide the least RMSE values, while KNN is the worst model for price forecasting. In addition, the algorithm neural networks is applied with its defaults parameters; we think that optimizing these parameters could improve prediction results for this algorithm. Secondly, the integration of textual data that comprises the description of the car state ameliorates significantly price prediction results in the car dataset. Moreover, employing bigrams of words enhances price estimation in this corpus.
Concerning the datasets used in the experiments, the performance results of the second dataset that contains used equipment are lower than the car dataset. This is due to three factors. Firstly, each document in the equipment data collection comprises a description of many different equipment lots. In this case, the content of the lot is heterogeneous. Secondly, the description of the list of used items is not sufficient to describe them, in contrast of the car dataset, in which the description of the used car is provided in detail. Thirdly, the range of the price in the equipment dataset is larger than in the car corpus, and this obviously increases the price prediction error.
5. Conclusion and Future Work
In this paper, we studied the contribution of using text mining in Arabic for enhancing price prediction results for used cars and equipment. The idea behind this work is that textual information that describes a used good state is very relevant to precisely estimate its price. In particular, we used both types of features related to second-hand cars and equipment, namely, structured variables and textual data to makeprediction.
For this perspective, we compiled two datasets related to used cars and equipment. In addition, we employed and tested different predictive models, namely, K–nearest neighbors, regression based algorithms and neural networks to compare their performance results.
Experimental results proved that using text mining and considering textual data fusion in the prediction process improves price estimation. The results showed also that linear regression is the most suitable model for the price prediction task. As future work, we intend to apply the proposed solution on other domains. We also think that deep learning techniques, information extraction and semantic approaches would be a solution for improving prediction results.