1 Introduction

The real estate market plays an important role in people’s lives, from individuals and families, to small businesses and large corporations. The process of purchasing or renting a property, whether for residential or commercial purposes, mainly depends on the economic and financial planning of a family or a company. Additionally, it is strongly related to the macroeconomics and the financial stability of much larger groups of people such as countries. Any sign of inconsistency or fluctuation in the real estate market can provoke apprehension in the state, trigger an economic recession or, ultimately, even lead to financial crises through housing bubble bursts. The potential risks are well known to the concerned parties and more importantly to governments that monitor the market on a regular basis. Banks have also invested greatly in real estate in order to obtain accurate house pricing estimates for mortgages and housing loans. These organizations often need to estimate the value of a given property for auctions or damage control when clients are unable to pay their debts. Besides states and organizations, property owners and investors should have the right to access valuable insights about the value of their properties too. This knowledge can increase the efficiency of managing assets or even help make profitable property investments.

Property estimations are performed by human experts like real estate brokers and engineers. This estimation process considers properties’ features and amenities, as well as external factors such as bus station density or distances to city centers. These are combined with other metrics, like the House Price Index [1], which tracks the changes in property prices, to arrive at a price estimate. During this process, there is no way of quantifying the accuracy of prediction nor the importance of each component that was included in the task. Therefore, the absence of confidence increases the risk of the forthcoming decision, which can end up being financially harmful.

In the contemporary world, the real estate market is represented mostly through different web-based services. In each country, there are numerous websites with vast amounts of properties available for renting or buying. These data have been utilized in the past for different analyses, ranging from creating models capable of predicting house prices based on their features to estimating prices over time in order to understand their seasonality. There has been a lot of research on this topic over the years, with big real estate datasets containing hundreds of properties being used to train machine learning models with the ultimate goal of providing meaningful price estimates. These datasets contain basic property features that are specific to the building itself, such as location, size, floor level and heating type to name a few. Moreover, they can incorporate other features related to the surrounding area of the property, such as road network accessibility and distances from basic points of interest. All these features contribute to the urban profile of a neighborhood, which can directly or indirectly affect prices to a great extent. The importance of these features and their correlation to the price estimates have been validated in previous research [25].

Environmental factors have not been taken into consideration in the literature as much as they should have, despite their obvious role when selecting a property. The two most popular ones are the air quality index and the noise pollution. The first indicates the level of cleanliness in the air that influences the overall health of the population in a given area [68]. The second one is related to the actual noise caused by road traffic, crowds, aviation and other factors such as the presence of night clubs or manufacturing establishments. The influence of noise pollution on the health of citizens living in an urban environment is well-established. There are numerous cases in the research literature underlining the negative aspects of noise [912].

Environmental Noise Directive (END)Footnote 1 is the primary law in the European Union (EU) dealing with noise pollution affairs. One of its main goals is to inform the public about the environmental noise and its effects on people’s health. Moreover, it requires from EU countries to provide noise maps and noise management plans on a regular basis.

Although noise pollution plays a major role in the nature of a neighborhood, research on its impact on house prices remains largely underexplored. To some extent, this is to be expected given the practical challenges of gathering environmental data, such as expensive measuring and monitoring tools, specialized software, and on-site orchestration of distributed sensors. In Greece, these studies are conducted by large corporations or state departments that subsequently hold the data for internal use. Of course, there are some crowdsourced initiatives [13] that aim to collect noise data, but for small countries like Greece, these are usually inadequate.

The impact of the real estate market on a country, in addition to the innovations that can emerge through research in the field, highlights the potential profit of such work. Being able to generate valuable environmental features of an urban area and, then, use those in the housing price prediction problem can help individuals, small and medium-sized businesses, all the way to large corporations, banks and government experts make profitable decisions. Aside from profitability, it can shed light on the various factors that influence prices. Knowing if and how the environment affects housing prices can assist urban planners to design more functional and efficient cities.

In the first part of this paper, we extract environmental data, and more specifically noise pollution, from published scientific studies. We focus on studies performed by the Hellenic Ministry of Environment and EnergyFootnote 2 for the urban area of Thessaloniki, Greece. The end results were published by the government with heat maps demonstrating the spatial distribution of noise across the city. However, none of the core noise measurements were made public, making any future use or contribution to the field difficult. We have managed to overcome this limitation by meticulously re-creating the sense of noise into a general-purpose and easy to use dataset.

In the second part of this work, we highlight the importance of noise in predicting house prices. To verify this, we have used the property database of Openhouse,Footnote 3 which is a real estate platform operating in major cities of Greece and, mainly, in the area of Thessaloniki. Regarding the machine learning models, we choose to use ensemble methods that proved to work well in the research literature. The property and the noise data are used to create multiple models with distinct configurations, exploring different aspects of the same problem.

The main contributions of this work are:

  1. 1

    A new general-purpose sense-of-noise dataset, as well as a new housing price dataset containing noise information for the area of Thessaloniki.Footnote 4

  2. 2

    An extensive experimental evaluation of the contribution of noise in the property price estimation process via ensemble models such as XGBoost [14] and light gradient boosting [15] models.

2 Related work

This section presents relevant research in the field of housing price prediction from a data perspective. It is important to discuss key relevant work in order to better understand the current state of the area, as well as to position this paper properly within the literature. We begin by outlining the most recent and best-performing solutions proposed for housing price estimates, considering basic property features. Then, we showcase approaches that incorporate various environmental features, with a specific focus on noise pollution. In both cases, we aim to investigate how the various features, especially environmental noise, affect prices.

Baldominos [3] studies the housing price prediction problem in the Salamanca district of Madrid. With a collection of 2266 properties from popular online sites containing the fundamental characteristics, they test the correlation between the features and the price to find out that size is the most important one. They use these data to construct various regression models of different specifications, such as support vector machines, multi-layer perceptrons and ensembles of regression trees, all trying to predict prices given the features. The final results showcase the superiority of the ensemble trees when compared to others. Imran [16] follows another approach for the capital of Pakistan, Islamabad. Alongside the basic property characteristics, they gather some features related to the surrounding area of a property. For instance, they attempt to include neighborhood related information through binary values (yes/no) indicating the existence of core amenities and services like hospitals, schools and entertainment. Although their experiments encapsulate many features, the results show that besides the total size, the number of bedrooms and bathrooms, also, radically influence the price, with support vector machines being the best performing model.

Truong [5] focuses on the Beijing area by using the “Housing Price in Beijing” dataset which contains more than 300,000 properties. Each property, apart from its standard attributes, has various spatial information like distance from the city center and subway accessibility. The exploratory analysis demonstrates direct correlation between the location and the property price, since each district has a different price range. Initially, random forest [17], XGBoost and lightweight gradient boosting models were used for training. Then, the authors combine these to build a stacked generalization model [18] by placing random forest and lightweight gradient boosting at the first level and XGBoost at the second one. This architecture outperforms any of the individual ones in terms of accuracy, with a much higher computational cost. Similarly, Xue [19] accumulates property data and urban details like bus and metro stations and routes, traffic and road network information for the city of Xi’an, China. The urban data are preprocessed and new meaningful indices are introduced. The property features and the new indices are utilized by ensemble models to highlight the fact that size is, again, the most influential factor in the matter of predicting prices. Additionally, they illustrate the importance of the neighborhood of a property, because the next most important group of features is related to the spatial indices. Along the same lines, Kang [20] engineers relevant features from more generic urban characteristics like human mobility patterns and socioeconomic data. They experiment with a gradient boosting ensemble in order to analyze features’ significance, where they come to the conclusion that some spatial features can play a more decisive role when it comes to predicting prices. For example, the prices of properties located near university campuses are mainly affected by the distance to the campus rather than their total size.

Environmental conditions can, also, act on prices. Chiarazzo [21] gathers property and air pollution data for the city of Taranto in Italy, which is marked as a high environmental risk area due to its heavy industry. With feature selection and an artificial neural network they put to the test the correlation of each feature through an one-by-one elimination process. Interestingly, they state that sulfur dioxide concentration, one of the five major air pollutants, is the most determinant with respect to price, ranking higher than other characteristics such as floor level and distance to the city center. Shanghai is another industrialized city, where Zou [22] evaluates the air pollution phenomenon in connection with property prices to quantify even more their relation. A total of 27,608 properties in conjunction with air pollutants are used as training data in a gradient boosting model which it attributes 1.6% in terms of contribution. Under no circumstance, this percentage can be considered as minimal, since a reduction of 1 μg/m3 in nitrogen dioxide increases the price by roughly 278 Yuan per square meter.

Regarding noise pollution, there is much less research available attempting to correlate house prices to noise levels. In general, noise pollution is measured in decibels, where higher values suggest noisier environments. Blanco [23] uses hedonic models to analyze the connection between prices and noise levels in three different areas in the United Kingdom. They suggest that when evaluating properties with similar amenities the presence or absence of noise affects people’s choices. In particular, the way noise impinges on prices differ depending on the area, where in some there is a positive correlation and in others a negative one. Brandt [24] investigates the same hypothesis in the city of Hamburg, Germany by combining multiple sources such as road, air and rail traffic noise pollution with hedonic models too. They highlight the non-linear relationship among noise and price by stating that price decreases significantly lower in areas with low levels of noise, as opposed to high noise level areas where the decrease is more remarkable. Contrary to Brandt’s work, Szczepanska [25] study the noise effect on two rather dissimilar locations, with reference to noise, in the city of Olsztyn, Poland. They indicate the existence of linear correlation between prices and noise pollution which underlines the notion that location can influence the noise-price connection in great measure.

Tsao and Lu [26] collect property data from the Ministry of the Interior of Taiwan for the city of Taoyuan and enhances them with a five year period of noise pollution data from the international airport of Taoyuan. The authors investigate the way aviation noise impacts the real estate market of the city, due to heavy air traffic in lower altitudes, with hedonic models. The models indicate that as the number of flights increases on top of an area, which translates to more noisy conditions, the prices of the corresponding properties decrease noticeably. Moreover, they measure the rate of price decline in certain decibel ranges and conclude that for roughly 65 dB of noise due to air traffic the decrease in price can get to 2356USD, where for more polluted areas the decline reaches the amount of 3622USD. Similarly, Morano [27] study the area of Bari, Italy in order to link noise pollution to house prices, with a total of 200 properties and noise information from the Strategic Noise Map of Bari as well as perceptual views for the quality of an area with regards to noise from residents. To measure the effect of noise, they employ a variation of a data-driven technique known as Evolutionary Polynomial Regression, or ERP [28], referred to as ERP-MOGA [29] which utilize genetic algorithms. The final results outline the negative correlation between prices and noise levels, where highly polluted areas lead to cheaper housing.

The studies mentioned previously span across different cities, countries or, even, cultures. Even though cross-cultural validation [30] is out of the scope of the current paper, we think it’s important to mention it since it can fuel future work around this topic.

The related work indicates that the forefront of housing price prediction has been dominated by machine learning approaches, demonstrating their effectiveness in capturing intricate relationships within diverse property features. However, in the realm of incorporating noise pollution as a crucial determinant, prevailing methodologies have largely relied on conventional hedonic regression models. In this study, we endeavor to utilize machine learning models, with a specific emphasis on noise pollution as a pivotal predictor of housing prices. Moreover, we leverage modern explainability techniques, which have demonstrated efficacy in prior research [31], to untangle the complex dynamics between noise levels and their impact on the real estate market. Through these efforts, we aim to provide a comprehensive and innovative perspective on the interplay between environmental factors and property valuation. These two focal points represent the primary distinctions between the current work and its counterparts in the related literature.

3 Noise data reconstruction

As previously stated, noise data are difficult to obtain because they require specialized equipment for precise measurements, as well as urban environmental specialists capable of completing a task of this complexity. These data must include geographical references in a form of a coordinate system, mapping points or blocks on a map to certain noise values in decibels. This process is usually done with Geographic Information System (GIS) software tools that try to model noise pollution [32, 33].

As far as we know, there is no such data openly available for the urban area of Thessaloniki, Greece. However, there are official studies of noise pollution for Thessaloniki orchestrated by the Hellenic Ministry of Environment and Energy.Footnote 5 The studies were conducted in 2015 for three major municipalities of the urban area of Thessaloniki, namely Thessaloniki, Neapoli and Kalamaria, with specialized equipment capable of measuring ground sounds levels caused mainly, but not only, by factors like vehicles (local transportation), crowds and nightlife, while additionally calculating aviation sound produced by airplanes landing to or taking off at the nearby airport. These noise sources are considered to be the primary causes of noise pollution in urban environments [34]. The duration of the studies were set to 46 consecutive days, capturing noise pollution at least once every hour or, in cases, every 15 minutes.

The final results were illustrated on a heatmap, where discrete colors represent different noise ranges of 5 decibel intervals. For each municipality, the results are segmented into daytime and nighttime noise and, in both cases, the data accumulate the sound sources by taking into account both traffic and aviation disturbances. Additionally, for Kalamaria there is a separate heatmap representing only the aviation noise.

Even though the data were gathered in 2015 they can still be relevant today for the city of Thessaloniki for two reasons. The first one is due to published studies indicating that noise pollution in Thessaloniki remains the same along the years [35]. The second one is the fact that noise outliers, such as noise coming from construction sites or extreme weather conditions, were excluded from the official heatmaps, rendering the dataset more accurate and relatively timeless in terms of the actual noise.

3.1 Idea and approach

The aforementioned studies did not make public the core measurement data that were used to create the provided heatmaps. To overcome this problem, we had to reconstruct these data with a small error. It is important to state that heatmaps used discrete colors mapped to specific small ranges of decibels as shown in Tables 1(a) and 1(b) (note that the ranges and the colors between the two tables are different). This means that each color represents the entire range without changing its tone. The ultimate goal is to be able to create the exact same maps by utilizing the reconstructed data. More specifically, the new dataset will contain the noise, in decibels, of a point given its latitude and longitude coordinates.

Table 1 The original mapping between the noise ranges and the corresponding colors (in RGB) for the area of Thessaloniki/Neapoli (left) and Kalamaria (right). Each noise range within an area is mapped to a different color, while the color mappings between the two areas differ

It is evident that an approximation of noise levels can be inferred from the colors on the maps. However, spatial information is insufficient to precisely map each pixel to its corresponding place on a geographic map. To address this, we employ a technique known as Georeferencing [36] in QGIS.Footnote 6 This technique performs spatial interpolation by aligning the heatmaps of the noise studies with an actual map, thereby enhancing the heatmap with spatial characteristics (no upsampling was performed). Subsequently, we associate every pixel in the image with a noise value in decibels based on its color, using color mapping [37]. Since there are transitioning effects, as demonstrated in Fig. 1, we compute the difference between the colors in the heatmaps and the predefined color ranges [38]. When this difference is sufficiently small, we can assign noise values based on the corresponding color. To calculate the difference, taking into account human perception of colors, state-of-the-art solutions propose the \(\Delta E^{*}\) method, which is based on the LAB color format [39].

Figure 1
figure 1

Color transition effects at the border of decibel ranges (60, 65] and (65, 70] in Thessaloniki’s original noise heatmaps

We used the most recent version of the \(\Delta E^{*}\) method, called CIELAB2000 [40]. The color comparison result is a number, where 0 means a complete match and as the number increases, the difference between the colors increases too. To use this method, one must carefully select the threshold after which the two colors will be considered to be non-matching. After running some experiments, we set the threshold to 20. As a consequence of the discreteness of colors on the heatmaps, the threshold is not considered to be that crucial in our scenario, because the main goal is to differentiate between the predefined ranges. However, it is important to state that lower thresholds (stricter comparisons) were discarding valid noise locations, while higher thresholds (less strict comparisons) were introducing noise locations in places where there weren’t any. Essentially, when the color of a pixel matches a color range, this pixel is assigned the corresponding noise value. Since our intention is to correlate housing prices with human perceived noise, we choose to represent each noise range with its arithmetic mean. So, if a pixel color matches the color of the 50-55 range, it will receive 52.5 as its noise value. The final result will be sufficient to describe the noise perception of an area if we take into account the “3 dB rule” in the field of Acoustics [41]. The rule states that during an increase of 3 decibels, the sound energy is doubled and, thus, it is accepted as the smallest difference that can be easily heard by most people. For instance, the average human will rarely notice a transition from 50 to 51 decibels or between 60 and 61.

It is clear that not all pixels are important due to the transitioning effects we mentioned earlier. For example, in cases where two ranges of radically different colors are adjacent on the map, the transitioning effect will add some pixels in between that probably will not match any color range. Additionally, there are cases where the initial studies could not accurately receive measurements, like the inside of buildings and at the sea. These pixels are not matched to any of the available noise ranges and, hence, are dropped to declutter the data. Table 2 gives the structure of the final dataset where latitude and longitude are expressed in WGS84 (World Geodetic System 1984) [42], also known as EPSG:4326.

Table 2 Final dataset structure. The feature names, data types and value ranges of the new noise dataset

These datasets can be used to create heatmaps that resemble the initial ones. Even though most parts of the images were removed in the process, the remaining locations are still great in number. This can be verified by considering the dataset size in terms of number of rows in the second column of Table 3. To plot that many points on a single map is exceedingly difficult due to memory constraints. At the same time, the datasets hold spatial information that is way too dense, making them really hard to work with. The dataset contains spatial information for the city of Thessaloniki (\(\mathit{latitude}\in [40.56989, 40.678946]\) and \(\mathit{longitude}\in [22.880402, 23.014126]\)Footnote 7) supporting an accuracy of at least 5 decimal points in terms of latitude and longitude. By taking this into consideration, Lambert’s formula [43] translates the accuracy in actual distances to 1.11 meters, meaning that each pixel has spatial coverage of about 1.23 m2. This level of detail is unnecessary and superfluous for the purposes of this work. To minimize the density of information to more practical levels, we utilize tessellation. Through this method, the map is segmented into separate same-sized squared tiles. We chose to tessellate the map by keeping only the four decimal points of the coordinates. Thus, the accuracy decreases to a resolution of 10 meters that is more manageable and adequate for our case. We group the points based on this rule and aggregate their noise using the arithmetic mean to create a representative indicator for the noise level of the given tile. This technique alters the shape of the dataset as shown in the third column of Table 3 and allows us to plot the results on a map.

Table 3 The size of the dataset in regards to the number of rows underlining the reduction in size after tessellation

4 Implementation and experimentation

4.1 Property data

Investigating the correlation and influence of noise in housing prices requires a real world housing prices dataset. For the purposes of this paper, we have utilized Openhouse.Footnote 8 Openhouse is a real estate platform operating in major cities of Greece. It contains high quality information for a wide range of properties, considering multiple aspects of them. Since Openhouse is a data oriented platform, paying critical attention to their service, they have provided Thessaloniki’s properties in order to experiment with the noise data reconstructed in the previous section. The data refer to residential properties offered for sale that were listed on the platform in October 2022. Each property has the features mentioned in Table 4.

Table 4 The property related features (from Openhouse), their data type, the proportion of missing values, as well as how the imputation was done in each case

The majority of the features are self-explanatory with the exception of ‘SubTypeId’ and ‘DoorFrameTypeId’. ‘SubTypeId’ refers to the structural subtype of the residential property receiving values like ‘apartment’ and ‘studio’ among others. ‘DoorFrameTypeId’ corresponds to the type of door frames a property has, such as ‘synthetics’ and ‘aluminum’ to name a few. We have performed an exploratory data analysis on the given dataset to locate potential outliers and verify the overall integrity. Outlier detection was done with the interquartile range (IQR) method. Using IQR in the ‘NumberOfRooms’ feature led to an upper limit of 7 rooms, which decreased the dataset size by no more than 1%. Similarly, in the ‘Size’ feature, the upper limit was 300 m2, which consequently reduced the size by almost 8%. Additionally, price outliers were removed too, by forcing a price range between 10,000 and 500,000 euros. Eventually, the filtered set consists of 2014 properties. The missing values were filled according to Table 4, where different aggregations were used depending on the data type. It must be noted that although ‘DoorFrameTypeId’ and ‘BasicHeatingTypeId’ features are missing approximately 30% of their values, they are considered of significant importance in the housing price prediction process based on the domain knowledge provided by Openhouse. Therefore, we decided to fill these too, and check their influence in practice. As far as the encoding of features, the ‘EnergyEfficiencyId’ and ‘FloorLevelId’ were encoded using incremental indices because they are ordinal categorical features. The other categorical features are nominal so one-hot and binary encoding [44] were used and compared. The one-hot encoding achieved better results and, thus, used in the following experiments.

4.2 Experiments

To investigate the correlation between housing prices and noise we utilize tree-based models that perform well in similar cases [5, 20, 22]. In particular, we use decision trees, random forest, XGBoost and light gradient boosting models. To verify the impact of noise we employ standard interpretability methods like feature importance, partial dependence [45, 46] and permutation importance [47] plots. To shed even more light on interpretability, we employ other advanced techniques such as local interpretable model-agnostic explanations, or LIME, [48] and Shapley additive explanations, or SHAP, [49]. The hyperparameter tuning for each model was accomplished with Bayesian optimization [50], which outperformed grid search, and 5-fold cross-validation. For the evaluation metric, we’ve used the mean absolute error.

The experiments were structured in three different axes. The first one corresponds to the procedure followed to assign the appropriate noise value to each property of the dataset. We choose to average the noise within a certain radius around each property, where the actual radius distance is manually set to 50 and 100 meters. The reasoning behind the selected distances is based on the inverse square law of noise modeling [51] which dictates that for each doubling of the distance from the source of noise, the intensity of the noise is decreased by roughly 6 dB. For example, a typical car (700-1300 cm3) has an average noise level of 82 dB [34]. If a person is exposed to such noise at a distance of 1 meter, at 50 meters the noise levels will attenuate to 48 dB and at 100 meters to 42 dB. For reference, the noise level during a normal conversation is approximately at 55 dB [52]. With that being said, selecting a radius larger than 100 meters will capture noise that will be most likely imperceivable to humans. For the sake of completeness, we should mention that the inverse square law holds true in open fields. In urban environments, noise does not follow exactly the inverse square law [53, 54], but it is still a good approximation.

The second one refers to the main noise characteristics we can use when assigning a noise value to a property. These characteristics are the following:

  • One feature for the average day noise and one for the average night noise (I)

  • One feature which averages both day and night noise (II)

  • One feature for the average day noise (III)

  • One feature for the average night noise (IV)

  • No features for noise in the baseline model (-)

The baseline model uses the same features as in cases I, II, III, and IV, without accounting for noise pollution characteristics. This configuration enables a more direct comparison of how noise features may affect housing prices.

The third and last experimental component is the area where we examine the effect of noise in pricing. The presence of noise can be translated differently depending on the urban attributes of each part of a city [30, 55]. Good examples that demonstrate this behavior are city centers, where the noise levels are usually increased compared to other places in the same city as a consequence of the high road and pedestrian traffic. In turn, the traffic is caused by the commercial nature of the center since most of the provided services and amenities are located there. In these areas, properties with high noise pollution may command higher prices compared to properties with lower levels of environmental noise. However, this pattern does not hold true for other parts of the city. For instance, in the suburbs, where there are mostly residential properties of families, the absence of noise is generally considered to be a positive factor that can raise the prices. Taking these into consideration, we focus on three different areas of Thessaloniki with contrasting urban features: the city center (A), Triandria, Toumpa and Harilaou areas (B) and Kalamaria area (C), as they are depicted in Fig. 2.

Figure 2
figure 2

The selected areas of Thessaloniki drawn on the map: the city center (A), Triandria, Toumpa and Harilaou areas (B) and Kalamaria area (C)

Based on the Hellenic Statistical Authority (ELSTAT)Footnote 9 these areas have approximately the same population (around 90,000 to 100,000). However, area C has relatively lower population density compared to the other areas highlighting its suburb-like characteristics. As for the actual properties, in each of the three areas the number of properties is, again, of the same order. More specifically there are 481,358 and 472 properties in areas A, B and C respectively. From a price perspective, which is the target variable of the models, OpenHouse data suggests that area C is more expensive, having an average of 2370€ per m2, while areas A and B range close to 2134€ per m2 and 2136€ per m2 respectively. This fact underlines that area C is considered to be more valuable and desirable than the other two. Across all areas, the average price for sale is set to 202,412.14€ with a standard deviation of 113,179.89€.

Regarding the public transport, Urban Transport Organization of Thessaloniki (OASTH) is the only operator in the area.Footnote 10 The latest OpenStreetMap data showcase that areas A and C have similar access to public transport of about 187 and 170 bus stops respectively, while area B has significantly lower access to public transport with only 110 bus stops.

Another reason why we chose these areas is their difference in terms of price-noise correlation. This is illustrated in Fig. 9, where the correlation between price per m2 and noise is plotted for the entire area of interest as well as each individual area. While it is challenging to discern any significant correlation across the entire area of Thessaloniki, focusing on areas A and C reveals a subtle contrast in trends. This observation spurred us to embark on a more thorough examination of these regions. Area B adheres to the pattern depicted in Fig. 9a and was selected as a representative subset of the entire area. The strength of the correlation can be quantified by calculating the R-Squared values between noise levels and the price per square meter, as illustrated in Table 5. Once again, despite the modest R-Squared values indicating a limited correlation, the divergence in tendencies between areas A and C piqued our interest for further exploration. Furthermore, in order to gain a deeper understanding of the data distribution within each area, we have compiled statistical tables for the fundamental features, available in Appendix D in Tables 89101112 and 13.

Table 5 The R-Squared (\(R^{2}\)) values between the noise in decibels and the price per m2 for each area, showcasing that areas are influenced differently by noise pollution

5 Results and discussion

In this section the first two subsections are dedicated to the results and discussion of the noise reconstruction process that was previously explained for the two municipalities of Thessaloniki/Neapoli and Kalamaria. Then, the experimental results and discussion are presented initially with general comments followed by specific ones focusing on each of the three selected areas of the previous section.

5.1 Noise reconstruction results for Thessaloniki and Neapoli

The results of the noise reconstruction process for the areas of Thessaloniki and Neapoli are showcased in Figs. 3 and 4. Figure 3 shows the average daily noise, ranging from 40 dB to almost 85 dB, for both the original and the reconstructed versions. The noisiest parts are the main roads and the intersections that can accommodate large numbers of vehicles. The two most distinguishable examples are the East and West entrances of the city where the noise can reach a level of 80 dB. Also, it is visible the way that the noise spreads almost equally around these highly polluted spots, which, in fact, increase the noise pollution of the surrounding area. Besides the road network, one more part of the city that is apparently noisy is the port, which is very big in size and greatly active during both daytime and nighttime. Furthermore, the correlation between road size, which leads to high traffic, and the noise pollution can be validated in urban areas with narrow streets. A very useful example is the area of “Upper Town” marked in Fig. 3, which is one of the oldest parts of the city where due to the increased elevation and the rough terrain the roads are extremely narrow. This fact, except the restrictions it imposes on the number of vehicles that can pass simultaneously, makes access difficult and not appealing to drivers. This is one reason why it is one of the quietest places in Thessaloniki. Figure 4 shows the average nightly noise in the same area in which, although the noisiest and quietest places remain the same, the noise pollution levels are much lower.

Figure 3
figure 3

Average daily noise in Thessaloniki/Neapoli (including traffic, crowd and aviation noise). On the right hand side there is the original heatmap, while on the left hand side there is the reconstructed heatmap. Both heatmaps use the same coloring palettes to depict the different noise ranges

Figure 4
figure 4

Average nightly noise in Thessaloniki/Neapoli (including traffic, crowd and aviation noise). On the right hand side there is the original heatmap, while on the left hand side there is the reconstructed heatmap. Both heatmaps use the same coloring palettes to depict the different noise ranges

5.2 Noise reconstruction results for Kalamaria

As in the previous subsection, Fig. 5 shows the average daily noise for the area of Kalamaria. Once again, the noisiest places are the main roads, while the quietest are those surrounded by low traffic streets. The yellow color indicates regions with the maximum noise levels such as the core intersections. Contrary to the heatmaps of the other two municipalities where the noise was almost entirely driven by the road network, in Kalamaria there are certain zones with little or no road network that are very noisy. This is caused due to air traffic, since the airline routes pass over the vicinity in relatively low altitude and the turbines generate noise that can reach over 100 dB [56]. This effect is more recognizable at night (see Fig. 6). Despite the fact that the road network has minimal traffic, some areas are noisier compared to others. The noise generated by airplanes is shown in Fig. 7 and 8. These figures are zoomed in a bit to improve readability and distinguish the street layout. The aviation data can be of great interest both in research and in industry, so in this paper, we provide a separate dataset for the aviation noise.

5.3 Experimental results

The experimental results are organized into two different groups based on the noise radius that was used. For each group all four models were trained on the three areas of Thessaloniki for all four noise characteristics described in the previous section. Due to the area segmentation, the number of properties has declined, leading to a concern about the sufficiency of the training set. To make sure the data were enough to be able to make valid conclusions, we plotted the learning curves of each model and verified that the curves reach a plateau. Essentially, we’ve trained the models with an increasing number of samples while keeping a hold-out set fixed. For area A, the training curve converged after incorporating approximately 300 properties, while for the areas B and C, after 230 and 250 properties respectively.

Also, because of the large number of different experimental combinations based on the experimentation axes we mentioned previously, we decided to omit showcasing every examination of the noise characteristics and keep, only, the one that performs the best. However, for the sake of completeness, we have included the detailed results in Tables 14 and 15 in Appendix E. We should point out that when changing noise radius there are circumstances where a property can end up without a noise value, especially when the radius decreases. In such cases, these properties are removed from the dataset and this is why there seems to be inconsistencies in the results when switching from one radius to another, even without incorporating the noise data.

The results of Table 6, where the radius is set to 100 meters, indicate a clear dominance of the XGBoost model in terms of both mean absolute error (MAE) and mean absolute percentage error (MAPE) values when compared to the baseline model. The performance gain in each area varies as well as the noise characteristics that are used. More precisely, in area A there is no significant improvement, while in the other two areas noise improves both scores radically. The LGBM model benefits from noise only in area C. The random forest and decision tree models are unable to make use of noise with the exception of area A where both are boosted. When the radius is set to 50 meters in Table 7, we observe the same pattern where the hierarchy between the models remains the same. The main differences appear to be the LGBM model that achieves finer results than XGBoost in area C and, also, the decision tree which is crucially improved with the use of noise in area B. Regarding the best performing models, even though setting the radius to 50 meters can reduce the MAE in areas A and C, the MAPE does not change remarkably. Furthermore, decreasing the radius exacerbates the results in area B, so the radius switch does not necessarily enhance the overall performance of the model. The hyperparameter configuration can be found in the Appendix F.

Table 6 Results for radius set to 100 m using 5-fold cross-validation for areas A, B and C. The results are presented in terms of the mean absolute error (MAE) and mean absolute percentage error (MAPE). Bold text marks the best score across all models for a given area. The dagger symbol indicates that noise pollution was included in the experiment. The “Noise” column refers to the different noise characteristics: one feature for the average day noise and one for the average night noise (I), one feature which averages both day and night noise (II), one feature for the average day noise (III), one feature for the average night noise (IV) and no features for noise in the baseline model (-)
Table 7 Results for radius set to 50 m using 5-fold cross-validation for areas A, B and C. The results are presented in terms of the mean absolute error (MAE) and mean absolute percentage error (MAPE). Bold text marks the best score across all models for a given area. The dagger symbol indicates that noise pollution was included in the experiment. The “Noise” column refers to the different noise characteristics: one feature for the average day noise and one for the average night noise (I), one feature which averages both day and night noise (II), one feature for the average day noise (III), one feature for the average night noise (IV) and no features for noise in the baseline model (-)

To measure the extent by which noise increases model performance and investigate the correlation between noise and price through interpretability evaluation methods, for the best performing models of each area, we plot the feature importance, permutation importance and partial dependence plots together with LIME and SHAP plots. We must mention that in permutation importance plots the measure of importance in XGBoost refers to the average gain across all splits a feature is used in, while in LGBM refers to the number of times a feature is used to split the data.

5.3.1 Area A

In the central area of Thessaloniki, XGBoost outperforms all other tested models in terms of MAE when the radius is set to 50 meters. However, the improvement compared to the baseline model is marginal, approximately 3%. In this model, both average day and night noise are used as features in the training. The average day noise is ranked in the feature importance plot of Fig. 10 almost as high as the construction date, while the night noise is located at a couple of ranks below. In the same plot, the ‘SubTypeId_4’, which denotes properties classified as studios, is marked as the most important feature. The partial dependence plot in Fig. 11a shows that property prices increase as the noise increases, which confirms the initial claim that city centers evaluate noise positively, which most probably occurs due to their commerciality. This can be verified by the LIME weights in Fig. 12 where high noise values correspond to bigger weights. SHAP values in the beeswarm of Fig. 13 highlights this relationship too, since the left hand-side is mostly colored with blue (low values), while the right hand-side with red (high values). At last, in Fig. 11b the night noise does not appear to act on prices at the quieter areas. However, as we progress to noisier parts, night noise has a negative impact on pricing. This is not strange because during night time the commerciality factor is not that crucial. The final predictions in this area demonstrate that one of the main factors that increases MAE is the property size. As the size increases, the model performance slightly decreases.

5.3.2 Area B

Once again, XGBoost with a noise radius of 100 meters demonstrates the best results for the Triandria, Toumpa, and Harilaou areas, reducing the MAE by 14.7% compared to the baseline. Additionally, in this area, the model performs exceptionally better in terms of absolute MAE values compared to the other areas. One plausible reason is the lower standard deviation of price per square meter, as shown in Appendix D. Another contributing factor might be the varying property types in each area. For instance, in area A, 58% of properties are apartments, while 34% are studios. In contrast, in area B, 76% of properties are apartments, and only 12% are studios. As for the factors adversely affecting the model’s performance, the predictions emphasize that energy efficiency plays a major role. As the property’s energy efficiency increases, the performance tends to decrease.

It should be noted that area B is the only area where setting the radius to 100 meters leads to better results when compared to setting it to 50 meters. Even though the reasoning behind this remains unknown, there are some factors that might be responsible. One of them is the density of the housing units in each area. In high density areas, choosing a larger area might improve the model since there are more neighboring houses within a certain radius. Also, the various topographical and geographical features can affect how sound propagates. For instance, area B is the only one located far from the coastline, while A and C are both seaside areas.

As in the previous area, this model utilizes both average day and night noise values. The night noise has similar importance to features such as the location and the heating type as it is depicted in Fig. 14. In the same figure, the permutation importance plot showcases that the overall noise affects at some degree the accuracy of the model. Even though, at first, day and night noise do not seem to influence prices, after a certain threshold in decibels they do have a negative effect on prices which contradicts the results of area A. One of the possible reasons why noise does not cause price changes in the initial decibel ranges is the fact that some parts of area B are close to the city center and, hence, noise is not directly considered as a bad attribute. Once more Figs. 1516 and 17 reinforces the previous findings about the generally negative correlation between noise and price.

5.3.3 Area C

In the Kalamaria area, the LGBM model when trained with a noise radius of 50 meters while taking into consideration only the average day noise achieves the best scores. Compared to the baseline model, there is an approximate performance gain of 2880 euros, representing a reduction of more than 9% in terms of MAE. Additionally, for MAPE, the improvement is marginally over 2%. When assessing factors that negatively influence the model’s performance, the predictions emphasize the significant role of the construction date. Specifically, the model tends to have higher prediction errors for older properties.

As for the features, the noise is ranked almost as high as ‘Size’ with regard to importance in Fig. 18. This area is located far from the center and as a consequence the noise appears to influence price negatively at most noise ranges. In Fig. 19a, the price declines almost linearly as we move to more noisy parts of the area, while LIME weights in Fig. 19b indicate the preference of the model to assign higher prices to properties with relatively low surrounding noise. Once more, the SHAP values of Fig. 20 confirm the aforementioned observations, where high average day noise values cause price drops and low average noises escalate prices. Concerning the noise characteristic used, one plausible reason why the model chooses to incorporate only the day noise is that contrary to the previous areas, Kalamaria includes also the aviation noise. As it can be seen by the corresponding heatmaps, aviation noise during night increases the overall noise which at some extend narrows the gap between day and night noise. This means that the two noise features are more correlated and, thus, one of them can potentially be redundant.

6 Conclusion

The main goal of this paper was to investigate how urban noise impacts residential property prices in the area of Thessaloniki. Currently, there is no publicly available spatial data regarding noise for the area of interest. Therefore, the first part of this work attempts to create a general purpose dataset indicating the sense of noise based on coordinates by taking advantage of official and public studies conducted by the Hellenic Ministry of Environment and Energy.

This new dataset is combined with the properties of the Openhouse platform to train tree-based machine learning models in order to verify the importance of noise in housing price estimates. The assumption that noise might be translated differently depending on the location of the property led us to focus the experiments on three separate regions of Thessaloniki with dissimilar characteristics. XGBoost and LGBM models attain the best results which first of all confirm that noise, as a matter of fact, influences prices and, secondly, it can affect some locations positively while others negatively. More specifically, property prices in the city center as well as locations in its vicinity, do increase as noise increases, which is probably the aftereffect of the overall commerciality of the area. In contrast, properties located far from the center are impacted negatively by noise. This makes sense considering that in decentralized areas, such as suburbs, there are mainly houses of families where quietness is more appreciated.

While this study provides valuable insights into how noise influences property prices, it is important to acknowledge its limitations. To begin with, the noise data used to train the models were sourced from the Hellenic Ministry of Environment and Energy and, then, reconstructed into a general purpose dataset. While we took measures to ensure data integrity, there may still be inherent limitations in the accuracy and completeness of the dataset. Furthermore, although the method employed to calculate noise pollution for each property is generally accepted, factoring in the surrounding buildings and accounting for elevation differences may lead to further refinement of the results. Lastly, despite the current sample size of housing properties appearing adequate for training, augmenting the dataset with additional samples could potentially enhance the model’s robustness.

As previously demonstrated, noise is a unique factor that can impact prices differently even within the same city. With this in mind, we strongly encourage the community to delve deeper into this subject, exploring various models, property types, and features. The newly reconstructed noise dataset can play a pivotal role in this endeavor, as its value extends far beyond real estate applications. Its versatility makes it an invaluable resource for a wide array of commercial projects and research pursuits, reaching even beyond fields directly associated with real estate.