1 Introduction

The planning of transport infrastructures is the first step that any administration must take to make them a reality. In most cases, more than 10 years may elapse from the bidding of the first planning studies until the commissioning of these public goods.

A technically sound understanding of the most relevant characteristics of an infrastructure in the planning phase can be vital for its capital expenditure and future sustainability.

In the case of roads and railway lines, their layout solution will always alternate ‘surface’ sections with ‘tunnel’ and ‘viaduct’ sections. It seems reasonable to think that, among several layout alternatives of the same typology, which offer citizens similar levels of service when they come into operation, the one which minimises the construction of singular works (tunnels and viaducts) will be the one which also presents the highest financial viability.

The planning, design and construction of linear infrastructures throughout the world follows very similar procedures. In each of the above-mentioned milestones, the topography of the terrain strongly conditions the type of layout solution adopted for each section of road or railway.

In this way, a linear infrastructure in operation can always be discretised into a succession of rectangular trapeziums. The geographical coordinates of two points of a route separated by a few metres, together with their elevations, allow the configuration of a trapezium to which a qualification label can be added: ‘surface’, ‘tunnel’ or ‘viaduct’. It is at this point when two possibilities open up:

  • The construction of databases with real information, targeted if desired by geographical areas and/or type of infrastructure.

  • The training of algorithms based on supervised learning techniques through classification methods.

The computational power currently offered by various machine learning-as-a-service (MLaaS)-type platforms makes it possible to solve everyday problems in a very efficient way that were unmanageable a decade ago.

In 2013, Elon Musk, through Hyperloop Alpha [1], modernised a concept that had been in man's head for more than 200 years. The idea of transporting people or freights on board pressurized capsules, which move inside tubes in a quasi-vacuum, is a natural evolution of the railway, trying to provide the most advanced trains with magnetic levitation, in a very low-pressure atmospheric environment. Since then, a global force has been generated to drive this innovative project that links the best trained human capital, the most capable financial capital and the boldest governments and institutions.

Such an infrastructure could present up to four different layout solutions (see Fig. 1).

Fig. 1
figure 1

Layout typologies in an Hyperloop infrastructure

In this context, it is worth studying the predictive potential that certain geometrical patterns could have in the development of new infrastructures. Unmasking their existence could be used to the benefit of countries infrastructure planning services, in order to provide taxpayers with greater efficiency in the management of public funds linked to this type of investment projects.

Concretely, this research addresses the following questions: Is it possible for a machine learning algorithm to determine patterns that define the layout characteristics of an existing transport network? Is it possible to infer the layout characteristics that similar lines or transport networks not yet built anywhere in the world would have? At a time when public administrations have not yet endorsed a massive production of Hyperloop network studies, this work aims to provide developers of this type of project with innovative planning methodologies.

This article firstly offers a review of the scientific literature about the layout of linear infrastructures and machine learning algorithms. Secondly, the article deals with the description of the proposed methodology. Subsequently, the structure and value of the different datasets is declared, linked to an existing high-speed rail network and a plausible European Hyperloop network of more than 12,000 km in length. With the set of declared data, the described methodology is applied, its soundness is checked, and the results obtained are offered in a tabulated way. The article concludes with a discussion of these results and the formulation of several conclusions.

2 Literature Review

This section reviews the scientific literature linked to the two key concepts on which this research article is based, namely, the layout of linear infrastructures and the technique of machine learning.

2.1 About Layout of Linear Infrastructures in Civil Engineering

There is an ancient legend that tells that the best route for a road between two points in a mountainous area is that defined by a loose donkey with freedom of movement heading to its desired destination.

This ‘liberal’ vision of the geometric design of a road transited to more interventionist positions in the eighteenth century on the part of the public authorities, which materialised through the approval of technical regulations of various kinds. In this context, it can be pointed out that in Spain in 1761 a Royal Decree was published with the aim of creating ‘straight and solid roads’.

This regulation, emanating from the public sector, was based on the previous scientific production of military engineers of that time such as the French Henri Gaultier and Bernard Forest de Belidor, the Spanish Miguel Sánchez Taramas, or the English John Müller.

In line with European tradition, the American Association of State Highway and Transportation Officials (AASHTO) emerged in the United States in 1914, with the aim of shaping the legislation and coordinating the various state policies on transport infrastructure in general. However, it was not until 1937 when a Special Committee on Administrative Design Policies within AASHTO [2] began to issue publications on geometric road design, as a common basis for the improvement of existing roads and the building of new ones.

At the end of the twentieth century, emulating AASHTO, some countries began to issue their own technical regulations for the design of their land communication routes. Thus, in 1972 the National Association of Australian State Road Authorities published the ‘Guide Policy for Geometric Design of Freeways and Expressways’[3, 4]. In 1973, the Indian Roads Congress published the ‘Recommendations about the Alignment Survey and Geometric Design of Hill Roads’ [3, 5]. As a final example, in 1976, the Mexican Ministry of Public Works published the first edition of its ‘Manual de Proyecto Geométrico de Carreteras’ [3, 6].

Road geometric layout regulations have historically been followed by the production of technical regulations in the layout of linear railway works. In this way, the existence of a technical heritage that has been successfully transferred to different software for the layout of linear works has been consolidated globally. Different versions of them have been appearing on the market since the beginning of the 1980s (ISTRAM®, CLIP®, CIVIL 3D®, ROADENG® among others).

It was in 2014 when the Spanish company Actisa Ltd. began marketing the software TADIL® [7], introducing it to the market as ‘the first software for infrastructure design through artificial intelligence’. This software, available in the field of road and rail, allows a process of self-design of linear works from boundary conditions such as the start and end point of the infrastructure, applicable geometric regulations, environmental, climatic, geotechnical restrictions, etc.

Even though each of the above-mentioned software offers results rich in detail, obtaining prior information for its use is not exempt from the consumption of technical and economic resources.

2.2 Examples of the Application of Machine Learning in Civil Engineering

Machine learning is an increasingly common tool for solving problems related to civil engineering. Over the last decade, experts from all over the world related to this field of knowledge have relied on machine learning techniques to try to provide answers to many different questions. Different types of regression models have been used to identify significant impact factors associated with station ridership at different periods of the day [8], to the early prediction of rail contact fatigue [9] and to investigate the significant risk factors related to car traffic environment, driver characteristics and vehicle types [10, 11]. Neural networks (also deep convolutional neural networks) have been used for the prediction of the diameter of columns as a preliminary step in the design of jet grouting applications [12], for building urban accident prediction models [10] or for the detection of rail surface defects [13, 14]. Random forest models have been applied when trying to predict highway crash likelihood with traffic data collected by discrete loop detectors and web-crawl weather data [15]. Different techniques associated with deep learning have been used to establish an accurate prediction model for renewable energy resources [10, 16] and to inspect the rail surface with 3D laser cameras [13, 17].

In addition to the above references, which are more focused on technical aspects, machine learning has also been used for the study and prediction of construction phase costs of civil engineering projects [18]. Examples include the development of a cost estimation model for residential building [18, 19] and the prediction of costs associated with construction projects through back-propagation neural networks [18, 20].

2.3 Contributions with Respect to the State of the Art

As has been shown in the previous section, machine learning techniques are widely used to respond to different problems associated with civil engineering, but there is currently no similar approach to respond to the objective set out in this article: to characterise new linear infrastructures layouts planned over a territory, first discovering and then using the patterns that emerge after the analysis of massive geometric data linked to other similar infrastructures, already in service.

Over time, the increasing availability of geographic information system (GIS) and building information modeling (BIM) data linked to linear infrastructures will make it possible to generate datasets with more instances than those used in this research. Given this changing situation, another novelty with respect to previous approaches will be to include in the methodological process the use of MLaaS platforms, which provide elastic and easily adaptable solutions.

As will be exemplified below, MLaaS tools will enable the fast identification of changes in patterns, and the simple re-training of predictive models that will readjust to new realities. The low cost in economic and temporal terms, and the scalability and computational power offered by these platforms, will make it possible in a very short time, to test a wide variety of algorithms of different types that provide a proven response to classification problems. In this way, changes in the available data may or may not be associated with changes in the patterns and, in turn, changes in patterns may or may not imply changes in the typology and characteristics of the machine learning model to be used (see Fig. 2). The methodology proposed will facilitate the identification of these changes when they occur, and the design of algorithms that adapt to the new datasets, without affecting the quality of the predictions already offered.

Fig. 2
figure 2

Graphical exemplification of the re-training and re-testing of predictive models.

3 Research Methodology

The type of layout of the linear works is not unrelated to the amount of the capital consumed in their execution. For the same connection on a Hyperloop line, a layout with a length on the surface of 95%, is not the same as one with a length on the surface of 85%. The cost of building elevated, underground, or underwater sections determines the final cost of this type of work.

To undertake any work using machine learning techniques, it will first be necessary to have a dataset with historical information already observed. This data is the training basis for a machine learning algorithm, which at a later stage makes it possible to predict the layout linked to a specific section of a hypothetical line.

In this research, the original dataset contains quantitative and qualitative information linked to 14-line sections of the Spanish high-speed rail network. The 1835.84 km analysed, divided into segments of 33.33 meters on average, generates a database of 55,080 records that constitute the observable reality. The choice of this type of railway infrastructure as a working element is justified by its similarity to a not yet built Hyperloop network. From the layout of the new transport network, which can be located anywhere in the world, its quantitative information can always be extracted. After a training process, it will be the best qualified algorithm, the one that allows one to obtain the qualitative characteristics of the new network (length in ‘surface’, ‘underground’, ‘elevated’ or ‘underwater’ mode).

The practical application of this research has consisted in the evaluation of a hypothetical Hyperloop transport network located within the European Union. Our network is 12,067.29 km long (363,385 records or segments) and consists of 28 lines connecting all European conurbations with more than two million inhabitants.

The proposed methodology revolves around the geometric figure of the rectangular trapezium and its proven predictive potential. The geographical coordinates of the points of origin and end of each section of the route of a linear transport infrastructure, together with its elevation, provide the minimum information necessary for the elevation representation of its topographic profile. The figure below (see Fig. 3) shows an approach to this idea.

Fig. 3
figure 3

Floor layout of a linear infrastructure and its elevation and topographic profile of terrain

In this context, each trapezium in the chain deduced from a line can be attributed various geometric magnitudes. Some of these magnitudes are related to the trapezium itself, such as the unevenness and the slope. Other magnitudes take into consideration the trapeziums before and after it, besides the trapezium itself. This link with the adjacent sections is made because of the rigidity of the civil engineering linear works layouts, where the boundary conditions imposed to the radii of curvature in plant, or to the slopes in elevation, determine the final solution of the layout. Thus, in a given segment, the unevenness of the ground is as relevant as the section in front and behind it. With this methodology, a set of values linked to each section is born, which takes into consideration the average unevenness of the terrain in the vicinity of the trapezium under analysis. As this characteristic has to be quantified for each proposed segment, two moving averages with predictive interest will be calculated. One that takes into account the segment in question and the 99 segments ahead of it, and another that takes into consideration the same reference segment and the 99 adjacent segments behind it. At this point it is interesting to note that the technique that uses moving averages to forecast the evolution of a certain stock market value, is also valid for determining the forecast of the evolution of the layout of a linear work. By analogy between stock market curves and topographical profiles of the terrain, financial techniques converge here with civil engineering techniques.

Once the dataset has been constructed from real data of the Spanish network, together with the proposed fields suitably calculated, a machine learning algorithm is trained using two tools that are specific to these techniques: BigML® [21] and KNIME® [22]. In this context, it should be noted that BigML® offers a higher training power than that offered by KNIME®, which is why the algorithm generated by BigML® will be the reference, while the algorithm generated by KNIME® will be the verification.

At a later stage after the training and testing process, once the BigML® algorithm that offers the best evaluation metrics has been chosen, a first contrast test will be performed. This test consists of extracting the geometric and layout data linked to two project alternatives for a high-speed rail corridor between Santiago de Chile and Valparaiso. Since 2018, a Chinese-Chilean consortium is promoting the development of such a connection in the southern country. In this context, a Spanish engineering company (Actisa, Ltd.) has been offering an open document [23] on its website since 2020, specifying the layout details of the two corridors (North and South) between both cities.

The last phase of the research methodology consists of applying the algorithm selected as optimal through BigML® and KNIME® to a Hyperloop network in Europe. On the lines of the network that cross the Alps or run almost entirely through the Netherlands, it is possible to intuit a priori the relative amount of each type of layout that they will present as a solution.

Figure 4 shows a summary of the steps taken to implement this research methodology.

Fig. 4
figure 4

Steps of the methodology

4 Data

This section refers specifically to the process of obtaining the data necessary to train algorithms and apply them in specific situations.

4.1 Historical Dataset

The National Centre for Geographic Information in Spain [24] provides public access to the basic topographic data of its rail transport network. On 31 December 2018, the aforementioned railway network was 25,468 km long, divided into various types (funicular railway, rack railway, underground, tram, light rail and conventional train).

If it looks at the part of the network classified as conventional trains (24,863 km), it finds that only 4049 km have international gauge or UIC standard, and of these, 2827 km are in use on the date indicated. In spite of the above, the process of analysing the open information has not allowed all the kilometres of network evaluated to be made eligible, mainly due to formal defects in the data. Therefore, the final length is considered to be 1836 km (see Fig. 5). With this length as a basis, the preparation of a dataset with historical information is possible.

Fig. 5
figure 5

Baseline data for the construction of the historical dataset and picture of the 14 lines used

Dividing the above-referenced 1836 km into segments of approximately 33.33 meters, a dataset of 55,080 instances is generated, each one with 14 features. In this sense, the first feature will have an instrumental purpose and will identify each trapezium with a correlative number within the same line it is part of. The following four features, considered primary, correspond to the longitude and latitude of the initial and final points of each segment, and would be obtained through GIS softwareFootnote 1. The following two features, which are a derivation of the previous four primary ones, represent the value of the ground elevation in metersFootnote 2. The following six features, which are the result of previous calculations, refer to magnitudes in meters of horizontal length, differences in vertical elevation and percentages of slope. The last of the features, qualified as the target, refers to the engineering solution with which the layout of the line was resolved in the historical past (‘surface’, ‘underground’ or ‘elevated’).

Figure 6 shows an example of the topology linked to a line belonging to a transport network and notation of defining elements of each rectangular trapezion considered in each instance.

Fig. 6
figure 6

Notation of defining elements of each rectangular trapezium considered in each instance

The six calculated features, linked to each instance of the datasets, are obtained according to the mathematical equalities, equations and boundary conditions set out below.

Equation (1) describes the geographical equality that exists between point A of a trapezium and point B of the preceding trapezium.

$$\left({{Long}}_{{a}_{N}},{Lat}_{{a}_{N}},{Alt}_{{a}_{N}}\right)\equiv \left({Long}_{{b}_{N-1}},{Lat}_{{b}_{N-1}},{Alt}_{{b}_{N-1}}\right)$$
(1)

In the same way, Eq. (2) describes the geographical equality that exists between point B of a trapezium and point A of the posterior trapezium.

$$\left({Long}_{{b}_{N}},{Lat}_{{b}_{N}},{Alt}_{{b}_{N}}\right)\equiv \left({Long}_{{a}_{N+1}},{Lat}_{{a}_{N+1}},{Alt}_{{a}_{N+1}}\right)$$
(2)

In both equalities, Long is the geographical Longitude, Lat is the geographical Latitude and Alt the Altitude. Equation (3) expresses as a result the difference in level in meters, in absolute terms, between points A and B of the same trapezium.

$${USS}_{N}=\left|{Alt}_{{b}_{N}}-{Alt}_{{a}_{N}}\right|$$
(3)

where USS means ‘Unevenness Single Segment’.

Equation (4) expresses as a result the difference in level in meters, in absolute terms, between point A of the anterior trapezium and point B of the posterior trapezium. This result is incorporated into the dataset as a specific feature of the central trapezium (N).

$${TSU}_{N}=\left|{Alt}_{{b}_{N+1}}-{Alt}_{{a}_{N-1}}\right|$$
(4)

where TSU means ‘Three Segment Unevenness’.

Equation (5) expresses as a result the distance in meters obtained through the Haversine formula, which exists between the projection in ground of points A and B of the same trapezium.

$${HD}_{N}=6,371\times {\mathrm{cos}}^{-1}\left[\mathrm{cos}\left(rad\left({90}^{\circ}-{Lat}_{{a}_{N}}\right)\right)\times \mathrm{cos}\left(rad\left({90}^{\circ}-{Lat}_{{b}_{N}}\right)\right)+\mathrm{sin}\left(rad\left({90}^{\circ}-{Lat}_{{a}_{N}}\right)\right)\times \mathrm{sin}\left(rad\left({90}^{\circ}-{Lat}_{{b}_{N}}\right)\right)\times \mathrm{cos}\left(rad\left({Long}_{{a}_{N}}-{Long}_{{b}_{N}}\right)\right)\right]$$
(5)

where HD means ‘Haversine Distance’.

Equation (6) expresses as a result the percentage of the slope in absolute value of the line joining points A and B of the same trapezium.

$${S}_{N}={USS}_{N}/{HD}_{N}$$
(6)

where S means ‘Slope’.

The Eq. (7) offers as a solution the difference in meters between the average altitude of a section and the moving average corresponding to the altitudes of 100 consecutive trapeziums in front of the reference trapezium in the same line. As expressed, this equation for the construction of the dataset is applicable by lines. In this context, the last 100 trapeziums of the line must take into consideration the expressed boundary conditions.

$${MAF}_{N}=\left(\frac{{{Alt}_{{a}_{N}}+ Alt}_{{b}_{N} }}{2}\right)-\frac{1}{100-{K}_{N}}\sum \nolimits_{i=N}^{N+99-{K}_{N}}\left(\frac{{{Alt}_{{a}_{N}}+ Alt}_{{b}_{N} }}{2}\right)$$
(7)
$${\text{where}}\left\{\begin{array}{c}1\le N\le {N}_{MAX}\\ {K}_{N}=O, {provided}\,{that}\, 1\le N\le {N}_{MAX}-99 \\ {K}_{N}=99-\left({N}_{MAX}-N\right)\end{array}\right.$$

and MAF means ‘Moving Average at the Front’.

Equation (8) offers as a solution the difference in meters between the average altitude of a section and the moving average corresponding to the altitudes of 100 consecutive trapeziums behind the reference trapezium in the same line (Moving Average at the Back, MAB). As expressed, this equation is applicable by lines. In this context, for the first 100 trapeziums of the line, the established contour conditions must be taken into consideration.

$${MAB}_{N}=\left(\frac{{{Alt}_{{a}_{N}}+ Alt}_{{b}_{N} }}{2}\right)-\frac{1}{min\left({K}_{N},100\right)}\sum \nolimits_{i=1-N+ {K}_{N}}^{N}\left(\frac{{{Alt}_{{a}_{N}}+ Alt}_{{b}_{N} }}{2}\right)$$
(8)
$${\text{where}} \left\{\begin{array}{c}1\le N\le {N}_{MAX}\\ {K}_{N}=N, {provided}\,{that}\, 1\le N\le 100\\ {K}_{N}=2\times N-100, {provided that} 100\le N\le {N}_{MAX}\end{array}\right. ,$$

and MAB means ‘Moving Average at the Back’.

Both USS, HD and S refer to intrinsic properties of a single trapezium. On the other hand, the TSU calculation features to the central trapezium a characteristic derived from the existence of three consecutive trapeziums, linking in some way the layout solution of the central section with the layout solutions adopted for the anterior and posterior sections. This creates a trend in the transition from one type of layout to another. MAF and MAB operate in the same way as TSU, trying to link to each section under analysis the layout trend that will occur in the 99 sections before or after that section.

The data mining executed with the initial source of geographic information and the calculation of features with the help of the exposed mathematical formulation allow the creation of a historical dataset (Online Resource 1) in which the target feature (‘surface’, ‘underground’ or ‘elevated’) is known (see Fig. 7).

Fig. 7
figure 7

Aspect of the features and instances of the historical dataset

The relevant numerical information from this dataset is shown in Table 1.

Table 1 Statistical metrics for historical dataset

4.2 Algorithm Training

Once the historical dataset has been obtained, it is time to train the machine learning algorithm. The work process is proposed in two phases. In Phase I, algorithms of different typologies will be trained and evaluated. At the end of Phase I, the optimal algorithm for this research will be selected. Phase II will consist of the validation of the optimal algorithm through a contrast test, and the application of the algorithm to a plausible European Hyperloop network.

4.2.1 Phase I: Training and Selection of the Optimal Algorithm

For the phase I: Training and Selection of the Optimal Algorithm, the flow chart is shown in Fig. 8 (a). The process of training and selection are mainly carried out in MLaaS tools which offer powerful and interesting computational and evaluation capabilities. This research will use two different MLaaS tools. BigML®, and in particular its OptiML® functionality, for the training and evaluation of the different algorithms, and KNIME® to test the conclusions obtained in BigML®, as exemplified in Fig. 8 (b).

Fig. 8
figure 8figure 8

Work process—Phase I

To start the training process, OptiML® asks the user for an error metric with which to perform its evaluations. In this case, the chosen metric is the Receiver Operating Characteristic (Area Under the Curve )-ROC (AUC).

From that moment on, OptiML® works in two steps. In the first step, the ‘Bayesian parameter search’, the algorithm performs a series of initial random partitions of the historical dataset. Eighty percent of the data from each partition (training set) will be used to train models of very different typologies (decision trees, random forest, deep neural networks, and logistic regressions), applying Bayesian parameter optimisation techniques. The evaluation of each algorithm or model is performed with its corresponding test set, e.g. if an algorithm has been generated with a training set A, it is evaluated with its test set A. The result of the first step is a set of 389 models of different typologies with promising ROC (AUC) results.

In the second step, OptiML® performs ‘Monte Carlo cross-validation’. From N new partitions, OptiML® re-trains and re-evaluates the 389 referenced models, associating to each of them several ROC (AUC) results. With the average of these results, OptiML® provides the user with a ranking of the 101 best algorithms.

Up to this moment, the process has been fully automatic and has only required a few hours to be completed. After completion, OptiML® provides the user with an interactive tool to analyse and compare the results of the 101 models with the most commonly used evaluation metrics, not only ROC (AUC). Using this tool, the optimal model will be selected for the case presented here.

The 10 best models trained offer very similar ROC (AUC) values (between 0.980 and 0.978). For this reason, it will draw upon on another evaluation metric to determine which is the optimal algorithm: the average precision. The historical dataset is strongly biased towards ‘surface’ results, so the model must be precise in discerning whether a segment should be classified as ‘underground’ or ‘elevated’. Among the 10 best models according to ROC (AUC), the one with the highest average precision value (91.7%) is number 9, a random forest algorithm composed of 95 decision trees. This algorithm is the one chosen as optimal for this research.

The following table, shows a comparison between the algorithm chosen as optimal for this research and the best model of each trained typology, based on the average precision (see Table 2):

Table 2 Comparison of metrics between algorithms

To finalize Phase I, the optimal algorithm is replicated in another MLaaS platform: KNIME®. A random forest of 95 decision trees is built, modulated with the Bayesian parameters obtained by OptiML®. The similarity between the resulting evaluation metrics and those obtained in BigML®, validate the correct choice of the algorithm.

Details of the comparison between the optimal model and its replication in KNIME® are provided in section 5 of this article.

4.2.2 Phase II: Validation and Application of the Optimal Algorithm

Once the optimal algorithm has been determined and validated, the work process moves into Phase II: Validation and Application of the Optimal Algorithm. The optimal algorithm will be verified by using it in a contrast test and then applying it to a plausible European Hyperloop network. The work process of Phase II is shown in Fig. 9. Details of the results of this phase are provided in section 5 of this article.

Fig. 9
figure 9

Work process—Phase II

4.3 Evaluable Datasets

The evaluable datasets should have the same structure as that expressed for the historical dataset (see Fig. 10).

Fig. 10
figure 10

Structure of the evaluable dataset linked to the Chilean corridors and European Hyperloop network

In the case of the preliminary contrast test linked to the high-speed railway line between Santiago de Chile and Valparaiso, the BigML® algorithm evaluates a dataset of 2778 segments in the North Corridor and 3281 segments in the South Corridor. Table 3 shows relevant numerical information from this dataset.

Table 3 Statistical metrics for contrast dataset

With respect to the Hyperloop network of the 28 lines proposed for Europe, the algorithms trained in BigML® and KNIME® obtain the layout characteristics of a dataset of 363,385 instances or segments (Online Resource 2). Table 4 shows relevant numerical information from this dataset.

Table 4 Statistical metrics for European Hyperloop Network dataset

5 Results

5.1 Metrics and Qualification of the Trained Algorithms

Table 5 shows the results of the main error metrics for the predictive models, obtained from the work done in BigML® and KNIME®. The ROC (AUC) values allow to advance a high technical solvency for the algorithms in the prediction process.

Table 5 Results of the training process

In the Table 5, the acronym ‘SUR’ refers to ‘surface’, the acronym ‘UDG’ refers to ‘underground’, and the acronym ‘ELE’ refers to ‘elevated’.

5.2 Contrast Test

As already mentioned, there is a self-drawing software in the market called TADIL®, whose owner company has published in 2020 an example of application [23] through its website. Specifically, the website offers a PDF file that documents very clearly the layout solutions linked to two high-speed rail corridor alternatives between Santiago de Chile and Valparaiso.

This example allows one, in the field of the proposed research, to verify the reliability in the application of the optimal algorithm trained with geographical data from Spain (Northern Hemisphere), when applied in very distant routes (Southern Hemisphere) (see Table 6).

Table 6 Contrast test results

Because of the absence of a high number of reference points on the PDF document linked to TADIL®, the replicas of the lines drawn up with GIS software and evaluated later with BigML® are not 100% coincidental (Online Resource 3). In spite of the above, the results offered by the algorithm in the rectified lineFootnote 3 are quite well adapted to the results offered by TADIL®. In both cases, the layout of the new infrastructure presents similar orders of magnitude, confirming that the machine learning algorithm developed with the proposed methodology can be an alternative to other types of developments.

5.3 Characterisation of the European Hyperloop Network

Both the algorithm derived from BigML® and the one derived from KNIME® have allowed the characterisation of a 12,000-km proposal of this new transport network (see Table 5). As explained in section 4.2 of this research, the model trained in BigML® has been endorsed using the ‘Monte Carlo cross-validation’ method, that is, it has given optimal and consistent results for a significant number of random partitions of the data. The model trained with KNIME® has only been tested for a single partition (80–20%), so their results, while useful for validating the model building algorithms by the OptiML® functionality, cannot be considered with the same level of strength of the BigML® results.

Having understood this point, the constructive solution has been chosen as the one offered by the optimal algorithm, trained in BigML®. The solution predicted by the KNIME® algorithm is considered in this section to evaluate the consistency of the optimal algorithm predictions.

Under these circumstances, a clear adaptation can be observed between the layout solutions calculated by both algorithms (see Fig. 11):

Fig. 11
figure 11

Comparative results of the algorithms

The numerical details obtained by each algorithm for the quantity of layout in ‘surface’, ‘underground’ and ‘elevated’ for the entire network and for each of its 28 lines can be found in Table 7.

Table 7 Characterisation results

5.4 Aspect of the Network and Final Results for Optimal Algorithm

The attached figure (see Fig. 12) shows a potential Hyperloop transport network in Europe.

Fig. 12
figure 12

Hyperloop network proposed for Europe (28 lines)

This network would directly serve all European population centres of more than 2 million people through 12,067.29 km. This length of network, with 28 lines and 28 stations, would have the direct capacity to connect one-third of Europe's GDP and one-fourth of its workforce. In addition, almost one-fifth of the goods moved annually in the Union could benefit from the existence of this new mode of transport.

The algorithm designed in BigML® and selected as optimal could not be studied during their training process segments with ‘underwater’ layout (‘UDW’), because this type of layout does not exist in the analysed railway network. Therefore, the sections of lines ‘07’, ‘19’ and ‘20’ that run over obviously maritime areas are classified as ‘UDW’ (see Table 8).

Table 8 Final results

6 Discussion

This research proposes the application of dynamic machine learning techniques. Current MLaaS platforms allow a kind of machine learning within machine learning to be applied, whereby the best algorithm for a dataset is one in particular, and it can be changed to another if the dataset also changes in a relevant way (e.g. more observations).

It is highly reliable that the algorithm selected as optimal—a random forest of 95 decision trees—offers a 99.37% surface layout like solution for the Katowice–Warsaw line, situated in one of the most extensive plains in Europe.

It is highly reliable that the optimal algorithm offers a solution with a 98.73% surface layout for the Geraadsbergen–Amsterdam line.

It is highly reliable that the optimal algorithm offers as a solution for the 124.90 km stretch from the vicinity of Bussoleno (in Italy) to the vicinity of Le Pont de Beauvoisin (in France) a 67.89% ‘underground’ layout and an 8.01% ‘elevated’ layout. It should be noted that this particular section represents the passage of Hyperloop through the Alps, the most mountainous area in Europe, and is integrated into line 28 which would link Vercelli to Paris.

The training of algorithms through databases based on high-speed rail networks allows to obtain the characterisation of high-speed rail lines, or of Hyperloop lines (by similarity) not yet built.

Everything seems to indicate that the same methodology applied in this research, on databases linked to linear high-speed railway infrastructures, could be applied to databases built on other subjects: regional or suburban railway networks, motorways, conventional roads, oil pipelines, or high-voltage electricity networks, for example.

7 Conclusion

It is shown that the rectangular trapezium as a unit of characterisation of a linear infrastructure, enriched by only a few topographical features, holds enormous predictive potential. Its use, combined with machine learning methodologies, based on supervised learning, yields brilliant results in predicting the layout solution in unknown segment observations. In the case presented, a dataset elaborated from public and open data, provided with a few calculated features, has been able to train two predictive algorithms with very good and excellent evaluation metrics.

The current offer of MLaaS platforms and open source software provides the average user with enormous storage and calculation capacities at a low or zero cost, in addition to numerous functionalities and evaluation and validation tools. This allows the methodology outlined here to be easily accessed and quickly implemented. The algorithm trained in BigML® and validated with KNIME®, has been able to provide in a few seconds of characteristics to a plausible Hyperloop network of 12,067.29 km.

The robustness of the algorithms obtained, the low cost involved in obtaining them, and the rapid application on new routes yet to be constructed (easily outlined and characterized with GIS tools) make this methodology a more than interesting alternative when it comes to preparing and enriching planning studies, preliminary projects or audits of construction projects. The testing and discrimination of many different layout alternatives, in a reliable, quick and inexpensive way, will make it possible to focus resources only on those alternatives that are really suitable for the intended objectives.

Although, as demonstrated, the main topographical and geometrical characteristics of a section have been more than sufficient to build solid and consistent predictive models, it should not be forgotten that supervised learning is a living method that can always be improved. In this sense, is likely that, as far as this methodology is concerned, improvement will occur as a result of the enrichment of the datasets with other features which, a priori, are already perceived as having predictive potential. These could include the geological characteristics of the terrain, where the first obstacle to overcome will be to obtain information that is not currently available as open data. Other features to be explored will be ‘subjective’, exclusive to each type of linear work.

Being able to incorporate the specific conditioning factors for each type of linear work that lead to one constructive solution or another will undoubtedly be enriched when it comes to improving the algorithms and providing them with greater precision. Thus, the use of the algorithms in real projects and the collaboration with professionals in the sector will be the best way to identify these subjective features and relate them to the cornerstone of this methodology: the rectangular trapezium.