1 Research Background

Given China’s continued economic aggregate and urbanization level growth in recent years, urban traffic congestion and air pollution caused by the increase in the number of private cars on roadways have become major issues that impede healthy and sustainable urban development. The public transit-oriented development (TOD) [1] model can effectively address such problems. In TOD areas, through the construction of an above and underground three-dimensional transportation network, combined with a variety of convenient and efficient transportation modes, residents can reach most of their destinations and ease their dependence on private cars, thus reducing urban traffic congestion [2].

Connecting neighboring public transportation stations via convenient and efficient modes, such as cycling or walking, improves land use efficiency and shortens the residents’ travel time. These affordable modes of travel may also open up more public resources, such as employment and educational opportunities. However, while TOD modes bring many benefits to society, they also have some negative effects, such as reduced urban livability due to an increased population density and higher social crime rates [3]. There are also some cases that suggest that TOD modes are not entirely suitable for urban development [4]. The aforementioned problems have raised questions regarding the impact of TOD modes in terms of the travel behavior of residents and the quality of life in the area.

Researchers have proposed several TOD station division methods in an attempt to gain a better understanding of the significance of TOD modes, yet there is no definite conclusion on which method is the most effective [5]. Furthermore, resident self-selection behavior has a significant effect on the study of resident travel behavior in TOD areas [6,7,8]. Based on this, and considering Beijing, China, as an example, this study first uses cluster analysis to classify rail transit stations and subsequently estimates the economic background of travelers based on the rental and sale prices of surrounding houses in the residential area, as well as their spending capabilities on living and entertainment. Further, we apply the propensity score matching method to address problems caused by the self-selection behavior of residents. Finally, based on the matched results, conclusions are drawn after comparing and analyzing the travel behavior of residents in TOD and non-TOD areas. The findings of this paper will aid the evaluation of the influence of TOD modes on traffic composition around rail transit stations and the travel behavior of residents.

2 TOD Station Classification

Owing to the diversity and macroscopic nature of the evaluation objectives, it is rather difficult to classify rail transit stations using a single indicator. Therefore, a comprehensive set of indicators for the classification of rail transit stations must be introduced and established.

The impact area of the station was first determined, which was defined using the “walking radius” method as the circular area within which passengers can walk to the adjacent rail transit station in a reasonable amount of time [9, 10]. The average walking time for able-bodied individuals is generally 6 min with a speed in the range of 80–85 m/min. As a result, the impact area of the station was defined as a circular area with a radius of 500 m around the station.

This study selected ten widely used and representative evaluation indexes from the perspectives of three primary indexes, namely accessibility, diversity, and connectivity [11,12,13], combining recent research achievements in this field both domestically and abroad, based on a deep understanding of the status and future objectives of TOD mode development. The values of each index were calculated from the land use data and POI (point of interest) data within the impact area of the station. The primary and secondary indexes used in this study are listed in Table 1.

Table 1 TOD station evaluation indexes

The three primary index values for all transit stations were respectively obtained by normalizing the secondary index values and adding them for analysis convenience. The three primary indexes were then used to classify the stations via cluster analysis. With the K-means method, 232 stations from the Beijing Subway system were clustered as two groups, which were optimized through multiple trials. The clustering results are shown in Table 2.

Table 2 Clustering results for Beijing Subway stations

In this study, category 1 stations were regarded as non-TOD stations for their relatively lower values of various indicators, as well as their lack of diversity, accessibility, and connectivity. On the contrary, because of their greater diversity, accessibility, and connectivity, category 2 stations were considered as TOD stations. Scatter plots are used to describe the information for all types of transit stations, as shown in Fig. 1, and it can be seen that stations with similar index characteristics are clustered into the same category, indicating that the above index values can be used to distinguish different types of stations.

Fig. 1
figure 1

Scatter plot of the clustering results for categories 1 and 2

3 Study on the Self-Selection Behavior of Residents

3.1 Hypothesis Test

Generally, the income level and occupation of residents have an impact on their travel behavior because people tend to choose their residential, working, and entertainment areas according to their economic attributes [14, 15]. People with higher spending capacity are more likely to live in areas where TOD is denser, resulting in relatively shorter travel distances. In this case, self-selection behavior will affect the study of the relationship between TOD and the travel behavior of rail transit passengers. Therefore, it is essential to verify the self-selection behavior of residents caused by their economic characteristics.

Resident travel survey data containing vehicle, household, and occupational information was used to perform a linear regression using SPSS. The average travel distance of an individual was the dependent variable, and the other information was characterized as the independent variable. Independent variables like monthly salary (individual after-tax income) and occupation were subdivided into five categories, namely, gender, monthly salary, occupation, car ownership, and e-bike ownership. These variables were transformed into dummy variables before conducting the regression. The description of these processed dummy variables is shown in Table 3.

Table 3 Description of the model’s dummy variables

The linear regression analysis is illustrated in Eq. 1, and the results are shown in Table 4

$$ y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \cdots + \beta_{{\text{p}}} x_{{\text{p}}} $$
(1)

where y is the individual average travel distance, β0 is the constant, xp and βp each represent the independent variable and the corresponding regression coefficient.

Table 4 Regression analysis results of resident travel survey data

The conclusion drawn from the above regression analysis results is that the resident travel distance is significantly influenced by monthly salary, occupation, and car ownership but not by age or gender. These findings also indicate that the economic attributes of residents have a significant impact on travel distance. Therefore, matching the economic attributes of residents is a reasonable way to eliminate the effect of resident self-selection behavior.

3.2 Determination of the Economic Attributes of Travelers

In this study, travelers’ residential, working, and entertainment areas were identified based on the relevant information from the Beijing Subway AFC (automated fare collection) data, including the station name, the exact time of passengers entering and leaving the station, and the frequency of using the station.

In general, the living and consumption costs of residents can be used as valid indicators of their economic status [16]. The rental and sale prices of houses in residential areas were used to calculate living costs, with the relevant data being obtained from housing rental information websites. Consumption costs were measured using commodity prices from stores in the residential and entertainment areas. These stores were divided into three categories, namely, shopping, catering, and entertainment, and the relevant data were obtained from websites such as dianping.com. In total, the economic attributes of travelers were represented by 16 living and consumption cost variables. The mean and variance were also calculated.

3.3 Propensity Score Matching

The resident self-selection behavior was controlled using a propensity score matching approach [17]. Propensity score matching refers to pairing individuals with the same or similar background characteristics in experimental and control groups to balance the distribution of confounding factors in the groups [18]. The principle behind this method is that certain events cause a fraction of the subjects in the sample to change without affecting other subjects or having little effect on them. In this study, two samples—those affected by the event and those not were under observation. The difference between the two samples before and after the event is the “net effect” of the event on the subjects [19].

First, the covariates affecting whether the travelers living in TOD areas or not were selected. The parameter values of the covariates were estimated using a logistic model to calculate the probability of residents living in TOD areas based on their economic characteristics, which was then used to represent the propensity score value. The traditional logistic model mentioned above is expressed as follows:

$$ p(X_{i} ) = p\left( {D_{i} = 1|X_{i} } \right) = \frac{{exp(\beta X_{i} )}}{{1 + \exp (\beta X_{i} )}} $$
(2)

where p(X) represents the propensity score of the sample and is a probability value of continuous distribution ranging from 0 to 1. D is a binary variable, Di=1 and Di=0 each represents residents living in TOD and non-TOD areas, respectively. Xi denotes the controlled variables reflecting individual economic attributes, including living and consumption cost variables mentioned in Section 3.2, and β denotes the corresponding coefficient.

Next, the nearest neighbor matching method was used to find a match from the non-TOD group whose individual propensity score was the closest to that of the TOD group. The nearest neighbor matching method is one of the most commonly used methods for propensity score matching. It compares the individual in the control group with the least difference in propensity score and the corresponding individual in the experimental group. The subjects in the experimental group are randomly sorted, and a match (with the closest propensity score) is found from the control group. The process starts with the first subject in the experimental group until all subjects in the group find a match in the control group.

Suppose Pi and Pj denote the propensity scores in the experimental and control groups, and I1 and I0 denote the experimental and control groups, respectively. When the absolute value of the difference in the propensity score is the least among all potential pairs between i and j, a control group subject j(j∈I0) is included in the neighbor relationship C(Pi) as a match for the experimental group subject i(i∈I1). This is expressed in Eq. 3.

$$ C(P_{i} ) = \mathop {\min }\limits_{j} ||P_{i} - P_{j} ||,\quad j \in I_{0} $$
(3)

Finally, the difference in travel behavior characteristics between TOD and non-TOD residents is calculated to indicate the impact of TOD modes on residents’ travel behavior. Equation 4 describes the difference between the experimental and control groups by introducing the average treatment effect (ATE).

$$ {\text{ATE}} = E(\Delta |p(x),D = 1) = E(y_{1} |p(x),D = 1) - E(y_{2} |p(x),D = 0) $$
(4)

where p(x) denotes the propensity scores, D is the binary variable in Eq. 2, and Y is an outcome variable. The second term in the equation must be estimated because it is counterfactual and unobservable, and its propensity score is defined as the conditional probability (shown in Eq. 5) of travelers with a given background variable (i.e., socioeconomic attribute variable) living in the TOD area.

$$ p(x) = Pr(D = 1|x) = E(D|x) $$
(5)

The propensity score matching results are presented in Table 5

Table 5 Propensity score matching results

Table 5 shows that there is a significant difference in travel behavior between travelers in TOD and non-TOD areas before and after the propensity score matching method was used to control the self-selection behavior of residents. The difference in both the travel distance and trip frequency after matching appears to decrease; however, the difference in travel distance after matching accounts for 68.9% of that before matching, implying that 31.1% is primarily influenced by the resident self-selection behavior. Regarding trip frequency, TOD has very little impact both before and after matching.

4 Travel Behavior Analysis

4.1 Travel Distance Analysis

The focus of the study is the continuous independent variable influenced by multiple factors. Therefore, a multiple linear regression model, based on the propensity score matched results, with travel distance as the dependent variable and information such as the built environment of subway stations as the independent variables, was selected to analyze travelers in two different areas.

Multiple linear regression models are widely used in solving actual problems, which are generally defined according to Eq. 6.

$$ Y = \beta X + \varepsilon $$
(6)

where Y=(y1,y2,y3,…,yn)T represents individual travel distance, X=(1,x1,x2,x3,…,xp)T is a set of independent variables including built environment variables around the station and individual economic attributes, which are listed in Tables 6 and 7, n is the number of individuals, and p is the number of dependent variables. β=(β0,β12,…,βp)T represents parameters to be estimated. ε=(ε123,…,εp)T means random error.

Table 6 Analysis results of travel distance in TOD areas
Table 7 Analysis results of travel distance in non-TOD areas

The travel distance analysis results for TOD and non-TOD areas are presented in Tables 6 and 7, respectively.

The above results show that there are significant differences in the number of connecting bus routes, density of non-motorized lanes, density of motorized lanes, and land use entropy between the two areas. Resident travel distance is mainly affected by the number of connecting bus routes and the density of non-motorized lanes in TOD areas, whereas it is more affected by the density of motorized lanes in non-TOD areas. This suggests that residents living in TOD areas prefer to travel by public transportation or non-motorized vehicles, whereas residents living in non-TOD areas prefer to drive their cars.

Furthermore, land use entropy can also be used to measure the number of land function categories as well as the distribution uniformity of each land use type in a given area [20]. The effect of land use entropy on travelers in TOD areas is significantly higher than that in non-TOD areas, indicating that land use functions and facilities in TOD areas are relatively well-developed, which could be one of the key reasons why TOD areas are more appealing to residents.

4.2 Trip Frequency Analysis

Poisson regression is a regression model for analyzing the dependent variable of count data and is often adapted to analyze the influencing factors of the number of occurrences of an event per unit time. A Poisson regression model was performed on travelers in TOD and non-TOD areas, using trip frequency as the dependent variable and the built environment around the station as the independent variable.

Assume that the number of occurrences of an event in a fixed time span is y, which follows a Poisson distribution with the expectation of μ. The density function of the distribution can be illustrated as follows.

$$ P_{r} \left( {Y = y|\mu } \right) = \frac{{e^{ - \mu } \mu^{y} }}{y!},\quad y = 0,1,2,3 \ldots $$
(7)

where Y represents the probability of occurrence of the event. It is assumed that for individual i, the number of occurrences of an event is yi and yi follows a Poisson distribution with the expectation of μ, the density function of the distribution is calculated according to Eq. 8.

$$ P_{r} \left( {Y_{i} = y_{i} |\mu_{i} } \right) = \frac{{e^{{ - \mu_{i} }} \mu_{i}^{{y_{i} }} }}{{y_{i} !}},\quad y = 0,1,2,3 \ldots $$
(8)

where the expectation value μi can be estimated as follows.

$$ In_{{\mu_{i} }} = X_{i}^{\prime } \beta^{\prime } = \sum\limits_{j = 1}^{k} {\beta_{j} x_{ji} } $$
(9)

where xji indicates explanatory variables listed in Table 8, βj denotes the corresponding coefficient for each explanatory variable representing the effect of explanatory variables on explained variables.

Table 8 Analysis results of trip frequency in TOD areas

The trip frequency analysis results for TOD areas are presented in Table 8.

According to these analysis results, trip frequency was not associated with any of the factors for travelers living in non-TOD areas; thus, the corresponding results are not shown. In contrast, as travelers in TOD areas mostly travel for commuting, shopping, entertainment, and other purposes, the number of jobs in the area and the number of connecting bus routes have a significant impact on the trip frequency.

5 Conclusions

Researchers in related fields have been studying the impact of TOD modes on resident travel behavior ever since the concept of TOD was first proposed and adopted. In this study, the impact area of TOD stations was first determined, and then indicators were selected from three primary TOD evaluation indicators, namely, accessibility, diversity, and connectivity. These indicators were used to create a TOD index evaluation system, after which the stations were classified via cluster analysis. To gain a better understanding of the difference in travel behavior between residents living in TOD and non-TOD areas, the propensity score matching method was applied to attenuate the impact of self-selection behavior, which also affects travel behavior.

The results show that residents in TOD and non-TOD areas have different travel behavior characteristics. Travelers in TOD areas live in areas with relatively well-developed land functions and facilities, and they prefer to take public transportation or non-motorized vehicles, whereas travelers in non-TOD areas are more inclined to use private cars. In TOD areas, the number of jobs and connecting bus routes have a significant impact on trip frequency, whereas in non-TOD areas, built environment information has no discernible impact on trip frequency.

In summary, this study contributes to the research and analysis of the impact of TOD modes on the travel behavior of residents and provides guidance to urban planners for the implementation of TOD modes. Future work can focus on the following two directions. First, the connotation of TOD constantly changes over time, so it must be studied in the context of the current TOD development situation in China. Second, owing to the lack of personal data regarding Beijing residents, this paper cannot prove the accuracy of the economic attributes of travelers inferred from their choices of residential and entertainment areas. Accordingly, relevant data can be obtained via questionnaires or cell phone signaling data in future studies.