1 Introduction

Critical urban infrastructure systems, such as transportation, power, water supply, waste, and emergency response systems, are facing an increasing number of threats and disruptions from both natural and human-made disasters (Henry and Ramirez-Marquez 2012; Bruyelle et al. 2014; Tang, Zhao, et al. 2023), resulting in potential economic losses and ever-growing costs for maintenance and rebuilding (Mao and Li 2018; Huck and Monstadt 2019; Liu and Song 2020; Serdar et al. 2022). The concept of “resilience” has been considered a promising solution that could tackle such negative impacts on future planning and development (Francis and Bekera 2014; Gheorghe et al. 2018; Tang, Wei, et al. 2023). To build and manage resilience in urban systems more adequately, it is fundamental to achieve an appropriate and accurate quantification of this compound system attribute (Cai et al. 2018).

Owing to the essential fuzziness and multidimensionality of resilience, the quantification of this concept has been an intractable issue in the field of disaster risk reduction and safety science (Tang et al. 2020). To this end, scholars from different disciplines have proposed a large number of models and metrics to quantify this fuzzy system property, such as the well-known resilience triangle, in an ongoing trend that has become increasingly pronounced.

Resilience quantification methods and tools can be classified into different categories, such as (1) performance-based, including system state- or phase-based models (Ouyang and Dueñas-Osorio 2012; Panteli et al. 2017; Cai et al. 2018; Najarian and Lim 2019; Cai et al. 2021; Patriarca et al. 2021; Yin et al. 2022; Jiang et al. 2023); (2) indicator-based assessment frameworks (Arreguín-Sánchez 2014; Ghaffarpour et al. 2018; Martišauskas et al. 2018; Wang et al. 2019); and (3) qualitative methods, such as scorecard assessments with expert knowledge (Cutter et al. 2008; Pettit et al. 2010; Shirali et al. 2013); and many others. Performance-based metrics are a widely applied group of tools that quantify resilience by analyzing time-series performance changes in a system’s functionality level. Many of these metrics are claimed to be generic methods for resilience quantification and can be applied to or tailored with minimum changes in different infrastructure systems. Due to their simplicity and generality, metrics from this group have been widely applied in many urban systems to date.

For users, a rapidly growing number of quantification metrics certainly enrich the toolbox available for measuring and benchmarking resilience for better management purposes. The generic characteristics of those performance-based metrics create an easy-to-apply environment. However, such variety also brings several obstacles during hands-on applications. For example, do these metrics measure the same concept of resilience since they are all devised for measuring system resilience? If not, which metric should be chosen under what data and constraint conditions? Only a small proportion of scholars have attempted to answer the first question. Clédel et al. (2020) presented that many metrics tend to evaluate resilience similarly; however, metrics can also be heterogeneous and do not capture the same resilience when different definitions of resilience are considered. Nishimi et al. (2021) compared psychological resilience metrics using correlation and regression methods and found that resilience metrics were weakly to moderately correlated. Notably, scholars have not yet reached a consensus on the above questions, and a comprehensive comparative analysis that focuses on general performance-based metrics is still missing. More importantly, effective guidance on the use of these metrics is also scarce. Hence, the research questions and objectives we aimed to tackle in this study are as follows.

RQ1:

Do these popular performance-based metrics measure the same resilience? That is, as all of them are designed for measuring resilience and most are claimed to be “generic” resilience quantification methods, do they capture the same essence and measure the same thing? (Objective: using a retrospective perspective to compare the similarities and differences of the selected metrics in a quantitative-qualitative combined approach.)

RQ2:

Which metric should be used under what circumstance? That is, if these popular performance-based metrics essentially measure different concepts of resilience, which one should be used based on different application contexts? (Objective: developing an effective guideline for users based on the findings from addressing the first research question.)

To address these two questions, a metric comparison study was conducted and we used simple, yet effective, quantitative methods, combined with an in-depth qualitative diagnosis, to reveal the answers. Based on their prominent reputation and applicability, 12 popular and recently proposed performance-based system resilience metrics were selected. The associations and differences among selected metrics were verified using time-series system performance data collected from the civil aviation system in China’s mainland during the first two waves of COVID-19 from January 2020 to March 2021. To examine the temporal scale sensitivity of these metrics and their association effects, all metrics were compared across two temporal scales: the daily scale and the weekly scale, respectively. The advantages and disadvantages of each metric were comparatively discussed, and a brief user guideline was proposed based on the findings.

The main contributions of this study can be summarized as follows:

  • To the best of our knowledge, this study is one of the first to compare the performance-based resilience metrics in critical infrastructure systems via a quantitative and qualitative combined approach.

  • This study provides a guideline for practitioners, which outlines “when to choose which” for users when measuring, building, and managing resilience in infrastructure systems, which helps identify the merits and disadvantages of mainstream tools in the resilience quantification toolbox.

  • This study offers an introspective analysis of diverse tools and lays an important foundation for an improved understanding of performance-based resilience in urban systems, which provides insightful evidence for expanding how critical infrastructure resilience is measured, perceived, and managed.

2 The Selected Metrics

Several common parameters in the selected metrics are listed below. Other specific parameters are explained in the following subsection.

\({t}_{0}\)

The time of disruption occurrence

\({t}_{d}\)

The time of stable performance immediately after the disaster

\({t}_{pr}\)

The time to start the recovery

\({t}_{r}\)

The end time of recovery

\({P}_{0}\)

The performance before the disaster occurred

\({P}_{d}\)

The stable performance immediately after the disaster

\({P}_{pr}\)

The performance at the beginning of recovery

\({P}_{r}\)

The steady performance after recovery

\({P}\left({t}\right)\)

The actual level of performance

\({TP}\left({t}\right)\)

The target performance

After reviewing the mainstream resilience metrics found in the literature from 2003 to 2022, 12 well-established resilience metrics were selected for comparison based on both metric characteristics (that is, algorithm applicability) and popularity in the literature. A summary of the selected performance-based metric characteristics is provided in Table 1. Some essential explanations of these 12 metrics are provided below. This article does not intend to give a full review of each metric and model, and further details should be referred to the corresponding citations associated with them.

Table 1 Summary and comparisons of the basic characteristics of the selected metrics

\(R1\) refers to the “resilience-triangle metric” proposed by Bruneau et al. (2003), which depicts the total loss of system resilience based on the areal difference between 100% functionality and the actual time-dependent performance. It is a well-applied metric that has inspired many “area-based” resilience metrics (for example, \(R2\)) (Eq. 1):

$$R1={\int }_{{t}_{0}}^{{t}_{r}}\left[100-P\left(t\right)\right]dt$$
(1)

\(R2\) was developed by Ouyang and Dueñas-Osorio (2012), notably similar to \(\mathrm{R}1\), with important modifications made regarding the representation of resilience. Instead of depicting resilience loss, \(R2\) provides a straightforward measurement of the resilience capacity (Eq. 2):

$$R2=\frac{{\int }_{{t}_{0}}^{{t}_{r}}P\left(t\right)dt}{{\int }_{{t}_{0}}^{{t}_{r}}TP\left(t\right)dt}$$
(2)

\(R3\) proposed by Francis and Bekera (2014) depicts resilience from different performance components, namely, rapidity, robustness, recovery, and time to recovery (Eqs. 3 and 4):

$$R3={S}_{p}\times \frac{{P}_{r}}{{P}_{0}}\times \frac{{P}_{d}}{{P}_{0}}$$
(3)
$${S}_{p}=\left\{\begin{array}{c}\left({t}_{\delta }/{t}_{r}^{*}\right)exp\left[-\alpha \left({t}_{r}-{t}_{r}^{*}\right)\right], {t}_{r}\ge {t}_{r}^{*}\\ \left({t}_{\delta }/{t}_{r}^{*}\right), otherwise\end{array}\right.$$
(4)

where \({S}_{p}\) is the factor of recovery speed, \({t}_{\delta }\) is the maximum time that is acceptable before recovery ensues, \({t}_{r}^{*}\) is the time when the initial operation is performed to stabilize the system to an intermediate state, \({t}_{r}\) is the time when the system is restored to the final state, and \(\alpha \) is the decay factor. Because strong inherent assumptions are difficult to realize, \({S}_{p}\) is defined as 1 in this study.

\(R4\) proposed by Zobel and Khansa (2014) is a conceptual extension of the resilience triangle (Eq. 5):

$$R4=1-{\sum }_{i}\left({X}_{i}+{X}_{i}{\prime}\right){T}_{i}/2{T}^{*}$$
(5)

where \({X}_{i}\) is the percentage of initial performance loss, \({X}_{i}{\prime}\) is the performance loss immediately reflected in the system before the event \(\left(i+1\right)\), \({T}_{i}\) is the recovery time, and T* is the predetermined time, allowing for the comparison of multiple cases under the same relative scale.

\(\mathrm{R}5\) proposed by Henry and Ramirez-Marquez (2012) measures the recovery (robustness) of the system via the differences between the after-recovery (before-event) performance and worst performance (Eq. 6):

$$R5=\frac{{P}_{r}-{P}_{d}}{{P}_{0}-{P}_{d}}$$
(6)

Toroghi and Thomas (2020) proposed \(\mathrm{R}6\), which measures the weighted average of \({R}_{a}\) (robustness), \({R}_{b}\) (redundancy), \({R}_{c}\) (resourcefulness), \({R}_{d}\) (rapidity), and \({R}_{e}\) (readjust capacity). The end-users of the system are divided into three categories—priority, emergency, and routine, and are comprehensively considered based on the above five properties according to Eq. 7:

$$R6=Weighted\,Avg.\left[{R}_{a},{R}_{b},{R}_{c},{R}_{d},{R}_{e}\right]$$
(7)

R7 was proposed by Ayyub (2014), where a failure event occurs at the time \({T}_{0}\), with a failure duration of \({\Delta T}_{f}\), and after a continuous recovery time (\({\Delta T}_{r}\)), the system returns to its final steady state, as derived via Eqs. 810:

$$R7=\frac{{T}_{0}+F\Delta {T}_{f}+R\Delta {T}_{r}}{{T}_{0}+\Delta {T}_{f}+\Delta {T}_{r}}$$
(8)
$$F=\frac{{\int }_{{t}_{0}}^{{t}_{d}}P(t)dt}{{\int }_{{t}_{0}}^{{t}_{d}}TP(t)dt}$$
(9)
$$R=\frac{{\int }_{{t}_{d}}^{{t}_{r}}P(t)dt}{{\int }_{{t}_{d}}^{{t}_{r}}TP(t)dt}$$
(10)

\(R8\) developed by Chen and Miller-Hooks (2012) measures the resilience of multi-modal transportation systems, which is defined as the share of expected post-disaster demands that the system meets under a given condition (Eq. 11):

$$ R8\, = \,E\left( {\mathop \sum \limits_{w \in W} d_{w} /\mathop \sum \limits_{w \in W} D_{w} } \right)\, = \,\frac{1}{{\mathop \sum \nolimits_{w \in W} D_{w} }}\,E\left( {\mathop \sum \limits_{w \in W} d_{w} } \right) $$
(11)

where \({d}_{w}\) and \({D}_{w}\) are the maximum demand level at the post- and pre-disaster stages, respectively. In this study, R8 is characterized by its initial performance before disruption and stable performance after recovery.

\(R9\) proposed by Arjomandi-Nezhad et al. (2020) measures rapidity by dividing the recovery performance by the average time of performance loss (Eq. 12):

$$R9=\left({P}_{0}-{P}_{d}\right)/\left[\frac{{\int }_{{P}_{d}}^{{P}_{0}}tdP}{\left({P}_{0}-{P}_{d}\right)}\right]$$
(12)

which can be rewritten as \({R9}^{*}\), as shown in Eq. 13:

$${R9}^{*}=\frac{{\left({P}_{0}-{P}_{d}\right)}^{2}}{\left({t}_{r}-{t}_{d}\right)\times {P}_{0}-{\int }_{{t}_{Pr}}^{{t}_{r}}Pdt-{\int }_{{t}_{d}}^{{t}_{pr}}Pdt}=\frac{{\left({P}_{0}-{P}_{d}\right)}^{2}}{\left({t}_{r}-{t}_{d}\right)\times {P}_{0}-{\int }_{{t}_{d}}^{{t}_{r}}Pdt}$$
(13)

\(R10\) was proposed by Nan and Sansavini (2017) (Eq. 14):

$$R10=\frac{{P}_{d}\times {RAPI}_{rec}\times RA}{{RAPI}_{dis}\times TAPL}$$
(14)

where \({RAPI}_{dis} ({RAPI}_{rec}\)) is the rapidity in the disruptive (recovery) phase, defined as the average slope of the system performance curve during the disruptive (recovery) phase; \(RA\) is the ratio of recovered performance to lost performance; and \(TAPL\) is the time-averaged performance loss of the system, dividing the area enclosed by the initial and actual performance curves of the system by time.

Tierney and Bruneau (2007) showed that resilience can be measured by the performance of a system after a disaster, and the time taken to return to the initial level of performance. Accordingly, \(R11\) was proposed by Attoh-Okine et al. (2009). Much like the resilience triangle, \(R11\) measures the time-dependent performance capacity, but outlines percentage changes per unit of time (Eq. 15):

$$R11=\frac{{\int }_{{t}_{0}}^{{t}_{r}}P\left(t\right)dt}{100\left({t}_{r}-{t}_{0}\right)}$$
(15)

\(R12\) proposed by Patriarca et al. (2021) is calculated by the following Boolean relationship among the absorptive, adaptive, and restorative capacities of the system (Eq. 16):

$$R12=Ab\bigvee \left(Ad\bigwedge Rec\right)=Ab+Ad*Rec-Ab*Ad*Rec$$
(16)

The absorption capacity of the system is calculated according to (Eq. 17), where \({C}_{Ab}\) is the aging factor:

$$Ab=\left(\frac{{P}_{d}}{{P}_{0}}\right)*{C}_{Ab}$$
(17)

The adaption capacity is characterized by the time interval in a stable post-disruption state (\({t}_{pr}-{t}_{d}\)) and a full cycle (\({t}_{r}-{t}_{0}\)) (Eq. 18):

$$Ad=1-\left(\frac{{t}_{pr}-{t}_{d}}{{t}_{r}-{t}_{0}}\right)$$
(18)

The recovery capacity is calculated according to Eq. 19, where \({C}_{T}\) is the recovery duration factor:

$$Rec=\frac{Arctan\left[\frac{{P}_{r}-{P}_{pr}}{\left(\frac{{t}_{r}-{t}_{pr}}{{t}_{r}-{t}_{0}}\right)}\right]}{90}*{C}_{T}$$
(19)

3 Data

In this section, the details of the data and the process of data cleaning and treatment are introduced. To evaluate the performance of the selected metrics, China’s civil aviation system impacted by the COVID-19 pandemic was used as an effective case study. Data regarding daily new diagnosed cases and on-time execution rates were collected for 427 days from 20 January 2020 to 21 March 2021 (Fig. 1). Daily new diagnosed cases data in China’s mainland were acquired from the daily report of the National Health Commission of the People’s Republic of China (2021). The flight execution rates were collected from flight stewards, and the Operation and Management Center of the Civil Aviation Administration of China (2021). The data of aviation execution rate were used to characterize Chinese aviation system performance.

Fig. 1
figure 1

Civil aviation execution rate under COVID-19

As shown in Fig. 1, COVID-19 has had a significant impact on China’s aviation industry. Following the national alert regarding the phenomenon of human-to-human transmission of COVID-19 on 20 January 2020, the civil aviation execution rate in China plummeted from 93.6 to 20.2% within 24 days. Only a few weeks later, with the acceleration of full-population screening testing and fast treatment responses, the epidemic was gradually brought under control, and the flight on-time execution rate gradually recovered. However, recurrent outbreaks, as well as related travel restrictions and epidemic control policies, had brought a large number of fluctuations to various extents in aviation system performance up until early March 2021.

In preliminary data processing, moving average, exponential, and Savitzky-Golay smoothing methods were considered to smooth the data for reducing possible data noises. The result yielded by Savitzky-Golay smoothing processes demonstrates a better outcome by retaining the extrema and time scale information while dealing with data noise. In addition, to test the temporal sensitivity of the metrics, the whole observation period was then divided into multiple cycles by two temporal scales—the daily scale (smooth span = 1 day) and the weekly scale (smooth span = 7 days)—respectively. The performance between two local peaks was defined as an effective resilient cycle, including disruption and recovery phases. In total, there were 59 and 134 cycles at the weekly and daily scales, respectively. These performance curves fluctuated at various degrees, including full recovery, insufficient recovery, and adapted recovery, which cover all scenarios of systems performance to test the performance of the selected metrics.

4 Results

In the results section, the similarities and differences of metrics are compared using a combination of quantitative and qualitative approaches. The 12 selected metrics were implemented using the collected data, and correlation analysis and statistical hypothesis testing (that is, the independent-samples Kruskal-Wallis test) were applied to compare the association between metrics. Finally, the similarities and differences of metrics are analyzed using a qualitative approach.

4.1 Quantitative Analyses

In this section, we present the results from the quantitative analyses and compare the metrics based on the quantification outcomes.

4.1.1 Quantification Outcomes

Figure 2 shows the quantified resilience values of the selected 12 metrics across the two temporal scales. The value ranges of the metrics were extremely different; accordingly, metrics were grouped based on their value ranges in the subplots. The results show that for both temporal scales, R2, R3, R4, R7, R8, and R12 demonstrated similar trends, indicating that these metrics quantify resilience in a similar way. But R1, R5, R9, and R10 had extreme values, with the values remaining at a low level across most cycles. Interestingly, R5 and R10 exhibited roughly opposite trends. Among all metrics, R11 demonstrated a unique trend of resilience, which was considered incomparable with other metrics, thereby implying that this metric has a distinct underlying mechanism for formulating resilience. In addition, R1, R9, and R10 returned negative values for some particular cycles, seemingly contradicting the definition of resilience.

Fig. 2
figure 2

Quantification outcomes for all metrics across daily and weekly cycles

4.1.2 Association Analysis

A correlation analysis was conducted to test whether the visual trends observed above could quantitatively confirm the relationships between metric pairs, which has been proven to be reliable in many studies (Nishimi et al. 2021; Vishnu et al. 2023). Before calculating the correlation coefficients, the normality assumption was tested using the Kolmogorov-Smirnov test for the obtained resilience indices across both temporal scales. As it was revealed that most of the calculated outcomes of the metrics were not normally distributed, Spearman’s correlation coefficient was applied.

Figure 3 shows the correlation matrix for the pairwise analysis of all weekly outcome metrics. From the results, 23 metric pairs demonstrated a highly strong positive correlation, while five had a strong positive correlation, two had a moderate positive correlation, three displayed a weak positive correlation, and five pairs maintained a very weak positive or no correlation. In contrast, 11 metric pairs showed a highly strong negative correlation, four had a strong negative correlation, one showed a moderate negative correlation, seven pairs displayed a weak negative correlation, and five pairs had a very weak negative or no correlation. Table 2 summarizes the correlation results, presenting the detailed metric pairs for each category. From the weekly scale correlation results, it can be observed that the ratio between positive and negative correlations was roughly 33:23 (excluding the very weak or no correlation pairs), indicating a higher rate of positive correlations among the tested metrics in terms of resilience outcomes.

Fig. 3
figure 3

Correlation matrix for all metrics on a weekly scale

Table 2 Summary of correlation results and metric pairs

The daily scale results demonstrate a higher similarity in terms of the overall correlation pair characteristics. Here, the overall ratio between positive and negative correlations was 34:16, indicating that more metrics shared similar time-series trends in their resilience quantification. One of the possible explanations for this ratio difference between weekly and daily cases is that, compared to the weekly scale, the number of resilience cycles is larger in daily scale performance, which indicates that the correlation tests at the daily scale have more samples than the weekly scale.

4.1.3 Significance of Differences

As the correlation analyses only indicated if metric pairs showed similar trends in their time-series quantification results, it remained necessary to explore whether the quantified resilience indices significantly differed from one series to another. Accordingly, the statistical differences of metric pairs were tested using the independent-samples Kruskal–Wallis test, a nonparametric test for non-normally distributed samples, which has been shown to be reliable in other studies (Sherwani et al. 2021). Again, using the weekly scale as the case example, the differences in all 12 metrics were found to be statistically significant (the significance level is less than 0.05). Furthermore, a pairwise diagnosis was also performed for more detailed comparisons. Figure 4 presents the overlaid results from both correlation and significance tests to facilitate a cross-comparison.

Fig. 4
figure 4

Correlation matrix and statistical tests on pairwise differences for all metrics across the weekly scale. The colors represent the correlation coefficients, the “+” sign indicates that the difference in quantified outcomes between a pair of metrics is statistically significant, and the “-” sign indicates that the difference in quantified outcomes between a pair of metrics is not statistically significant.

Multiple pairs, such as R1–R2 and R1–R3, demonstrated significant differences in terms of the resilience they quantified (in total, 35 pairs of metrics fit this category); whereas pairs including R1–R6 and R1–R9 showed no significant differences (N = 31 pairs). On the daily scale, the total number of these two categories was slightly different (N = 44 and 22 pairs with and without significant differences, respectively).

4.1.4 Metrics Performance Against Changes in Temporal Scales

In response to the research question, it was speculated that if two metrics have a highly strong positive correlation, with no significant differences in their quantification outcomes, they are likely highly similar metrics essentially measuring the same resilience under similar definitions. In contrast, if two metrics produced highly negative correlations with significant differences in their quantified indices, they are essentially two distinct metrics based on different or even unrelated interpretations of “resilience.”

Accordingly, the results can be easily understood by mapping all pairs of metrics in their respective quadrants (Fig. 5), and by overlaying the weekly- and daily-scale results, stable pairs of metrics whose performance could be less sensitive to changes in the temporal scales of system’s time-series functionality can be identified. From the results, in the total 66 metric pairs, only 12 pairs showed high positive correlations with no significant quantification differences, including R2–R3, R2–R4, R2–R7, R2–R12, R3–R4, R3–R5, R3–R7, R3–R12, R4–R7, R4–R12, R5–R8, and R7–R12. This implies that they might measure the same concept of resilience in terms of quantitative analyses. On the other hand, R1–R2, R1–R3, R1–R4, and R1–R7 are based on distinct interpretations of “resilience” in terms of quantification outcomes.

Fig. 5
figure 5

Correlation matrix and statistical tests on pairwise differences for all metrics. a Results at the weekly scale; b Results at the daily scale; c The overlapping metric pairs between a and b, which are the stable pairs against changes in the temporal scale. Note The total number of metric pairs is 66. The numbers in the figure indicate the R1 to R12 metrics.

Even though many of the later proposed resilience metrics were inspired or even based on the classic “resilience-triangle” metric (that is, R1, the most classic metric proposed by Bruneau et al. (2003)), R1 showed quite different quantification outcomes compared to many others. A possible explanation for this observation is that R1 is based on quantifying the performance loss owing to disruptive events under performance curves, which largely differs from other subsequent proposed metrics. Newer metrics place less emphasis on performance loss as a whole for indicating system resilience, instead focusing on individual resilience-building capacities at each system stage, indicating that many subsequent and recently developed metrics have diverged from this traditional interpretation of resilience. This can also be roughly confirmed by the evolution of the resilience concept and the “capacities to measure” information in Table 1.

4.2 Qualitative Analyses

From the quantitative results, it can be seen that most metrics are not strongly positively correlated and have significant differences, which means that most metrics measure different resilience. To unravel the details of the concept interpretation and understand resilience metrics, we further discuss the results in terms of (1) the definitions, (2) the components, and (3) the expressions of the selected metrics in a qualitative approach.

First, metrics, functioning as interpretations of the definition of “resilience” often measure one or several capacities based on different resilience definitions, with absorptive, adaptive, and restorative capacities being the most common ones. Through the literature review, we found that among the selected metrics, absorptive capacity is often measured by robustness (for example, R3, R5, R12), while restorative capacity is usually expressed by the level of system performance recovery (for example, R5, R8, R10, R12) or recovery rapidity (for example, R3, R6, R10). No consensus has yet been reached surrounding adaptive capacity in practical applications. Two controversial points about adaptive capacity remain: (1) it is not commonly agreed on which stage adaptive capacity is reflected; and (2) the components that measure adaptive capacity are not uniformly defined. Different opinions on these controversial points have been documented in the literature. For instance, Patriarca et al. (2021) defined adaptive capacity as the time interval during which a system reaches a new stable post-failure state and maintains it until the recovery actions take place. Here, measurements were derived by the stable disrupted stage and full cycle stage (R12). Nan and Sansavini (2017) characterized adaptive capacity in the recovery stage, taking the speed and performance loss during this stage to jointly measure adaptive and restorative capacity (R10). Francis and Bekera (2014) expressed the adaptive capacity of a system by the proportion of the original system performance retained after reaching a new stable performance level (R3).

Based on the aforementioned discourses, it is clear that scholars hold different opinions regarding how resilience can exactly be defined and thus interpreted in the investigation. This in part explains why different schools of thought emerged in the field. Indeed, this high diversity seems quite natural since the concept per se is a fuzzy and multidimensional one—scholars see it differently in different systems. However, this might also reflect a caveat that the more well-rounded the interpretation of resilience, the higher the chances that the corresponding metric developed would encounter more incompatible issues in many other application cases. This can be seen even clearer if we keep diving deeper into the metric components and their quantification mechanism.

Second, the components that measure resilience capacities are highly diverse, including robustness, recovery, rapidity, recovery time, resourcefulness, redundancy, performance loss, performance capacity, and more. The first four items are the most common components in metrics. Among them, the characterization of robustness and recovery has reached a certain consensus. Robustness is measured by how much functionality drops after the event hits, which represents the performance loss in the destruction phase (R3, R4, R5, R6, R12). Recovery is characterized by post-recovery performance after the disruptions, which indicates the degree of performance recovery in the recovery phase (R3, R5, R6, R9). In contrast, there is no consensus on the measurement of rapidity. Some scholars use recovery time (R3, R4, R6), while others use the slope of the recovery curve (R10) or the ratio of recovered performance to the average outage time (R9). Very few metrics measure the resourcefulness and redundancy of the system. Performance loss and performance capacity are characterized by the integral area, which is usually used as a comprehensive measure of resilience. Overall, the use of different components reflects scholars’ different thinking and emphasis on resilience, and different combinations of components usually form essentially different metrics.

Finally, in terms of metric expression, it is apparent that there is no unified model for formula construction. First, differences exist in the construction of metrics assessing the same component. For example, regarding the measurement of robustness, R3 and R12 employ the ratio of the lowest to initial resilience values, while R4 and R5 use the difference between them; however, R6 not only combines the ratio and the difference but also introduces an exponential function. Second, regarding the combination of components, some scholars have used the form of weighted sums (R6), while others have used component products or ratios (R3, R5, R10, R11) or a combination of a set of sub-models and sub-metrics (R7, R12). The lack of a unified standard formula has led to significant differences in the metric expression forms, which might potentially lead to different quantification results. All in all, these metrics, even though claim to be capable of measuring “the system resilience” in a generic way, are essentially measuring different things, which is also understandable as metrics reflect how the concept is interpreted and defined. Since the interpretation of resilience is different, the developed metrics would certainly measure differently. As most of the metrics cannot be used interchangeably, it highlights the necessity to carefully select which metric to use when measuring resilience with specific data and system constraints.

Despite the above evidence regarding the differences, we can also observe that some metrics have similar components and expressions, thus, producing highly similar quantification results and patterns. Here, we argue that within a certain context, these metrics can be regarded as assessing the same resilience. For instance, the association analyses showed that 12 metric pairs were highly similar across weekly and daily scales and such similarity can be quite stable against the changes in temporal scales of system performance.

5 The Proposed User Guideline

Most of the selected metrics are designed to be generic and theoretically applicable to many other infrastructure systems. However, some metrics are only applicable to specific systems and data types owing to their pros and cons. For instance, some metrics are suitable for continuous time series with the full picture of the system performance curves (R1, R2, R4, R7, R9, R10, and R11) because they need to track the area-based performance losses in a complete way, while some metrics only require information on the critical turning points in the performance curves (R3, R5, R6, R8, and R12), and the system performance may not necessarily be continuous. Based on the merits and disadvantages of each metric, we propose a user guide for the selected performance-based metrics (Fig. 6).

Fig. 6
figure 6

A brief guideline for metric users

For metric users, it is noteworthy that for a fuzzy system concept such as resilience, the metric construction could be highly flexible and deeply context-dependent, resulting in stark differences between quantification benchmarking aligning with the diversity of its definitions. We also need to pay extra attention to the following limitations of these metrics in future metric development.

  • First, some metrics are based on strong assumptions. If such assumptions cannot be relaxed in a wider context, their applicability to other systems might significantly deteriorate. Taking R3 as an example, this metric assumes that a set of initial operations is performed to stabilize the system to an intermediate state. However, such an intermediate state might not be achievable in all systems under different disruption scenarios.

  • Second, the post-disaster performance level might be higher than that of the pre-disaster performance (that is, adaptive growth after disasters), which should be fully considered in the metric development to avoid negative scores.

  • Third, some metrics were developed with a format of product or ratio, which can potentially cause extreme values and thus introduces outliers into the comparative analyses (appropriate treatment might be needed for those metrics, such as normalization and outlier treatment).

In this sense, from the perspective of better understanding the concept of resilience, it is still worthwhile to explore a more multidimensional interpretation of resilience in urban safety studies, even though there might be a caveat of higher incompatibility risk for ever-increasingly complicated quantification metrics (Najarian and Lim 2019; Yarveisy et al. 2020; Raoufi and Vahidinasab 2021; Xu and Chopra 2022). We speculate that this high diversity of metrics would still be the mainstream trend of resilience quantification studies in the foreseeable future, as indeed, the fundamental issues of resilience “of what,” “for whom,” and “by how” are dynamic and ever-changing from case to case (Cutter 2016; Meerow and Newell 2019).

6 Conclusion

Effectively quantifying resilience and understanding this fuzzy concept from multidimensional perspectives would help to provide alternative and promising solutions to urban risk management and system safety issues. The rapid emergence of quantification methods offers stakeholders a wider range of available tools, but it also brings two intractable challenges, including (1) do these popular performance-based metrics measure the same resilience? and (2) which metric to choose under what circumstance? Although studies on the comparison of different resilience metrics have been documented in the literature (Clédel et al. 2020; Nishimi et al. 2021; Vishnu et al. 2023), quantitative-qualitative combined comparison of performance-based resilience metrics remains to be further explored.

In this study, 12 performance-based resilience metrics, including the classic resilience triangle, were tested using a quantitative-qualitative combined approach. We first verified their effectiveness and then performed pairwise association analyses through simple, yet effective, analytical methods using data from the aviation system in China throughout the COVID-19 pandemic. Finally, their differences and similarities were effectively assessed by conducting an in-depth analysis of their definitions of resilience, basic components, and expression forms. The main conclusions of this study can be summarized in a two-fold manner.

On the one hand, most of the metrics essentially measured different resilience, despite several stable metric pairs showing similar quantification results. Largely due to their different conceptual definitions, scholars have focused on various resilience capacities. Most scholars measure one or several of the system’s absorption, adaptation, and recovery, while few scholars have more closely examined the system’s abilities to prepare for, and resist disasters. Although some scholars attempt to measure the same resilience capacity, the components they selected are different. For example, R3 and R10 both measure the absorption, adaptation, and recovery capacity of systems, but R3 represents resilience via robustness, recovery, and time to recovery, whereas R10 characterized this feature with robustness, recovery, rapidity, as well as performance loss per unit time. Thus, although some metrics evaluate resilience in a similar manner, the metrics based on different definitions, components, and expressions often assess distinct components of “resilience.”

On the other hand, based on the analysis of the characteristics, advantages, and disadvantages of the 12 metrics, a user guide for their appropriate uses in different systems and with different data types has been proposed. The controversy and limitations of current metric construction, which have profound implications for future metric construction and optimization, are discussed in this article.

This study sheds new light on how we measure, think about, and make decisions related to critical infrastructure resilience. Most importantly, it provides an effective diagnosis of popular and recent quantification metrics, with a better understanding of the question of “how to choose.” We also acknowledge that this study has three limitations. First, many latest metrics might have emerged during the preparation of this article that were not taken into account. Future work will expand the spectrum of selected metrics and models. Second, the proposed user guideline will need to be continuously refined and updated as the field evolves. Additionally, due to the availability of the datasets, only one application case was used for implementing the selected resilience quantification metrics. More application cases could be explored in future studies.