1 Introduction

At the beginning of 2021, the CPC Central Committee and the State Council announced the outline of the national comprehensive three-dimensional transportation network planning, outlining the blueprint of the national comprehensive three-dimensional transportation network, while the rail transit is the backbone of the comprehensive transportation network.

In recent years, rail transit failures have occurred frequently: for example, a sudden power supply failure in the running section from Century Park to Zhangjiang High Tech of Shanghai Metro Line 2 paralyzed the section for up to 5 h, resulting in the early warning of class I large passenger flow; the escalator equipment of Beijing Metro Line 4 failed and passengers fell, resulting in a stampede accident, resulting in one death, two serious injuries and 26 minor injuries; when the train stopped in a tunnel due to sparks in the carriage of Guangzhou Metro Line 8, frightened passengers staged a tunnel escape. Through in-depth analysis of previous accident cases, it can be seen that the operation risk is transmitted in a chain from the occurrence to the last accident. It is very important to explore the risk chain and diagnose and prevent it in time. In case of failure or accident during the operation periods, the transmission process will show a chain structure. The safety of metro operation has become a difficult problem that must be faced.

Urban rail transit operation risk chain refers to an orderly sequence of hidden dangers in the process of urban rail transit operation, which are not found and controlled in time, and then transmitted successively to form a chain transmission effect, resulting in accidents against will. Because of the obvious transmissibility of rail transit hazard sources, for example, the power failure of the overhead contact system in elevated station will affect the normal operation plan. If the operation plan changes and the passenger flow organization of the station changes abruptly, the station may produce huge hidden dangers such as sudden large passenger flow and platform congestion, forming a risk chain transmission chain, as shown in Fig. 1.

Fig. 1
figure 1

Risk chain transmission development demonstration

Different risk chains are intertwined and superimposed to form a risk chain network, and the impact of accidents will expand in series, resulting in unpredictable serious disasters. Therefore, the importance of hazard control and governance of rail transit operation under super large-scale network is self-evident.

This study presents a novel approach that integrates data mining and information fusion technologies to analyze the safety status of metro operation based on accumulated data. A systematic multi-source, high-dimensional, and heterogeneous collection and analysis experiment is designed to mine and integrate safety-related data, reveal the correlation mechanisms behind hazard sources, and form risk chain groups. By identifying key factors in the risk chain, this study provides a methodological framework for breaking the chain in a timely manner and establishing special emergency and response plans. It also serves as a theoretical and decision-making basis for improving metro operation safety and service level under a super-network mode, enhancing social image, and formulating and revising relevant industrial standards and specifications.

2 Literature Review

At present, there are some methods of hazard identification in academia based on accidents, such as grey relational analysis, analytic hierarchy processing, probability statistics, and subjective experience. Because of the widespread subjectivity, these methods are unable to scientifically determine the key hazard sources.

Various studies have investigated aspects of hazard identification. For example, He [1] studied the hierarchical optimization method of hazard identification and evaluation of metro operation, strengthened the quantification of the index system and reduced the error caused by subjectivity. Some scholars looked for hazard sources through the fault tree method. Yang [2] applied the fault tree method to identify the hazard sources of train operation and determine the risk factors, which has a certain application value. However, the fault tree method can only carry out simple causal reasons, and it is not suitable for the analysis of complex systems. Subsequently, in order to overcome this defect, Yang [3] and others introduced the Bayesian network method into the hazard identification of railway passenger station and incorporated the uncertainty fuzzy theory as a powerful supplement and improvement to the safety analysis of complex systems. Some scholars tried to improve the shortcomings of the above algorithm by using rough set theory. Their basic idea is to screen out the key factors from several attributes and identify the hazard sources. Wang [4] and others used rough set theory to identify the risk factors of construction. On this basis, Jia [5] and others combined rough set and genetic algorithm to study the key risk factors of identified hazard sources, which was more objective and accurate than other methods. With the development of big data mining technology, Ding [6] deeply studied the data mining algorithm, modeled the rail transit dispatch log of a mega city, mined the main hazard sources of rail transit, and developed an intelligent hazard identification system based on a data warehouse. Zhou et al. [7] created the subway construction safety risk early warning index system based on historical data. Zhang et al. [8] determined the causes of construction accidents through 571 investigation reports of construction accidents collected and combined with grey correlation analysis. Hei [9] used the text mining method to analyze the hidden dangers of subway construction, and visually displayed the redundant text data information, providing strong support for the hidden danger investigation. Xu et al. [10] added information entropy term weighting to keyword importance evaluation to solve the impact of accident reports with different lengths on text mining results. Other scholars have studied the comprehensive application and design of machine learning (ML) algorithms in construction safety. Bugalia et al. [11] used an extensive experimentation strategy consisting of input data processing, n-gram modeling, and sensitivity analysis, which made great contributions to data classification, but the ML classifiers faced challenges in distinguishing between “unsafe act (UA)” and “unsafe condition (UC).” Therefore, they made improvements based on the research object.

The concept of risk chain communication comes from the fields of project management, trade finance, media, public opinion and culture. There is little research in the field of rail transit safety management. Chapman [12] first put forward the idea of risk chain, and then the risk management theory was applied in engineering areas. Kangari et al. [13] studied and verified the risk chain from the perspective of the supply chain, and proposed that the risk chain is a series of chain effects caused by a logical relationship, which forms interrelated risks. On the basis of previous studies on risk chain, Liu [14] analyzed the derailment data from the FRA rail equipment accident database for the period 2001–2010, accounting for frequency of occurrence by cause and number of trains derailed. The statistical analyses were conducted to examine the effects of accident cause, which showed that broken rails or welds were the leading derailment cause on main, yard, and side tracks. These analyses can provide a guidance method for mining the causes of rail traffic accidents. Liu [15] reported that there are mutual coupling and transmission relationships among risks based on supply chain according to the transmission relationship among risks. Ma [16] divided the risk chain into four types: cause and effect chain, combination chain, replication chain and migration chain. Wang [17] constructed an unsafe event data analysis model with regular expression and pattern matching technology, and then established the matching model of high-speed railway derailment-based external environment risk factors. The two models were applied to the occurrence of unsafe events. Moradi [18] proposed that a challenging problem in risk and reliability analysis of complex engineering systems (CES) is performing and updating risk and reliability assessments on the whole system with sufficiently high frequency. Li et al. [19] studied the risk element transmission theory, which laid a solid theoretical basis for risk identification. Cao et al. [20] used Monte Carlo simulation to simulate the interaction of risk chain and constructed the risk chain evaluation model. In the field of engineering, there are abundant researches on the chain transmission of faults and accidents, which were reflected in the cause analysis, causal factors and Relationship Research of accidents. However, there are few studies that can provide reference for the construction of rail transit operation risk chain. Li et al. [21] proposed a knowledge reasoning method based on the ontology modeling of an accident chain scenario to conduct correlation analysis on hidden dangers and clarify the cause mechanism of accidents. Qiu et al. [22] handled unstructured accident cases in a structured way and proposed the grid operation method to mine possible accident cause chains, so as to achieve targeted prevention and emergency disposal. Fu [23] developed a general modeling and analysis procedure for risk interactions based on association rule mining and weighted network theory.

To sum up, many domestic and international scholars have conducted extensive research on the identification algorithm of hazards, risk classification, and control in the engineering field. However, there are few research achievements regarding the identification of hazard sources in metro operation. In the railway transportation and mining engineering field, there are some sporadic hazard source identification algorithms that are primarily concentrated on the fault tree analysis method and the association rule algorithm. However, these methods can only identify the cause combination of the accident and cannot uncover the transmission chain between hidden dangers. Other studies mainly focus on the accident cause model of hazard sources, but there is almost no research on the chain coupling mechanism and risk chain between traffic and transportation failure (accident) cause factors. Therefore, research on the risk chain based on major hazard sources has important practical significance. Big data and data mining algorithms have been widely used in China, but mainly for commercial purposes or bank loan risk assessment, with less focus on rail transit safety management. The structure of metro operation safety data is complex, heterogeneous, multi-source, and extremely difficult to collect. It is necessary to set up acquisition equipment, build data transmission networks, design high-speed algorithms with parallel computing capabilities, and explore the construction of data mining algorithm groups to mine and identify rail transit operation hazards. The current metro operation safety management and control lack pertinence, making it difficult to handle accidents and faults quickly. Metro operation is a complex and large-scale engineering system, involving many hazard sources. Each hazard source requires an accurate emergency or disposal plan, as well as an accurate starting mechanism and conditions.

So, as an emerging technology in the era of big data, text mining combines risk analysis with big data is the general trend. The directional edge in the risk chain is related to a certain probability, which depends on the possibility of the development direction of the event and shows the direction orientation of risk occurrence and transfer. It has great research value for accident prediction and auxiliary decision-making.

3 Problem Description

3.1 The Basic Problem

With the expansion of the operation scale of urban rail transit, super-large-scale network operation has become a trend, the operation complexity has increased, the cascade impact among lines is obvious, and the potential safety hazards in the operation process have continuously increased. The operation process mainly includes line passenger flow (including stations, platforms, elevators and transfer channels), dispatch logs, equipment operation status, account records, and rail damage, which are mostly stored in text form and partially unstructured. It is particularly important to analyze the data and explore the key hazard sources leading to accidents and the relationship among hazard sources.

In this paper, we explore how to deeply mine the risk transmission mechanism behind the data from the massive rail transit hidden danger data and provide support for the identification and chain management and control of operation hazard sources, which is an unavoidable problem that needs to be solved by metro operation enterprise. For operation risk management of urban rail transit, many domestic and foreign scholars have done considerable basic research on the identification algorithm of hazard sources, risk classification and risk control in the engineering field, and preliminarily explored the hazard source mining and risk chain transmission mechanism in the transportation field. On the basis of analyzing the existing research, they comprehensively used data mining, machine learning, and natural language processing (NLP) to conduct in-depth research on the key risk sources of rail transit operation and realize the effective risk transmission among the key risk sources of rail transit operation.

3.2 Research Technology Route Structure

This paper proposes a method of mining and constructing risk chain of metro operation hazard sources based on text mining and causality extraction. It identifies the urban rail transit operation hazard sources from unstructured text, takes the hazard sources as the core, explores the risk transmission law from the perspective of chain risk transmission to build the causal risk chain, and identifies the key cause points, changes the development direction of the accident by controlling the key cause points, and can provide strong support for the early warning and prediction of urban rail transit operation safety. The research plan is shown in Fig. 2.

Fig. 2
figure 2

Research technology route structure

4 Hazard Source Mining Based on Dispatch Logs and Accident Reports

Data from each department of rail transit operation have both semi-structured and unstructured characteristics, which makes data collection, standardization and informatization difficult. We used a collection of multi-source, high-dimensional and heterogeneous data, mainly including passenger flow videos, public works maintenance data, dispatch logs, station accounts (paper version), delay data of more than 15 minutes, and automated fare collection (AFC) passenger flow data. Mining rail transit operation safety hazards involves a data acquisition scheme, data flow algorithms, data semantic definition and standardization, in-depth study of rail transit operation data structure, conceptual description and definition of data, and establishing an abstract structure that can describe fault data and a semantic description framework that specifically expresses internal associations. Finally, a data warehouse that can be used for rail transit operation safety data mining is constructed.

4.1 The collection of operation safety data

  • (1) Dispatch log

The Metro enterprise has stored a large amount of train fault information and rescue information in their daily operation. With the continuous accumulation of operation, a large number of dispatch logs have been formed, which has high guiding value for operation practice, as shown in Table 1.

  • (2) Accident reports

Table 1 Operational scheduling failure log record form (partial)

In order to make up for the possible limitations and shortcomings of the data in the operation dispatch logs, 110 urban rail transit operation accident reports at home and abroad are collected and stored in a folder as a text file (txt) for data mining supplement.

There are three ways to obtain accident reports:

  1. i.

    Search and collect reports related to urban rail transit operation accidents through relevant safety management websites;

  2. ii.

    Collect the accident data analyzed in relevant references and obtain the key information and relevant materials of the accident through network query;

  3. iii.

    Search for reports related to urban rail transit operation accidents from the official website of metro operation units.

  4. iv.

    The collected data cannot fully cover rail transit operation accidents, so representative data were selected for mining and analysis, and some accident reports are shown in Table 2.

Table 2 Accident report (partial)

4.2 Text Data Preprocessing

  • (1) Construction of word segmentation lexicon

For Chinese word segmentation, words that have a professional lexicon but are not included in the lexicon and must be segmented will affect the text word segmentation. The third-party library Jieba package is loaded for word segmentation in the Python environment. In order to ensure the effect of word segmentation, the professional word library and out-of-service word library of rail transit and transportation are established. If the user-defined lexicon and deactivated lexicon are configured, the specific process of word segmentation is shown in Fig. 3.

  • (1) Custom lexicon configuration

Fig. 3
figure 3

Word segmentation flowchart

Jieba package can recognize some words, but the effect of proper noun segmentation is not ideal. For example, “door fault” is often divided into “door” and “fault,” but “door fault,” as a proper noun of urban rail transit operation fault, should be extracted as a whole rather than further divided. In addition, some words are difficult to recognize because they are colloquial. Therefore, before word segmentation, a user-defined lexicon should be built, and a total of 804 urban rail transit hazard-related words and 406 urban rail transit professional terms will be built to form a user-defined thesaurus.

  • (2) The stop-word lexicon configuration

Stop-words affect the text word segmentation, and there is a large amount of data to cover up the real meaningful feature words. Removing stop-words and retaining high-correlation feature words can improve the efficiency of text mining. The common stop-words involved are as follows:

  1. (i)

    Common adverbs, conjunctions, and exclamations, such as “of,” “make,” “after” and so on, which have high frequency and no redundancy;

  2. (ii)

    The selected corpus is the failure log and accident report text of urban rail transit operation. The words that appear frequently but have no practical significance, such as “urban rail transit,” “safety” and the names of Metro stations, are also merged into stop-words.

The stop-word lexicon of the Machine Laboratory of Sichuan University and the stop-word lexicon of the Harbin Institute of Technology were included. Finally, a stop-word lexicon containing 3050 words is formed.

  • (2) Word segmentation result

The configured word segmentation system is used to process the rail transit operation dispatch log and accident report text. Some word segmentation results are shown in Table 3, resulting in 20,876 words in total.

Table 3 Word segmentation result (part)

There are too many original feature items obtained from the first word segmentation, and it contains a lot of useless information, which has a significant impact on the statistical analysis results of key words of operation hazard sources.

Previous research [20] shows that when the proportion of words with strong discrimination in topic representation is less than 1/10 of the mining document set, the amount of the original data set is large, and it is necessary to select and calculate the eigenvalues of the original word segmentation results. It can be seen from Table 3 that words such as “traffic dispatching,” “fault” and “station” cannot be used as semantic expressions affecting the operation safety of urban rail transit. On this basis, it is necessary to select eigenvalues and screen out words unrelated to the hazard sources of metro operation, which is an important step for further text mining.

4.3 Selection of Key Eigenvalues

Text feature representation methods mainly include square root function, Boolean function, and term frequency-inverse document frequency (TF-IDF) function. The TF-IDF function is the most widely used vector space model, which can take into account the change of word frequency and the distinction of text semantic features. A TF-IDF algorithm is selected to screen the eigenvalues.

TF-IDF algorithm consists of TF and IDF. TF refers to word frequency, which indicates the ratio of the number of occurrences of the word in the text to the total number of occurrences in the text after removing the stop-word. The higher the word frequency, the greater the weight of the word in the text. IDF refers to the inverse document frequency, which means that if all texts in the total text contain word W, the word W has little significance to the whole text. On the contrary, if only a few texts contain word w, and w can be expressed as the theme of the total text. The TF-IDF algorithm combines the characteristics of TF and IDF, filters some common words that are meaningless to the topic, and can filter out the content reflecting the topic. TF-IDF calculation is shown in Eq. (1):

$$\left\{ \begin{gathered} {\text{TF}} - {\text{IDF}} = tf_{i,j} \times idf \hfill \\ idf_{i} = \log \frac{\left| D \right|}{{DF_{i} }} \hfill \\ \end{gathered} \right.$$
(1)

in which \(\left| D \right|\) refers to the total number of documents;

\({\text{DF}}_{i}\) refers to the frequency of the text containing the eigenvalue;\(tf_{i,j}\) indicates the frequency of occurrence of characteristic value \(t_{i}\) in J-Class documents.

After the word segmentation of Jieba package, 20,876 words are obtained, which contain a large number of words that have no significant effect. Further, the TF-IDF algorithm is used to calculate the text eigenvalue and delete the words with low TF-IDF value. First, we perform smoothing processing according to Eq. (1) to prevent \(DF_{i}\) from being 0 and calculate the score word as TF-IDF value, as shown in Tables 4 and 5:

Table 4 High TF-IDF characteristic value
Table 5 Low TF-IDF characteristic value

Through calculation, the words with higher TF-IDF value can better represent hazard sources of rail transit operation. The TF-IDF values of characteristic values are sorted, and the top 5% TF-IDF values are selected as key characteristic words. Because of the large amount of data, the mining process cannot be completed at one time. Relevant steps need to be improved, including the update of word segmentation thesaurus and the selection of keywords, so as to improve the accuracy of key eigenvalue selection.

According to the above process, through three rounds of mining analysis, the final mining statistical results of 76 key feature words are obtained for subsequent cluster analysis of key hazard sources.

After obtaining the key feature words, in order to achieve the visual effect of the final data expression, tableau is used for visual display, and the final statistical results are displayed in the form of word cloud diagram, making the key subject words more obvious and intuitive. The word cloud diagram is shown in Fig. 4.

Fig. 4
figure 4

Key feature value word cloud

The causes and degrees of faults in different dispatch fault logs and accident reports are different, but through word cloud visualization, we can find the common and universal key eigenvalues in these text data, which greatly affect the operation safety of urban rail transit and should be analyzed as key risk factors.

In Fig. 4, 75 key features were obtained from text mining according to certain rules. The figure is arranged from inside to outside according to the text size and color depth, the higher the weight of the key feature items, the larger the font, the darker the color and the more centered. As can be seen from Fig. 4, the first five risk factors are large passenger flow, power failure due to network contact, PSD failure, foreign matter intrusion and train delay, which is consistent with the actual situation.

4.4 Identification of Key Hazard Sources

Through the analysis of 75 hazard source keywords identified by text mining, there are multiple expressions of the same keyword, which needs to be further classified and summarized to obtain the hazard sources of urban rail transit operation shown in Table 6.

Table 6 Key hazards in urban rail transit operations

5 Risk Chain Construction Method Based on Key Hazard Sources

5.1 Parameter Construction of Risk Chain Edge

5.1.1 Determining the Direction of the Risk Chain

There is no clear definition of risk chain in academic circles. Risk chains usually spread in a chain-like manner, which is the sum of a series of explicit and implicit risks [24, 25]. Because of the particularity and complexity of urban rail transit operation, the internal risk relationship of the system is complex and has the characteristics of obvious chain transmission. Usually, the occurrence of accidents is caused by the continuous diffusion after the occurrence of the original risk event. The risk chain expresses the correlation between risk elements and the risk transmission path, provides strong support for the control of key risk points, and is conducive to preventing the source of risk and reducing the possibility of risk transmission.

Only some key hazard sources are obtained through the calculation in 4.4. In the face of massive operation data, in order to obtain the disaster risk chain of key hidden dangers, it is necessary to reduce and extract information for the whole life cycle of key hidden dangers. The flow chart shown in Fig. 5 is designed to realize the construction of the direction of the risk chain.

Fig. 5
figure 5

The flow chart of the direction construction of the side of the risk chain

The specific operation steps are designed as follows:

Step1. Build risk pairs to data set E [n] = {[R1, R2],…,[Ri, Rj], …}, which indicates that there are n items in the data set, for example, in the data set E[1]=[R1,R2], R2 risk is caused by R1 risk, which indicates that R1 and R2 constitute a risk data set. The construction method of risk data set is described in details in Section 5.2;

Step 2. Execute risk chain construction function get_ RiskChain();

Step 3. Obtain a total of N causal data sets E[n]=[R1,R2],…,[Ri,Rj] ;

Step 4. Set the hazard source as S0, which belongs to an event in R1... Rj, then search all events related to S0 and build the risk chain caused by it;

Step 5. Traverse all N data sets and search all data sets which are related to the existence of S0;

Step 6. When a data set containing S0 is found, it is determined that another data R with causal relationship is a new hazard source S1;

Step 7. Take S1 as the search object, traverse the n-item data set again to find the next risk point related to S1 until all risk points with causality are found;

Step 8. Save and output all S, so the risk chain constructed by S0 has been completed;

Step 9. End the risk chain construction function, return the built risk chain and output them.

5.1.2 Determination of Risk Chain Edge Transfer Probability

The main purpose of risk chain construction based on key hazard sources is to reveal the law and mode of risk development. In order to measure the possibility of risk causal transmission, the risk transfer probability is marked on the edge of the risk chain. Transfer probability refers to the possibility of risk Rj when risk Ri occurs, that is, the probability of risk Rj caused by risk Ri, as shown in Eq. (2):

$$P(R_{i} \backslash R_{j} ) = \frac{{{\text{count}}(R_{i} ,R_{j} )}}{{\sum\nolimits_{k} {{\text{count}}(R_{i} ,R_{k} )} }}$$
(2)

in which \({\text{count}}(R_{i} ,R_{j} )\) indicates the frequency of Ri and Rj ,\({\text{count}}(R_{i} ,R_{k} )\) indicates the total number of all possible risk events when Ri occurs.

Because some risk tuples in the extracted causality pairs are not strongly related to the occurrence of hazard sources, these events should not be linked to the causality risk chain and will become isolated points [26].

Through experimental calculation, it can be obtained that the edge transfer probability of 70% of the data are greater than 0.48, and the other 30% are lower than this value. Furthermore, the data dispersion is strong, the edge transfer probability of most risk tuples is lower than 0.1, indicating that these risk tuples have low correlation with hazard sources and do not form the edge of the risk chain. Therefore, we set the pruning threshold to 0.48 and delete the risk event pairs with low importance to improve the accuracy of risk chain construction. Finally, the risk chain based on key hazard sources is obtained as shown in Fig. 6, where R* represents the key hazard source, R1, R2..., R8 represent the associated risk factors between the risks leading to R*, the direction of the arrow represents the causal relationship between the risk factors, and the dotted line represents that the edge transfer probability between risk pairs is lower than the pruning threshold.

Fig. 6
figure 6

Construction of a causal risk chain based on key hazards

5.2 The Construction of Risk Data Set

The extraction of causal relationship between risk points is the basic work of risk chain construction. The collected operation accident report and dispatch fault log data are fused to form a data warehouse for mining, the risk events with causality are extracted, and the risk pairs are formed to construct the risk chain candidate set. When extracting causality from text information, it is found that some texts contain explicit causal connectives or causal cues, such as “because,” “because of,” “so,” “cause,” and “therefore.” This causality is called explicit causality. Text information that does not contain causal cues but expresses causality accounts for about 2/3 of the data, which is called implicit causality. This paper extracts the causal relationship of risk events in the text information to form risk pair data sets. The construction process is shown in Fig. 7.

Fig. 7
figure 7

Causality extraction flowchart

5.2.1 Explicit Causality Extraction Based on Pattern Matching

Explicit causality is the most common in text, and it is the main source of causality extraction. The extraction of explicit causality in metro operation safety accident text mainly aims at the extraction of event tuples in causality clauses and sentences, and expresses the causality between events by inputting explicit causality sentences. To extract explicit causality, it needs to build causality matching rules and templates to match the cause clause and result clause in the sentence, then continue to split the cause clause and result clause, and express the split results in a structured form to form event pairs. The explicit causality extraction framework is shown in Fig. 8.

Fig. 8
figure 8

An explicit causality risk pair extraction framework for urban rail transit operation accidents

Sentences with explicit causality have obvious causal cues. The causal semantics are formally expressed through language patterns. Firstly, several specific causal patterns are determined, and then the sentences in the text are matched with those associated with the same semantics one by one. The advantage of this method is that it is relatively accurate and consistent with people's habits.

There are a large number of causal suggestive words in the text, which can represent the relationship of causal sentence patterns. Sentences with explicit causal relationship can be divided into terminal type, middle type and supporting type. Based on the above three syntactic patterns and data text features, this paper summarizes five categories of causal cue words suitable for causality extraction in the field of rail transit. Table 7 shows the syntactic patterns matched by each causal cue word type, in which Pi represents the i-th syntactic pattern.

Table 7 Causal syntactic patterns and causal cue word correspondence table

The causality in the text is abstracted into five specific syntax patterns, and the corresponding extraction matching rules are designed as follows:

$$\begin{array}{*{20}l} {{\text{(i)}}\;{\text{rule}}\;{1}} \hfill & {} \hfill & {{\text{(ii)}}\;{\text{rule}}\;2} \hfill \\ {{\text{if }}w_{1} \in cue1{\text{ and }}w_{i} \in sign} \hfill & {} \hfill & {if \, w_{i} \in cue2{\text{ and }}w_{1} \notin cue4} \hfill \\ {{\text{then s}}_{i} \in P_{1} } \hfill & {} \hfill & {{\text{then s}}_{i} \in P_{2} } \hfill \\ {{\text{cause}}_{{P_{1} }} = \{ w_{2} ,...,w_{i} \} {\text{ and}}} \hfill & {} \hfill & {{\text{cause}}_{{P_{2} }} = \{ w_{1} ,...,w_{i - 1} \} {\text{ and}}} \hfill \\ {{\text{effect}}_{{P_{1} }} = \{ w_{i + 1} ,...,w_{n} \} } \hfill & {} \hfill & {{\text{effect}}_{{P_{2} }} = \{ w_{i + 1} ,...,w_{n} \} } \hfill \\ \end{array}$$
$$\begin{array}{*{20}l} {({\text{iii}})\;{\text{rule}}\;3} \hfill & {} \hfill & {({\text{iv}})\;{\text{rule}}\;4} \hfill \\ {if \, w_{i} \in cue3{\text{ and }}w_{1} \notin cue6} \hfill & {} \hfill & {i{\text{f }}w_{1} \in cue4 \, and \, w_{i} \in cue5} \hfill \\ {{\text{then s}}_{i} \in P_{3} } \hfill & {} \hfill & {{\text{then s}}_{i} \in p_{4} } \hfill \\ {{\text{effect}}_{{P_{3} }} = \{ w_{1} ,...,w_{i - 1} \} \;{\text{and}}} \hfill & {} \hfill & {cause_{{p_{4} }} = \{ w_{2} ,...,w_{n - 1} \} \;{\text{and}}} \hfill \\ {{\text{cause}}_{{P_{3} }} = \{ w_{i + 1} ,...,w_{n} \} } \hfill & {} \hfill & {effect_{{p_{4} }} = \{ w_{i + 1} ,...,w_{n} \} } \hfill \\ \end{array}$$
$$\begin{gathered} ({\text{v}})\;{\text{rule}}\;5 \hfill \\ if \, w_{1} \in cue6\;{\text{and}}\;w_{i} \in cue7 \hfill \\ {\text{then s}}_{i} \in P_{5} \hfill \\ {\text{effect}}_{{P_{5} }} = \{ w_{2} ,...,w_{i - 1} \} \;{\text{and}} \hfill \\ {\text{cause}}_{{P_{5} }} = \{ w_{i + 1} ,...,w_{n} \} \hfill \\ \end{gathered}$$

in which sign represents the punctuation mark in the sentence, cause represents the reason clause, and effect represents the result clause. Si refers to the i-th sentence in the text,\(\{ w_{m} ,...,w_{n} \}\) indicates the text content between the m-th word and the n-th word in the sentence.

The cause and result clause extracted from the text are expressed in the form of tuples so as to form a tuple pair of causal risk events, and prepare for the sorting of subsequent risk transmission relationship and the construction of risk chain. When urban rail transit operation accidents occur, there are often a variety of hazard sources involved, which are represented in the form of risk chain, and it can better describe the sequence of risks. The trigger word described in the text sentence is usually the predicate, that is, the event trigger word, which can clearly express the trigger of the event. The words before and after the event trigger word are event arguments [27], which describe the object, participant, time, place of the event.

The pattern matching method is used to extract the cause and result clause, and the extraction results are shown in Table 8.

Table 8 Examples of reason and result clauses under pattern matching

The extraction of reason and result clause need to be expressed in a structured form to form a causality diagram. Therefore, the event triplet [24] to express the cause and structure events was introduced. The most common event representation is the event triplet, i.e. e = (S, P, O), which is expressed as a text event [subject, predicate, object]. This expression is often used in the ontology description structure in the security field. P, S and O are, respectively, represented as actions, roles and events executed by roles on actions. In the process of causality extraction, each event contains at least one event trigger word (i.e. P).

The extraction of event tuples usually uses the semantic role of language technology platform (LTP) [25] to mark each reason and result sentence and count whether there is (S, P, O). If there is, the event tuples are extracted directly. For the reason and result sentences that cannot be directly marked with event triples, the dependency parsing framework is used for extraction, and the extraction steps based on dependency syntax are as follows:

Step1: The lexicon is used to store the syntactic dependent child nodes of each word in the sentence, and the storage location represents the relationship between the word and the corresponding child nodes; then store the parent node and extract the dependency relationship;

Step2: According to the dependency structure that does not generate the word, record the part of speech and dependency relationship between the word and the parent node;

Step3: Cyclic extraction of words with verb object relationship and attribute postposition relationship;

Step4: For the words in the extracted subject and object, find the words with related dependency structure and delete the irrelevant words;

Step5: Extract the identified subject or object as an event tuple.

5.2.2 Implicit Causality Extraction Based on Machine Learning

There are still a large number of causal relationships without causal cues in the text information, and the text of urban rail transit operation safety accidents is no exception, which is called implicit causality. For example: “As soon as the train of urban rail transit line 10 bound for Jinsong failed, it stopped at Suzhou Street station for repair. The whole line was affected, resulting in passenger detention and overcrowding. Therefore, Haidian Huang Zhuang station was forced to take current limiting measures.” The description of causality is not obvious, but it can summarize a causality chain: train fault→line influence→congestion→current limiting measures.

In contrast to explicit causality, the extraction of implicit causality often needs to analyze and judge whether there is causality according to the context content because there is no causal prompt word. Based on this, the process of implicit causality extraction is to extract the events in the sentence first, then filter to form candidate risk event pairs, and finally complete causality extraction by using the deep learning framework. The two-way LSTM (long short-term memory network) method based on the self-attention mechanism is used to extract the implicit causality. The input of the two-way LSTM model based on the self-attention mechanism is the sentence set marked with event trigger words. The implicit causality extraction is regarded as a classification problem. Based on the classification model, the qualified causal event pairs are selected from the candidate event pairs. This method is based on the statistical model and is a machine learning method. The method of extracting the implicit causality of event tuples is shown in Fig. 9:

Fig. 9
figure 9

Event tuple implicit causality extraction process

For the extraction of implicit causality, we should first extract the event tuple, select the verb as the event trigger word through part of speech filtering, then extract the event triplet, and finally get the candidate event pair. The generation steps of candidate event pairs are as follows:

Step1: Filter word classes of the text sentence to be mined, and keep the verb as the trigger word of the event;

Step2: Using the dependency parsing method, keep the subject part and object part corresponding to the trigger word as the subject and object of the event tuple;

Step3: Screen out the components related to the subject and object of the event trigger words in the dependency parsing results;

Step4: Express the event tuple in a structured form: {subject and subject related components, trigger word and trigger word related components, object and object related components}, which can also be expressed in the form of similar event triples:\(E = [s + s_{0} ,t + t_{0} ,o + o_{0} ];\)

Step5: Generation of candidate causal pairs: define that the sentence s contains m events \(s = \{ E_{1} ,E_{2} ,...,E_{m} \} ,\) pair all the events contained in the sentence into event pairs, for example: \(\langle E_{i} ,E_{j} ,{\text{direction}}\rangle ,\) in which direction indicates the causal relationship type, and \({\text{dierection}} = \{ 1, - 1,0\} ,\;1 \le i \ne j \le m\) direction=1 indicates that Ei is cause, Ej is effect. If direction= - 1, it is just the opposite. Define direction=0, it means that there is no causal relationship between Ei and Ej. The logarithm of events that can be generated in each sentence , and it can be expressed: \(\frac{m(m - 1)}{2},\) in which m indicates the number of events in a sentence. For instance, “Through video monitoring, the staff found that there was water seepage in the ceiling of the substation. The station management immediately closed the substation below Chongwenmen station, and the normal operation of some sections of line 5 was forced to be interrupted,” so E={e1: water seepage in substation ceiling; e2: Close the substation under Chongwenmen station; e3: normal operation was forced to be interrupted }, then, the resulting event pairs containing event set E are combined in pairs to form candidate causal pairs, such as \(\langle e_{2} ,e_{3} ,1\rangle .\)

With the development of computer technology, attention mechanism [28, 29] is widely used to deal with natural language analysis tasks. Attention mechanism combines the internal experience of biological observation with the external feeling and can be applied to the extraction of sparse data features. By calculating the probability distribution of attention, we can not only highlight the key inputs that have a great impact on the final output results, but also analyze the relationship between model input and output.

In order to improve the experimental effect, the location identifier of event trigger words is added to the self-attention mechanism model, and a two-way LSTM model based on the self-attention mechanism is established to enhance the accuracy of implicit causality extraction. The location identifier of the event trigger word is marked as pi, and the model is marked as Self-Attention BiLSTM+Pi, based on which the implicit causality extraction model includes the following layers:

  1. (1)

    Word embedding layer: transform the text information into vector expression, the sentence sequence s contains M words, after training, the sequence vector of the sentence is obtained and expressed as: \(s = \{ X_{1} ,X_{2} ,...,X_{m} \} ,\) The word vector corresponding to the ith word can be expressed as \(X_{i} = [W_{i}^{{{\text{word}}}} ,P_{i}^{{{\text{position}}}} ],\;W_{i}^{{{\text{word}}}}\) is the text word vector, \(P_{i}^{position}\) is position identification (PI), \(P_{i}^{{{\text{position}}}} = [d_{i} ,v_{i} ],\;d_{i}\) represents the vector formed by the distance between the trigger word and other words in the text, in which: \(v_{i} = \{ 0,1\}\), where \(v_{i} = 1\)indicates a trigger word, and \(v_{i} = 0\)indicates other words.

  1. (2)

    BiLSTM layer: consists of a forward LSTM network and a reverse LSTM network. The forward and reverse directions indicate recognition in the forward and reverse order of sentences, respectively, and it can be used to process the text information of the context. The module is composed of many neural network modules, which are combined to realize the function of information transmission and sharing; meanwhile, a memory unit is designed to store historical information to avoid long-term and short-term dependence. The LSTM has a “gate” structure, including a Tanh layer and a Sigmoid Layer, which are used to filter useful information and transfer it to the next period, which are shown as Eqs. (3) and (4):

    $$h_{i} = \vec{[}LSTM \oplus LST\mathop{M}\limits^{\leftarrow} ]$$
    (3)
    $$H = [h_{1} ,h_{2} ,...,h_{m} ]$$
    (4)
  1. (3)

    Self-attention layer: the linear combination of M LSTM implicit vectors in H is used to represent variable length sentences, which are encoded into fixed vectors or matrices. In addition, the self-attention mechanism is introduced to give different importance to different words, as shown in Eqs. (5) and (6):

    $$A = soft\max (w_{1} \tanh (w_{2} H^{T} ))$$
    (5)
    $$M = HA$$
    (6)

in which:\(H \in R^{m*2u}\) represents the set of all hidden states H, w1 represents the weight matrix, w2 represents the parameter matrix, M is a matrix level sentence obtained by the calculation of Eq. (5).

  1. (4)

    Output layer: use softmax classifier to classify the candidate causal events obtained in sentence s to the relationship represented by <ei, ej> and obtain the classification result label.

    $$h^{*} = \tanh (M)$$
    (7)
    $$\hat{p}(y\left| s \right.) = soft\max (w^{(s)} h^{*} + b^{(s)} )$$
    (8)
    $$\hat{y} = \arg \, \max \hat{p}(y\left| s \right.)$$
    (9)

The direction of the final output of the Eqs. (7)–(9) model represents the relationship category of event tuple pairs, and the event tuple pairs with labels that is not 0 are filtered to obtain the risk event tuple pairs with causal relationship.

5.3 Risk Pair Extraction and Analysis

This section adopts the metro operation safety accident investigation report and operation dispatch fault log and collects 110 accident reports and 30,027 dispatch fault log records. According to Section 4, after the configuration of the word segmentation lexicon and the preprocessing of word segmentation, with the help of LTP4.0 platform, “Jieba,” “Chon,” “re” and regular expression square, we judge the explicit causal sentences of the text data, in the studied accident report and dispatch log text data sources, 95,601 sentences were collected, including 21,988 sentences with causal prompts and 73,612 sentences without causal prompts, accounting for 23% and 77%, respectively.

5.3.1 Explicit Causality Risk Pair Extraction and Results Analysis

Because the raw data are primarily composed of operational dispatch fault logs and accident reports, the matched sentences generated by the causal relationship pattern may contain only causal keywords such as P1, P2, P3, P4 or P5, without explicit causal relationship information about the hazards. To more accurately extract valid matching risk events and remove invalid data, this paper defines matching sentences that contain hazards and related information as valid matching sentences. The specific identification method is to loop through the hazard information extracted in Section 4.4 in the matching sentences using SQL statements. If it exists, the matching sentence is identified as a valid matching sentence. Partial specific matching cases are shown in Table 9.

Table 9 Results of identification of valid matching sentences (partial)

Valid matching risk event pairs refer to risk event pairs that can clearly represent the causal relationship after the effective matching sentences are extracted. Table 10 shows partial results of effective matching risk event pairs extraction. The reason clause is stored in the “reason” column, the result clause is stored in the “result” column, and the subject and object of the effective risk event pair are stored in the “cause” and “effect” columns, respectively.

Table 10 Explicit causality of accident report extraction results (partial)

The proposed pattern matching-based display causality extraction method in this paper evaluates its causality extraction effectiveness using precision as the metric. The calculation formula is shown in Eq. (10):

$$E_{{{\text{Accuracy}}}} = \frac{{E_{{{\text{match}}}} }}{{{\text{Total}}_{{{\text{matching}}}} }} \times 100\%$$
(10)

in which:

\(E_{{{\text{Accuracy}}}}\): the accuracy of causality effect of causality extraction;

\(E_{{{\text{match}}}}\): the effective risk event pairs;

\({\text{Total}}_{{{\text{matching}}}}\): the total matching sentences.

And the extraction effect of explicit causality is shown in Table 11.

Table 11 Analysis on the extraction effect of explicit causality

5.3.2 Implicit Causality Risk Pair Extraction and Results Analysis

To validate the effectiveness and accuracy of the proposed the Self-Attention BiLSTM+PI method for extracting implicit causal relationships, experiments were conducted on both the publicly available SemEval_Task8 data set and the urban rail transit operation scheduling fault log and accident report text data set. Firstly, the 73,612 sentences without causal markers in the text corpus studied in this paper were divided into training and testing sets in a 7:3 ratio. Additionally, 1331 annotated sentences indicating causal relationships from the public data set SemEval_Task8 were selected as the contrast data and were also divided into training and testing sets in the same proportion. Next, the Self-Attention BiLSTM+PI method was executed to obtain the causal relationship matrix of risk events based on the training set. Partial causal relationship matrices of some risk events are shown in Table 12.

Table 12 Matrix of causal relationships between risk events (partial)

Then, the risk event pairs were extracted from the testing set and matched with the corresponding directions in the causal relationship matrix using a softmax classifier. Finally, the event pairs with non-zero directions were selected to obtain the pairs of risk events with causal relationships in the testing set. The risk event pairs (part) extracted by implicit causality are shown in Table 13, where e1 and e2 represent the two risk event tuples extracted from the sentence column of the original sentence. The value of the direction column represents the relationship between the two risk event tuples, a direction of 1 means e1 is the causal event of e2, and direction of -1 means e1 is the resulting event of e2.

Table 13 Risk event pairs extracted from implicit causality

This experiment adopts Intel (R) Core (TM) i5-1035 G1 CPU @ 1.0GHZ 1.19GHZ, and the software environment is the Windows 10 operating system and the Anaconda 3.0 hardware experiment environment. The extraction effect is evaluated by using the evaluation indicators in the relationship extraction field: accuracy (P), recall (R) and harmonic mean value F, the calculation Eqs. (11) - (13):

$$p = \frac{a}{b}$$
(11)
$$R = \frac{a}{c}$$
(12)
$$F = \frac{2PR}{{P + R}}$$
(13)

in which, the accuracy rate P represents the ratio of the number of correct causal relationships to the number of all relationships, the recall rate R represents the ratio of the number of correct causal relationships to the total number of such relationships in the test sample, and the F value represents the comprehensive evaluation index of the whole experimental effect. The experimental results are shown in Table 14.

Table 14 Analysis of the effect of implicit causality

As can be seen from Table 14, in public data set SemEval_2010_Task8, the F value of the Self-Attention BiLSTM method is 4.64% higher than that of the BiLSTM method, while the F value after adding the location identifier is 15.27% and 10.63% higher than that of BiLSTM and Self-Attention BiLSTM, respectively, which proves that the extraction effect has been significantly improved after adding the self-attention mechanism.

Generally, risk event R can directly or indirectly cause other risk events, and when the state of risk event R changes, it can cause other risk events to occur, there is an obvious correlation between these risk events. Therefore, to study the transmission process of causal risk chain, the key to the control of risk chain is the source event, and protective measures should be taken in time to reduce the risk transmission possibility of the whole risk chain.

5.3.3 Analysis and Discussion on the Robustness of Input Data Characteristics for Hazard Sources

The robustness of natural language processing is a problem that must be discussed. If the robustness of the model is not high, there will be large problems in practical applications. Therefore, it is necessary to discuss and analyze the robustness of NLP proposed in this section:

In the process of text processing based on the rail transit operation dispatch log, the structure, carrier form, and processing flow of the original data will affect the data set. Therefore, for text data, even if a character or position is replaced, the semantics of the original text may change, or the generated text does not conform to the grammatical structure, which will not adapt to the established model and affect its robustness [30]. In terms of improving the robustness of the model, we try to use the robust loss function to improve the robustness, which can be achieved by constraining the training log data samples, that is, the data within a given distance are artificially divided into the same classification as the original sample, as shown in Fig. 10, to deal with the model classification errors caused by small disturbances, thus improving the robustness of the model. Through this method, we can achieve a more accurate classification curve.

Fig. 10
figure 10

Data division under interference situation

In addition, regularization constraints on model parameters can be added to further improve the robustness of the model. As we know, adding regularization constraints can improve the generalization of the model, but the higher the characteristic dimension of the model, the worse its generalization performance, and the robustness is declining.

5.4 Visualization and Analysis of Risk Chain

Part of the key hazard sources have been mined through the research in Section 4. To obtain the disaster risk chain of key hazard sources, it is necessary to extract the causal relationship from the data set through the risk pair mining method constructed in Section 5.3 to express the risk cause, screen the key hazard sources in the risk data, extract their full life cycle information, and reveal the risk cause sequence and the risk transmission mechanism through the visual method.

There are 16 key hazard sources calculated in Section 4.4. Taking the first five hazard sources in Table 4 as an example, according to the risk chain construction method described in 5.2, the risk chains obtained are shown in Figs. 11, 12, 13, 14 and 15.

Fig. 11
figure 11

The risk chain of catenary failure

Fig. 12
figure 12

The risk chain of PSD failure

Fig. 13
figure 13

Fire hazard risk chain

Fig. 14
figure 14

The risk chain of turnout failure

Fig. 15
figure 15

Traction failure of risk chain

Taking the risk chain with “catenary failure” as the key hazard source, it can be concluded that there are many reasons for catenary failure, such as, plastic bags, kites and other sundries entangle the catenary in windy weather, which is usually attributed to the weather node in the traditional hazard source disposal scheme. For another example, the pantograph arcing phenomenon frequently occurs due to the excessive conductor difference design, which leads to sparks on the catenary, thus forming the catenary fault. As for the disposal scheme of the risk chain, the traditional method will simply be attributed to the equipment problem, rather than the detailed display of which equipment and why, let alone the detailed process of risk transmission.

In the traditional handling of metro accidents, the operating enterprises usually seek the causes of accidents according to the accidents that have occurred, and through a period of analysis, write a report to rectify the causes of accidents and control the hidden dangers. This kind of hidden danger treatment is actually “headache cures the head, foot pain cures the foot,” which lacks a real holistic view and overall view, and it is unable to accurately control and manage the dangerous sources of rail transit operation. From the risk chain obtained from the research results of this paper, we can also find out the subsequent events affected after the occurrence of key hazard sources, such as the subsequent impact events caused by catenary failure: list the results of failure to leave the warehouse, train outage, train braking failure, etc. The risk chain reflects the general evolution mechanism and transmission mode between risk events. Guided by the risk chain, the direction of risk transmission can be changed by changing the state of risk occurrence, providing accurate direction guidance for emergency disposal and management and control of accidents.

While the advantages of dealing with hazard sources from the perspective of refined identification of hazard sources proposed in this paper are: through the algorithm proposed in this paper, the hazard sources that cause accidents can be accurately identified, and the causative chain of accidents can be mined from the past multi-dimensional massive accident data to form the accident chain, which forms a visual risk chain for the metro operators, and can quickly find the location of the chain break, and control the occurrence of accidents with the fastest speed and the lowest cost.

6 Conclusions and Further Studies

Based on the theoretical knowledge of event causality extraction in the field of natural language processing, this paper proposes a text mining framework for hazard identification, which can realize the visualization of the whole life cycle information of key hazard sources. Taking more than 100,000 metro operation dispatch logs and 110 accident reports related to operation safety collected in a city from 2019 to 2020 as the data set for in-depth research, the main results are listed as follows:

  1. (i)

    A framework of hazard source text extraction is proposed, and a professional domain lexicon is constructed. Combining Chinese word segmentation technology and TF-IDF algorithm, hazard sources of urban rail transit operation safety accidents are mined. Using this method, 75 key feature items are obtained, and 15 key hazard sources are further mined at last.

  2. (ii)

    The combination method based on pattern matching and machine learning is used to extract text data, and the extraction accuracy reaches 87.53%. In terms of F value of text data, the Self-Attention BiLSTM+PI method is better than the Self-Attention BiLSTM, improved by 10%.

  3. (iii)

    A risk chain construction method based on key hazard sources is proposed to extract the full life cycle information based on key hazard sources, it can realize the visual display of the risk chain, and can help to find the position of “break the chain” points quickly, which can improve the accident management and control ability.

The concept of risk chain is introduced into the research of metro operation risk management and control, providing new ideas for risk management, and can realize the transformation of metro operation safety from “passive aftermath” to “active defense,” which is of far-reaching significance for identifying key hazard sources of metro operation and their chain propagation mechanism. It can provide decision-making basis for scientific formulation of relevant industry standards and specifications.

Through the research in this paper, a systematic multi-source, high-dimensional, heterogeneous rail transit operation and production data collection center can be established to integrate the production practice data of operation, public works, and maintenance and form the intelligent identification data warehouse of rail transit hazard sources and risk chain library, in order to change the status quo that the utilization rate of rail transit production data is relatively low, but the failure rate is high. From a new perspective, the accident cause transmission mode of hazard sources, namely the risk chain, is clarified, and the key nodes in the risk chain are identified to provide decision support and theoretical support for “chain breaking.” A data mining method group and application direction was formed for the application of massive operation data of urban rail transit with metro operation safety as the core and operation demand-oriented. A set of methods applicable to intelligent identification of hazard sources in rail transit operation was formed, and then the safety management of rail transit was standardized, and the refined management of hazard sources was realized. A methodological basis for formulating national, local and industrial standards and specifications related to rail transit operation safety was provided.