Introduction

An oil well is drilled in phases (Bourgoyne et al. 1986; Mitchell and Miska 2016), resulting in a telescopic geometric structure, whereby concentric casing columns with smaller diameters are successively installed at greater depths at the end of each phase. At the end of drilling, a production column, through which the fluids produced or injected into the well are conducted, is installed, such that the casings and the production column constitute the main components of the well structure. Due to its telescopic structure, several annuli in the well are isolated, especially in subsea wells (Cahill 2011), making it difficult to obtain information for decision-making during the well productive life.

Despite the careful dimensioning of well structures, unforeseen conditions can occur during operation (Brechan et al. 2019), such as valve maneuvers that over-pressurize casings or corrosions issues (Luibimova et al. 2021; Shaker and Jaafar 2022). Such situations can compromise the entire well structure, resulting in high financial losses—construction of subsea oil wells can cost hundreds of millions of dollars, and the cost of recovery, depending on the damage, is unpredictable. Furthermore, such events can have immeasurable environmental impacts (Shaker and Sadeqb 2022), especially if oil spills into the sea, which would be difficult to contain. Campos et al. (2019) reported a case where a well with integrity loss and leakage into the sea required permanent abandonment using a new interception well, causing harm to several ecosystems and eroding the image of the oil company. Unfortunately, cases like this are not uncommon, as Yuan et al. (2013) documented casing failure in over 60% of wells in a given field during production.

Nowadays, oil and gas fields are maturing and creating new threats. This urged the operating companies and industry researchers to have an intensive focus on well integrity. Building Well Integrity Management System (WIMS) establishes standardized criteria to guarantee that the integrity of all wells is preserved during their lifespan, functions properly in healthy condition, and can operate consistently to fulfill the expected (Yakoot et al. 2021). In 2016, the importance of methodologies and models for predicting well behavior and managing well integrity became increasingly apparent with the publication of the Agência Nacional do Petróleo, Gás Natural e Biocombustíveis (2016). This resolution instituted the Operational Safety Regime for Well Integrity of Oil and Natural Gas and approved the Technical Regulation of the Brazilian Well Integrity Management System. The Brazilian Well Integrity Management System established essential requirements and minimum standards of operational safety and environmental preservation to be met by companies holding exploration and production rights under contracts with the ANP in oil and natural gas wells in Brazil. Many applications in machine learning and deep learning models were studied in different petroleum engineering and geosciences segments such as petroleum exploration, reservoir characterization, oil well drilling, production, and well stimulation (Tariq et al. 2021) and still have the potential to contribute to the analysis of whether the constructed well is operating within established safety limits and to the anticipation of operational problems.

Advancements in data management have empowered oil companies to build extensive databases, consolidating diverse data types from various sources, including machine sensors, economic metrics, human resources, production metrics, fluid data, and well mechanical conditions. These databases serve as a foundation for robust algorithms aimed at averting failures in the oil industry, such as pipeline leak detection (Liu et al. 2019), production hydrate detection (Marins and Netto 2018), inflow detection during drilling (Tang et al. 2019), and artificial lift failure (Liu et al. 2011).

While some researchers (Anjos et al. 2020; Aranha et al. 2022) have applied digital twin technology to monitor well integrity in real-time, capturing thermo-mechanical behavior and loads on tubular and well components, this approach does not encompass all potential subsurface or control system-related phenomena. For instance, it may not address wear, column holes, equipment malfunctions, or spurious valve closures–areas where machine learning could proactively anticipate issues and prevent operational losses. Improved anomaly prediction in offshore wells is crucial due to the high costs of rig interventions, preventable through early model-driven detection, minimizing production loss, and safeguarding well integrity. Some studies have explored machine learning’s potential in this domain:

  • Vargas et al. (2017) investigated K-Nearest Neighbors (K-NN) and Floating Window algorithms to detect issues like DHSV closure, production loss, line obstructions, and control valve closure. Real and simulated data demonstrated these methods’ feasibility in real plant applications.

  • Marins et al. (2021) introduced Condition-Based Monitoring (CBM) using Random Forest and Bayesian models to classify eight operational problems. Achieving over 94% accuracy, their models offered minimal detection delays, allowing operators ample time for damage mitigation.

  • Turan and Jaschke (2021) evaluated various machine learning algorithms, with Decision Trees standing out, effectively classifying failures except for spurious DHSV closure.

  • Machado et al. (2022) proposed an anomaly detection model with an LSTM autoencoder and Support Vector Machines (SVM) classifier, improving performance for faults with slow dynamics, particularly by enhancing detection times.

Unsupervised machine learning offers promise in scenarios with scarce labeled data, like oil well production (Elavarasan et al. 2018), by uncovering hidden patterns from unlabeled data. This study focused on using unsupervised machine learning algorithms for anomaly detection in multivariate oil production time series data, particularly targeting spurious DHSV closures. The model leveraged the Scikit-learn package’s Novelty and Outlier Detection algorithm, exploring hyperparameters like the local outlier factor (LOF) estimator. The developed model was tested in subsea wells (internal database) located in the Santos Basin, showcasing its potential for enhancing operational safety and efficiency.

Methodology

The work consisted of carrying out an evaluation of the Novelty and Outlier Detection algorithm of the Scikit-learn framework, investigating the impact of:

  1. (i)

    the hyperparameters of the Local Outlier Factor (LOF) estimator (Breunig et al. 2000);

  2. (ii)

    the temporal windows for training, that is, how many time steps to enter the predictor and how many steps to be estimated/tested in the data from the public database (Vargas et al. 2019), which has classified data sets for the spurious DHSV closure problem and after applying in anomalies detection in offshore wells.

Based on this, the temporal window formation routine, training, and testing process implemented for the unsupervised algorithm at each time step used can be seen in the schematic view of Fig. 1.

Fig. 1
figure 1

Schematic view of the temporal window routine

The implemented methodology followed the eight steps depicted in Fig. 2 and described as follows:

  1. (1)

    For the public database (Vargas et al. 2019), the data in.csv files were read to define the size of the dataset. For the case studies (internal database), the sensor data were read from a database for a defined period (size) after they were acquired by the real-time monitoring system;

  2. (2)

    Applied for the public base, the pre-processing was performed for the classifier (class), with transient and steady-state failure states unified as a single anomalous state, to give as much time as possible for the operator to intervene and minimize the losses. For this reason, in all experiments, it was opted to use only the normal and fault instead of having a transient state. The proposed method was a one-class classifier: This scheme consists of a single classifier to discriminate only between normal and faulty status. Therefore, in this case, all fault, transient status, and not-a-number (NaN) values in the class are combined into a unique class which is classified against the normal-operation class. There was no not-a-number (NaN) value in the variable monitoring data;

  3. (3)

    Normalization was applied using StandardScaler (Scikit-learn 2022);

  4. (4)

    The training and test set were defined for the time step, based on acquisition timestamp;

  5. (5)

    The training set defined at each time step was applied to the Novelty and Outlier Detection algorithm with previously defined hyperparameters, generating a model;

  6. (6)

    The test window was then evaluated with this model, whereby the response (cp index) was calculated, which varied from 1 to −1 for each test set of data, with an average cp being calculated from an analysis window with the results provided by the model;

  7. (7)

    The average cp of this test window was then compared with a threshold value for assigning the normal or faulty state; in this case, the threshold used (cp threshold) was 0 (zero); with average cp positive or equal to zero, the model classified as normal state; if average cp was negative value, the model classified as faulty state;

  8. (8)

    With the test set evaluated as normal, a time step defined by the parameterized interval value was given, and the process was repeated until the end of the data set. If the evaluation fails, a time step was given but the model was not retrained, only a new set of tests was evaluated.

Fig. 2
figure 2

Flowchart containing the steps of the proposed methodology

Novelty and outlier detection algorithm

The Local Outlier Factor (LOF) algorithm (Breunig et al. 2000) is an unsupervised anomaly detection method that calculates the local density deviation of a given data point in relation to its neighbors. It considers the samples that have a substantially lower density than their neighbors as outliers. The number of neighbors considered (parameter n_neighbors) is normally set to 1) greater than the minimum number of samples that a cluster must contain so that other samples can be local outliers concerning this cluster, and 2) less than the maximum number of samples close to samples that could potentially be local outliers. In practice, this information is usually not available and, for practical purposes, n_neighbors should be set. The key difference between the LOF algorithm and existing notions of outliers is that being outlying is not a binary property.

For any positive integer k, the k-distance of object p, denoted as k-distance (p), is defined as the distance d(p,o) between p and an object o ∈ D such that:

(i) for at least k objects o’ ∈ D\{p} it holds that d(p,o’) ≤ d(p,o), and.

(ii) for at most k−1 objects o’ ∈ D\{p} it holds that d(p,o’) < d(p,o).

Given the k-distance of p, the k-distance neighborhood of p contains every object whose distance from p is not greater than the k-distance, i.e., Nk-distance(p) (p) = {q ∈ D\{p}|d(p, q) ≤ k- distance(p)}. These objects q are called the k-nearest neighbors of p. Let k be a natural number. The reachability distance of object p concerning object o is defined as reach-distk (p, o) = max { k-distance(o), d(p, o)}.

Figure 3 illustrates the idea of reachability distance with k = 4. Intuitively, if object p is far away from o (e.g. p2 in the figure), then the reachability distance between the two is simply their actual distance. However, if they are “sufficiently” close (e.g., p1 in the figure), the actual distance is replaced by the k-distance of o. The reason is that in so doing, the statistical fluctuations of d(p,o) for all the p’s close to o can be significantly reduced. The strength of this smoothing effect can be controlled by the parameter k. The higher the value of k, the more similar the reachability distances for objects within the same neighborhood.

Fig. 3
figure 3

Reach-dist(p 1,o) and reach-dist(p2,o), for k = 4 (Breunig et al. 2000)

In a typical density-based clustering algorithm, two parameters define the notion of density: (i) a parameter MinPts specifying a minimum number of objects and (ii) a parameter specifying a volume. These two parameters determine a density threshold for the clustering algorithms to operate. That is, objects or regions are connected if their neighborhood densities exceed the given density threshold. To detect density-based outliers, however, it is necessary to compare the densities of different sets of objects, which means that it has to determine the density of sets of objects dynamically. Therefore, the MinPts was kept as the only parameter, and the values reach-distMinPts(p, o) was used, for o ∈ NMinPts(p), as a measure of the volume to determine the density in the neighborhood of an object p.

The local reachability density (lrd) of p is defined in Eq. (1) as follows:

$${\text{lrd}}_{{{\text{MinPts}}}} \left( p \right) = \frac{1}{{\left( {\frac{{o \in \sum {N_{{{\text{MinPts}}}} } \left( p \right){\text{reach}} - {\text{dist}}_{{{\text{MinPts}}}} \left( {p,o} \right)}}{{\left| {N_{{{\text{MinPts}}}} \left( p \right)} \right|}}} \right)}}$$
(1)

Intuitively, the local reachability density of an object p is the inverse of the average reachability distance based on the MinPts-nearest neighbors of p. Note that the local density can be ∞ if all the reachability distances in the summation are 0 (zero). This may occur for an object p if there are at least MinPts objects, different from p, but sharing the same spatial coordinates, i.e., if there are at least MinPts duplicates of p in the dataset.

The local outlier factor (LOF) of p is defined in Eq. (2) as follows:

$${\text{LOF}}_{{{\text{MinPts}}}} = \frac{{o \in \sum N_{{{\text{MinPts}}}} \left( p \right)\frac{{{\text{lrd}}_{{{\text{MinPts}}}} \left( o \right)}}{{{\text{lrd}}_{{{\text{MinPts}}}} \left( p \right)}}}}{{\left| {N_{{{\text{MinPts}}}} \left( p \right)} \right|}}$$
(2)

The outlier factor of object p captures the degree to which p was called an outlier. It is the average of the ratio of the local reachability density of p and those of p’s MinPts-nearest neighbors. It is easy to see that the lower p’s local reachability density is, and the higher the local reachability densities of p’s MinPts-nearest neighbors are, the higher the LOF value of p.

Metrics for performance evaluation

Classification metrics are highlighted and used as a benchmark to evaluate the performance of classification algorithms. Therefore, a set of metrics was used to compare and evaluate the adjusted unsupervised machine learning algorithm.

Accuracy (ACC) considers all normal and fault samples. ACC can be calculated in Eq. (3) as follows:

$$\mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{{\mathrm{n}}_{\mathrm{total}}}$$
(3)

where TP is the number of true positives, TN is the number of true negatives and ntotal is the total number of samples. Precision (PR) means the true positive value compared to the false negative. PR can be calculated in Eq. (4) as follows:

$$\mathrm{PR}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(4)

where FP is the number of false positives.

Recall (REC) indicates the proportion of anomalous data that are correctly detected from all anomalies. Typically, in industrial applications, REC is a prominent metric as false negatives lead to much more harmful results than false alarms or false positives. The REC can be calculated in Eq. (5):

$$\mathrm{REC}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(5)

where FN is the number of false negatives.

Specificity (SP) estimates the ability of the algorithm to predict true negatives over false positives and can be calculated in Eq. (6):

$$\mathrm{SP}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}$$
(6)

The F1 Score is a harmonic mean between REC and PR, and can be estimated in Eq. (7):

$${\text{F}}1 = \frac{{2\left( {{\text{PR}} \cdot {\text{REC}}} \right)}}{{{\text{PR}}\;{ + }\;{\text{REC}}}}$$
(7)

For cases with unbalanced data, Accuracy (ACC) is not an appropriate measure; in such cases, it is essential to use balanced accuracy (ACCb), which can be estimated in Eq. (8):

$$\mathrm{ACCb}=\frac{\mathrm{REC}+\mathrm{SP}}{2}$$
(8)

The use of F1 as a metric also enables mitigation of the effect of the unbalanced database, since the amount of positive data is much smaller than the negative data in the used database. Accuracy has little sensitivity in cases of temporal classification.

AUC-ROC presents a relationship between the False Positive Rate (FPR) and the True Positive Rate (TPR), where FPR is equivalent to 1–SP; values close to 1 indicate classifiers with excellent performance. In addition to classification metrics, in time series analysis, the correct identification of a failure is as important as its rapid detection. Detection time (TD) is measured in seconds and defined as the difference between the time when the calculated parameter (cp) was below the defined threshold (cp threshold), indicating the failure, and the time of the existing failure classification in the database.

Case studies

Public database and data processing

The developed model was first evaluated on data from well and platform sensors in operation from a public database (Vargas et al. 2019), which contains a classification of problems that occurred in wells and production lines and includes real data for spurious DHSV closure. Spurious closure often occurs without any indication on the surface, such as a pressure drop in the hydraulic actuator. The automatic identification of spurious closure of this valve promptly can allow it to be reopened through corrective operating procedures, avoiding production losses and additional costs.

Figure 4 shows a schematic view of an offshore well connected to the production platform and the position of the TPT sensors (pressure and temperature sensor on the Christmas tree), the PDG (Pressure Downhole Gauge), and the Downhole Safety Valve (DHSV) in an offshore well.

Fig. 4
figure 4

Schematic view of the offshore well connected to the production platform (Vargas et al. 2019)

The main sensors/attributes to be used were as follows: (i) pressure and temperature sensors in the TPT of the Wet Christmas Tree; (ii) pressure and temperature sensors in the PDG; and (iii) surface pressure in the well production/injection line. Table 1 presents the attributes available in the database, descriptions, and units. The available data are in the frequency of one second.

Table 1 Available attributes in the database with descriptions and units

When building the database, Vargas et al. (2019) classified each event instance in addition to failure periods. The transient and steady states of failure were unified as a single anomalous state, that is, a binary classification (normal and failure) was performed. There is no not-a-number (NaN) value in the variable monitoring data. However, there are few data points without labels (NaN values in the class column). For these samples, the data located between the transient and steady-state periods of failure were considered a failure, and the rest as normal. Table 2 presents a summary of the datasets used.

Table 2 Summary of data statistics

Model parameterization

Different combinations of parameters were implemented and tested during the execution of the algorithms. The intention of this study was not to exhaustively test all possible parameterizations, as it would be necessary to test all possibilities of existing parameters, which would be laborious and with a high level of computational cost. However, some hyperparameter adjustments were performed to obtain the best possible results given the mentioned restrictions. The hyperparameters N_neighbors and leaf_size were optimized through hyperparameter tuning, for example, in a grid search procedure. In the current study, the value 32 for N_neighbors and the value 30 for leaf_size were defined in accordance with the database used. All algorithms were implemented using Python 3.9 programming language. Table 3 presents the hyperparameters used in the Local Outlier Factor estimator in the public datasets and for the cases studied, after the grid search procedure.

Table 3 Hyperparameters used for Local Outlier Factor estimator

For the first time step of the public database, an interval of 240 s (4 min) was defined for the training window, which was composed of 240 samples of each attribute, and the next 60 s was defined as the test set size. At each iteration, the model response is calculated (cp index) for each point of the test set, which varies from 1 to −1, a window is then created to analyze the last results provided by the model, and a threshold is defined; in this case, the threshold used (cp threshold) was 0 (zero). Sensitivity studies were also carried out, varying this threshold by 5% (cp threshold + 0.05), 10% (cp threshold + 0.10), and 20% (cp threshold + 0.20). If the failure threshold has not been reached in the test set, a time step of 60 s is given (size of the test set), and the training set is again reassigned, now using the last 3 min as shown in Fig. 2. If the failure threshold has been reached, the model is not retrained, being only tested again for the next time step and test set.

For the real use cases coming from the monitoring system, the number of samples of each attribute was kept the same, containing 240 samples for the training set. However, the data arrive with a much lower frequency than the public base, the data are obtained every 30 s, so the training period represents 2 h (120 min) and the test period with 60 samples represents the next 30 min.

Real dataset–Brazil’s pre-salt offshore wells

After the evaluation of the developed model using the public database with the hyperparameters and the other model’s parameterization described previously, the developed model was implemented in production/injection subsea wells in the Santos basin, eastern Brazil, all equipped with multiplexed Wet Christmas Trees. A web-based prototype was developed to provide a user-friendly experience for monitoring multiple wells simultaneously. The proposed methodology was integrated into an in-house real-time monitoring system of offshore wells, gathering data from sensors of subsea production/injection wells. An advantage of this system lies in its comprehensiveness in identifying anomalies related to downhole safety valves for hundreds of subsea wells where the anomalies would be difficult for human operators to track. In subsea wells scenarios, certain challenges have been observed. These include sporadic gaps in sensor data during real-time monitoring and a reduced acquisition frequency in comparison to publicly available data. Consequently, these factors extend the analysis period when compared to the previously utilized public database. Some use cases, compared with human post-classification, are presented in the following sequence.

The model was applied in several wells and four case studies are highlighted to illustrate the efficacy of the model developed in monitoring situations and anomalous events detected:

Well #1 is an oil production well in the pre-salt area in Santos Basin, water depth of 1955 m, drilled in five phases with directional trajectory and final depth of 5821 m, equipped with Intelligent Completion Valves and producing in two different zones at 5376 m to 5738 m.

Well #2 is an oil production well in the pre-salt area in Santos Basin, ultradeep water scenario (2029 m), drilled in four phases, directional trajectory, and final depth of 5928 m, equipped with Intelligent Completion Valves and producing in three different zones at 5489 m to 5928 m.

Well #3 is an oil production well in the pre-salt area in Santos Basin, ultradeep water scenario (2041 m), drilled in five phases, directional trajectory, and final depth of 5870 m, equipped with Intelligent Completion Valves and producing in three different zones at 5432 m to 5774 m.

Well #4 is an injector of water well in the pre-salt area in Santos Basin, in water depth of 2022 m, drilled in four phases, vertical trajectory and final depth of 5923 m, equipped with Intelligent Completion Valves, and injecting in two different zones at 5407 m to 5854 m.

Results and discussions

Evaluation of the proposed model using a public database

The first results presented are the evaluation of the proposed model using the public database (Vargas et al. 2019). The analysis and calculation of the metrics for each of the fifteen id datasets varying the thresholds (cp threshold) from 0 to 0.05, 0.1, and 0.2 are shown in Tables 4, 5, 6, 7, respectively. The tables show the metrics obtained by applying the unsupervised Novelty and Outlier Detection model of the Scikit-learn 2022 with the hyperparameters shown in Table 3.

Table 4 Results obtained by the unsupervised model (cp threshold = 0)
Table 5 Results obtained by the unsupervised model (cp threshold = 0.05)
Table 6 Results obtained by the unsupervised model (cp threshold = 0.10)
Table 7 Results obtained by the unsupervised model (cp threshold = 0.20)

Figure 5 presents an example of a result for id dataset 1, for a threshold (cp threshold) equal to 0, showing a classification (classification) generated by the model that is very close to the classification contained in the database (class), considering the transient and steady-state intervals of failure as failure state. In Fig. 5, the x-axis presents the time, and the y-axis presents the variables P-PDG, P-TPT, T-TPT, and P-MON-CKP normalized using StandardScaler (Scikit-learn 2022) the model classification (classification) and fault classification (class).

Fig. 5
figure 5

Example of data analysis between model classification (classification) and fault classification (class) in the database for Id dataset 1 (WELL-00003_20141122214325.csv)

The best result was obtained for a threshold (cp threshold) equal to 0, as can be seen by the balanced accuracy (ACCb), F1-SCORE, and other metrics. The average balanced accuracy for the 15 datasets tested was 0.991 with a standard deviation of 0.014. A worsening in the performance of the model was observed with a positive variation of the threshold (cp threshold). Figure 6 shows the variation of the means calculated for each of the metrics as a function of the value defined as the failure threshold (cp threshold).

Fig. 6
figure 6

Results obtained for metrics ACC, PR, REC, SP, F1-SCORE, and ACCb (mean values) as a function of threshold variation (cp threshold)

Comparison with other methods

The results of the present study were compared with those of previous studies in the literature using Random Forest (Marins et al. 2021), Decision Tree (Turan and Jaschke 2021), and LTSM (Machado et al. 2022). This comparison using the same public database (Vargas et al. 2019) is important to analyze the advances achieved between the different methods applied to the detection of the same type of event in subsea wells.

Table 8 presents the summarized results for the four classifiers applied to spurious DHSV closures. RF was used for the Random Forest results of Marins et al. (2021), DT for the Decision Tree work of Turan and Jaschke (2021), and LTSM for the work of Machado et al. (2022). Finally, LOF was used for the present study, and comparisons are shown in the ACC, F1-SCORE, and ACCb metrics.

Table 8 Comparison of metrics between literature and the proposed methods for identifying spurious closure events of DHSV

The proposed model using a Novelty and Outlier Detection algorithm from the Scikit-learn package (2022) with a Local Outlier Factor (LOF) estimator obtained similar results to the LSTM model when comparing the accuracy metric (ACC), but the F1-SCORE for the proposed method (LOF) showed better performance. The results for identifying spurious closure events using DT (Turan and Jaschke 2021) were justified by the lower number of instances.

Application in Brazil’s pre-salt offshore wells

The four cases studied, compared with human post-classification (class), are presented in the following sequence. The analyzed data from well #1 are presented in Fig. 7 and show data from the PDG Pressure (P-PDG), TPT Pressure (P-TPT), TPT Temperature (T-TPT) sensors, and Production Choke Position at Topside (CHK, %) along the results of analyses and classifications for a specific time window at 02/25/2021. The model (blue curve) identified the event properly at 2:32 PM, the anomalous events were corrected by the operator in the sequence with valve cycling performed from 2:54 PM to 5:24 PM, and the human post-classification (class) identified the event at 2:54 PM.

Fig. 7
figure 7

Results for well #1 for the anomalous event detected–PDG, TPT sensors, and production choke position at Topside, comparison between model (classification) and fault classification (class)

The following example refers to well #2. In this well, the developed system was applied and identified an anomaly event at 5:57 PM on 04/21/22. In the field, the problem was only detected sometime later and corrected by the operator in the sequence with valve cycling. Figure 8 presents the data from the PDG Pressure (P-PDG), TPT Pressure (P-TPT), TPT Temperature (T-TPT) sensors, and Production Choke Position at Topside (CHK, %) and classifications along the period of 04/21/22 to 04/23/22.

Fig. 8
figure 8

Results for well #2 for the anomalous event detected–PDG, TPT sensors, and production choke position at topside, comparison between model (classification) and fault classification (class)

Figure 9 presents data and results for the first anomalous event at well #3, on 06/04/2022. The model identified the anomaly at 4:14 PM (curve classification) and correctly classified it as spurious closure of DHSV. It is important to note that the operator then proceeded with the valve cycling and reopening procedure, resuming production in a safe condition. After the reopening well at 11:50 PM, the system restarts to analyze.

Fig. 9
figure 9

Results for well #3 for the first anomalous event detected–PDG, TPT sensors, and production choke position at topside, comparison between model (classification) and fault classification (class)

After some time, a second anomalous event for well #3, on 06/06/2022, was identified and is presented in Fig. 10. The model identified the anomaly at 6:29 PM (curve classification) and correctly classified the spurious closure of DHSV. It is important to note that the operator again proceeded with the valve cycling and reopening procedure, resuming production in a safe condition. After the reopening well on 06/07/23 at 6:50 AM, the system restarts to analyze.

Fig. 10
figure 10

Results for well #3 for the second anomalous event detected–PDG, TPT sensors, and production choke position at topside, comparison between model (classification) and fault classification (class)

Figure 11 presents data and results for well #4, on 04/25/2022. The model identified the anomaly at 10:44 AM (curve classification) and correctly classified it as spurious closure of DHSV. In the field, the problem was only detected sometime later.

Fig. 11
figure 11

Results for well #4 for anomalous event detected–PDG, TPT sensors, and production choke position at topside, comparison between model (classification) and fault classification (class)

Table 9 presents the Accuracy, F1-SCORE, and the balanced accuracy values (ACCb) calculated for the cases studied when compared with the human post-classification (class).

Table 9 Comparison of metrics for cases studied in this dataset

The proposed system performs well on the task of anomaly detection, presenting detection rates exceeding 90% in the real field production scenarios studied. Changes in well and reservoir conditions over time impose concept and data drift in time series, which are challenging factors for machine learning models.

Conclusions

This study focused on using an unsupervised machine learning algorithm for anomaly detection in multivariate oil production time series data, particularly targeting spurious subsurface safety valve closures. The main findings can be found as follows:

  1. (1)

    Highly Accurate Anomaly Detection: The proposed model demonstrates exceptional performance with a balanced accuracy (ACCb) of 0.9910 and an F1-SCORE of 0.9969 when predicting spurious DHSV closure events, addressing the challenge of unbalanced databases.

  2. (2)

    Threshold Optimization: The model achieved optimal results with a threshold (cp threshold) set at 0, as evidenced by superior ACCb and F1-SCORE metrics.

  3. (3)

    Innovation in Anomaly Detection: It was introduced a Novelty and Outlier Detection algorithm using Local Outlier Factor (LOF) from the Scikit-learn package, showcasing competitive accuracy with LSTM models and superior F1-SCORE.

  4. (4)

    Real-World Application: The proposed LOF-based method exhibits excellent accuracy, with over 90% detection rates in offshore well production scenarios. This early anomaly identification minimizes non-productive time (NPT) and significantly boosts oil production.

  5. (5)

    Robustness to Concept and Data Drift: The model’s ability to maintain high detection rates in the face of changing well and reservoir conditions highlights its resilience to concept and data drift in time series data, a critical factor in real-world applications.

  6. (6)

    Field Testing: The model’s applicability was validated in ultradeep water subsea wells in the pre-salt area of the Santos Basin, demonstrating its potential to enhance anomaly prediction in offshore wells, ultimately reducing costly equipment malfunctions and safeguarding well integrity.

  7. (7)

    Industry Impact: This research addresses pressing concerns in the oil and gas industry by enabling timely anomaly detection and corrective actions, leading to improved operational safety and substantial cost reduction.

In summary, the research introduces an innovative approach to anomaly detection in offshore well production scenarios, offering high accuracy, robustness, and real-world applicability.