1 Introduction

The concern about pollution and climate change is rising every year with media and public opinion is increasingly concerned about global sustainability. Certainly, one reason of concern is the annual emission of a huge amount of carbon dioxide (CO\(_2\)), largely derived from anthropogenic activities like transportation, heavy industries, and electricity generation from fossil combustibles. For these reasons, at the international level, several countries and supranational organizations are devising strategies to decrease the consumption of hydrocarbons, a notable example being the Paris Agreement in 2016, whose main goal is to reduce the annual CO\(_2\) emission levels by at least 55% by 2030 (compared to their values reached in 1990). The relevance of the problem is highlighted by the fact that international agreements on global CO\(_2\) emissions reduction involve countries only on a voluntary basis. In other words, there is currently an absence of commitment at an international level to pollution control. This shows how it is difficult to reach a global, enforceable agreement on the reduction of CO\(_2\) emission levels (although some theoretical models about the effectiveness of possible commitment policies have been developed in the literature: see, e.g., El Ouardighi et al. [8] and El Ouardighi et al. [9]). The problem is also highly relevant from an economic point of view. In this respect, a key statistic describing climate change impacts of CO\(_2\) emissions is the so-called social cost of CO\(_2\) (see, e.g., Kikstra et al. [18]), i.e., the projected cost to society of releasing an additional ton of CO\(_2\).

Nevertheless, to the best of the authors’ knowledge, the potential inaccuracy/missingness of CO\(_2\) emissions data in certain countries was overlooked, at least at the country level. Thereby, taking the hint from the past successful applications of machine learning to environmental sciences (see, e.g., Hsieh [16]), in this article, we contribute to the topic by studying CO\(_2\) emissions by applying a Supervised Machine Learning (SML) method to a country-sector level database spanning several years in the recent past. Specifically, we employ a method called Matrix Completion (MC, see Hastie et al. [15]), which was recently popularized, among others, by the 2021 Nobel-prize winner in Economics, Guido W. Imbens [1]. Its core idea consists in the formulation of a suitable optimization problem modeling SML [23], namely the minimization of a proper trade-off between the approximation error over a set of observed elements of a matrix (training set) and a proxy of the rank of the reconstructed matrix, e.g., its nuclear norm (i.e., the summation of all its singular values). A strong advantage of MC with respect to other methods resides in its flexibility, which permits it to be adopted, with appropriate adaptations, in various fields of research. Classical applications of various forms of MC (see, e.g., Candès and Recht [3]) arise, e.g., in collaborative filtering, system identification, and recovery of sensor maps. Two recent successful examples of MC application are represented by the works Metulini et al. [22], where MC is applied for the reconstruction of World Input–Output Database (WIOD) subtables, and Gnecco et al. [14], where MC is used to define a novel index of economic complexity, based on the different degree of predictability of the elements of the Revealed Comparative Advantage (RCA) matrix which are associated with each country. In the environmental context, an application of MC to climate prediction is made by Ghafarianzadeh and Monteleoni [10]. A recent use of MC for the prediction of CO\(_2\) emission levels is made by Huang et al. [17], but is limited to 11 urban areas in China. Such an application is justified therein by the presence of locally missing data on CO\(_2\) emission levels.

Given the framework above, in this work we propose a specific adjustment (i.e., a proper pre-processing of the available dataset) that may improve MC performance for its specific use with data associated with annual CO\(_2\) emission levels on a country and economic sector basis. More in details, we apply MC to country-specific subsets of the available pre-processed database of CO\(_2\) emissions, combining its prediction with the ones of suitably-constructed baselines. The statistical significance of the results of the one-sided Wilcoxon matched-pairs signed-rank tests performed highlights that the combination of MC with the baselines has typically a better performance than the baselines themselves. The present article represents a thorough extension of the short conference article by the same authors (Biancalani et al. [2]), in which MC was compared only with a simple baseline (the sector-specific average over all the available years in the training set) and no MC/baseline combination was performed.

The article is structured as follows. Section 2 provides a description of the database used for the successive analysis. Section 3 details the adopted methodology of analysis, which is based on a regularized matrix completion optimization problem. Section 4 reports its results. Finally, Sect. 5 concludes the work, pointing out possible extensions of the analysis.

2 Database description

The database used for our analysis is described by Corsatea et al. [4], and is freely downloadable from the following hyperlink: https://joint-research-centre.ec.europa.eu/document/download/b572c87b-a2fb-4ab6-af38-ff0451273e9e_en?filename=co2em56.zip. It refers to annual CO\(_2\) emission levels from 56 economic sectors and from households, for 12 energy commodities. It covers 29 European countries and 13 other major countries in the world, in the period 2000–2016 (one observation for each country, sector, and year). Namely, the 42 countries which are covered in the database are the following ones: AUS (Australia), AUT (Austria), BEL (Belgium), BGR (Bulgaria), BRA (Brazil), CAN (Canada), CHE (Chile), CHN (China), CYP (Cyprus), CZE (Czech Republic), DEU (Germany), DNK (Denmark), ESP (Spain), EST (Estonia), FIN (Finland), FRA (France), GBR (Great Britain), GRC (Greece), HRV (Croatia), HUN (Hungary), IDN (Indonesia), IND (India), IRL (Ireland), ITA (Italy), JPN (Japan), KOR (South Korea), LVA (Latvia), LTU (Lithuania), LUX (Luxembourg), MEX (Mexico), MLT (Malta), NOR (Norway), POL (Poland), PRT (Portugal), ROU (Romania), RUS (Russia), SVK (Slovakia), SVN (Slovenia), SWE (Sweden), TUR (Turkey), TWN (Taiwan), and USA (United States of America).

In our analysis, for each country, data associated with the last three rows of the corresponding matrix of CO\(_2\) yearly emission levels are removed. They correspond, respectively, to the emissions associated with 2 specific sectors (coded, respectively, as T: “Activities of households as employers; undifferentiated goods- and services-producing activities of households for own use”, and U: “Activities of extraterritorial organizations and bodies”), for which the annual CO\(_2\) emission levels reported in the database are typically 0 or nearly equal to 0; to the emissions associated with final consumption expenditure by households (coded as \(FC\_HH\)). Indeed, the interest of the present analysis is in the emissions related to production activities. Concluding, a total of 54 sectors is considered to perform our analysis. The resulting yearly CO\(_2\) emission levels matrices (one for each country considered) have 54 rows and 17 columns.

3 Methodology

In the article, we exploit the following formulation of the Matrix Completion (MC) optimization problem, which was studied theoretically by Mazumder et al. [21]:

$$\begin{aligned} \underset{{\hat{\textbf{M}}}\in \mathbb {R}^{m \times n}}{\textrm{minimize}} \left( \frac{1}{2} \sum _{(i,j) \in \Omega ^\textrm{tr}} \left( {\hat{M}}_{i,j}-M_{i,j}\right) ^2 + \lambda \Vert {\hat{\textbf{M}}}\Vert _*\right) , \end{aligned}$$
(1)

where \(\Omega ^\textrm{tr}\) (which, using a machine-learning expression, can be called training set) is a subset of pairs of indices (ij) corresponding to positions of known elements of a matrix \(\textbf{M} \in {\mathbb {R}}^{m \times n}\), \({\hat{\textbf{M}}}\) is the completed matrix (to be optimized by solving the optimization problem above), \(\lambda \ge 0\) is a regularization parameter, and \(\Vert \mathbf{\hat{M}}\Vert _*\) is the nuclear norm of the matrix \(\mathbf{\hat{M}}\). The problem has a similar structure as the well-known Least Absolute Shrinkage and Selection Operator (LASSO) optimization problem (see, e.g., Kim and Bou [19], for its short presentation). The regularization parameter \(\lambda\) controls the trade-off between fitting the known elements of the matrix \(\textbf{M}\) and achieving a small nuclear norm of its reconstruction \(\mathbf{\hat{M}}\). In this article, the optimization problem (1) is solved numerically by applying an iterative algorithm called Soft Impute, developed by Mazumder et al. [21] – to which we refer for a convergence analysis – and reported in the following as Algorithm 1 (see also the supplementary material of Gnecco et al. [14] for a short discussion of implementation issues about this algorithm). The following notation is used. For a matrix \(\textbf{Y} \in {\mathbb {R}}^{m \times n}\), \(\textbf{P}_{\Omega ^\textrm{tr}}(\textbf{Y})\) represents the projection of \(\textbf{Y}\) onto \(\Omega ^\textrm{tr}\), \(\textbf{P}_{\Omega ^\textrm{tr}}^{\perp }(\textbf{Y})\) denotes the projection of \(\textbf{Y}\) onto the complement of \(\Omega ^\textrm{tr}\), whereas \(\textbf{S}_\lambda (\textbf{Y}):= \textbf{U} \varvec{\Sigma }_\lambda \textbf{V}^\top ,\) being \(\textbf{Y}=\textbf{U} \varvec{\Sigma } \textbf{V}^\top\) (with \(\varvec{\Sigma }=\textrm{diag} [\sigma _1,\ldots ,\sigma _R]\)) the singular value decomposition of \(\textbf{Y}\), and \(\varvec{\Sigma }_\lambda :=\textrm{diag} [(\sigma _1-\lambda )_+,\ldots ,(\sigma _R-\lambda )_+]\), with \(t_+:=\max (t,0)\).

figure a

At each iteration, the Soft Impute algorithm exploits its current solution (called \(\mathbf{\hat{M}}^\textrm{old}\)) to the MC optimization problem (1) with the aim of estimating the unobserved portion of the matrix \(\textbf{M}\). This estimate is used to generate a completed matrix \(\mathbf{\hat{M}}\) which, differently from \(\mathbf{\hat{M}}^\textrm{old}\), coincides by construction with \(\textbf{M}\) on the training set. Finally, a new estimate (called \(\mathbf{\hat{M}}^\textrm{new}\)) is obtained by computing the singular value decomposition of \(\mathbf{\hat{M}}\), reducing by \(\lambda\) its singular values larger than \(\lambda\), and zeroing all the other singular values. The matrix \(\mathbf{\hat{M}}^\textrm{new}\) replaces \(\mathbf{\hat{M}}^\textrm{old}\) at the next iteration. The algorithm is initialized with \(\mathbf{\hat{M}}^\textrm{old}=\textbf{0}\).

In our application, the tolerance parameter of the Soft Impute algorithm (which refers to the minimum allowable relative change \(\frac{\Vert \mathbf{\hat{M}}^\textrm{new}-\mathbf{\hat{M}}^\textrm{old}\Vert _F^2}{\Vert \mathbf{\hat{M}}^\textrm{old}\Vert _F^2}\) of the square of the Frobenius norm of the reconstructed matrix, and forms the termination criterion used by the algorithm) is selected as \(tol=10^{-10}\). Additionally, when convergence is not achieved, in order to reduce the computational effort, the algorithm terminates after \(N^\textrm{it}=10^5\) iterations. This is motivated by the fact that, in our application, MC has to be performed several times, for different matrices \(\textbf{M}\) (one for each country in the database), several training sets \(\Omega ^\textrm{tr}\), and various choices of \(\lambda\).

Since MC typically achieves better performance when the elements of the matrix to which it is applied have similar orders of magnitude (see, e.g., its successful application considered by Gnecco et al. [12], where the matrix elements are percentages between \(0\%\) and \(100\%\)), for every country, the original matrix of annual CO\(_2\) emissions is pre-processed by dividing every row by the \(l_1\) norm of that row restricted to the training set (i.e., by the summation of the absolute values of its elements restricted to the training set), then multiplying it by the fraction of observed elements in that row (this pre-processing step is not performed in case a row contains only zeros in the associated training set). The resulting (country-specific) matrix is denoted as \(\textbf{M}^\mathrm{pre\text {-}processed}\). It is worth observing that such a pre-processing exploits only the elements of the training set (i.e., the elements of the validation and test sets, which are described later in this section, are not involved in such a step). In other words, this pre-processing aims at making similar the orders of magnitude of the elements belonging to different rows of the pre-processed matrix, and at the same time it does not use any information that one may want to predict later using MC, since this would be unfair.

Then, we consider the three following methods:

  1. 1.

    In the first case, MC is applied directly to the pre-processed matrix, i.e., one takes \(\textbf{M}=\textbf{M}^\mathrm{pre\text {-}processed}\) in Eq. (1). The output \(\mathbf{\hat{M}}\) of Algorithm 1 is then taken as an estimate of the pre-processed matrix, i.e., one takes \(\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}=\mathbf{\hat{M}}\). In the following, this method is denoted simply as “MC”.

  2. 2.

    In the second case, first a suitable baseline estimate is generated, which is collected in a matrix \(\mathbf{\hat{M}}^\textrm{baseline}\). Then, MC is applied to the difference between the pre-processed matrix and the baseline estimate matrix, i.e., one takes \(\textbf{M}=\textbf{M}^\textrm{residual}:=\textbf{M}^\mathrm{pre\text {-}processed}-\mathbf{\hat{M}}^\textrm{baseline}\) in Eq. (1). The output \(\mathbf{\hat{M}}\) of Algorithm 1 is then taken as an estimate of the residual of the pre-processed matrix, i.e., one takes \(\mathbf{\hat{M}}^\textrm{residual}=\mathbf{\hat{M}}\), or equivalently \(\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}=\mathbf{\hat{M}}^\textrm{baseline}+\mathbf{\hat{M}}\). In the following, this method is denoted as “MC/baseline”.

  3. 3.

    In the third case, only the baseline estimate is used, hence one gets \(\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}=\mathbf{\hat{M}}^\textrm{baseline}\). In the following, this method is denoted as “baseline”.

In both cases 1 and 2, an additional post-processing step is included, thresholding to 0 any negative element (when present) of the matrix \(\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}\). This step is motivated by the fact that the original matrix of annual CO\(_2\) emission levels is non-negative (likewise its pre-processed version). Such a step is not needed for case 3, assuming that the baseline estimates are non-negative (as it occurs for the choices of the baselines detailed in the following). In both cases 1 and 2, for every \(\lambda\), the resulting completed and thresholded matrix is denoted as \(\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}_{\lambda }\).

In the following, three different baseline estimate matrices are considered, denoted as \(\mathbf{\hat{M}}^{\textrm{baseline}_1}\), \(\mathbf{\hat{M}}^{\textrm{baseline}_2}\), and \(\mathbf{\hat{M}}^{\textrm{baseline}_3}\). They are defined, respectively, as follows:

  • each element of \(\mathbf{\hat{M}}^{\textrm{baseline}_1}\) is generated as the sector-specific (i.e., row-specific) simple moving average (Chiulli [5]) of \(\textbf{M}^\mathrm{pre\text {-}processed}\) over the previous 5 years in the training set (respectively, for the first 5 years, the value itself on each element of the training set);

  • each element of \(\mathbf{\hat{M}}^{\textrm{baseline}_2}\) is generated as the year-specific (i.e., column-specific) average of \(\textbf{M}^\mathrm{pre\text {-}processed}\) over the training set (by construction, \(\mathbf{\hat{M}}^{\textrm{baseline}_2}\) is constant on each column);

  • \(\mathbf{\hat{M}}^{\textrm{baseline}_3}\) is the mean of \(\mathbf{\hat{M}}^{\textrm{baseline}_1}\) and \(\mathbf{\hat{M}}^{\textrm{baseline}_2}\), i.e., \(\mathbf{\hat{M}}^{\textrm{baseline}_3}=\frac{\mathbf{\hat{M}}^{\textrm{baseline}_1}+\mathbf{\hat{M}}^{\textrm{baseline}_2}}{2}\).

The choice of the first baseline is motivated by the fact that a preliminary descriptive analysis highlighted that, for every country, annual CO\(_2\) emission levels of every sector change typically quite smoothly from one year to the successive one. The choice of the second baseline is motivated by the fact that some yearly changes across sectors can be still observed, especially between the years 2008 and 2009. However, annual CO\(_2\) emission levels among sectors are still quite heterogeneous at the cross-sectional level. For this reason, the prediction capability of the second baseline is expected to be smaller than that of the first baseline. An intermediate case is represented by the third baseline, which combines the first two baselines.

The reason for which one expects that applying MC on the residual of a baseline estimate matrix improves the performance of MC when the latter is used alone – at least when the baseline has good prediction capability – is that, in the MC optimization problem (1), the non-negative regularization term \(\lambda \Vert \mathbf{\hat{M}}\Vert _*\) refers to the whole completed matrix \(\mathbf{\hat{M}}\). Hence, for large \(\lambda\), the elements of \(\mathbf{\hat{M}}\) tend to be shrunk towards 0, in a similar way as it occurs in the case of the LASSO optimization problem. In particular, this can cause a negative bias in the estimates when \(\textbf{M}\) is a matrix with non-negative elements (moreover, biasedness can be ascribed also to the fact that the Soft Impute algorithm is initialized with a \(\textbf{0}\) matrix, and terminates after \(N^\textrm{it}\) iterations). The combination of MC with a suitable baseline estimate matrix is expected to make such a bias less negative (and possibly improve the MC performance), because in this case the reconstructed matrix is decomposed into the sum of two matrices, and the regularization term acts not on the whole reconstructed matrix, but only on one of these two matrices. Finally, the combination of MC with a baseline is expected to have better generalization capability than the baseline itself, since in that case, for \(\lambda =0\), the optimal solution to the MC optimization problem (1) does not alter the baseline estimate outside the training set, whereas improvements are likely to obtained for other values of \(\lambda\).

In the present MC application to (country-specific) matrices associated with CO\(_2\) emissions, the union of the validation and test sets refers to positions of matrix elements which are artificially obscured (but are still available as a ground truth), whereas the training set refers to the positions of all the remaining elements of the matrix considered. Specifically, for every country, MC is applied 20 times, every time generating the training set as follows:

  • \(50\%\) of randomly selected rows (sectors) are observed over the whole time period;

  • the remaining \(50\%\) rows are observed over all the years, except for the last three years in the database (namely, 2014, 2015, and 2016).

Then, to avoid overfitting, the regularization parameter \(\lambda\) is chosen according to the following validation method. First, the set of positions of unobserved elements of the matrix \(\textbf{M}\) is partitioned randomly into a validation set \(\Omega ^\textrm{val}\) (which contains about \(25\%\) of the positions of the unobserved elements) and a test set \(\Omega ^\textrm{test}\) (which refers to the positions of the remaining elements). In order to ease the comparison of the MC results when considering different countries, the random choices of the training, validation, and test sets are the same for every country, in each of the 20 repetitions of the MC application (nevertheless, distinct repetitions turn out to have different realizations of the training, validation, and test sets). It is worth noting that, by construction, the training, validation, and test sets do not overlap.

Finally, the optimization problem (1) is solved for several choices \(\lambda _k\) for \(\lambda\), exponentially distributed as \(\lambda _k=2^{k/2-25}\), for \(k=1,\ldots ,100\). For every \(\lambda _k\), the Root Mean Square Error (RMSE) of matrix reconstruction on the validation set is computed as

$$\begin{aligned} RMSE_{\lambda _k}^\textrm{val}:=\sqrt{\frac{1}{|\Omega ^\textrm{val}|}\sum _{(i,j) \in \Omega ^\textrm{val}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _k,i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) ^2}, \end{aligned}$$
(2)

then the choice \(\lambda _{k^\circ }\) that minimizes \(RMSE_{\lambda _k}^\textrm{val}\) for \(k=1,\ldots ,100\) is obtained. For every \(\lambda _k\), the RMSEs of matrix reconstruction on the training and test sets (\(RMSE_{\lambda _k}^\textrm{tr}\) and \(RMSE_{\lambda _k}^\textrm{test}\)) are defined similarly, as

$$\begin{aligned} RMSE_{\lambda _k}^\textrm{tr}:=\sqrt{\frac{1}{|\Omega ^\textrm{tr}|}\sum _{(i,j) \in \Omega ^\textrm{tr}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _k,i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) ^2}, \end{aligned}$$
(3)

and

$$\begin{aligned} RMSE_{\lambda _k}^\textrm{test}:=\sqrt{\frac{1}{|\Omega ^\textrm{test}|}\sum _{(i,j) \in \Omega ^\textrm{test}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _k,i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) ^2}. \end{aligned}$$
(4)

In the following section, focus is given to their values obtained for \(\lambda =\lambda _{k^\circ }\).

To conclude, it is worth discussing at least shortly some computational aspects. Achieving an optimal selection for the \(\lambda\) parameter in the MC optimization problem (1) can be very expensive from a computational point of view, and several methods were proposed in the specialized literature to accelerate the Soft Impute algorithm, e.g., by approximating its singular value thresholding phase, or by introducing a warm-start phase, which initializes \(\lambda\) near its optimal value: see, e.g., Yao and Kwok [24] and de Araújo et al. [6]. Constrained variations of the optimization problem (1) were also proposed (see, e.g., Duarte et al. [7] for some examples). However, for our analysis, the selection of an optimal \(\lambda\) (one selection for each country/repetition pair) is not particularly time-consuming, due to the small size of the matrices considered (indeed, each matrix is made of 54 rows and 17 columns). Nevertheless, for bigger matrices, observing some stability of the optimal \(\lambda\) parameter with respect to changing repetition or changing the country under investigation can help speeding up substantially the application of MC. A final remark has to do with how to make the results of the analysis reproducible, since the choices of the training, validation, and test sets are random. This aim can be easily fulfilled by using deterministic algorithms for sampling (see, e.g., Gnecco et al. [11]). In the specific case, pseudo-random number generators can be employed to generate such sets in a deterministic way. The results reported in the next section refer, indeed, to the case of pseudo-random number generation.

4 Results

For the reasons outlined in Sect. 3, for each choice of the baseline, the following outcomes are expected, when comparing the three methods (“MC/baseline”, “MC”, “baseline”):

  1. 1.

    “MC/baseline” is expected to have a larger prediction performance than “baseline” alone;

  2. 2.

    “MC/baseline” is expected to have a larger prediction performance than “MC” alone, when “baseline” has good prediction performance;

  3. 3.

    “MC” alone is expected to have a larger prediction performance than “baseline” alone, when “baseline” has little prediction performance.

In the following, results that confirm statistically these expectations are provided.

To begin with the presentation of the outcomes of the analysis, Figs. 1 and 2 illustrate the results obtained in one repetition of the analysis, for a representative country (Spain), taken as case study (similar results are achieved for other repetitions, as detailed in the following). In particular, Fig. 1 shows, for each of the three baselines: the RMSEs on the training, validation and test sets achieved by the combined MC/baseline method described in Sect. 3 (solid colored lines); the RMSEs on the training, validation and test sets achieved by the MC method alone (dash-dotted colored lines); their comparison with the respective RMSEs produced by the baseline alone (dashed horizontal colored lines); the location of the minimum RMSE on the validation set, for the combined MC/baseline method (solid vertical black line) and for the MC method (dash-dotted vertical black line). Moreover, for illustrative purposes, Fig. 2 provides, just for the case of the combined MC/baseline method and the first baseline, a colored visualization of: the elements of the original matrix (to be reconstructed); the positions of its missing elements associated with the specific repetition; the reconstructed matrix obtained in correspondence of the optimal choice of the regularization parameter; the respective element-wise absolute value of the reconstruction error. It is worth observing that, for baseline\(_1\), data related to the first five years are reconstructed exactly by the combined MC/baseline method, since the corresponding columns in the training set used by MC are made exclusively by zeroes, due to the removal of the specific baseline from the matrix to be reconstructed.

Then, Table 1 reports, for all the 20 repetitions and the associated test sets, the RMSE on the test set for the combined MC/baseline method (in correspondence of the optimal choice of the regularization parameter \(\lambda\)) for the representative country and each baseline; the RMSE on the same test set for the MC method (in correspondence of the optimal choice of the regularization parameter \(\lambda\)); the RMSE on the same test set obtained by each baseline.

  1. 1.

    In the case of the representative country, for each baseline, the combined MC/baseline method shows a statistically significant better performance than the baseline alone. Indeed, the application of a one-sided Wilcoxon matched-pairs signed-rank test (used with an analogous goal as in the work Gnecco and Nutarelli [13]) rejects the null hypothesis that the difference \(\Delta ^\textrm{baseline}_\mathrm{MC/baseline}:=RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC/baseline)}\) has a symmetric distribution around its median and that this median is smaller than or equal to 0 (for each baseline: p-value = \(4.7846 \cdot 10^{-5}\), significance level \(\alpha = 0.05\)). More generally, the same null hypothesis is rejected for: 36 among the 42 countries in the case of baseline\(_1\); 42 among the 42 countries in the case of baseline\(_2\); 41 among the 42 countries in the case of baseline\(_3\).

  2. 2.

    In the case of the representative country, the same statistical test as above is used to test the null hypothesis that the difference \(\Delta ^\textrm{MC}_\mathrm{MC/baseline}: = RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC)}-RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC/baseline)}\) has a symmetric distribution around its median and that this median is smaller than or equal to 0 (the optimal regularization parameter \(\lambda _{k^\circ }\) may be different for the two methods). For baseline\(_1\), the null hypothesis is rejected (p-value = \(4.7846 \cdot 10^{-5}\), significance level \(\alpha = 0.05\)). For baseline\(_2\), the null hypothesis is not rejected (p-value = 0.9868, significance level \(\alpha = 0.05\)). For baseline\(_3\), the null hypothesis is rejected (p-value = \(1.1787 \cdot 10^{-4}\), significance level \(\alpha = 0.05\)). More generally, the same null hypothesis is rejected for: 41 among the 42 countries in the case of baseline\(_1\); 10 among the 42 countries in the case of baseline\(_2\); 39 among the 42 countries in the case of baseline\(_3\).

  3. 3.

    In the case of the representative country, the same statistical test as above is used to test the null hypothesis that the difference \(\Delta ^\textrm{baseline}_\textrm{MC}: = RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC)}\) has a symmetric distribution around its median and that this median is smaller than or equal to 0. For baseline\(_1\), the null hypothesis is not rejected (p-value = 0.9997, significance level \(\alpha = 0.05\)). For baseline\(_2\), the null hypothesis is rejected (p-value = \(4.7846 \cdot 10^{-5}\), significance level \(\alpha = 0.05\)). For baseline\(_3\), the null hypothesis is rejected (p-value = 0.0191, significance level \(\alpha = 0.05\)). More generally, the same null hypothesis is rejected for: 0 among the 42 countries in the case of baseline\(_1\); 40 among the 42 countries in the case of baseline\(_2\); 18 among the 42 countries in the case of baseline\(_3\).

It is also worth remarking that, in the case of the results reported in Fig. 1 for the case of the representative country, the choice of the optimal regularization parameter \(\lambda _{k^\circ }\) turns out to depend negligibly on the absence/presence/choice of the baseline combined with MC. Similarly, a small dependence of the choice of \(\lambda _{k^\circ }\) is observed with respect to the repetition/the choice of the country analyzed, in the sense that they turn out to be always smaller than 1 (a detailed analysis is not shown, due to space limitations).

Additionally, empirical and standard deviations of the quantities \(\Delta ^\textrm{baseline}_\mathrm{MC/baseline}\), \(\Delta ^\textrm{MC}_\mathrm{MC/baseline}\), and \(\Delta ^\textrm{baseline}_\textrm{MC}\) are reported in Table 2, for the three baselines and for all the countries considered in the analysis, confirming the findings above. Concluding, the numerical results obtained confirm statistically the expectations reported at the beginning of this section.

It is also worth remarking that, as expected according to Sect. 3, for all the countries the average estimate on the test set increased in almost all the 20 repetitions of the analysis, when moving from the original MC method to each combined MC/baseline method. In other words, there was almost always an increase in the bias (estimated) on the test set, i.e., of the following quantity:

$$\begin{aligned} \mathrm{estimated\,\textrm{bias}}_{\lambda _{k^\circ }}^\textrm{test}: = \frac{1}{|\Omega ^\textrm{test}|}\sum _{(i,j) \in \Omega ^\textrm{test}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _{k^\circ },i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) . \end{aligned}$$
(5)

As expected, for the original MC method, this estimated bias was almost always negative. Again, additional details are not reported, due to space limitations.

Fig. 1
figure 1

Results of the application of one repetition of the combined MC/baseline method, of MC alone, and of the baseline alone, in the case of a representative country (Spain). a Case of baseline\(_1\). b Case of baseline\(_2\). c Case of baseline\(_3\)

Fig. 2
figure 2

Colored visualization of: pre-processed elements of the annual \(\textrm{CO}_2\) emission levels matrix related to a representative country (Spain); positions of the missing elements (in both the validation and test sets) in one repetition of the combined MC/baseline\(_1\) method; reconstructed matrix obtained in correspondence of the optimal choice of the regularization parameter; respective element-wise absolute value of the reconstruction error. The x-axis refers to the year (the year 2000 being associated with the column number 1, the year 2016 with the column number 17), whereas the y-axis refers to the sector. In the last subfigure, errors smaller than \(2^{-10}\) have been replaced by \(2^{-10}\), for a better visual representation

Table 1 RMSEs on the test set for the various methods considered in the article, in the case of a representative country (Spain)
Table 2 Empirical means and standard deviations (over the 20 repetitions) of the quantities \(\Delta ^\textrm{baseline}_\mathrm{MC/baseline}:=RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test}\, \mathrm{(MC/baseline)}\), \(\Delta ^\textrm{MC}_\mathrm{MC/baseline}:=RMSE^\textrm{test}\, \mathrm{(MC)}-RMSE_{\lambda _{k^\circ }}^\textrm{test}\, \mathrm{(MC/baseline)}\), \(\Delta ^\textrm{baseline}_\textrm{MC}:=RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test}\, \mathrm{(MC)}\) for the three baselines and for all the countries considered in the analysis

5 Conclusions

In the work, matrix completion has been combined and compared with suitable baselines for the reconstruction of artificially missing elements of matrices representing annual CO\(_2\) emissions at the sector level, for each country considered in the analysis. Nevertheless, for specific countries (different from the ones in the database: e.g., selected developing countries), the corresponding matrices may have really missing elements, which could be effectively reconstructed by matrix completion. In this application, the only difference with respect to the current analysis is that it would be not possible to evaluate the error on every element of the test set (being a ground truth unavailable for really missing elements).

A second potential extension of our analysis is represented by the construction of counterfactuals (e.g., as investigated by Kumar and Liang [20], which refers to a different application of matrix completion), useful to predict policy effects on annual CO\(_2\) emissions of specific sectors in selected countries. In practice, this would entail obscuring matrix entries affected by a policy (e.g., any national countermeasure aimed to reduce pollution levels related to the economic activity of specific sectors), to predict their corresponding values in the absence of the policy (counterfactual values). This application would require avoiding getting negatively biased estimates of the unobserved matrix elements. Indeed, overestimates may be actually needed to construct valid counterfactuals.

The analysis made in the present work could be also extended as follows. As an additional step, matrix completion could be applied to matrices obtained by combining the information coming from different countries, or by merging the information available on annual trade flows and annual CO\(_2\) emission. Indeed, a significant feature of the database used for our analysis is that its adopted classification of sectors matches the one of the World Input–Output Database (WIOD) tables (https://www.rug.nl/ggdc/valuechain/wiod/wiod-2016-release), whose elements represent annual trade flows from input country-specific sectors to output country-specific industries/final consumption sectors. Finally, other baselines could be also considered for the combination/comparison with matrix completion.