Prediction of annual CO2 emissions at the country and sector levels, based on a matrix completion optimization problem

Biancalani, Francesco; Gnecco, Giorgio; Metulini, Rodolfo; Riccaboni, Massimo

doi:10.1007/s11590-023-02052-2

Prediction of annual CO₂ emissions at the country and sector levels, based on a matrix completion optimization problem

Original Paper
Open access
Published: 26 September 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Optimization Letters Aims and scope Submit manuscript

Prediction of annual CO₂ emissions at the country and sector levels, based on a matrix completion optimization problem

Download PDF

Francesco Biancalani¹,
Giorgio Gnecco ORCID: orcid.org/0000-0002-5427-4328¹,
Rodolfo Metulini² &
…
Massimo Riccaboni¹

594 Accesses
1 Citation
Explore all metrics

Abstract

In the recent past, annual CO$_2$ emissions at the international level were examined from various perspectives, motivated by rising concerns about pollution and climate change. Nevertheless, to the best of the authors’ knowledge, the problem of dealing with the potential inaccuracy/missingness of such data at the country and economic sector levels has been overlooked. Thereby, in this article we apply a supervised machine learning technique called Matrix Completion (MC) to predict, for each country in the available database, annual CO$_2$ emissions data at the sector level, based on past data related to all the sectors, and more recent data related to a subset of sectors. The core idea of MC consists in the formulation of a suitable optimization problem, namely the minimization of a proper trade-off between the approximation error over a set of observed elements of a matrix (training set) and a proxy of the rank of the reconstructed matrix, e.g., its nuclear norm. In the article, we apply MC to the imputation of (artificially) missing elements of country-specific matrices whose elements come from annual CO$_2$ emission levels related to different sectors, after proper pre-processing at the sector level. Results highlight typically a better performance of the combination of MC with suitably-constructed baseline estimates with respect to the baselines alone. Potential applications of our analysis arise in the prediction of currently missing elements of matrices of annual CO$_2$ emission levels and in the construction of counterfactuals, useful to estimate the effects of policy changes able to influence the annual CO$_2$ emission levels of specific sectors in selected countries.

Strategies to achieve a carbon neutral society: a review

Article Open access 08 April 2022

Does a Carbon Tax Reduce CO2 Emissions? Evidence from British Columbia

Article Open access 13 April 2022

Air pollution prediction with machine learning: a case study of Indian cities

Article 15 May 2022

1 Introduction

The concern about pollution and climate change is rising every year with media and public opinion is increasingly concerned about global sustainability. Certainly, one reason of concern is the annual emission of a huge amount of carbon dioxide (CO$_2$), largely derived from anthropogenic activities like transportation, heavy industries, and electricity generation from fossil combustibles. For these reasons, at the international level, several countries and supranational organizations are devising strategies to decrease the consumption of hydrocarbons, a notable example being the Paris Agreement in 2016, whose main goal is to reduce the annual CO$_2$ emission levels by at least 55% by 2030 (compared to their values reached in 1990). The relevance of the problem is highlighted by the fact that international agreements on global CO$_2$ emissions reduction involve countries only on a voluntary basis. In other words, there is currently an absence of commitment at an international level to pollution control. This shows how it is difficult to reach a global, enforceable agreement on the reduction of CO$_2$ emission levels (although some theoretical models about the effectiveness of possible commitment policies have been developed in the literature: see, e.g., El Ouardighi et al. [8] and El Ouardighi et al. [9]). The problem is also highly relevant from an economic point of view. In this respect, a key statistic describing climate change impacts of CO$_2$ emissions is the so-called social cost of CO$_2$ (see, e.g., Kikstra et al. [18]), i.e., the projected cost to society of releasing an additional ton of CO$_2$.

Nevertheless, to the best of the authors’ knowledge, the potential inaccuracy/missingness of CO$_2$ emissions data in certain countries was overlooked, at least at the country level. Thereby, taking the hint from the past successful applications of machine learning to environmental sciences (see, e.g., Hsieh [16]), in this article, we contribute to the topic by studying CO$_2$ emissions by applying a Supervised Machine Learning (SML) method to a country-sector level database spanning several years in the recent past. Specifically, we employ a method called Matrix Completion (MC, see Hastie et al. [15]), which was recently popularized, among others, by the 2021 Nobel-prize winner in Economics, Guido W. Imbens [1]. Its core idea consists in the formulation of a suitable optimization problem modeling SML [23], namely the minimization of a proper trade-off between the approximation error over a set of observed elements of a matrix (training set) and a proxy of the rank of the reconstructed matrix, e.g., its nuclear norm (i.e., the summation of all its singular values). A strong advantage of MC with respect to other methods resides in its flexibility, which permits it to be adopted, with appropriate adaptations, in various fields of research. Classical applications of various forms of MC (see, e.g., Candès and Recht [3]) arise, e.g., in collaborative filtering, system identification, and recovery of sensor maps. Two recent successful examples of MC application are represented by the works Metulini et al. [22], where MC is applied for the reconstruction of World Input–Output Database (WIOD) subtables, and Gnecco et al. [14], where MC is used to define a novel index of economic complexity, based on the different degree of predictability of the elements of the Revealed Comparative Advantage (RCA) matrix which are associated with each country. In the environmental context, an application of MC to climate prediction is made by Ghafarianzadeh and Monteleoni [10]. A recent use of MC for the prediction of CO$_2$ emission levels is made by Huang et al. [17], but is limited to 11 urban areas in China. Such an application is justified therein by the presence of locally missing data on CO$_2$ emission levels.

Given the framework above, in this work we propose a specific adjustment (i.e., a proper pre-processing of the available dataset) that may improve MC performance for its specific use with data associated with annual CO$_2$ emission levels on a country and economic sector basis. More in details, we apply MC to country-specific subsets of the available pre-processed database of CO$_2$ emissions, combining its prediction with the ones of suitably-constructed baselines. The statistical significance of the results of the one-sided Wilcoxon matched-pairs signed-rank tests performed highlights that the combination of MC with the baselines has typically a better performance than the baselines themselves. The present article represents a thorough extension of the short conference article by the same authors (Biancalani et al. [2]), in which MC was compared only with a simple baseline (the sector-specific average over all the available years in the training set) and no MC/baseline combination was performed.

The article is structured as follows. Section 2 provides a description of the database used for the successive analysis. Section 3 details the adopted methodology of analysis, which is based on a regularized matrix completion optimization problem. Section 4 reports its results. Finally, Sect. 5 concludes the work, pointing out possible extensions of the analysis.

2 Database description

The database used for our analysis is described by Corsatea et al. [4], and is freely downloadable from the following hyperlink: https://joint-research-centre.ec.europa.eu/document/download/b572c87b-a2fb-4ab6-af38-ff0451273e9e_en?filename=co2em56.zip. It refers to annual CO$_2$ emission levels from 56 economic sectors and from households, for 12 energy commodities. It covers 29 European countries and 13 other major countries in the world, in the period 2000–2016 (one observation for each country, sector, and year). Namely, the 42 countries which are covered in the database are the following ones: AUS (Australia), AUT (Austria), BEL (Belgium), BGR (Bulgaria), BRA (Brazil), CAN (Canada), CHE (Chile), CHN (China), CYP (Cyprus), CZE (Czech Republic), DEU (Germany), DNK (Denmark), ESP (Spain), EST (Estonia), FIN (Finland), FRA (France), GBR (Great Britain), GRC (Greece), HRV (Croatia), HUN (Hungary), IDN (Indonesia), IND (India), IRL (Ireland), ITA (Italy), JPN (Japan), KOR (South Korea), LVA (Latvia), LTU (Lithuania), LUX (Luxembourg), MEX (Mexico), MLT (Malta), NOR (Norway), POL (Poland), PRT (Portugal), ROU (Romania), RUS (Russia), SVK (Slovakia), SVN (Slovenia), SWE (Sweden), TUR (Turkey), TWN (Taiwan), and USA (United States of America).

In our analysis, for each country, data associated with the last three rows of the corresponding matrix of CO$_2$ yearly emission levels are removed. They correspond, respectively, to the emissions associated with 2 specific sectors (coded, respectively, as T: “Activities of households as employers; undifferentiated goods- and services-producing activities of households for own use”, and U: “Activities of extraterritorial organizations and bodies”), for which the annual CO$_2$ emission levels reported in the database are typically 0 or nearly equal to 0; to the emissions associated with final consumption expenditure by households (coded as $FC\_HH$). Indeed, the interest of the present analysis is in the emissions related to production activities. Concluding, a total of 54 sectors is considered to perform our analysis. The resulting yearly CO$_2$ emission levels matrices (one for each country considered) have 54 rows and 17 columns.

3 Methodology

In the article, we exploit the following formulation of the Matrix Completion (MC) optimization problem, which was studied theoretically by Mazumder et al. [21]:

$$\begin{aligned} \underset{{\hat{\textbf{M}}}\in \mathbb {R}^{m \times n}}{\textrm{minimize}} \left( \frac{1}{2} \sum _{(i,j) \in \Omega ^\textrm{tr}} \left( {\hat{M}}_{i,j}-M_{i,j}\right) ^2 + \lambda \Vert {\hat{\textbf{M}}}\Vert _*\right) , \end{aligned}$$

(1)

where $\Omega ^\textrm{tr}$ (which, using a machine-learning expression, can be called training set) is a subset of pairs of indices (i, j) corresponding to positions of known elements of a matrix $\textbf{M} \in {\mathbb {R}}^{m \times n}$, ${\hat{\textbf{M}}}$ is the completed matrix (to be optimized by solving the optimization problem above), $\lambda \ge 0$ is a regularization parameter, and $\Vert \mathbf{\hat{M}}\Vert _*$ is the nuclear norm of the matrix $\mathbf{\hat{M}}$. The problem has a similar structure as the well-known Least Absolute Shrinkage and Selection Operator (LASSO) optimization problem (see, e.g., Kim and Bou [19], for its short presentation). The regularization parameter $\lambda$ controls the trade-off between fitting the known elements of the matrix $\textbf{M}$ and achieving a small nuclear norm of its reconstruction $\mathbf{\hat{M}}$. In this article, the optimization problem (1) is solved numerically by applying an iterative algorithm called Soft Impute, developed by Mazumder et al. [21] – to which we refer for a convergence analysis – and reported in the following as Algorithm 1 (see also the supplementary material of Gnecco et al. [14] for a short discussion of implementation issues about this algorithm). The following notation is used. For a matrix $\textbf{Y} \in {\mathbb {R}}^{m \times n}$, $\textbf{P}_{\Omega ^\textrm{tr}}(\textbf{Y})$ represents the projection of $\textbf{Y}$ onto $\Omega ^\textrm{tr}$, $\textbf{P}_{\Omega ^\textrm{tr}}^{\perp }(\textbf{Y})$ denotes the projection of $\textbf{Y}$ onto the complement of $\Omega ^\textrm{tr}$, whereas $\textbf{S}_\lambda (\textbf{Y}):= \textbf{U} \varvec{\Sigma }_\lambda \textbf{V}^\top ,$ being $\textbf{Y}=\textbf{U} \varvec{\Sigma } \textbf{V}^\top$ (with $\varvec{\Sigma }=\textrm{diag} [\sigma _1,\ldots ,\sigma _R]$) the singular value decomposition of $\textbf{Y}$, and $\varvec{\Sigma }_\lambda :=\textrm{diag} [(\sigma _1-\lambda )_+,\ldots ,(\sigma _R-\lambda )_+]$, with $t_+:=\max (t,0)$.

At each iteration, the Soft Impute algorithm exploits its current solution (called $\mathbf{\hat{M}}^\textrm{old}$) to the MC optimization problem (1) with the aim of estimating the unobserved portion of the matrix $\textbf{M}$. This estimate is used to generate a completed matrix $\mathbf{\hat{M}}$ which, differently from $\mathbf{\hat{M}}^\textrm{old}$, coincides by construction with $\textbf{M}$ on the training set. Finally, a new estimate (called $\mathbf{\hat{M}}^\textrm{new}$) is obtained by computing the singular value decomposition of $\mathbf{\hat{M}}$, reducing by $\lambda$ its singular values larger than $\lambda$, and zeroing all the other singular values. The matrix $\mathbf{\hat{M}}^\textrm{new}$ replaces $\mathbf{\hat{M}}^\textrm{old}$ at the next iteration. The algorithm is initialized with $\mathbf{\hat{M}}^\textrm{old}=\textbf{0}$.

In our application, the tolerance parameter of the Soft Impute algorithm (which refers to the minimum allowable relative change $\frac{\Vert \mathbf{\hat{M}}^\textrm{new}-\mathbf{\hat{M}}^\textrm{old}\Vert _F^2}{\Vert \mathbf{\hat{M}}^\textrm{old}\Vert _F^2}$ of the square of the Frobenius norm of the reconstructed matrix, and forms the termination criterion used by the algorithm) is selected as $tol=10^{-10}$. Additionally, when convergence is not achieved, in order to reduce the computational effort, the algorithm terminates after $N^\textrm{it}=10^5$ iterations. This is motivated by the fact that, in our application, MC has to be performed several times, for different matrices $\textbf{M}$ (one for each country in the database), several training sets $\Omega ^\textrm{tr}$, and various choices of $\lambda$.

Since MC typically achieves better performance when the elements of the matrix to which it is applied have similar orders of magnitude (see, e.g., its successful application considered by Gnecco et al. [12], where the matrix elements are percentages between $0\%$ and $100\%$), for every country, the original matrix of annual CO$_2$ emissions is pre-processed by dividing every row by the $l_1$ norm of that row restricted to the training set (i.e., by the summation of the absolute values of its elements restricted to the training set), then multiplying it by the fraction of observed elements in that row (this pre-processing step is not performed in case a row contains only zeros in the associated training set). The resulting (country-specific) matrix is denoted as $\textbf{M}^\mathrm{pre\text {-}processed}$. It is worth observing that such a pre-processing exploits only the elements of the training set (i.e., the elements of the validation and test sets, which are described later in this section, are not involved in such a step). In other words, this pre-processing aims at making similar the orders of magnitude of the elements belonging to different rows of the pre-processed matrix, and at the same time it does not use any information that one may want to predict later using MC, since this would be unfair.

Then, we consider the three following methods:

1.
In the first case, MC is applied directly to the pre-processed matrix, i.e., one takes $\textbf{M}=\textbf{M}^\mathrm{pre\text {-}processed}$ in Eq. (1). The output $\mathbf{\hat{M}}$ of Algorithm 1 is then taken as an estimate of the pre-processed matrix, i.e., one takes $\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}=\mathbf{\hat{M}}$. In the following, this method is denoted simply as “MC”.
2.
In the second case, first a suitable baseline estimate is generated, which is collected in a matrix $\mathbf{\hat{M}}^\textrm{baseline}$. Then, MC is applied to the difference between the pre-processed matrix and the baseline estimate matrix, i.e., one takes $\textbf{M}=\textbf{M}^\textrm{residual}:=\textbf{M}^\mathrm{pre\text {-}processed}-\mathbf{\hat{M}}^\textrm{baseline}$ in Eq. (1). The output $\mathbf{\hat{M}}$ of Algorithm 1 is then taken as an estimate of the residual of the pre-processed matrix, i.e., one takes $\mathbf{\hat{M}}^\textrm{residual}=\mathbf{\hat{M}}$, or equivalently $\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}=\mathbf{\hat{M}}^\textrm{baseline}+\mathbf{\hat{M}}$. In the following, this method is denoted as “MC/baseline”.
3.
In the third case, only the baseline estimate is used, hence one gets $\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}=\mathbf{\hat{M}}^\textrm{baseline}$. In the following, this method is denoted as “baseline”.

In both cases 1 and 2, an additional post-processing step is included, thresholding to 0 any negative element (when present) of the matrix $\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}$. This step is motivated by the fact that the original matrix of annual CO$_2$ emission levels is non-negative (likewise its pre-processed version). Such a step is not needed for case 3, assuming that the baseline estimates are non-negative (as it occurs for the choices of the baselines detailed in the following). In both cases 1 and 2, for every $\lambda$, the resulting completed and thresholded matrix is denoted as $\mathbf{\hat{M}}^\mathrm{pre\text {-}processed}_{\lambda }$.

In the following, three different baseline estimate matrices are considered, denoted as $\mathbf{\hat{M}}^{\textrm{baseline}_1}$, $\mathbf{\hat{M}}^{\textrm{baseline}_2}$, and $\mathbf{\hat{M}}^{\textrm{baseline}_3}$. They are defined, respectively, as follows:

each element of $\mathbf{\hat{M}}^{\textrm{baseline}_1}$ is generated as the sector-specific (i.e., row-specific) simple moving average (Chiulli [5]) of $\textbf{M}^\mathrm{pre\text {-}processed}$ over the previous 5 years in the training set (respectively, for the first 5 years, the value itself on each element of the training set);
each element of $\mathbf{\hat{M}}^{\textrm{baseline}_2}$ is generated as the year-specific (i.e., column-specific) average of $\textbf{M}^\mathrm{pre\text {-}processed}$ over the training set (by construction, $\mathbf{\hat{M}}^{\textrm{baseline}_2}$ is constant on each column);
$\mathbf{\hat{M}}^{\textrm{baseline}_3}$ is the mean of $\mathbf{\hat{M}}^{\textrm{baseline}_1}$ and $\mathbf{\hat{M}}^{\textrm{baseline}_2}$, i.e., $\mathbf{\hat{M}}^{\textrm{baseline}_3}=\frac{\mathbf{\hat{M}}^{\textrm{baseline}_1}+\mathbf{\hat{M}}^{\textrm{baseline}_2}}{2}$.

The choice of the first baseline is motivated by the fact that a preliminary descriptive analysis highlighted that, for every country, annual CO$_2$ emission levels of every sector change typically quite smoothly from one year to the successive one. The choice of the second baseline is motivated by the fact that some yearly changes across sectors can be still observed, especially between the years 2008 and 2009. However, annual CO$_2$ emission levels among sectors are still quite heterogeneous at the cross-sectional level. For this reason, the prediction capability of the second baseline is expected to be smaller than that of the first baseline. An intermediate case is represented by the third baseline, which combines the first two baselines.

The reason for which one expects that applying MC on the residual of a baseline estimate matrix improves the performance of MC when the latter is used alone – at least when the baseline has good prediction capability – is that, in the MC optimization problem (1), the non-negative regularization term $\lambda \Vert \mathbf{\hat{M}}\Vert _*$ refers to the whole completed matrix $\mathbf{\hat{M}}$. Hence, for large $\lambda$, the elements of $\mathbf{\hat{M}}$ tend to be shrunk towards 0, in a similar way as it occurs in the case of the LASSO optimization problem. In particular, this can cause a negative bias in the estimates when $\textbf{M}$ is a matrix with non-negative elements (moreover, biasedness can be ascribed also to the fact that the Soft Impute algorithm is initialized with a $\textbf{0}$ matrix, and terminates after $N^\textrm{it}$ iterations). The combination of MC with a suitable baseline estimate matrix is expected to make such a bias less negative (and possibly improve the MC performance), because in this case the reconstructed matrix is decomposed into the sum of two matrices, and the regularization term acts not on the whole reconstructed matrix, but only on one of these two matrices. Finally, the combination of MC with a baseline is expected to have better generalization capability than the baseline itself, since in that case, for $\lambda =0$, the optimal solution to the MC optimization problem (1) does not alter the baseline estimate outside the training set, whereas improvements are likely to obtained for other values of $\lambda$.

In the present MC application to (country-specific) matrices associated with CO$_2$ emissions, the union of the validation and test sets refers to positions of matrix elements which are artificially obscured (but are still available as a ground truth), whereas the training set refers to the positions of all the remaining elements of the matrix considered. Specifically, for every country, MC is applied 20 times, every time generating the training set as follows:

$50\%$ of randomly selected rows (sectors) are observed over the whole time period;
the remaining $50\%$ rows are observed over all the years, except for the last three years in the database (namely, 2014, 2015, and 2016).

Then, to avoid overfitting, the regularization parameter $\lambda$ is chosen according to the following validation method. First, the set of positions of unobserved elements of the matrix $\textbf{M}$ is partitioned randomly into a validation set $\Omega ^\textrm{val}$ (which contains about $25\%$ of the positions of the unobserved elements) and a test set $\Omega ^\textrm{test}$ (which refers to the positions of the remaining elements). In order to ease the comparison of the MC results when considering different countries, the random choices of the training, validation, and test sets are the same for every country, in each of the 20 repetitions of the MC application (nevertheless, distinct repetitions turn out to have different realizations of the training, validation, and test sets). It is worth noting that, by construction, the training, validation, and test sets do not overlap.

Finally, the optimization problem (1) is solved for several choices $\lambda _k$ for $\lambda$, exponentially distributed as $\lambda _k=2^{k/2-25}$, for $k=1,\ldots ,100$. For every $\lambda _k$, the Root Mean Square Error (RMSE) of matrix reconstruction on the validation set is computed as

$$\begin{aligned} RMSE_{\lambda _k}^\textrm{val}:=\sqrt{\frac{1}{|\Omega ^\textrm{val}|}\sum _{(i,j) \in \Omega ^\textrm{val}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _k,i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) ^2}, \end{aligned}$$

(2)

then the choice $\lambda _{k^\circ }$ that minimizes $RMSE_{\lambda _k}^\textrm{val}$ for $k=1,\ldots ,100$ is obtained. For every $\lambda _k$, the RMSEs of matrix reconstruction on the training and test sets ($RMSE_{\lambda _k}^\textrm{tr}$ and $RMSE_{\lambda _k}^\textrm{test}$) are defined similarly, as

$$\begin{aligned} RMSE_{\lambda _k}^\textrm{tr}:=\sqrt{\frac{1}{|\Omega ^\textrm{tr}|}\sum _{(i,j) \in \Omega ^\textrm{tr}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _k,i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) ^2}, \end{aligned}$$

(3)

and

$$\begin{aligned} RMSE_{\lambda _k}^\textrm{test}:=\sqrt{\frac{1}{|\Omega ^\textrm{test}|}\sum _{(i,j) \in \Omega ^\textrm{test}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _k,i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) ^2}. \end{aligned}$$

(4)

In the following section, focus is given to their values obtained for $\lambda =\lambda _{k^\circ }$.

To conclude, it is worth discussing at least shortly some computational aspects. Achieving an optimal selection for the $\lambda$ parameter in the MC optimization problem (1) can be very expensive from a computational point of view, and several methods were proposed in the specialized literature to accelerate the Soft Impute algorithm, e.g., by approximating its singular value thresholding phase, or by introducing a warm-start phase, which initializes $\lambda$ near its optimal value: see, e.g., Yao and Kwok [24] and de Araújo et al. [6]. Constrained variations of the optimization problem (1) were also proposed (see, e.g., Duarte et al. [7] for some examples). However, for our analysis, the selection of an optimal $\lambda$ (one selection for each country/repetition pair) is not particularly time-consuming, due to the small size of the matrices considered (indeed, each matrix is made of 54 rows and 17 columns). Nevertheless, for bigger matrices, observing some stability of the optimal $\lambda$ parameter with respect to changing repetition or changing the country under investigation can help speeding up substantially the application of MC. A final remark has to do with how to make the results of the analysis reproducible, since the choices of the training, validation, and test sets are random. This aim can be easily fulfilled by using deterministic algorithms for sampling (see, e.g., Gnecco et al. [11]). In the specific case, pseudo-random number generators can be employed to generate such sets in a deterministic way. The results reported in the next section refer, indeed, to the case of pseudo-random number generation.

4 Results

For the reasons outlined in Sect. 3, for each choice of the baseline, the following outcomes are expected, when comparing the three methods (“MC/baseline”, “MC”, “baseline”):

1.
“MC/baseline” is expected to have a larger prediction performance than “baseline” alone;
2.
“MC/baseline” is expected to have a larger prediction performance than “MC” alone, when “baseline” has good prediction performance;
3.
“MC” alone is expected to have a larger prediction performance than “baseline” alone, when “baseline” has little prediction performance.

In the following, results that confirm statistically these expectations are provided.

To begin with the presentation of the outcomes of the analysis, Figs. 1 and 2 illustrate the results obtained in one repetition of the analysis, for a representative country (Spain), taken as case study (similar results are achieved for other repetitions, as detailed in the following). In particular, Fig. 1 shows, for each of the three baselines: the RMSEs on the training, validation and test sets achieved by the combined MC/baseline method described in Sect. 3 (solid colored lines); the RMSEs on the training, validation and test sets achieved by the MC method alone (dash-dotted colored lines); their comparison with the respective RMSEs produced by the baseline alone (dashed horizontal colored lines); the location of the minimum RMSE on the validation set, for the combined MC/baseline method (solid vertical black line) and for the MC method (dash-dotted vertical black line). Moreover, for illustrative purposes, Fig. 2 provides, just for the case of the combined MC/baseline method and the first baseline, a colored visualization of: the elements of the original matrix (to be reconstructed); the positions of its missing elements associated with the specific repetition; the reconstructed matrix obtained in correspondence of the optimal choice of the regularization parameter; the respective element-wise absolute value of the reconstruction error. It is worth observing that, for baseline$_1$, data related to the first five years are reconstructed exactly by the combined MC/baseline method, since the corresponding columns in the training set used by MC are made exclusively by zeroes, due to the removal of the specific baseline from the matrix to be reconstructed.

Then, Table 1 reports, for all the 20 repetitions and the associated test sets, the RMSE on the test set for the combined MC/baseline method (in correspondence of the optimal choice of the regularization parameter $\lambda$) for the representative country and each baseline; the RMSE on the same test set for the MC method (in correspondence of the optimal choice of the regularization parameter $\lambda$); the RMSE on the same test set obtained by each baseline.

1.
In the case of the representative country, for each baseline, the combined MC/baseline method shows a statistically significant better performance than the baseline alone. Indeed, the application of a one-sided Wilcoxon matched-pairs signed-rank test (used with an analogous goal as in the work Gnecco and Nutarelli [13]) rejects the null hypothesis that the difference $\Delta ^\textrm{baseline}_\mathrm{MC/baseline}:=RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC/baseline)}$ has a symmetric distribution around its median and that this median is smaller than or equal to 0 (for each baseline: p-value = $4.7846 \cdot 10^{-5}$, significance level $\alpha = 0.05$). More generally, the same null hypothesis is rejected for: 36 among the 42 countries in the case of baseline$_1$; 42 among the 42 countries in the case of baseline$_2$; 41 among the 42 countries in the case of baseline$_3$.
2.
In the case of the representative country, the same statistical test as above is used to test the null hypothesis that the difference $\Delta ^\textrm{MC}_\mathrm{MC/baseline}: = RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC)}-RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC/baseline)}$ has a symmetric distribution around its median and that this median is smaller than or equal to 0 (the optimal regularization parameter $\lambda _{k^\circ }$ may be different for the two methods). For baseline$_1$, the null hypothesis is rejected (p-value = $4.7846 \cdot 10^{-5}$, significance level $\alpha = 0.05$). For baseline$_2$, the null hypothesis is not rejected (p-value = 0.9868, significance level $\alpha = 0.05$). For baseline$_3$, the null hypothesis is rejected (p-value = $1.1787 \cdot 10^{-4}$, significance level $\alpha = 0.05$). More generally, the same null hypothesis is rejected for: 41 among the 42 countries in the case of baseline$_1$; 10 among the 42 countries in the case of baseline$_2$; 39 among the 42 countries in the case of baseline$_3$.
3.
In the case of the representative country, the same statistical test as above is used to test the null hypothesis that the difference $\Delta ^\textrm{baseline}_\textrm{MC}: = RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test} \mathrm{(MC)}$ has a symmetric distribution around its median and that this median is smaller than or equal to 0. For baseline$_1$, the null hypothesis is not rejected (p-value = 0.9997, significance level $\alpha = 0.05$). For baseline$_2$, the null hypothesis is rejected (p-value = $4.7846 \cdot 10^{-5}$, significance level $\alpha = 0.05$). For baseline$_3$, the null hypothesis is rejected (p-value = 0.0191, significance level $\alpha = 0.05$). More generally, the same null hypothesis is rejected for: 0 among the 42 countries in the case of baseline$_1$; 40 among the 42 countries in the case of baseline$_2$; 18 among the 42 countries in the case of baseline$_3$.

It is also worth remarking that, in the case of the results reported in Fig. 1 for the case of the representative country, the choice of the optimal regularization parameter $\lambda _{k^\circ }$ turns out to depend negligibly on the absence/presence/choice of the baseline combined with MC. Similarly, a small dependence of the choice of $\lambda _{k^\circ }$ is observed with respect to the repetition/the choice of the country analyzed, in the sense that they turn out to be always smaller than 1 (a detailed analysis is not shown, due to space limitations).

Additionally, empirical and standard deviations of the quantities $\Delta ^\textrm{baseline}_\mathrm{MC/baseline}$, $\Delta ^\textrm{MC}_\mathrm{MC/baseline}$, and $\Delta ^\textrm{baseline}_\textrm{MC}$ are reported in Table 2, for the three baselines and for all the countries considered in the analysis, confirming the findings above. Concluding, the numerical results obtained confirm statistically the expectations reported at the beginning of this section.

It is also worth remarking that, as expected according to Sect. 3, for all the countries the average estimate on the test set increased in almost all the 20 repetitions of the analysis, when moving from the original MC method to each combined MC/baseline method. In other words, there was almost always an increase in the bias (estimated) on the test set, i.e., of the following quantity:

$$\begin{aligned} \mathrm{estimated\,\textrm{bias}}_{\lambda _{k^\circ }}^\textrm{test}: = \frac{1}{|\Omega ^\textrm{test}|}\sum _{(i,j) \in \Omega ^\textrm{test}} \left( \hat{M}^\mathrm{pre\text {-}processed}_{\lambda _{k^\circ },i,j}-M^\mathrm{pre\text {-}processed}_{i,j} \right) . \end{aligned}$$

(5)

As expected, for the original MC method, this estimated bias was almost always negative. Again, additional details are not reported, due to space limitations.

Table 1 RMSEs on the test set for the various methods considered in the article, in the case of a representative country (Spain)

Full size table

Table 2 Empirical means and standard deviations (over the 20 repetitions) of the quantities $\Delta ^\textrm{baseline}_\mathrm{MC/baseline}:=RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test}\, \mathrm{(MC/baseline)}$, $\Delta ^\textrm{MC}_\mathrm{MC/baseline}:=RMSE^\textrm{test}\, \mathrm{(MC)}-RMSE_{\lambda _{k^\circ }}^\textrm{test}\, \mathrm{(MC/baseline)}$, $\Delta ^\textrm{baseline}_\textrm{MC}:=RMSE^\textrm{test} \mathrm{(baseline)}-RMSE_{\lambda _{k^\circ }}^\textrm{test}\, \mathrm{(MC)}$ for the three baselines and for all the countries considered in the analysis

Full size table

5 Conclusions

In the work, matrix completion has been combined and compared with suitable baselines for the reconstruction of artificially missing elements of matrices representing annual CO$_2$ emissions at the sector level, for each country considered in the analysis. Nevertheless, for specific countries (different from the ones in the database: e.g., selected developing countries), the corresponding matrices may have really missing elements, which could be effectively reconstructed by matrix completion. In this application, the only difference with respect to the current analysis is that it would be not possible to evaluate the error on every element of the test set (being a ground truth unavailable for really missing elements).

A second potential extension of our analysis is represented by the construction of counterfactuals (e.g., as investigated by Kumar and Liang [20], which refers to a different application of matrix completion), useful to predict policy effects on annual CO$_2$ emissions of specific sectors in selected countries. In practice, this would entail obscuring matrix entries affected by a policy (e.g., any national countermeasure aimed to reduce pollution levels related to the economic activity of specific sectors), to predict their corresponding values in the absence of the policy (counterfactual values). This application would require avoiding getting negatively biased estimates of the unobserved matrix elements. Indeed, overestimates may be actually needed to construct valid counterfactuals.

The analysis made in the present work could be also extended as follows. As an additional step, matrix completion could be applied to matrices obtained by combining the information coming from different countries, or by merging the information available on annual trade flows and annual CO$_2$ emission. Indeed, a significant feature of the database used for our analysis is that its adopted classification of sectors matches the one of the World Input–Output Database (WIOD) tables (https://www.rug.nl/ggdc/valuechain/wiod/wiod-2016-release), whose elements represent annual trade flows from input country-specific sectors to output country-specific industries/final consumption sectors. Finally, other baselines could be also considered for the combination/comparison with matrix completion.

Data availability

The datasets analyzed in the current study are available for free at https://joint-research-centre.ec.europa.eu/document/download/b572c87b-a2fb-4ab6-af38-ff0451273e9e_en?filename=co2em56.zip. This hyperlink was accessed in 2023. The original 2019 version of the database, available at https://joint-research-centre.ec.europa.eu/system/files/2019-09/co2em56.zip, contained also data related to the Netherlands (NLD), which were removed in the updated version of the database.

References

Athey, S., Imbens, G.W.: Machine learning methods that economists should know about. Ann. Rev. Econ. 11, 685–725 (2019)
Article Google Scholar
Biancalani, F., Gnecco, G., Metulini, R., Riccaboni, M.: Matrix Completion for the Prediction of Yearly Country and Industry-Level CO$_2$ Emissions. In Proceedings of the $8^{\rm th}$ International Conference on machine Learning, Optimization & Data science (LOD 2022), Lecture Notes in Computer Science, vol. 13810, pp. 14-19, (2023)
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2009)
Article MathSciNet MATH Google Scholar
Corsatea, T.D., Lindner, S., Arto, I., Román, M.V., Rueda-Cantuche, J.M., Velázquez Afonso, A., Amores, A.F., Neuwahl, F.: World Input-Output Database Environmental Accounts. Update 2000-2016. Publications Office of the European Union, Luxembourg, (2019) https://doi.org/10.2791/947252
Chiulli, R.M.: Quantitative Analysis: an Introduction. CRC Press, (2020)
de Araújo, T., Gonçalves, D.S., Torezzan, C.: A two-phase rank-based algorithm for low-rank matrix completion. Opt. Lett. (2022). https://doi.org/10.2791/947252
Article MATH Google Scholar
Duarte, L.T., Mussio, A.P., Torezzan, C.: Dealing with missing information in data envelopment analysis by means of low-rank matrix completion. Ann. Oper. Res. 286, 719–732 (2020)
Article MathSciNet MATH Google Scholar
El Ouardighi, F., Kogan, K., Gnecco, G., Sanguineti, M.: Commitment-based equilibrium environmental strategies and time-dependent absorption efficiency. Group Decis. Negot. 27, 235–249 (2018)
Article Google Scholar
El Ouardighi, F., Kogan, K., Gnecco, G., Sanguineti, M.: Transboundary pollution control and environmental absorption efficiency management. Ann. Oper. Res. 287, 653–681 (2020)
Article MathSciNet MATH Google Scholar
Ghafarianzadeh, M., Monteleoni, C.: Climate Prediction Via Matrix Completion. In Proceedings of the 17th AAAI Conference on Late-Breaking Developments in the Field of Artificial Intelligence (AAAIWS’13-17), pp. 35-37, (2013)
Gnecco, G., Sanguineti, M., Gaggero, M.: Suboptimal solutions to team optimization problems with stochastic information structure. SIAM J. Optim. 22, 212–243 (2012)
Article MathSciNet MATH Google Scholar
Gnecco, G., Landi, S., Riccaboni, M.: Can machines learn creativity needs? an approach based on matrix completion. Ital. Econ. J. (2022). https://doi.org/10.1007/s40797-022-00200-8
Article Google Scholar
Gnecco, G., Nutarelli, F.: On the trade-off between number of examples and precision of supervision in machine learning problems. Opt. Lett. 15, 1711–1733 (2019)
Article MathSciNet MATH Google Scholar
Gnecco, G., Nutarelli, F., Riccaboni, M.: A machine learning approach to economic complexity based on matrix completion. Sci. Rep. (2022). https://doi.org/10.1038/s41598-022-13206-0
Article Google Scholar
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: the Lasso and its Generalizations. CRC Press, (2015)
Hsieh, W.W.: Machine Learning Methods in the Environmental Sciences: Neural Networks and Kernels. Cambridge University Press, (2009)
Huang, W., Wei, D., Wang, C., Lin, C.: Matrix completion-based prediction analysis in carbon emissions. Int. J. Embedded Syst. 14, 143–148 (2021)
Article Google Scholar
Kikstra, J.S., Waidelich, P., Rising, J., Yumashev, D., Hope, C., Brierley, C.M.: The social cost of carbon dioxide under climate-economy feedbacks and temperature variability. Environ. Res. Lett. 16(9), 094037 (2021)
Article Google Scholar
Kim, S.H., Boukouvala, F.: Machine learning-based surrogate modeling for data-driven optimization: a comparison of subset selection for regression techniques. Optim. Lett. 14, 989–1010 (2020)
Article MathSciNet MATH Google Scholar
Kumar, A., Liang, C.-H.: Credit constraints and GDP growth: evidence from a natural experiment. Econ. Lett. 181, 190–194 (2019)
Article MATH Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
MathSciNet MATH Google Scholar
Metulini, R., Gnecco, G., Biancalani, F., Riccaboni, M.: Hierarchical clustering and matrix completion for the reconstruction of world input-output tables. AStA Ad. Stat. Anal. (2022). https://doi.org/10.1007/s10182-022-00448-6
Article Google Scholar
Sra, S., Nowozin, S., Wright, S.J.: (Editors), Optimization for Machine Learning. MIT Press, (2012)
Yao, Q., Kwok, J.T.: Accelerated Inexact Soft-Impute for Fast Large-Scale Matrix Completion. In Proceedings of the $24^{\rm th}$ Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), pp. 4002-4008, (2015)

Download references

Acknowledgements

The authors acknowledge partial support from the “Dipartimento di Eccellenza 2023–2027” project at IMT - School for Advanced Studies, Lucca. F. Biancalani, G. Gnecco, and R. Metulini are members of GNAMPA (Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni) at INdAM (Istituto Nazionale di Alta Matematica).

Funding

Open access funding provided by Scuola IMT Alti Studi Lucca within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

IMT - School for Advanced Studies, Piazza San Francesco, 19, 55100, Lucca, Italy
Francesco Biancalani, Giorgio Gnecco & Massimo Riccaboni
University of Bergamo, Via Caniana 2, 24127, Bergamo, Italy
Rodolfo Metulini

Authors

Francesco Biancalani
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Gnecco
View author publications
You can also search for this author in PubMed Google Scholar
Rodolfo Metulini
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Riccaboni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giorgio Gnecco.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Biancalani, F., Gnecco, G., Metulini, R. et al. Prediction of annual CO₂ emissions at the country and sector levels, based on a matrix completion optimization problem. Optim Lett (2023). https://doi.org/10.1007/s11590-023-02052-2

Download citation

Received: 21 August 2022
Accepted: 27 July 2023
Published: 26 September 2023
DOI: https://doi.org/10.1007/s11590-023-02052-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Prediction of annual CO₂ emissions at the country and sector levels, based on a matrix completion optimization problem

Abstract

Similar content being viewed by others

Strategies to achieve a carbon neutral society: a review

Does a Carbon Tax Reduce CO2 Emissions? Evidence from British Columbia

Air pollution prediction with machine learning: a case study of Indian cities

1 Introduction

2 Database description

3 Methodology

4 Results

5 Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prediction of annual CO2 emissions at the country and sector levels, based on a matrix completion optimization problem

Abstract

Similar content being viewed by others

Strategies to achieve a carbon neutral society: a review

Does a Carbon Tax Reduce CO2 Emissions? Evidence from British Columbia

Air pollution prediction with machine learning: a case study of Indian cities

1 Introduction

2 Database description

3 Methodology

4 Results

5 Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Prediction of annual CO₂ emissions at the country and sector levels, based on a matrix completion optimization problem