This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Open access

Classifying rock types by geostatistics and random forests in tandem

and

Published 17 April 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Citation Parag Jyoti Dutta and Xavier Emery 2024 Mach. Learn.: Sci. Technol. 5 025013 DOI 10.1088/2632-2153/ad3c0f

2632-2153/5/2/025013

Abstract

Rock type classification is crucial for evaluating mineral resources in ore deposits and for rock mechanics. Mineral deposits are formed in a variety of rock bodies and rock types. However, the rock type identification in drill core samples is often complicated by overprinting and weathering processes. An approach to classifying rock types from drill core data relies on whole-rock geochemical assays as features. There are few studies on rock type classification from a limited number of metal grades and dry bulk density as features. The novelty in our approach is the introduction of two sets of feature variables (proxies) at sampled data points, generated by geostatistical leave-one-out cross-validation and by kriging for removing short-scale spatial variation of the measured features. We applied our proposal to a dataset from a porphyry Cu–Au deposit in Mongolia. The model performances on a testing data subset indicate that, when the training dataset is not large, the performance of the classifier (a random forest) substantially improves by incorporating the proxy features as a complement to the original measured features. At each training data point, these proxy features throw light based on the underlying spatial data correlation structure, scales of variations, sampling design, and values of features observed at neighboring points, and show the benefits of combining geostatistics with machine learning.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

An economic mineral deposit that is termed an 'ore' deposit, is an unusually high concentration of metallic or non-metallic minerals in a part of the Earth's crust that can be extracted economically or strategically [1]. The identification of the rock types (lithology), ore-bearing geologic structures, and the geological properties of those rocks, are fundamental in defining ore deposits. Exploration geologists are initially concerned with the lithology of the country rocks, and subsequently move toward the identification of the host rocks (that host the ore body) and wall rocks (that surround the host rocks). Geological properties that characterize the different rock types in an ore deposit include the basic physical attributes that can be observed with the naked eye, and attributes pertaining to geochemistry, petrography, mineralogy, alteration, material science and rock mechanics which can be measured only in the laboratory. The classification of rock types is helpful to many modeling and engineering decisions in downstream stages of a mining project, such as: (1) generating geologic maps and 3D geological models, (2) estimating the probability of occurrence of economic minerals in a particular rock type, (3) estimating the cost of blasting and extracting the raw material to be mined, (4) adjusting drilling parameters for a given rock type to maximize core recovery, (5) estimating recoverable resources, and (6) estimating the rock strength, among others.

In practice, the visual determination of the rock type on the basis of its mineralogy (when distinct), color, grain size and texture, structure, contacts with other rock types, or other visual attributes, is the regular method of identifying and classifying a lithology. However, two rocks may look very much the same physically, but be geochemically different. When the visual procedure fails to exactly determine or confirm the lithology class, chemical assays of the rock samples reveal their exact type. However, chemical assays of rocks involve service costs that have to be planned judiciously.

Porphyry Cu deposits, including porphyry Cu–Au–Mo and porphyry Cu–Au deposits, and Cu–(Mo) breccia pipes are the world's largest source of copper. These are generally large-tonnage–low-grade metal deposits. Ore occurs in bulk-mineable mineralization that is spatially, temporally and genetically associated with intrusive rock plutons of the granite family with the metals transported by hydrothermal fluids [2]. The intrusive plutons, host or wall rocks, and other country rocks are commonly subjected to hydrothermal alteration overprinting, resulting in a loss of the original physical signature of the lithology types, thereby posing a difficulty in their visual identification and classification. Alteration caused by weathering and meteoric waters acidified after percolating through iron sulfides (pyrite, FeS2) in the deposit, further adds to the problem of labelling the exact rock type. This is the first dimension to the problem addressed here.

Where chemical assay of the rock sample is the available solution to the identification of the lithology, there lies another problem, related to the cost-benefit paradigm of the mining and metals industry. Why would a mining company invest money in assaying a population of rocks sampled from a lithologic domain if a few initial sample assays indicate a poor metal grade of the domain? This economic logic leads to the resultant problem of missing values of some or all assay variables pertaining to the rock sample at a data point in underground 3D space. This is the second dimension of the problem.

A more pessimistic corollary of the same logic adds a third and more acute dimension: complete absence of information. Why would the company invest at all in additional drill holes for extracting those cylindrical rock core samples from 'that' low-grade lithologic domain? A core loss or poor core recovery during drilling is an additional risk. The complete absence of information leads to 'no data' patches of underground zones among zones of 'data-rich' and 'data-poor' irregularities.

There are two perspectives on handling missing data: avoiding the rows in which some or all variables have missing values, versus filling the missing values by statistically calculated imputations. The former approach is naive, conservative, faithful to the missing values, but implies discarding a data point location along with grade values that are costly to acquire. The latter approach appears reasonable for most investigators but is considered as inappropriate practice by mineral resource reporting systems belonging to CRIRSCO [3]. It is pertinent to note that data imputation in the blank cells of the assay data table means imputing values of the components of a spatial vector at different data point locations, which implies changing the original spatial data configuration from 'partially heterotopic' (i.e. with under-sampled variables) to 'isotopic' (with equally-sampled variables).

One typically characterizes mineralization by sampling less than 0.001% of the deposit. Therefore, the scale of uncertainty combined with high levels of geological variability create unique risks for the profitable exploitation of mineral resources. The ultimate problem is that given a sparse population of rock core data at scattered locations, we are required to predict the spatial distribution of the different rock types and metal grades comprising the 3D subsurface of the deposit [46]. The traditional approach of classifying the rock type of drill core samples relies on whole-rock geochemical assays, comprising concentrations, and their transformed values, of a series of major oxides, alkaline and alkaline-earth elements, trace elements, and other elements to as much as 55–60 feature variables. While several machine learning case studies exist on rock type classification from whole-rock geochemistry (e.g. [7, 8]), there are perhaps just a few on rock type classification from a limited number of feature variables comprising essentially of assays of metal grades. This is primarily due to the general unavailability of proprietary drill hole datasets of ore deposits in the public domain for scientific research.

The prime novelty of our study is the ingenious introduction of two complementary sets of feature variables ('proxies') at sampled data points, by means of geostatistical techniques: (1) leave-one-out cross-validation (LOOCV), and (2) filtered kriging for removing short-scale spatial variation ('noise') of laboratory measured values. Relying on the different sets of feature variables—the original measured set and the two sets of proxies, we tested a machine learning technique, namely the Random Forests (RF) [9], in tandem for classifying a total of 13 rock types occurring within the ore deposit.

Our principal scientific questions in the current research investigation are:

  • 1.  
    Is it possible to predict (classify) the different rock types with reliable accuracy from a given set of assayed grade values at sampled data locations? Our intuition is based on that it could be possible using a machine learning classifier provided the classification algorithm can recognize a decisive pattern of numerical features (metal grades) associated with each individual rock type. The input 'features' are assay values of geochemical variables and bulk density (BD) measured in a drill hole dataset in compliance with industry quality assurance and quality control standards. We will assess the accuracy of predictions by evaluation with standard model performance metrics for classification.
  • 2.  
    Could the different rock types be predicted with more accuracy if we add complementary sets of numerical features (proxy grade values generated by geostatistical techniques) to the original assayed grade values? In this case, our intuition is based on that more accurate prediction is plausible as a more distinctive pattern of features could be likely generated corresponding to each individual rock type (thus, rendering the classifier as more efficient in recognizing the pattern). The underlying merit of the complementary sets of grade values is that they are spatially correlated with the original assayed grade values.
  • 3.  
    What could be the prediction performances of the classifier if we were to substitute the assayed grade values at the sampled data locations and the missing values by the complementary sets of features (proxy grade values) generated by geostatistical techniques?

While trying to answer the above questions, we will analyze two cases, depending on whether the feature variables are equally sampled or not; in the latter case, the missing data will be filled with the median value per lithology class for each corresponding input feature variable to apply the traditional classification approaches.

The paper is outlined as follows: section 2 provides a geological description of the application case study under consideration. The dataset of this case study is introduced in section 3, together with the proposed tools and methodology. Section 4 presents and discusses the results of the classification problem. Section 5 concludes the paper, while detailed statistics on the classification results are provided in Appendices.

2. Geologic setting of the deposit

Porphyry deposits are regionally associated with convergent plate boundaries, along island arcs and continental margins where subduction of oceanic crust takes place beneath continental crust. The mineralization process is characterized by intrusions of large volumes of granitic magma accompanied by the availability of metal-rich hydrothermal fluids, with the sites of ore localization controlled by structural features, rock types, and hydrothermal alteration styles. Copper and gold are typically the main economic minerals in these deposits, with molybdenum and silver being common as additional. A good summary of porphyry deposits, their types of metal associations, host rocks, and alteration can be found in [10]. The details of the ore-forming process can be found in [2, 11].

The Central Asian Orogenic Belt extends for 5000 km from the Urals in the west to the Pacific coast in the east. Within this belt, a chain of porphyry mineralization occurs in Uzbekistan, Kazakhstan, Russia, China, and Mongolia, ranging in age from the late Neoproterozoic till the Jurassic. The most prolific geologic period of ore formation, including the largest deposits, was during the Late Devonian and Early Carboniferous [12].

In Mongolia, porphyry mineralization occurs in two places: Erdenet and Oyu Tolgoi. Erdenet is located in Central Mongolia, whereas Oyu Tolgoi is located within the South Gobi Desert, approximately 650 km due south of the capital, Ulaanbaatar. The latter constitutes a cluster of seven porphyry Cu–Au–Mo–(+Ag) deposits, the largest high-grade group of Paleozoic porphyry deposits known in the world. The rock sequence surrounding this cluster of deposits is predominantly composed of basaltic to intermediate volcanic rocks of Devonian age, overlain by layered dacitic pyroclastic and sedimentary rocks intruded by basaltic and dolerite sills. The southern deposits (Southwest, Heruga North, and Heruga) are characterized by high Au (g/t)/ Cu (%) ratios (0.8-3:1), and hosted by biotite-magnetite-altered augite basaltic rocks. In contrast, the Hugo Dummett (North and South) deposits are characterized by lower Au (g/t)/ Cu (%) ratios (0.1-1:1), and hosted mainly by volcanic and plutonic rocks with extensive sericitic and advanced argillic hydrothermal alteration. The detailed geology and mineral exploration history in the Oyu Tolgoi district can be found in [13, 14].

The deposit concerned in this study is the Hugo Dummett South deposit (figure 1). It is a polymetallic porphyry with economic grades of Cu, Au, Mo and Ag and potentially deleterious grades of As, F, S and Fe. It is hosted by an east-dipping basalt rock sequence intruded by porphyritic quartz-monzodiorite plutons. The deposit is both disrupted by, and bounded by fault displacements in four directional orientations. The mineralization occurs in an anastomosing network of stockwork veinlets of sulfides ± quartz with Cu grades typically greater than 2%. There are potentially multiple overlapping mineralizing phases. Hydrothermal alteration exhibits a strong lithologic control with a number of assemblages including advanced argillic, muscovite/sericite, and intermediate argillic styles with minor K-silicate alteration. The stockwork is mainly localized within the Late Devonian quartz-monzodiorite intrusive; however, it also extends into the basaltic wall rock units.

Figure 1.

Figure 1. Simplified geological map of the Hugo Dummett South porphyry deposit considered in this study and its surrounding area.

Standard image High-resolution image

3. Materials and methods

3.1. Data description

In the present study, we have considered an exploration dataset comprising geological, geochemical, and survey information of 360 drill holes representing 178 km of drilling. The dataset includes separate spreadsheets: assay values of thirteen chemical elements (Cu, Au, Ag, Mo, As, Fe, S, C, F, Pb, Zn, Cd, Hg), dry BD values, lithology classes, alteration types, mineral percentages, and survey data in the form of collar and down-the-hole survey files. In the original assay table, core sample lengths vary from 0.1 m to 1908.8 m (excluding the core loss intervals). Sample lengths in the range of 1 m – 3 m comprise 98.4% of the total samples. As per standard practice, samples were optimally composited to a mean length of 5.0 m without any left-overs, in lieu of fixed-length compositing. Decomposition of 12 large samples into smaller length composites has been done due to the fact that samples with lengths ${\gt}5.0$ m have negligible grade values for all the thirteen assay variables. Basic statistics on the target and feature variables are provided in tables 1 and 2, respectively. An interpretive lithological model (3D) with the trajectories of drill holes is shown in figure 2.

Figure 2.

Figure 2. Interpretive lithological model (3D) used for mineral resource estimation with the drill hole intercepts superimposed. Coordinate values are omitted for confidentiality reasons.

Standard image High-resolution image

Table 1. Statistics on target variable (lithology) in composited data table.

Serial numberCategoryCodeAgeNumber of records
1Carboniferous andesite dykeANDCarboniferous1941
2Basalt dykesBADCarboniferous4683
3Biotite granodiorite dykesBIGDLate Devonian and older7394
4Cretaceous clayCLAYCretaceous1352
5FaultsFLTLate Devonian and older1810
6Hydrothermal brecciaHBXLate Devonian and older576
7Hanging wall sequenceHWSLate Devonian and older $16\,028$
8IgnimbriteIGNLate Devonian and older9181
9Carboniferous intrusiveINTRCarboniferous160
10Quaternary coverQCOQuaternary13
11Quartz monzodioriteQMDLate Devonian and older $14\,864$
12Carboniferous rhyolite dykesRHYCarboniferous2500
13Augite basaltVALate Devonian and older6821
Total    $67\,323$

Table 2. Statistics on assay variables in composited data table ($67\,323$ composites).

VariableNumber of recordsNumber of missing valuesMinimumMaximumMeanStandard deviation
Cu (%) $37\,465$ $29\,858$ 0.0016.680.60490.7540
Au (ppm) $37\,295$ $30\,028$ 0.0013.610.08580.2264
Ag (ppm) $16\,904$ $50\,419$ 0.0032.171.27721.7289
Mo (ppm) $36\,915$ $30\,408$ 0.130300744.3265.08
As (ppm) $36\,010$ $31\,313$ 0.508416134.09301.87
Fe (%)8299 $59\,024$ 0.26164.35051.7195
S (%)8320 $59\,003$ 0.00181.95892.3614
C (%)1362 $65\,961$ 0.01140.43780.7892
F (ppm)3994 $63\,329$ 12.8 $37\,000$ 1849.41907.7
Pb (ppm)8299 $59\,024$ 1.91361258.83105.45
Zn (ppm)8299 $59\,024$ 1.006936223.42360.63
Cd (ppm)8299 $59\,024$ 0.0030.430.60191.3849
Hg (ppm)950 $66\,373$ 0.001.280.09530.1603
BD (t/m3) $10\,831$ $56\,492$ 2.302.892.75420.0433

3.2. Methods

3.2.1. Decision trees

A decision tree is a set of questions organized in a hierarchical manner and represented graphically as a tree, with a data structure made of a collection of nodes and edges, and devoid of loops. Nodes are divided into internal (or split) nodes and terminal (or leaf) nodes. All nodes have exactly one incoming edge. For a given input object, a decision tree predicts an unknown property of the object by asking successive questions about its known properties. Depending on the answer of a binary test, the data is sent to the right child node or to the left child node. This process is repeated until the data point reaches a leaf node. The leaf nodes contain a classifier or regressor which associates an output (a class label or a continuous value) with the input, and the decision is then made based on the output at the leaf node. The more the questions asked (the number of branches), the higher the confidence is in the response.

3.2.2. RFs

RFs [9] is an ensemble learning method that combines multiple decision trees to make predictions. It operates by constructing a collection (or forest) of decision trees, where each tree is trained on a subset of the training data. Randomness is introduced at two levels:

  • 1.  
    Bagging or bootstrap aggregating involves sampling the training data with replacement to create multiple bootstrap samples. Each decision tree is trained on a different bootstrap sample. This duplicates some samples and does not select others. An average of approximately 63.2% of cases is included in each training subset, whereas the remaining, or 'out-of-bag' (OOB) samples (approximately 37.8%) are used for validation.
  • 2.  
    Random feature selection: The second form of randomness involves selecting a random subset of features at each split node during tree construction. The number of variables in this subset is predefined and consistent across the forest. This feature subsampling helps to decorrelate the trees and to reduce overfitting. At every node, the randomly selected variables are then ranked by their ability to produce a split threshold that maximizes the homogeneity of child nodes relative to the parent node.

The class assigned by RF to each sample is determined by a majority vote combined from the prediction output of all classification trees. The reader is referred to [9, 15, 16] for the theory, mathematical rigor and consistency proofs of the algorithm, parameter selection, resampling mechanisms, variable importance measures, and the theoretical and methodological progress made since its initial development. Although RF has been extremely successful as a generic classification and regression method, it is routinely used for classification problems (e.g. [1719]).

3.2.3. Kriging

Kriging is a geostatistical predictor that applies to a regionalized variable, i.e. a quantitative variable indexed by spatial coordinates. It relies on the modeling of the spatial correlation structure of this variable, via mathematical tools such as the (auto)covariance function or the variogram that quantify the similarity or the dissimilarity between paired observations of the variable, as a function of their spatial separation. Unlike machine learning techniques (e.g. decision trees or RFs) that are used to predict one variable (target) from another (feature), kriging predicts the same variable as the input, but at different spatial locations, with the unbiased prediction leading to the smallest mean-square error, i.e. the error with zero mean and the smallest variance. The reader is referred to [20] for details on the representation of regionalized variables by spatial random fields, inference and modeling of covariance functions or variograms, as well as on the mathematical background and the most common variants of kriging.

In the following section, two kriging-based approaches will be used to complement the original feature variables inputted in machine learning algorithms. The motivation is to account not only for the feature variables observed at a sampling location when predicting a target variable at this particular location, but also for the information brought by the feature variables observed at surrounding locations. These are:

  • 1.  
    LOOCV. The value of a feature variable at a given location is predicted by kriging, using as input the values of the same feature variable at surrounding locations and a modeled covariance function or variogram. The cross-validated prediction therefore includes the information of the neighboring data, but excludes the information of the data itself.
  • 2.  
    Kriging with filtering. This variant applies when the modeled covariance function or variogram of the feature variable is decomposed into two components; typically, the first component is the nugget effect variance, which stands for the discontinuity of the covariance function or variogram at the origin, while the second component is the continuous part of the covariance function or variogram. In such a situation, the feature variable can also be interpreted as the sum of two components, associated with the components of its covariance function or variogram. In particular, the nugget effect component represents the short-scale variability and noise associated with measurement errors [21], while the continuous component represents the spatial variations at larger scales. Kriging with filtering [22, chapter 15] allows predicting the continuous component of a feature variable (viewed as the 'error-free' variable) at a given location, using as input the values of the observed variable (including the nugget effect component) at this location and surrounding locations, as well as the modeled covariance function or variogram of the variable. Unlike LOOCV, here the prediction not only includes the information of the neighboring data, but also the information of the data itself.

3.3. Proposed methodology

The proposed methodology is illustrated in figure 3 and detailed in the next subsections. The objectives of this methodology are the following:

  • 1.  
    To test whether it is possible to predict different classes of intrusive lithology ('target variable') with assay variables and BD as input 'feature variables'. In doing so, our emphasis is not to test multiple machine learning algorithms for evaluating their individual predictive powers, but to, primarily, find out whether a limited number of assayed metal grades exhibit extractable patterns of variations for correlation with different rock types via supervised learning.
  • 2.  
    To compare classification performances by the traditional approach versus our ingenious approach of creating proxy variables as input features, as explained below. In the case where feature variables are not equally sampled, the traditional approach relies on imputing missing values. Although we had several statistical options of imputation, we opted for a geochemically sound procedure, i.e. the missing values were filled with the median value per rock type for each corresponding measured variable, to apply the traditional classification approach. The principal objective was to test whether the proxy variables created by us are: (a) indispensable, or (b) somewhat useful, or (c) useless, for the purpose of the said classification.

Figure 3.

Figure 3. Schematic workflow of the proposed methodology, combining geostatistical analyses (yellow panel) and machine learning (green panels) for enhanced classification in a spatial context.

Standard image High-resolution image

3.3.1. Geostatistical modeling of feature variables

The first step of the methodology is the calculation of experimental variograms to identify the spatial correlation structure of the feature variables (chemical element assays and BD), which encompasses their spatial variability at both short and large scales and can depend on the spatial direction. To avoid over-simplifications and to get a realistic geostatistical model, we considered two directions for variogram calculation: the horizontal plane (omni-horizontal calculations) and the vertical direction (table 3).

Table 3. Parameters for experimental variograms.

 Horizontal directionVertical direction
Azimuth (°)00
Azimuth tolerance (°)9090
Dip (°)090
Dip tolerance (°)2020
Lag (m)15.010.0
Number of lags2020
Lag tolerance (m)7.55.0

The fitting of theoretical variogram models considers nested structures of nugget, spherical and exponential types. The fit is performed with a least-square procedure (table 4 and figure 4).

Figure 4.

Figure 4. Examples of variograms for chemical assays (Cu, Ag, As, Fe and S) and bulk density (BD). Crosses represent experimental variograms and solid lines represent fitted models, along the horizontal (black) and vertical (blue) directions.

Standard image High-resolution image

Table 4. Parameters for theoretical variogram models.

Feature variableStructureHorizontal range (m)Vertical range (m)Partial sill
Aunugget000.0049
 spherical201500.0184
 spherical200 $200\,000$ 0.0138
Cuexponential201500.3383
 exponential2002500.1793
 exponential200 $20\,000$ 0.0429
Agnugget000.2035
 exponential10200.3861
 spherical401301.4489
Asexponential1020 $19\,505.4$
 spherical25160 $84\,222.4$
Cexponential70100.1232
 exponential150200.1075
 spherical $100\,000$ 2000.1435
Fnugget00 $51\,230.1$
 exponential15070 $715\,967.7$
Fenugget000.5656
 exponential40401.5007
 spherical1001000.7857
 spherical100 $100\,000$ 0.0238
Monugget001982.3
 exponential150101250.7
 exponential250150982.2
Snugget000.8027
 exponential70700.2095
 exponential200703.6043
Pbexponential40404539.1
 spherical1001002627.6
 spherical100 $100\,000$ 5384.0
Znexponential5020 $37\,666.9$
 exponential130130 $83\,085.7$
Cdexponential50201.0198
 exponential1301300.6007
Hgnugget000.0152
 exponential40200.0107
 spherical $100\,000$ 1000.0086
BDexponential202500.00083
 exponential20 $20\,000$ 0.00013

3.3.2. Geostatistical prediction by LOOCV and kriging with filtering

For both types of prediction, the feature variables were kriged separately. In each case, ordinary kriging was used, with a moving neighborhood implementation. The shape of the neighborhood was ellipsoidal, and its anisotropy was chosen as a function of the anisotropy of the variogram model (e.g. the neighborhood was more elongated in case of a strong anisotropy, or closer to a sphere in case of isotropy) (table 5). To avoid getting predictions that are locally very high, a top-cut value was applied to feature variables with long-tailed histograms. Finally, to avoid inconsistent results, negative predictions were post-processed and set to zero.

Table 5. Parameters for kriging.

Feature variableHorizontal search radius (m)Vertical search radius (m)Optimal number of dataTop-cut value
Au15002500402.0 ppm
Cu15002500408.0 pct
Ag150025004020 ppm
As15002500405000 ppm
C800040004010 pct
F2500150040 $10\,000$ ppm
Fe8000700040 
Mo2500150040500 ppm
S2500150040 
Pb8000 $10\,000$ 401500 ppm
Zn15001200404000 ppm
Cd150012004015 ppm
Hg8000400040 
BD1500250040 

3.3.3. Feature engineering

To compare the model performances corresponding to the traditional machine learning approaches versus our approach, we have created an extended dataset comprising of three sets of feature variables. The first set consists of the assay variables of the original dataset, i.e. Cu, Au, Ag, Mo, As, Fe, S, C, F, Pb, Zn, Cd, and Hg, plus BD. The second set consists of leave-one-out cross-validated ('LOOCV') values, and the third set consists of filtered (after removing the 'nugget variance' from the variogram) kriging values for each of the above feature variables. Only the first set contains missing values. In the third set, if the variable has no nugget, then the filtered value is equal to the original value; if the value of a variable is missing, then the filtered value is its kriging prediction from the surrounding values and matches the LOOCV value.

3.3.4. RF prediction (classification) of lithology classes

The following steps were performed sequentially.

Classification based on the dataset with imputed values (Models M1, M2 and M3)

  • 1.  
    The missing values of the first set of feature variables: assayed grade variables, viz., Cu, Au, Ag, Mo, As, Fe, S, C, F, Pb, Zn, Cd, and Hg, plus BD, were filled with the median value corresponding to each lithology class.
  • 2.  
    Our first task was to test the prediction of the lithology types with the first set of feature variables (14). Then, the next is to test the same combining the first, second, and third set of feature variables ($14 \times 3 = 42$ feature variables). The last task is to test the same combining only the second and third set of feature variables ($14 \times 2 = 28$ feature variables). Statistics on the target variable (lithology) in the composited dataset is given in table 1. The dataset has 67 323 observation points, including imputed values for the first set of feature variables (14). We split the dataset into subsets comprising training and testing in a 70:30 ratio. Since lithology has been sampled in an unbalanced way, subsets comprising training and testing data were split through stratified random sampling. The prior probabilities are considered so as to match total sample frequencies (table 6). The purpose of sub setting the training rows is for model training and tuning, and the testing rows for model evaluation. Optimizing RF for modeling involves tuning some parameters controlling the structure of each individual tree, the structure and size of the forest as well as its randomness. The methodology and software that we have adopted for tuning hyperparameters for the RF classification models are described later in this section.

Table 6. Multi-class response information in the training and testing subsets.

Lithology classPriorTraining count%Testing count%
AND0.0213592.025820.87
BAD0.0532794.8714042.09
BIGD0.0851767.6922183.30
CLAY0.019471.414050.60
FLT0.0212671.885430.81
HBX0.014040.601720.26
HWS0.17 $11\,220$ 16.6748087.14
IGN0.0964279.5527544.09
INTR01120.17480.07
QCO0100.0230.00
QMD0.15 $10\,405$ 15.4644596.62
RHY0.0317502.607501.12
VA0.074777.0920463.04
All0.70 $47\,131$ 70.01 $20\,192$ 29.99

Classification based on the dataset without missing values (Models M4, M5 and M6)

  • 2.  
    Next, we repeat the exercise (prediction of lithology types with the RF classifier), but with all missing data (i.e. imputed values) removed. A reduction in the number of data points from $67,323$ to 899, and the number of lithology classes from 13 to 12 (table 7) was implicit due to the fact that missing values for different grade variables in the original dataset were not uniformly distributed among the data points. Dry BD (t m−3), a physical variable, in the dataset was discarded due to missing values at a huge proportion of these 899 data points. Hence, this time, the sets of feature variables consisted of only chemical grade variables: the first set consisted of 13 assayed chemical grade variables; the second set consisted of 13 cross-validated (LOOCV) chemical grade variables, and third set consisted of 13 kriged chemical grade variables with nugget effect filtering.
  • 4.  
    Due to the similarity in age and mode of formation, we merged the AND, RHY, and INTR lithology classes into a single class: INTR. Hence, the total number of lithology classes was reduced to 10 for this set of classification models. The final statistics of the lithology variable used for the prediction models are shown in table 8. For this set of models, we split the 899-point dataset into training/testing subsets in a ratio 70:30. Table 9 shows the number and percentages of lithology class records in the training and testing subsets.

Table 7. Multi-class response information (before merging) in the training and testing subsets.

Serial numberCategoryCodeNumber of records
1Carboniferous andesite dykeAND17
2Basalt dykesBAD40
3Biotite granodiorite dykesBIGD34
4Cretaceous clayCLAY14
5FaultsFLT107
6Hydrothermal brecciaHBX20
7Hanging wall sequenceHWS74
8IgnimbriteIGN132
9Carboniferous intrusiveINTR1
10Quartz monzodioriteQMD374
11Carboniferous rhyolite dykesRHY66
12Augite basaltVA20
Total  899

Table 8. Multi-class response information (after merging) in the training and testing subsets.

Serial numberCategoryCodeNumber of records
1Basalt dykesBAD40
2Biotite granodiorite dykesBIGD34
3Cretaceous clayCLAY14
4FaultsFLT107
5Hydrothermal brecciaHBX20
6Hanging wall sequenceHWS74
7IgnimbriteIGN132
8Carboniferous intrusiveINTR84
9Quartz monzodioriteQMD374
10Augite basaltVA20
Total  899

Table 9. Multi-class response information of table 8 separated into training and testing subsets.

Lithology classTraining count%Testing count%
BAD283.11121.33
BIGD242.67101.11
CLAY101.1140.44
FLT758.34323.56
HBX141.5660.67
HWS525.78222.45
IGN9310.34394.34
INTR596.56252.78
QMD26229.1411212.46
VA141.5660.67
All63170.1926829.81

Following the model runs for the above cases, we analyzed the model performances from each confusion matrix after deriving the accuracy rates, kappa statistics, and other statistics that include the sensitivity, specificity, precision, recall, prevalence, detection rate, and balanced accuracy. We also investigated the relative feature importance according to statistical criteria. Finally, we did some model tuning to make the models run on the testing subsets, for each of the three sets of feature variables, and obtain the final sets of results, which are reported in the Results section and in Appendices.

The randomForest R package version 4.7-1.1 [23], which implements Breiman and Cutler's RFs for classification and regression supporting the original Fortran code [9], was the fundamental package used to obtain the RF models. We, by and large, followed [24], for the parameter tuning strategies for the RF classification. The specific hyperparameter values were determined using the tuneRanger R package version 0.14.1 [25] that tunes the RF classifier using the OOB observations.

3.3.5. Hyperparameter tuning

We considered the following hyperparameters for tuning: (a) the number of candidate feature variables considered at each split (denoted as mtry), (b) number of observations that are drawn for each tree (i.e. sample size, denoted by n), and (c) the minimum number of observations in a leaf node (nodesize). An optimal compromise between low correlation and reasonable strength of the trees can be controlled by tuning the parameters mtry, n, and nodesize [9]. However, prior to tuning these three hyperparameters, we also estimated the hyperparameter ntree (number of trees to grow) by finding the optimal number of trees to grow using the R package of [23]. Resampling during model tuning was performed by 5-fold cross-validation. We used the tuneRanger R package, to choose the optimal hyperparameter values for our models. A default nodesize parameter value of 1 was used for all the models. The default splitting rule used was that minimizing the Gini impurity. Research on tuning hyperparameters for variable importance is still in its infancy [24]. Therefore, we report (see appendices) both measures of variable importance: the Gini variable importance (mean decrease Gini) as well as the permutation variable importance (mean decrease accuracy).

4. Results and discussion

4.1. Performances of classification models

The optimally tuned model was selected based on the highest accuracy of candidate models with different hyperparameter values. The accuracy rate was computed using predictions made using the optimized model on the training dataset. The final hyperparameter values used for models M1, M2, M3, M4, M5, M6 and their prediction performances are indicated below (tables 10 and 11). Additional results are tabulated in appendix A. See [26] for technical details and definitions of the metrics.

Table 10. Prediction performance for models based on dataset with imputed values.

Model = M1Model = M2Model = M3
14 predictors42 predictors28 predictors
(Assay: 14)(Assay: 14 + LOOCV: 14 + Filtered-K: 14)(LOOCV: 14 + Filtered-K = 14)
Target: Lithology classes = 13Target: Lithology classes = 13Target: Lithology classes = 13
HyperparametersHyperparametersHyperparameters
mtry = 12 mtry = 24 mtry = 2
ntree = 250 ntree = 400 ntree = 500
splitrule = gini splitrule = gini splitrule = gini
minimum node size = 1minimum node size = 1minimum node size = 1
Training set = 47 131 samplesTraining set = 47 131 samplesTraining set = 47 131 samples
Accuracy = 99.61%Accuracy = 99.64%Accuracy = 99.64%
Testing set = 20 192 samplesTesting set = 20 192 samplesTesting set = 20 192 samples
Accuracy = 99.61%Accuracy = 99.65%Accuracy = 82.02%
Kappa = 0.9954Kappa = 0.9958Kappa = 0.7854

Table 11. Prediction performance for models based on dataset without missing values.

Model = M4Model = M5Model = M6
13 predictors39 predictors26 predictors
(Assay: 13)(Assay: 13 + LOOCV: 13 + Filtered-K: 13)(LOOCV: 13 + Filtered-K = 13)
Target: Lithology classes = 10Target: Lithology classes = 10Target: Lithology classes = 10
HyperparametersHyperparametersHyperparameters
mtry = 7 mtry = 20 mtry = 2
ntree = 250 ntree = 200 ntree = 450
splitrule = gini splitrule = gini splitrule = gini
minimum node size = 1minimum node size = 1minimum node size = 1
Training set = 631 samplesTraining set = 631 samplesTraining set = 631 samples
Accuracy = 80.19%Accuracy = 100%Accuracy = 100%
Testing set = 268 samplesTesting set = 268 samplesTesting set = 268 samples
Accuracy = 61.19%Accuracy = 67.54%Accuracy = 67.16%
Kappa = 0.4869Kappa = 0.5669Kappa = 0.5611

4.2. Discussion

Need for spatial machine learning

RF is a machine learning tool that has proved to be useful in doing regression and classification analyses. However, RF was not developed for modeling spatial stochastic processes because of its inherent inability to take into account the nature of spatial dependency (the spatial autocorrelation structure) intrinsic to such processes. However, there have been attempts made for spatial prediction (regression) using alternate models or even integrating RF with geostatistical algorithms. Notable are that of spatial linear mixed-models [27], RF-GLS [28], RF-Residual Kriging [29, 30], spatial RF [31] and geographical RF [32]. The spatial structure intrinsic to mineralization can extend to nested structures with additional difficulties of modeling spatial drifts with estimation complexities [20, 22], in the events of which the above-cited modeling frameworks, unfortunately, fall short in theory. In particular, when grades of core-samples are the input data to obtain predictions of average grades of unit blocks with volumes different from that of the core-samples, the problem of 'change of support' [20] is invoked, which a unique contribution of geostatistics in the realm of stochastic processes.

The scientific problem in this investigation does not necessitate to merge the algorithms of RF with those of geostatistics. At first glance, the problem is simply one of classifying the likely rock type associated with a given spatial pattern of grade variations measured on drill cores. However, there are additional dimensions to the problem, as cited in section 1, making it more complex. The traditional method for missing values in a dataset is to impute with realistic values. Then arises, which value is realistic and which is not? This is where expert subject knowledge comes in. Notwithstanding this, we approached the missing data problem with the simplest technique, that is, imputing the missing values by the median grade per lithology type. On the other hand, we have the option of incorporating proxy variables with values at each data point generated by geostatistics. Hence, the task was to evaluate which technique—traditional versus geostatistical—is superior.

Incorporation of spatial information in classification by geostatistics

The large amount of data and feature variables used in the first exercise that considers the dataset with imputed values leads to accuracy rates close to 100% (table 10). However, this performance significantly decreases when reducing the number of data and feature variables to the subset of data with no missing values (table 11), in which case the traditional RF classifier (model M4) applied to the testing dataset reaches an accuracy of 61.2% only and a kappa statistic less than 0.49. In this case, the prediction performances of model M5, for which the accuracy raises to 67.5% and the kappa statistic to 0.56, show the importance of incorporating proxy variables obtained by a spatial processing of the input information:

  • The LOOCV predictors convey information on the expected value of the feature variables at an observation point, based on the spatial correlation structure of these variables, on their sampling design and on their values observed at neighboring points. Because they ignore the values at the target observation point, these LOOCV predictors can be seen as a 'summary' of the neighboring information at the point under consideration.
  • The filtered kriging predictors provide another complementary information, consisting of the expected values of the feature variables if there were no short-scale variability or measurement errors. If one decomposes the feature variables into a 'signal' (associated with the continuous component of the covariance function or variogram models) and a 'noise' (associated with the nugget effect), the kriging with filtering provides a predictor of the former, i.e. the noise-free feature variables. As the LOOCV, this predictor accounts for the spatial correlation structure of the feature variables, their sampling design and their values observed at neighboring points, but also for the value observed at the target point, which is not 'wasted' as in LOOCV, and for the basic nested structures used in the spatial correlation model, which are associated with different scales of variations [22]. The rationale of using the filtered kriging predictors is that the relationship between the lithology (output) and the feature variables (input) is better explained by considering the large-scale variations of the inputs that are not contaminated by short-scale variations. This is analogous to the denoising of images with convolutional neural networks [33], except that it is applicable even when the observations are not arranged in a regular sampling design. Note that, if the correlation model does not have a nugget effect, then the filtered kriging predictor matches the original predictor at any point where the latter is observed.

Another significant advantage of using the LOOCV and filtered kriging predictors in the classification problem is the handling of missing data, insofar as they provide a 'clever' alternative to the imputation of missing values, based on the spatial correlation structure and neighboring information, instead of (in complement of) a simple median value by lithological class. The use of geostatistical proxies can be decisive in the presence of highly heterotopic datasets, for which discarding the missing data would imply a considerable loss of information. Note that the LOOCV and filtered kriging give the same predicted value for a feature variable at a point with missing observation.

Associations between lithology and chemical assays

Minerals and metal enrichments in porphyry ore deposits are formed by the precipitation of sulfides from a hydrothermal system accompanying magma intrusion through pre-existing host rocks. Several interrelated phenomena get activated during such ore-forming processes, which depend on the geochemical, petrophysical (e.g. porosity, permeability) and thermodynamical (pH, redox potential, thermal conductivity, etc) rock properties, and on the environment of magma generation and its chemical characteristics.

Some chemical elements are more likely to be found in specific rock types than in others in the Earth's crust [34]. However, this usual metal-rock associations based on geochemical affinity may change once a mineralization event takes place. As in our case, elements such as Cu, Zn, and Au (both chalcophile and siderophile) have got hosted both in an intrusive acidic igneous rock such as quartz-monzodiorite (QMD), as well as in the basic igneous wall rock such as Augite Basalt (VA), after the hydrothermal mineralization event.

Accordingly, an explicit quantitative assessment of the lithology-grade relationship is a very arduous task. To top it all, grade assays are affected by short-scale geological variability and measurement errors, which renders their relationship with the lithology 'noisier' and more difficult to display.

Our work suggests that machine learning classification can successfully model this complex lithology-grade relationship and predict the lithology from the grade assays, and even better when the spatial information is taken into consideration, in particular when different scales of spatial variations are identified and the short-scale variability of the grade assays is filtered out.

Variable importance in the classification models

Bootstrap aggregating (bagging) is a general-purpose procedure for reducing the variance of a statistical learning method. When we perform bagging involving many trees, it is no longer possible to represent the resulting statistical learning method using a single tree, and it is no longer clear which variables are most important to the procedure. Thus, bagging improves prediction accuracy at the expense of interpretability. However, we can obtain an overall summary of the importance of each predictor using the Gini index. This index is a measure of node purity—a small value indicates that a node contains predominantly observations from a single class. We add up the total amount that the Gini index is decreased by splits over a given predictor, averaged over all bagged trees.

The RF algorithm estimates the importance of a variable by looking at how much prediction error increases when (OOB) data for that variable is permuted while all others are left unchanged. The necessary calculations are carried out tree by tree as the RF is constructed [35]. There are actually four different measures of variable importance implemented in the original code. However, we report only two measures of variable importance in appendix B. The first is the mean decrease of accuracy in predictions on the OOB samples when a given variable is permuted. The second is the mean decrease Gini, a measure of the total decrease in node impurity that results from splits over that variable, averaged over all trees.

In the results, models M2 and M5 comprise the three sets of predictors: (1) original assayed grades (denoted with the index 1 in the Appendices, e.g. Cu1, Au1, Ag1), (2) LOOCV predictors (denoted with the index 2), and (3) filtered kriging predictors (denoted with the index 3). Model M2 constitutes the data table when the original assayed grades are imputed by the median value per lithology at the missing data points. Model M5 constitutes the data table with those data points, at which missing data occurs, completely removed. We can see in appendix B that, in model M2, the top 10 most important predictors belong exclusively to the first set, whereas, in model M5, the top 15 most important predictors (barring F from the first set) belong wholly to the second and third sets of predictors.

Despite this fact, the lower performance of model M3 in comparison with models M1 and M2, and the marginally lower performance of model M6 in comparison with model M5, show that, even if they are the most important, the LOOCV and filtered kriging predictors are a complement to, not a substitute of, the original assayed predictors.

5. Concluding remarks

This work fits into the context of mineral resources modeling and dealt with the use of machine learning to predict lithological classes from a set of feature variables (geochemical elements and dry BD), through a case study of a polymetallic porphyry deposit located in Mongolia.

Classification has been performed with the RF algorithm. Our main finding is that, when the training set is not so large, the predictive performance of RF can be substantially improved by incorporating new feature variables corresponding to proxy variables calculated by the geostatistical technique of kriging. Specifically, two proxies have been introduced, consisting of (1) LOOCV predictors that summarize, at each data location, the information of its neighboring data, and (2) the filtered kriging predictors that provide a denoised version of the feature variables, which carries out information on the continuous components of the feature variables, free of measurement errors. The proxy variables calculated at a data point account for the spatial data correlation structure, scales of variations, sampling design, and values of the features observed at neighboring points. In the second classification exercise (with 631 data in the training set and 13 feature variables), the accuracy rate on the testing dataset raised from 61.2% without the proxy variables, to 67.5% with these variables.

Another advantage of using geostatistical proxy variables is the possibility to handle missing data in a clever way than in traditional approaches where missing values are imputed with median values per lithology class. In the presence of a highly heterotopic dataset where data imputation becomes adventurous, one could think of using model M3, with no need to discard any data (as in exercise n°2) or to impute any missing value (as in models M1 and M2 in exercise n°1), just by substituting the under-sampled variables with the geostatistical proxy variables. This also opens the door to the possibility to predict the lithology class at any unsampled location, as the proxy variables can be calculated at any location of interest, even if no measurement is available (note that, in such a situation, the LOOCV and the kriging with filtering lead to the same values, because the nugget effect is automatically filtered when making the kriging prediction at an unsampled location).

Even better, as a perspective for future work, one could think of simulating the feature variables at any (sampled or unsampled) location, via geostatistical approaches [20], instead of using LOOCV or kriging with filtering. By having multiple sets of simulated variables, it becomes possible to think of a probabilistic classification at the target location [36, 37].

Acknowledgments

This work was partially funded by the National Agency for Research and Development of Chile through grants ANID PIA AFB230001 (X Emery) and ANID Fondecyt 1210050 (X Emery). The authors acknowledge the constructive comments of two anonymous reviewers, as well as the sponsorship of Rio Tinto for the drill hole dataset used in this study as part of the Mineral Resource Estimation Conference 2023, organized by the Australasian Institute of Mining and Metallurgy (AusIMM).

Data availability statement

The data cannot be made publicly available upon publication because they are owned by a third party and the terms of use prevent public distribution. The data that support the findings of this study are available upon reasonable request from the authors.

Appendix A: Performance statistics for models M1 to M6

Table A1. Performance statistics for model M1.

 SensitivitySpecificityPrecisionRecallPrevalenceDetection rateBalanced accuracy
Class: AND0.99310.99980.99480.99310.02880.02860.9965
Class: BAD0.99150.99950.99290.99150.06950.06890.9955
Class: BIGD0.99820.99970.99730.99820.10980.10960.9989
Class: CLAY0.9852110.98520.02010.01980.9926
Class: FLT0.98160.99910.96910.98160.02690.02640.9904
Class: HBX0.98840.99990.98270.98840.00850.00840.9941
Class: HWS0.99770.99950.99850.99770.23810.23760.9986
Class: IGN0.99750.99970.99820.99750.13640.1360.9986
Class: INTR11110.00240.00241
Class: QCO11110.00010.00011
Class: QMD0.99620.99870.99550.99620.22080.220.9975
Class: RHY0.9960.99970.9920.9960.03710.0370.9978
Class: VA0.9990.99990.99950.9990.10130.10120.9995

Table A2. Performance statistics for model M2.

 SensitivitySpecificityPrecisionRecallPrevalenceDetection rateBalanced accuracy
Class: AND0.99660.99990.99660.99660.02880.02870.9982
Class: BAD0.99150.99940.99220.99150.06950.06890.9954
Class: BIGD0.99820.99940.99550.99820.10980.10960.9988
Class: CLAY0.98520.99990.99750.98520.02010.01980.9926
Class: FLT0.97790.99950.98330.97790.02690.02630.9887
Class: HBX110.994210.00850.00851
Class: HWS0.99790.99950.99850.99790.23810.23760.9987
Class: IGN0.99710.99990.99930.99710.13640.1360.9985
Class: INTR11110.00240.00241
Class: QCO11110.00010.00011
Class: QMD0.99750.99870.99530.99750.22080.22030.9981
Class: RHY0.99470.99960.99070.99470.03710.03690.9972
Class: VA0.999110.9990.10130.10120.9995

Table A3. Performance statistics for model M3.

 SensitivitySpecificityPrecisionRecallPrevalenceDetection rateBalanced accuracy
Class: AND0.5670.99330.71430.5670.02880.01630.7801
Class: BAD0.39960.97350.52970.39960.06950.02780.6865
Class: BIGD0.76870.97590.79710.76870.10980.08440.8723
Class: CLAY0.86910.99740.87340.86910.02010.01740.9333
Class: FLT0.3260.99250.54630.3260.02690.00880.6592
Class: HBX0.62210.99920.86290.62210.00850.00530.8106
Class: HWS0.94590.97210.91380.94590.23810.22520.959
Class: IGN0.89430.97630.85640.89430.13640.1220.9353
Class: INTR0.45830.99960.70970.45830.00240.00110.7289
Class: QCO0.666710.66670.66670.0001 $9.90\times 10^{-5}$ 0.8333
Class: QMD0.89980.93330.79270.89980.22080.19870.9165
Class: RHY0.60930.99130.72890.60930.03710.02260.8003
Class: VA0.8920.98450.86660.8920.10130.09040.9382

Table A4. Performance statistics for model M4.

 SensitivitySpecificityPrecisionRecallPrevalenceDetection rateBalanced accuracy
Class: BAD00.9766000.044800.4883
Class: BIGD0.20.98060.28570.20.03730.00750.5903
Class: CLAY00.9848000.014900.4924
Class: FLT0.09380.93640.16670.09380.11940.01120.5151
Class: HBX0.66670.98850.57140.66670.02240.01490.8276
Class: HWS0.50.95930.52380.50.08210.0410.7297
Class: IGN0.79490.93450.67390.79490.14550.11570.8647
Class: INTR0.680.94240.54840.680.09330.06340.8112
Class: QMD0.82140.80770.75410.82140.41790.34330.8146
Class: VA0.66670.99240.66670.66670.02240.01490.8295

Table A5. Performance statistics for model M5.

 SensitivitySpecificityPrecisionRecallPrevalenceDetection rateBalanced accuracy
Class: BAD0.33330.98440.50.33330.04480.01490.6589
Class: BIGD0.20.98450.33330.20.03730.00750.5922
Class: CLAY00.9924000.014900.4962
Class: FLT0.09380.93220.15790.09380.11940.01120.513
Class: HBX0.6667110.66670.02240.01490.8333
Class: HWS0.63640.96750.63640.63640.08210.05220.8019
Class: IGN0.82050.9520.74420.82050.14550.11940.8862
Class: INTR0.760.95060.61290.760.09330.07090.8553
Class: QMD0.88390.82050.77950.88390.41790.36940.8522
Class: VA0.66670.99240.66670.66670.02240.01490.8295

Table A6. Performance statistics for model M6.

 SensitivitySpecificityPrecisionRecallPrevalenceDetection rateBalanced accuracy
Class: BAD0.33330.99220.66670.33330.04480.01490.6628
Class: BIGD0.20.98450.33330.20.03730.00750.5922
Class: CLAY00.9924000.014900.4962
Class: FLT0.06250.94070.1250.06250.11940.00750.5016
Class: HBX0.6667110.66670.02240.01490.8333
Class: HWS0.63640.95930.58330.63640.08210.05220.7979
Class: IGN0.82050.94760.72730.82050.14550.11940.8841
Class: INTR0.720.94240.56250.720.09330.06720.8312
Class: QMD0.89290.82050.78120.89290.41790.37310.8567
Class: VA0.66670.99240.66670.66670.02240.01490.8295

Appendix B: Variable importance tables for models M1 to M6

Assay predictors are denoted with index 1, LOOCV predictors with index 2, and filtered kriging predictors with index 3.

Table B1. Variable importance table (sorted as per permutation variable importance) for model M1.

 ANDBADBIGDCLAYFLTHBXHWSIGNINTRQCOQMDRHYVAMean decrease accuracyMean decrease Gini
C10.65290.46940.66380.59290.33570.73810.25060.3070.5380.21750.14870.64850.26280.33814057.936
Hg10.56570.3370.34380.55830.5840.75560.01990.43880.51720.16250.34940.53790.45380.31963372.613
F10.16870.30510.18740.36480.15790.16030.58250.12960.45920.49120.13550.12340.42610.29494074.832
Pb10.09320.04680.04620.08720.04730.06550.01560.06720.03960.210.0050.04840.06110.03671440.238
S10.08480.04250.03510.20530.04980.17430.02470.02350.0450.20250.01120.03650.06840.03631576.449
Fe10.0730.03920.05260.11270.01430.0110.00260.01450.05890.20620.0050.06690.06880.0267761.4488
Zn10.03970.03230.01380.0170.01810.01310.00730.02790.01740.14250.00510.01290.08710.0219550.1273
Cd10.03670.01280.01150.01760.01270.0610.00260.01970.01650.10880.0020.00790.0490.0135383.8004
BD10.01310.00890.0020.01190.04620.08530.00110.00620.55390.220.01330.04330.00060.0105265.7439
Ag10.00740.00660.01990.04830.00750.02490.00110.00750.02270.15380.00350.00440.02170.0088396.448
Mo10.00380.00410.00070.01670.00380.01120.00280.00110.01810.13250.0010.00130.00130.002387.3287
As10.0030.00170.00190.00830.00220.00540.00030.00160.02420.09120.00080.00070.0010.001439.9058
Cu10.00250.00060.00020.0010.00120.00360.00190.00020.01230.03250.00220.0008 $ 2.30\times 10^{-5} $ 0.001241.3564
Au10.00070.00040.00030.00340.00030.00130.00010.00040.01540.01380.00080.0005 $ 6.93\times 10^{-5} $ 0.000515.5744

Table B2. Variable importance table (sorted as per permutation variable importance) for model M2.

 ANDBADBIGDCLAYFLTHBXHWSIGNINTRQCOQMDRHYVAMean decrease accuracyMean decrease Gini
C10.69330.54630.70880.64670.37260.80020.23710.36580.57940.14780.21420.71590.28840.37644570.177
Hg10.60830.36330.36380.57520.67080.61150.01590.53950.5290.05560.45550.53730.46770.36423719.936
F10.20980.32990.18340.37880.16750.15980.60240.14990.44560.42890.17110.15280.52080.32394422.471
Pb10.09940.050.04240.09790.05170.04730.02610.04730.03510.15780.00730.0460.05090.0361290.594
S10.07740.04240.0370.18740.0430.07340.03230.01440.02890.10330.01140.03630.05570.0341226.98
Fe10.06510.03380.04430.08550.01310.0030.00730.00250.04410.12670.00540.060.06050.023557.4544
Zn10.02490.01410.00450.00650.00550.00410.00320.02590.00250.05440.00170.0050.07410.0148329.9298
BD10.01310.00820.0010.03120.06560.05250.00070.00490.55280.13560.01180.04290.00050.0103251.4482
Cd10.02220.00780.00880.01570.00460.02970.00190.01290.00450.04560.00210.00420.01860.0076215.9698
Ag10.00560.00610.00870.05230.00880.01330.00140.00350.0080.05110.00180.00440.0130.0057230.4467
C30.00030.00020.00020.00020.004−0.00460.00020.00220.00560.02560.00430.0013 $6.81\times 10^{-5} $ 0.001532.2555
As10.00350.00130.0010.00320.0020.00130.00020.00060.01840.04330.00030.00040.00010.000720.0095
Mo10.0020.00060.00070.00980.00260.00080.0006 $3.10\times 10^{-5}$ 0.00650.03110.00040.0003 $3.83\times 10^{-5}$ 0.000721.5275
F30.00040.00030.00150.00080.00130.0007 $1.96\times 10^{-5} $ 0.00040.002300.00150.0009 $1.73\times 10^{-5} $ 0.00079.9824
Fe30.00070.0002 $9.37\times 10^{-5} $ 0.00030.00050.00080.001 $ 6.05\times 10^{-5} $ 0.00050.00440.00070.00130.00150.00078.8438
F20.0001 $5.89\times 10^{-5} $ 0.00060.00030.00080.0002 $7.19\times 10^{-5} $ 0.00040.00120.00330.00160.0009 $2.34\times 10^{-5} $ 0.00069.5585
C20.00030.0002 $6.30\times 10^{-5}$ 0.00060.00150.00030.00010.00030.00250.03330.00130.0005 $8.55 \times 10^{-6}$ 0.00059.8441
Fe20.00040.00020.00020.0004 $4.58\times 10^{-5} $ $7.48\times 10^{-5} $ 0.0010.00010.0020.00440.00030.00090.00020.00047.8021
BD30.00020.00030.00010.00090.00120.0025 $5.49\times 10^{-5} $ 0.00050.00140.00220.00070.0006 $3.47 \times 10^{-6} $ 0.00047.545
Cu10.00080.0001 $4.65\times 10^{-5} $ 0.00160.00040.00070.0001 $6.44 \times 10^{-6} $ 0.00230.01890.0010.0002 $-5.80 \times 10^{-6} $ 0.00033.5924
Cu20.0002 $4.79\times 10^{-5} $ $5.46\times 10^{-5} $ 0.00030.00030.00080.00020.00010.00090.00440.00060.0005 $6.12\times 10^{-6} $ 0.00034.0761
S20.00040.0001 $2.45\times 10^{-5} $ 0.00110.00060.00140.00010.00010.0025−0.00110.00060.0002 $9.39\times 10^{-5} $ 0.00036.0914
Hg20.00030.0004 $6.61\times 10^{-5} $ 0.0061 $-4.66\times 10^{-5} $ −0.0012 $3.76\times 10^{-5} $ $7.47\times 10^{-5} $ 0.00060.01220.00040.0002 $ 8.93 \times 10^{-6}$ 0.00037.5523
BD20.00050.0003 $8.69\times 10^{-5} $ 0.00060.00090.002 $ 2.48\times 10^{-5} $ 0.0004−0.00020.00560.00040.0004 $9.15\times 10^{-6} $ 0.00037.178
Hg30.00040.00070.00010.00320.0002 $-5.09\times 10^{-5} $ $1.74\times 10^{-5} $ $9.92\times 10^{-5} $ 0.00190.01220.00050.0003 $-5.92\times 10^{-6} $ 0.00038.4062
Au10.0003 $6.31\times 10^{-5} $ $5.12\times 10^{-6} $ 0.0036 $3.50\times 10^{-5} $ 0.0004 $7.06\times 10^{-5}$ 0.00010.00430.00220.0002 $7.20\times 10^{-5}$ $ 1.79\times 10^{-5} $ 0.00023.4839
Au20.0003 $1.61\times 10^{-5} $ $5.98\times 10^{-5} $ 0.00040.00030.00110.00020.00030.00570.00440.00040.0003 $2.35\times 10^{-5} $ 0.00026.9862
Ag2 $6.51\times 10^{-5} $ 0.00030.00010.00040.00050.00050.00030.00020.00590.00330.00020.0005 $2.91\times 10^{-6} $ 0.00025.823
Mo20.0003 $ 9.98\times 10^{-5}$ $ 2.98\times 10^{-5} $ 0.0005 $ 4.13\times 10^{-5}$ 0.00040.0001 $ 4.42\times 10^{-5}$ 0.00210.00670.00040.0003 $1.17\times 10^{-5} $ 0.00025.2598
Zn20.00040.0002 $1.09\times 10^{-5}$ 0.00020.0002 $-5.84\times 10^{-5} $ $8.01\times 10^{-5} $ 0.00050.00060.00440.0004 $7.40\times 10^{-5} $ $7.38\times 10^{-5}$ 0.00027.3934
Cd20.0005 $7.42\times 10^{-5} $ 0.00020.0003 $3.46\times 10^{-6} $ 0.0004 $4.86\times 10^{-5} $ 0.00020.001−0.00440.00030.0001 $4.11\times 10^{-5}$ 0.00025.8706
Au3 $-4.53\times 10^{-5} $ $1.47\times 10^{-5} $ $9.65\times 10^{-5} $ 0.00070.00020.00090.00010.00030.010.01330.00040.0003 $2.66\times 10^{-5}$ 0.00026.4571
Ag3 $5.64\times 10^{-5} $ 0.0001 $7.89\times 10^{-5} $ 0.00030.0003 $1.84\times 10^{-5} $ 0.00020.00020.00390.00330.00030.0006 $-2.77\times 10^{-6} $ 0.00025.4331
Cu30.0006 $9.98\times 10^{-5} $ $4.09\times 10^{-5} $ 0.00020.00020.00070.0002 $4.80\times 10^{-6} $ 0.00140.010.00020.0003 $-3.00\times 10^{-6} $ 0.00023.8091
As3 $7.27\times 10^{-5} $ 0.00020.00030.00060.00040.0016 $2.49\times 10^{-5} $ 0.00020.00090.01110.0002 $ 3.34\times 10^{-5} $ 0.00050.00025.1825
Mo30.0002 $7.81\times 10^{-5} $ $5.17\times 10^{-5}$ 0.00060.00040.00060.0003 $4.65\times 10^{-5}$ 0.0020.01220.00020.0001 $-8.69\times 10^{-6} $ 0.00025.6312
S30.00040.0002 $4.10\times 10^{-5}$ 0.0011 $4.15\times 10^{-5}$ 0.0011 $7.94\times 10^{-5}$ $6.56\times 10^{-5}$ $-9.66\times 10^{-5}$ 0.00110.00070.0002 $2.90\times 10^{-5}$ 0.00025.54
Zn30.0006 $5.13\times 10^{-5}$ $2.47\times 10^{-5}$ $8.58\times 10^{-5}$ 0.0002 $9.33\times 10^{-5}$ $5.22\times 10^{-5}$ 0.00040.00070.00110.0002 $7.35\times 10^{-5}$ 0.00020.00025.2306
As2 $4.26\times 10^{-5}$ 0.00010.00010.00040.00020.0023 $4.00\times 10^{-5}$ $6.56\times 10^{-5}$ 0.00170.00560.0001 $5.60\times 10^{-5}$ $2.83\times 10^{-6}$ 0.00015.0496
Pb20.00020.0001 $4.06\times 10^{-5}$ 0.00030.00050.0004 $8.27\times 10^{-5}$ 0.00010.0017−0.00110.00010.0001 $9.04\times 10^{-5}$ 0.00014.338
Cd30.0007 $1.30\times 10^{-5}$ $2.71\times 10^{-5}$ 0.00010.00010.0003 $3.40\times 10^{-5}$ 0.00020.0025−0.00220.0002 $5.52\times 10^{-5}$ 0.00010.00014.1277
Pb30.0004 $9.80\times 10^{-5}$ $5.37\times 10^{-5}$ 0.00020.0005 $5.26\times 10^{-5}$ $1.90\times 10^{-5}$ 0.00010.00070.0022 $5.65\times 10^{-5}$ 0.0003 $1.80\times 10^{-5}$ $9.45\times 10^{-5}$ 3.6644

Table B3. Variable importance table (sorted as per permutation variable importance) for model M3.

 ANDBADBIGDCLAYFLTHBXHWSIGNINTRQCOQMDRHYVAMean decrease accuracyMean decrease Gini
S20.06340.03750.07350.09950.03290.08140.10370.09190.03010.0430.09020.07080.07040.083696.024
Cu20.06140.03290.06750.06360.0320.04750.11260.06160.04760.0190.07450.06130.14630.0829792.518
F20.07570.0440.07440.10560.040.06350.1020.07890.05460.060.07150.08140.10280.0815708.979
S30.06310.03750.07350.09570.03230.07740.09780.09130.04110.0390.08930.07140.06730.081697.111
Hg30.06210.03680.09220.17540.03450.06670.08620.06660.03790.0220.06720.0620.12360.0788626.431
F30.07330.0440.07040.10770.03960.0640.09960.07760.05530.050.06630.07820.09940.0787701.754
Hg20.0610.03580.09010.17820.03290.06370.08430.06310.0330.040.06470.05920.11930.0765623.761
Cu30.05830.02830.06450.05530.02760.04140.10590.05370.03150.0190.06280.05740.13550.0753711.657
Mo20.0650.03670.07390.11370.03020.0490.12490.06740.02430.0060.05130.0480.0680.075817.567
Pb20.05440.03330.10440.06010.02960.0540.08950.07790.03190.0370.0420.06660.07540.0692568.836
Mo30.05910.03310.06710.09520.02660.04260.10890.06020.0228−0.0150.04890.04540.06020.0671764.693
Au20.05370.03570.06760.0510.03080.06470.06840.06210.02990.0120.08570.05280.05460.0651664.616
C20.04950.03120.07820.13990.02920.06030.07380.06350.03330.0420.06580.06610.04860.0645634.443
C30.0480.03070.07870.14040.03050.06040.07380.06240.03290.0790.0640.06560.04880.064628.27
Pb30.04840.03040.09890.05060.02580.04980.08120.07270.02510.0310.03710.05930.06940.0632534.848
Au30.05290.03340.06390.04960.03130.06090.06750.05790.030.0180.08150.05130.04990.0622651.456
BD20.05080.03080.07560.03450.02750.05370.0550.06250.03730.0270.0460.06350.08050.0562570.635
Fe20.04480.02990.05920.04530.02370.04730.07120.0570.03670.0120.0430.05870.04430.0528505.262
BD30.04770.02950.06690.03280.02410.04960.04910.05660.04380.0180.0410.0570.08020.0513534.843
Fe30.04360.02850.05480.04140.02210.04230.06710.05350.03570.0110.03890.05740.04230.0494494.123
Ag20.04360.02950.05830.04010.01990.03440.05620.04010.03060.0130.03670.04420.0720.0473550.961
As20.03990.02530.05140.03510.01990.03960.06420.05740.03660.0150.03580.04090.03720.0465574.738
Cd20.04010.02270.04110.03750.02090.04030.07090.05210.03530.0240.0280.04040.05370.0461447.007
As30.03860.02310.04870.03370.01840.04360.07190.04750.03630.0150.03210.04230.03380.0454487.261
Zn20.04060.02320.04110.04570.01750.03760.06570.05020.02990.0270.03120.03680.04860.0447520.732
Zn30.040.02220.03980.04490.01670.03620.06810.04990.02680.0150.02710.03880.04570.0439515.322
Ag30.03990.02660.05490.03620.01860.03090.05180.03630.03040.0130.03220.04250.07070.0437533.928
Cd30.03930.02340.03950.03550.02120.03710.06540.05110.02440.0270.02440.03770.05580.0437432.017

Table B4. Variable importance table (sorted as per permutation variable importance) for model M4.

 BADBIGDCLAYFLTHBXHWSIGNINTRQMDVAMean decrease accuracyMean decrease Gini
F10.07230.0486−0.001−0.0084−0.02790.18620.0380.07790.0630.1180.059917.8613
Ag10.05830.0632−0.0038−0.01120.00950.01780.04710.03650.07980.0760.0515.8915
Zn10.040.054−0.0029−0.00410.02930.10330.10440.03720.04250.10310.049917.8492
Cu10.03350.0394−0.0038−0.01410.06090.13880.01580.0740.06160.09080.049615.1981
Au10.05260.0649−0.00430.00180.06350.06450.04050.02590.0620.04260.046114.7698
S10.07570.08920−0.00090.05280.05940.04970.08290.03370.02990.041915.6537
Mo10.05950.0417−0.0081−0.0070.02110.0870.03650.04540.04930.04070.041515.839
Fe10.03290.05880−0.00450.00890.06430.0220.10380.03920.15930.040216.6727
C10.07220.05190.0038−0.00690.02390.13680.02570.01340.03210.00630.034112.3887
Cd10.03670.0716−0.0043−0.0040.00970.01610.03430.01330.0410.04240.029313.524
Pb10.080.0116−0.0014−0.0018−0.00240.02250.03420.03080.02310.0130.022913.1225
Hg10.01730.025200.00010.01−0.00170.02630.01330.0178−0.01660.01379.3773
As10.01090.0246−0.0014−0.008−0.00820.01220.00910.01850.0196−0.02380.012211.1179

Table B5. Variable importance table (sorted as per permutation variable importance) for model M5.

 BADBIGDCLAYFLTHBXHWSIGNINTRQMDVAMean decrease accuracyMean decrease Gini
Au20.060.0557−0.0025−0.00150.04810.14250.06630.01550.07840.05420.060812.8895
Au30.10270.04930.00250.00310.04270.09360.03710.00960.05670.04560.046210.8739
F20.03720.03450.0075−0.00620.02080.09250.01480.02830.03040.06030.02797.8726
F30.040.027500.00020.00170.12640.01680.03310.02190.05540.02797.1694
Ag20.00230.00910.00250.00390.0190.01380.02030.00650.04850.040.02737.125
Ag30.01720.0443−0.005−0.0034−0.0070.00790.0290.01450.04250.02250.02596.0664
C20.05750.03470.0050.0012−0.01670.10520.03520.00690.01620.01250.02419.1127
S20.01860.06550.005−0.00120.07680.01570.02910.02640.02390.00210.02239.943
Fe30.01020.00960.0025−0.005−0.00690.020.01010.06710.02310.06830.01998.2342
Pb20.02330.01890.0025−0.0082−0.01040.01560.03520.020.02610.00420.01986.6939
F10.02890.03170−0.0097−0.0150.09440.00810.01370.01490.01330.0184.9281
Fe20.00430.0032−0.0050.0080.0033−0.00020.00420.03940.02690.05380.0186.8288
S30.02020.0410.0025−0.00090.03420.0340.02030.02280.01580.00960.01797.3436
Zn20.00660.02930.0017−0.00070.01170.03280.03420.0150.01250.020.01626.567
Cu20.01010.0022−0.00250.00480.01850.06330.00960.01530.01750.02420.0165.212
S10.02430.0348−0.0125−0.00860.02130.03080.01640.04020.01150.00560.01515.8781
Fe10.01910.00920−0.0085−0.01250.02910.00310.03980.01620.05750.01436.4831
Zn10.01170.00770−0.00090.00250.0450.01890.02590.010.030.01354.5939
Zn3−0.00130.02140−0.0060.00170.03530.02820.01150.00770.01170.01145.1166
Ag1−0.00230.0008−0.005−0.0026−0.01040.00190.01250.00580.02010.03420.01113.7639
C30.03690.0330.0025−0.003400.04660.0130.01430.00470.00420.01083.7902
Cd20.01760.0103−0.00250.0040.00080.00950.01160.01530.01160.03390.01025.9184
C10.02660.0240−0.00080.00330.05440.00530.01010.00520.00250.00993.5855
Mo10.00580.005600.00110.010.01950.00810.01160.00790.00130.00863.4879
Mo30.005−0.0032−0.0133−0.00380.00330.02210.02170.00580.00830.00330.00854.5287
Mo20.00910.0053−0.0025−0.00250.02040.00770.0126−0.00010.00820.01790.00684.9142
Au10.00380.00650.00750.00390.01580.00530.01010.0020.008−0.00040.00643.5055
Cu10.00240.001300.00150.01290.00690.00130.0330.00440.00290.00612.9869
Cu30.001−0.00430−0.00290.00020.00950.00230.02610.007−0.00540.00593.2155
Cd30.00030.003300.0009−0.0050.01080.00920.00720.0060.00290.00522.6198
Pb30.00890.0073−0.00420.0002−0.00670.00650.01390.00150.00320.00580.00492.9691
Pb10.0090.00410 $9.09\times 10^{-5}$ −0.00130.0010.00840.00710.00450.0050.00422.6577
Cd1−0.00140.0064−0.0050.0034−0.00460.0010.00550.00450.0056−0.00180.00412.4785
As10.0005−0.00540.0025−0.0009−0.00130.01030.00110.00230.0036−0.00070.00241.9875
Hg2−0.0046−0.00040.00170.0166−0.00670.00450.00240.003−0.00080.0090.00234.0571
As20.00950.02080−0.0066−0.010.00340.0045−0.00250.00310.00370.0023.4918
Hg10.000700−0.0008−0.0021−0.00320.00510.0010.0023−0.00170.00151.5937
Hg30.00250.003300.00880.0078−0.0065−0.00020.00270.0013−0.00420.00153.1068
As3−0.00330.000300.001300.00390.0026−0.00060.0001−0.00330.00071.9225

Table B6. Variable importance table (sorted as per permutation variable importance) for model M6.

 BADBIGDCLAYFLTHBXHWSIGNINTRQMDVAMean decrease accuracyMean decrease Gini
Au20.060.05220.00640.00630.05180.1110.06760.02010.07720.06940.060613.036
Au30.0870.04380.00870.00630.05450.08770.04720.02250.05880.05110.04912.009
F30.0520.0381 $-1\times 10^{-4}$ −0.003−0.01030.1780.02870.05240.03640.06830.042710.4413
F20.0580.04380.0029−0.0030.01750.15150.02810.03950.03480.08510.040410.7746
Ag20.0210.02370.01050.00130.01030.00840.03010.00920.05790.03770.03298.3464
Ag30.0140.0320.0006−0.0070.03320.00720.03520.01310.05330.0350.03158.0507
S20.0230.07220.01020.00380.08320.03340.03890.0240.03410.0070.030511.0913
S30.0350.05180.0026 $-2\times 10^{-4}$ 0.06180.04550.03740.03460.02640.00730.028310.326
Fe30.0250.00930.00330.00220.00410.02770.01320.08030.02690.10140.02710.9636
Cu20.0260.01160.00120.00240.03580.08280.00740.02770.02940.04640.0267.8726
C20.0460.05120.00780.0013−0.00450.11220.03350.01070.01750.00380.0269.7221
Zn30.0140.0096−0.004−0.0020.01020.08040.04860.03590.01470.05370.02448.4346
Fe20.0130.01190.0040.00920.00740.00450.01170.05610.03230.05370.02419.5132
Zn20.0160.03030.003−0.0020.01070.04630.05410.02340.01510.04270.02268.585
Pb20.0270.0084−0.004−0.0060.00160.0160.04260.01480.02660.00440.02078.0985
C30.0430.0416−0.003−0.0040.00320.0940.01290.01730.00980.00590.01776.1296
Cu30.0090.0033−0.0010.00170.02620.02510.00720.05770.01540.03190.01656.1409
Mo30.0130.013−0.007−0.0050.0080.01760.02780.01870.01410.00950.01376.228
Cd20.0140.0041 $-4\times 10^{-4}$ 0.00280.00820.01960.02470.01680.01060.04060.01316.9025
Mo20.010.0104−0.002−0.0060.02130.00810.02320.00790.01630.01150.01226.0524
Pb30.0210.0042−0.004−0.002−0.00630.010.01760.01330.0075−0.00260.00844.7196
Cd30.0020.0102−0.003 $-7\times 10^{-4}$ 0.00320.01040.01410.0070.010.00830.00824.3019
Hg2−00.00310.00240.01170.00280.00150.00720.00730.0058−0.00460.00555.1636
As20.0020.02130.0028−0.003−0.00280.01550.00650.00120.0054−0.00660.00484.6214
Hg30.0020.00140.00190.00610.0017−0.0020.00440.00580.0067−0.00270.00464.5215
As3 $7\times 10^{-4}$ 0.00230.0002−0.004−0.0040.00630.0030.00470.0046−0.00490.00283.3472
Please wait… references are loading.
10.1088/2632-2153/ad3c0f