1 Introduction

In interpersonal communications, facial expressions, body movements, and language constitute the emotional expression system. Among them, facial expressions are the most important way for the communication among people, and they are divided into macro-expression (MaE) and micro-expression (ME). MEs are spontaneous expressions, and even when people try to hide their inner emotions, their facial muscles move uncontrollably and produce ME.

ME is a physiological stress reaction that occurs naturally and is not controlled by humans. Therefore, it is difficult to fake and suppress. ME reflects the true emotions of people and has practical value in the fields of medical treatment, polygraph detection, and criminal investigation. Currently, the MaE recognition task has been widely studied owing to a large number of MaE data sets available, and the proposed algorithms can reach over 90% accuracy scores on the public data sets. Compared with MaE, the ME recognition problem is much more difficult. The reasons lie in the following aspects. Firstly, MEs are short in duration (only 1/25 to 1/2 s long [1]) while MaEs last much longer (usually 1/2 to 4 s long). Secondly, the intensity of ME is low, and only a few muscles in the facial region are involved. Furthermore, the changes in these muscles are subtle and hard to be caught.

ME can be classified into six typical emotions: happiness, sadness, fear, anger, disgust, and surprise [2]. But owing to the small sample size problem, the ME data sets are split into three categories instead: positive, negative and surprise. Even if the image under study is cropped and resized, the dimensionality of the ME feature is still large. For example, the number of three-dimensional optical flow features proposed for STSTNet [8] can reach 2352. Due to the sparse nature of ME data, only a small part of the optical flow of the facial region is useful for classification.

When applying the existing feature selection algorithms, the challenges for the ME recognition task are listed as follows: (1) The characterizes of ME data sets are class imbalance and high dimensional features along with the small sample size problem. Consequently, the traditional feature selection methods can’t well handle these problems and are prone to suffer the overfitting problem. (2) The validation scheme for the ME recognition is leave-one-subject-out (LOSO), and the evaluation of each individual’s fitness value is based on fivefold or tenfold cross-validation scheme in most cases. Consequently, the evolutionary strategies-based feature selection algorithms require a huge amount of computational resources for the search of optimal feature subsets. (3) In the optical flow images, the adjacent pixels describe the motion information of the same facial muscle or reflect similar information, which are highly correlated in most cases. So, it is necessary to effectively filter such redundant features to accelerate the search algorithm. Without limiting the sizes of initial feature subsets, the search space is very large for the heuristic algorithms, since there are 28 × 28 × 3 = 2352 features in the ME recognition problem.

In this paper, we conduct the feature selection on facial regions by using genetic programming (GP) to extract the facial regions that are related to emotional categories. We hypothesize that by focusing on specific facial regions, the accuracy of ME recognition can be improved.

Evolutionary algorithms are biologically-based optimization algorithms, and the typical algorithms are Genetic Programming (GP) and genetic algorithms (GA). GA had been used to select compact feature subsets or component subsets in the bioinformatics fields with great success [3]. Besides, it had been used to remove features that were extracted by the deep neural networks [4]. In [5], Zhang et al. encoded the feature subsets and the matched classifiers as the chromosomes, so as to optimize an evolutionary Error-Correcting Output Codes ensemble algorithm.

The memetic algorithms (MA) refer to the population-based meta-heuristic search approaches, which combine population-based global methods and local optimization methods. MAs have been used for feature selection in different research fields. [6] presented a multi-objective feature selection algorithm by fusing GA and the local search strategy, aiming to reduce the dimensionality of features and improve the performance of the classifier. [7] used filter methods as the local search heuristic to tune the population of GA. And [8] integrated GA with the stochastic local search strategy to select the optimal feature subsets and optimal parameters of SVM simultaneously.

The memetic automaton is the software entity autonomously acquiring the capability and intelligence of continuous improvement through the embedded meme. The meme is the building block of information, which is learned independently or interactively. Memetic representations can be divided into the categories of symbolism, connectionism, or others. To date, representations of symbolic modalities include tree, schemata, graph, and logic. In [9], a memetic crossover operator was introduced to allow individuals to imitate the observed success of other individuals. [10] proposed a memetic GP framework to improve the performance of GP on complicated symbolic regression problems, which generated features by a GP variant, named linear imperative programming with differential evolution (LDEP). Then it used a gradient-based nonlinear least-squares algorithm locally finetunes weights for features. [11] extended Multidimensional Multiclass Genetic Programming with Multidimensional Populations(M3GP) to symbolic regression, combined two local search methods with a GP global search. [12] designed three mutation-based operators to inject local search into the multi-objective GP.

Memetic algorithms have been applied for feature selection. For example, [13] designed an unsupervised feature selection method based on the memetic algorithm according to the characteristics of hyperspectral images, with information retention and redundancy reduction as optimization objectives. [14] used a memetic algorithm to process the feature vectors extracted from macro-expression images, which significantly reduced the number of features and improved the recognition accuracy. [7] used texture and shape features to describe handwritten text images, and used a hybrid feature selection method based on a memetic algorithm to reduce the dimensionality of the input feature vectors. [15] designed a feature selection method based on the memetic algorithm that used proximity graphs to evaluate the quality of a subset of features. However, these methods dealt with the feature selection problem by expressing it as a one-dimensional feature vector, ignoring the spatial information of the image features.

This paper proposes a tree-type feature subset constructor to learn relationships among various types of MEs, with each subtree representing memes learned from the data. On the one hand, the tree structure has lower space complexity in representing sparse feature subsets of high-dimensional data. On the other hand, compared with the use of mathematical functions in symbolic regression, we use a more adaptive function set to make memetic propagation more efficient.

To address the high-dimensional image feature selection problem, this paper uses GP as a technique for the search of optimal feature subsets through the evolutionary process. The tree structure of GP preserves the feature selection process, and forms the final result fitting for the human learning process. Our GP can automatically pick up effective features by removing redundant and irrelevant features, so as to achieve high classification performance.

Our GP can well handle the challenges for feature selection in the ME recognition task by decomposing the three-class ME data into three binary-class problems. The combination of F1-score and the cross-entropy is set as the fitness function, leading to better balance the performance across different classes. As for the high-dimensional feature selection problems, our algorithm limits the feature subset size through the application of the sliding window. That is, the candidate features are split into a set of 2 × 2 × 1 units, so that the size of candidates is 588. In this way, the search space is cut down. In addition, the depth of the tree structure is set to 10 to limit the feature sizes included in each individual. Furthermore, this setting also drives the algorithm to keep only relevant features and effectively reduce the computational cost.

In conclusion, this study addresses the feature selection mechanism in the field of the ME recognition, and the main contributions of our work include:

  1. 1.

    We propose a ME recognition algorithm based on GP that performs feature selection on facial features.

  2. 2.

    In this work, feature selection is performed by a sliding window, which reduces the difficulty of the feature selection task.

  3. 3.

    We use GP as a tree-type feature subset constructor, with each subtree representing meme learned from the data. The tree structure has lower space complexity in representing sparse feature subset and an adaptive function set, making the memetic propagation more efficient.

2 Literature review

2.1 The ME recognition

A typical ME recognition system can be divided into three stages: image preprocessing, feature extraction, and emotion classification. First, in the image preprocessing stage, image sequences are filtered to select frames that contain MEs, and the redundant and invalid frames are removed. There are two types of frame selection methods: sequence-based and apex-based methods. The sequence-based methods use the entire image sequence or a segment of the video to extract features [16,17,18,19]. The apex-based approaches consider the motion intensity of the given ME, and use only the onset frames with neutral expressions and the apex frames with the highest expression intensity for feature extraction [20,21,22]. Then, it is necessary to identify and crop the face area in the image, normalize the picture of the face, and reduce the size of the picture as much as possible while keeping the facial features intact.

In the feature extraction stage, the ME features are mainly divided into appearance-based features and geometry-based features. For the appearance-based expression features, local binary patterns (LBP) had been widely used [23]. divided the image into several regions from which LBP features were extracted. Then they were concatenated into a feature vector for describing the dynamic features of the face. An LBP describes textural features using a histogram of binary numbers by comparing the results of a given pixel with its domain pixels as binary numbers. The LBP operator is simple yet powerful, and it is widely used for the baseline method of ME recognition [24]. LBP is robust to the changes of monotonic intensity, such as illumination and contrast variations. While it should be noted that LBP is sensitive to scaling, rotation, translation, non-rigid formation, and small pixel value fluctuations. Handcrafted approaches use the extracted features as inputs for the classifier to perform ME recognition. [19] combined a local binary pattern on three orthogonal planes (LBP-TOP), a histogram of oriented gradients, and a histogram of image gradient orientations to produce a feature vector for recognition purposes. In [25], based on the LBP-TOP, the extended local binary patterns on three orthogonal planes (ELBPTOP) approach was proposed to explore local binary information in the radial and angular directions contained in ME video sequences.

In geometry-based feature extraction methods, optical flow describes the motion of an image based on the luminance variations in its pixels, which enables the capture of weak facial movements. Optical flow-based features include Optical Strain Feature (OSF) [26], Optical Strain Weight (OSW) [27], Fuzzy Histogram of Oriented Optical Flows (FHOOF) [28], Main Directional Mean Optical Flow (MDMO) [17], and Bi-Weighted Oriented Optical Flow (Bi-WOOF) [22].

Recently, a series of ME recognition algorithms based on convolutional neural networks (CNN) had been proposed, and they used the hierarchical structures of CNN to automatically extract important features. Since the training process for neural networks requires a large number of samples, [29] used the MaE data set to pre-train ResNet10 and exploited the relationship between MaEs and MEs for the migration learning. The shallow triple-stream three-dimensional CNN (STSTNet) [8] had been used to extract features from optical strains and horizontal and vertical optical flows, and it deployed a neural network to perform classification. [30] used CapsuleNet to directly recognize MEs in apex frames. In our previous work, [20] analyzed several spatio-temporal features and investigated two generative adversarial networks (GANs) for the ME recognition system.

2.2 Genetic programming

Due to the powerful representation capability of GP, it provides a flexible tree structure to implement the feature selection and classification task simultaneously [31]. For example, each individual of the GP represented a decision tree to tackle a binary class classification problem, and the GP evolved to generate an ensemble of binary classifiers to handle the multiclass problem through the One VS. All (OVA) scheme [32]. On the other hand, GP could also perform implicit feature selection and feature construction while training classifiers [31, 33,34,35]. GP had been applied for feature learning in the field of image classification to learn high-level features [36, 37]. In [37], different image descriptors were applied to different regions of an image to extract high-level features. In [38], a three-level tree representation based on GP was defined to perform feature selection, feature construction, and binary class classification. In [39], a multi-objective optimization approach EvoFS was proposed to select optimal feature subsets with the objectives of minimizing feature subset size and test error, and maximizing mutual information between each feature and target. [40] proposed an improved version of Salp Swarm Algorithm (ISSA) to the feature selection task in wrapper-mode. ISSA used Opposition Based Learning at the initialization phase, and Local Search Algorithm to improve exploitation. [41] proposed a Grey Wolf Optimizer algorithm integrated with a Two-phase Mutation (TMGWO) for feature selection, rectified the problem of being trapped in the local optima and the lack of diversity.

However, the research of GP in the image classification field mainly focuses on the combination of image descriptors, and the feature selection has not been well studied yet. Compared to the task of generating image descriptors, the pixel-based image feature selection problem is more difficult because of the larger search space. In [30], masks were used to represent feature subsets, and the set operators were used as the function sets for the search of optimal combinations. Besides, GP had been used to develop a series of studies regarding high-dimensional feature selection [42,43,44]. For example, [42] used a feature construction embedded approach to investigate the potential of GP in the high-dimensional gene selection problem.

3 Methodology

The framework of the proposed algorithm is shown in Fig. 1. It includes image preprocessing, optical flow feature extraction, feature selection based on GP, and ME classification. The optical flow algorithm proposed in [8] is used for data preprocessing. After this step, optical flow features are split by the sliding windows. The original multiclass classification problem is decomposed into multiple binary class classification problems using the One vs. Rest (OVR) scheme. Each binary class classification problem is solved by the feature selection algorithm individually to obtain the corresponding optimal feature subset. Finally, these feature subsets are used to train three classifiers, and the final results are obtained by the majority voting strategy. The framework of this study is described below.

Fig. 1
figure 1

Flow diagram of the proposed ME recognition framework

3.1 Optical flow feature extraction

Three data sets are used for the experiments: CASME II [24], SMIC [45] and SAMM [46]. Since the ME data sets were all obtained using high-speed cameras, many redundant images depict the same poses in the data sets. So the preprocess eliminates these redundant frames by only keeping the apex frames. An apex frame is a frame with the highest ME intensity in a video sequence. CASME II and SAMM were labeled by experts, while SMIC did not provide an index for apex frames. An automatic apex frame spotting framework is used to calculate the apex frames in SMIC by the D&C-RoIs [47] method. This method uses the LBP operator to compute the differences between frames to find the apex frames.

The optical flow is the instantaneous rate of the motion of a pixel projected on a two-dimensional plane by a moving object in space. It can obtain information about the motions of objects in adjacent frames by the changes in pixels in the time domain. The motion field of an object in space can be approximated as an optical flow field, which is a two-dimensional vector field describing the trend of the luminance of each pixel in an image. Each vector represents the information about the instantaneous motion rate of a pixel point. In the proposed ME recognition framework, the optical flow field of the onset frame and the apex frame provides explicit information about the changes in facial motion. The optical flow field represents the direction and velocity of the motion of each pixel between two frames, and it requires the following assumptions to be satisfied:

  1. (a)

    Luminance invariance: the luminance of the surface of the moving object remains the same in different frames.

  2. (b)

    Time continuity: time does not cause dramatic changes in the object. The degree of object motion is small and changes gradually as time progresses.

  3. (c)

    Spatial correlation: the changes in the shape of the object can be ignored, and neighboring points exhibit the same motion vectors.

Let \(I\left( {x,y,t} \right)\) represent the luminance of a pixel point (x,y) at a certain frame t, and the basic constraint equation for optical flow is:

$$ \nabla I \cdot \vec{p} + I_{t} = 0 $$
(1)

where \(\nabla I = \left( {I_{x} ,I_{y} } \right)\) is the partial derivative of the pixel luminance in the image along the x,y direction, and it represents the spatial gradient of the pixel luminance. \(I_{t}\) is the partial derivative of the pixel luminance in the image along the direction of time t, representing the temporal gradient of the pixel luminance. \(\vec{p}\) is the horizontal and vertical motion vector of the optical flow.

$$ \vec{p} = \left[ {u = \frac{dx}{{dt}},v = \frac{dy}{{dt}}} \right]^{T} $$
(2)

To calculate the degree of motion of the object, the optical flow strain method is deployed to extract features. The optical strain calculates the degree of facial skin deformation when a non-rigid facial motion occurs [34]. It is assumed that the motion of an object is represented by a two-dimensional vector \({\varvec{u}} = \left[ {u,v} \right]^{T}\). The motion deformation is expressed by:

$$ \varepsilon = \frac{1}{2}\left[ {\nabla {\varvec{u}} + \nabla {\varvec{u}}} \right]^{T} $$
(3)

It can also be formulated as a Hessian matrix:

$$ \varepsilon = \left[ {\begin{array}{*{20}l} {\varepsilon_{xx} = \frac{du}{{dx}}} \hfill & {\varepsilon_{xy} = \frac{1}{2}\left( {\frac{du}{{dy}} + \frac{dv}{{dx}}} \right)} \hfill \\ {\varepsilon_{yy} = \frac{1}{2}\left( {\frac{dv}{{dx}} + \frac{du}{{dy}}} \right)} \hfill & {\varepsilon_{yy} = \frac{dv}{{dy}}} \hfill \\ \end{array} } \right] $$
(4)

where (\(\varepsilon_{xx}\), \(\varepsilon_{yy}\)) is the normal principal diagonal strain component and (\(\varepsilon_{xy}\), \(\varepsilon_{yx}\)) is the nondiagonal shear strain component. The optical strain variables is obtained by:

$$ \left| \varepsilon \right| = \sqrt {\left( {\frac{du}{{dx}}} \right)^{2} + \left( {\frac{dv}{{dy}}} \right)^{2} + \frac{1}{2}\left( {\frac{du}{{dy}} + \frac{dv}{{dx}}} \right)^{2} } $$
(5)

The optical flow features and the optical flow strain are stitched together to serve as the inputs for the algorithm. Each video sequence yields a three-dimensional optical flow: \({\Theta } = \left\{ {u,v,\varepsilon } \right\}\), where \(u\) is the horizontal component of the optical flow field, \(v\) is the vertical component of the optical flow field, and \(\varepsilon\) is the optical strain feature.

3.2 The feature selection algorithm based on GP

Usually, GP uses symbolic regression to find optimal feature subsets. The purpose of symbolic regression is to find the mapping function between the independent variable and the dependent variable when feature selection can be done implicitly. Evolutionary algorithms usually take wrappers and embedded methods for feature selection. The framework of an evolutionary algorithm mainly includes problem encoding, individual recombination (crossover, mutation), fitness evaluation, and offspring selection. GP is often used for feature construction, and there are few studies related to GP in the field of feature selection [31, 48].

In GP, individuals are encoded into a flexible tree-structure, which solves the immutable gene length limitation of the traditional GA and yields a flexible search capability. However, GP-based methods are usually not applied to the large scale feature selection problem. When the feature dimensionality exceeds 100 dimensions, the search space of traditional GP is too large to converge with a limited computational cost. When the feature dimensionality reaches thousands of dimensions, it is impractical to use symbolic regression directly. In this paper, GP is designed to select relevant features from the high-dimensional ME data.

3.2.1 The design of individual

In this paper, the ME data sets are transformed into the 3D optical flow data. To reduce the search space, the sliding window with the size of 2 × 2 × 1 is deployed to slice the 3D optical flow data into smaller pieces without overlapping. The size of the 3D optical flow is 28 × 28 × 3, so the sliding window technique produces 14 × 14 × 3 image regions, which greatly reduces the difficulty of performing feature selection. The number of features is cut down from 2352 to 588. The size of the sliding window can be changed according to the characteristics of the image data.

Our GP uses these segments as the terminal set of individuals, and uses the union operator to merge these segments to form the subtrees. The union operator is used as the function for the individuals. And all selected segments are downscaled to a one-dimensional vector at the root node, which is then fed to the classifier. Figure 2 illustrates the individual structure, where ARG1, ARG2, ARG3, and ARG4 represent different segments. Each individual represents a subset of features, and the evolutionary process produces the optimal feature subsets in the final generation.

Fig. 2
figure 2

An example individual and the corresponding subset

3.2.2 Evolutionary process

In our algorithm, the initial population size is set to 50, and the first generation is randomly generated by the ramped half-and-half method. To limit the size of the feature subset for the first population, the heights of the generated individuals are restricted within [1, 3]. During the iterative process, the crossover and mutation operations are performed on the individuals in a population. The individuals entering the next generation are selected by the tournament method.

Figures 3, 4 and 5 show the examples of the reproduction operators used in our method. The crossover operator is the one-point crossover, which randomly selects a node as the crossover point on each of the two individuals, and then swaps the subtrees below the crossover nodes between each individual to produce the next generation. The mutation operation includes the uniform mutation and shrink mutation operators. The uniform mutation firstly selects a mutation point in the individual at random, then the subtree rooted at the mutation point is replaced by a new Full subtree. The shrink mutation randomly chooses an internal node and replaces it with one argument of the selected node. After the crossover and mutation operations, the generated individuals are treated as offspring. The probability of crossover is 100%, and the probability of the shrink mutation and the uniform mutation are both set to 10%.

Fig. 3
figure 3

An example of the One-Point Crossover operator

Fig. 4
figure 4

An example of the Uniform Mutation operator

Fig. 5
figure 5

An example of the Shrink Mutation operator

To ensure the convergence speed of the population, during the iterative process, the optimal individual is retained from the current to the next generation. Research has shown that evolutionary algorithms can achieve better convergence effects through elite retention strategies [49]. Our GP terminates at the 50th iteration, and the historically optimal individual is picked up as the output.

3.2.3 Fitness

In evolutionary algorithms, the fitness function plays an important role in guiding the direction of population evolution and assessing the problem-solving abilities of individuals. Our fitness function consists of two measurements: the Macro-F1 score and cross-entropy, as defined by formula (6). For each individual, the fivefold cross-validation is used to evaluate the contribution of the selected feature subset on the training set. Individuals with greater fitness values indicate that they are of higher performance on the validation set, and their genes are of higher probability to be retained in the next generation.

$$ Fitness = Macro\;F1\;score - Cross\;Entropy $$
(6)

The Macro-F1 Score is an evaluation metric for the classification problem. Although accuracy is often used as an evaluation metric in such problems, it cannot reflect the performance under the circumstance of the class imbalance problem when the prediction result biases toward the majority class. To cope with the class imbalance problem, the Macro-F1 Score is deployed to provide a balanced evaluation of the precision and recall values for different classes. Suppose that there are n classes, \({\text{TP}}_{{\text{i}}}\) (true positives) means that the model correctly predicts the positive class i. \({\text{FN}}_{{\text{i}}}\) (false negative) means that the prediction of class i is negative but the value is actually positive. \({\text{FP}}_{{\text{i}}}\) (false positive) means that the prediction of class i is positive but the value is actually negative. The Macro-F1 Score is calculated as follows:

$$ recall_{i} = \frac{{TP_{i} }}{{TP_{i} + FN_{i} }} $$
(7)
$$ precision_{i} = \frac{{TP_{i} }}{{TP_{i} + FP_{i} }} $$
(8)
$$ Macor{ }F1{ }score = \mathop \sum \limits_{i = 1}^{n} \frac{{2 \times precision_{i} \times recall_{i} }}{{n \times \left( {precision_{i} + recall_{i} } \right)}} $$
(9)

Cross entropy is deployed to calculate the similarity between the predicted probability distribution and the true probability distribution. It is most often used in classification problems. Assume that there are n classes, cross-entropy is calculated as follows:

$$ Cross{ }Entropy = { } - \mathop \sum \limits_{i = 1}^{n} y_{i} \times \log \left( {\widehat{{y_{i} }}} \right) $$
(10)

where \(y_{i}\) denotes the ground-truth probability distribution, and \(\widehat{{y_{i} }}\) denotes the predicted probability distribution. They both and take the value within [0, 1].

3.3 Emotion classification

In the experiments, Support Vector Machine (SVM) [50] is used as the base classifier for fitness evaluation and expression classification in all experiments. After obtaining 3D optical flow data, the OVR method is used to decompose the multiclass classification problem of ME recognition into three binary class classification problems. For each binary class classification problem, GP selects the feature subsets that are related to the class of interest. After repeating the feature selection process, the obtained optimal feature subsets are fed into classifiers to form the final prediction result according to the OVR strategy.

To effectively compare the performance of the proposed algorithm, the LOSO cross-validation method is deployed. It retains the expression data of one subject during each iteration as the test data, and the remaining subjects are treated as the training set. This step is repeated until every subject is verified. Finally, the accuracy of the model is obtained by averaging the results of all the test data.

4 Experimental settings

4.1 Data set

Three ME data sets are used in this study: SMIC, SAMM, and CASME II. The details of these data sets are shown in Table 1. In CASME II, happiness is labeled as the positive class, and disgust and depression are labeled as the negative class. Emotions such as anger, disgust, contempt, fear, and sadness in SAMM are labeled as the negative class. After relabeling the ME classes, it is observed that the three data sets are class-imbalanced, as there are much more samples in the negative class than those in both the positive and the surprise classes.

Table 1 Details of the three ME data sets

In a ME video sequence, the first and the last frame are called as the onset frame and the offset frame respectively, and the frame with the highest facial expression intensity is the apex frame. The SMIC data set does not provide the locations of the apex frames, so they need to be detected using related techniques.

4.2 Performance metrics

The three ME data sets are combined to form the Full data set, which contains 248, 103 and 83 samples in the negative class, the positive class and the surprise class respectively. The performance metrics deployed in this study include the conventional measures, such as precision, recall, and accuracy. To avoid bias caused by the class imbalance problem, the F1 score, unweighted F1 score, and unweighted recall (UAR) are deployed. UAR is calculated in the same way, just as Recall given in formula (12).

$$ Accuracy = { }\frac{{\mathop \sum \nolimits_{\alpha = 1}^{M} \mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } }}{{\mathop \sum \nolimits_{\alpha = 1}^{M} \mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } + \mathop \sum \nolimits_{\alpha = 1}^{M} \mathop \sum \nolimits_{\beta = 1}^{k} FP_{\alpha }^{\beta } }} $$
(11)
$$ Recall = \mathop \sum \limits_{\alpha = 1}^{M} \frac{{\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } }}{{M \times \left( {\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } + \mathop \sum \nolimits_{\beta = 1}^{k} FN_{\alpha }^{\beta } } \right)}} $$
(12)
$$ Precision = \mathop \sum \limits_{\alpha = 1}^{M} \frac{{\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } }}{{M \times \left( {\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } + \mathop \sum \nolimits_{\beta = 1}^{k} FP_{\alpha }^{\beta } } \right)}} $$
(13)
$$ F1{ }Score = \frac{2 \times Precision \times Recall}{{Precision + Recal}} $$
(14)
$$ Recall_{\alpha } = \frac{{\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } }}{{\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } + \mathop \sum \nolimits_{\beta = 1}^{k} FN_{\alpha }^{\beta } }} $$
(15)
$$ Precision_{\alpha } = \frac{{\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } }}{{\mathop \sum \nolimits_{\beta = 1}^{k} TP_{\alpha }^{\beta } + \mathop \sum \nolimits_{\beta = 1}^{k} FP_{\alpha }^{\beta } }} $$
(16)
$$ UF1{ }Score = \frac{{\mathop \sum \nolimits_{\alpha = 1}^{M} \frac{{2 \times Precision_{i} \times Recall_{i} }}{{Precision_{i} + Recall_{i} }}}}{M} $$
(17)

Here M represents the number of classes and k represents the number of characters. \(TP_{\alpha }^{\beta }\) indicates that subject β is identified as a true positive in class α.

5 Results and discussion

This section demonstrates the performance of the proposed algorithm and compares it with other methods. The results of selected features are shown, and the process and results of feature selection are further analyzed. All the experiments are carried out on the Mac OS Mojave 10.14.6, with Intel Core i7-9750H CPU and 16 GB memory. In the following experiments, SVM is deployed as the base learner with the default parameters.

5.1 Comparison with other algorithms

Table 2 lists the results of the comparison among our proposed algorithm and other error-correcting output coding (ECOC) algorithms. All algorithms use the 3D optical flow data obtained in Sect. 2.1 as the inputs. It is obvious that with the GP-based feature selection, our algorithm outperforms the traditional OVO and OVR algorithms. The accuracy, F1 score, UF1 score, and UAR metrics of our proposed algorithm on the Full data set are 0.7636, 0.7226, 0.7204, and 0.7063, respectively. These results are better than those of the comparison algorithms. Since our algorithm uses the OVR approach to decompose the multiclass classification problem, the effectiveness of using GP for feature selection is confirmed. Compared with the OVR algorithm, the algorithm proposed in this paper improves the accuracy by 24%, the F1 score by 27.8%, the UF1 score by 27.96%, and UAR by 22.5%.

Table 2 Recognition results of different algorithms using optical flow features

Table 3 compares our algorithm with other recently proposed meta-inspired feature selection algorithms. It is observed that our method achieves the best overall performance on the Full data set. In addition, our GP can distinguish the differences between class pairs more precisely, and perform better on minority classes of unbalanced data.

Table 3 Recognition results of different feature selection algorithms

Table 4 shows the parameters, running time and size of the selected feature subset for different algorithms. From the table, it is found that ISSA requires the least running time, because it is the only algorithm that doesn’t use the k-fold cross-validation scheme for the fitness evaluation. Although the cross-validation scheme alleviates the overfitting problem, it greatly increases the running time. In addition, TMGWO mutates the features one by one in the two-phase mutation operation, which takes much more time during the evaluation process. Both EvoFS and TMGWO limit the size of feature subsets, but the number of features they choose is quite different. EvoFS is a multi-objective genetic algorithm, while TMGWO directly set the feature subset size and the misclassification rate as the fitness function. Our algorithm fails to further compress the feature subset size because it doesn’t include the penalty term to control the scale of the feature subset. This shows that the multi-objective evolution algorithm has more potential in eliminating redundant and irrelevant features. In general, our GP achieves a good balance between running time, classification results and feature subset size.

Table 4 The parameters of k-fold cross-validation, running time (seconds) and the average number of selected features(#F) of different algorithms

Table 5 lists the classification results of our proposed algorithm and other ME recognition algorithms on the ME data sets. In the experiments, methods 1, 2, 3, and 4 are traditional feature extraction methods. Among them, LBP-TOP is an extension of the LBP that can effectively capture features in the time domain space. It was widely used as the baseline for expression recognition tasks. Methods 4–9 are classical deep learning models. Except for the STSTNet model, all the models use data augmentation to avoid overfitting. The utilized data augmentation methods include flipping horizontally, cropping the top and bottom randomly by 5%, and pretraining on the ImageNet data set.

Table 5 Recognition results of different algorithms

The results in Table 5 show that our proposed algorithm outperforms the other algorithms. GP results in a relatively balanced improvement in terms of the recognition of the three types of MEs. Our method outperforms the other algorithms on the Full data set, SMIC, and CASME. It also performs well on the SAMM data set, which is the most difficult in the recognition task. The performance of VGG16 is better than that of VGG19 even though the network structure of the VGG19 model has three more layers than that of VGG16. This indicates that the amount of ME data is insufficient and overfitting may occur in the deeper network, so the network with shallow layers has better performance.

Table 6 lists the confusion matrices for our GP approach on all data sets. To further compare with our algorithm, Tables 7 and 8 provide the confusion matrices of the original OVR method and VGG16, respectively. By comparing the OVR method with our algorithm, it can be seen that the proposed algorithm achieves significant improvements in its recognition performances for the positive and surprise classes. However, the recognition of negative classes slightly decreases. This is caused by the uneven distribution of samples in the data set. There are 248 samples in the negative class, which is twice and three times the number of samples compared with those in the positive and the surprised classes. The OVR method predicts most of the samples as negative classes, so the recall metric for the negative class is promoted at the cost of deteriorating the recall values for the other two classes.

Table 6 The confusion matrices for our GP on the Full, SMIC, CASME II, and SAMM data sets
Table 7 The confusion matrices for the OVR scheme on the Full, SMIC, CASME II, and SAMM data sets (measured in terms of the recognition rate %)
Table 8 The confusion matrices for VGG16 on the Full, SMIC, CASME II, and SAMM data sets

The confusion matrix shows that GP performs best for the recognition of only a few classes. A recall rate of 60.55% is achieved when dealing with the hardest positive class, and a recall rate of 63.86% is achieved for the surprise class with the smallest number of samples as the best results among the comparison algorithms. After feature selection, the recall scores for the minority classes are greatly improved. While the precision of the majority class is improved from 73.85% to 79.78%, and the class imbalance problem is mitigated. The confusion matrix of VGG16 is given in the comparison section, and it can be seen that the performance of our approach is better than that of VGG16 in most cases, and only the recall rate of the positive class of SAMM is lower than that of VGG16. Since the number of samples in the positive class of SAMM is only 26, the numbers of samples correctly predicted by the two methods are 10 and 11, which represents a small difference.

5.2 Evolutionary process analysis

Figure 6 depicts the trends of the UF1 scores, UAR values, and accuracy results obtained by our algorithm over the Full database during the GP evolution process. The figures show that the feature subsets are randomly selected because all individuals are randomly generated at the beginning of the evolution procedure. And this leads to poor performance at first. After the evolution process begins, the algorithm searches for a valid subset of features, which leads to a rapid improvement in accuracy. Finally, after the number of iterations exceeds 25, the accuracy index improves slowly. During the evolution process, the UF1 scores and UAR also show similar fluctuations as the accuracy changes. This illustrates that our algorithm can handle the class imbalance problem, and the prediction results are not biased towards the majority class. It validates that GP is effective in the field of the ME recognition for the use in a feature selection algorithm.

Fig. 6
figure 6

The evolutionary curve of our GP on the Full data set

Figures 7, 8 and 9 depict the change curves of the classification accuracy, UF1 scores, and UAR values respectively during 50 evolution generations. The use of the LOSO method to evaluate the generalization performance of the algorithm, together with the aggressive evolution strategy, leads to fluctuations in the evolution curves. However, after 40 iterations, the performance of our algorithm is better than those of other algorithms. Both the CASME II and SAMM data sets are relatively easier to handle for the classification task, and our algorithm yields the best performance after evolving to 20 generations. The classification performance is relatively poorer on the SMIC data set, and the best performance is achieved by the 50th generation. The size of the feature subset increases gradually in the evolutionary process. Both the CASME II and SAMM data sets require only a small number of features, while SMIC requires more features to yield good classification performances.

Fig. 7
figure 7

The evolutionary curve of our GP on the SMIC data set

Fig. 8
figure 8

The evolutionary curve of our GP on the CASME II data set

Fig. 9
figure 9

The evolutionary of our GP on the SAMM data set

5.3 Analysis of feature selection results

The size of the original optical flow feature is 28 × 28 × 3, and it is further processed by the sliding window with size 2 × 2 × 1 to produce 14 × 14 × 3 facial regions. From the results of all optimal individuals on the Full data, the feature selection frequency of different facial regions is illustrated by Fig. 10. Our GP selects 20 facial regions on average, and reduces the dimensionality of the optical flow data to 237 dimensions, which is only 10.09% of the original 2352 dimensions. Although our GP achieves the high dimensionality reduction capability, our framework can maintain the excellent discriminative ability by obtaining the high overall performance. The results further demonstrate the effectiveness of the proposed feature selection algorithm.

Fig. 10
figure 10

The frequency distribution of the features selected by our GP

Table 9 shows the average ratios of the number of selected features to the number of all features for different data sets and classes. In the table, the first three rows represent the proportions of features selected for different ME classes, and the fourth row represents the ratios of the feature subsets to all features after integrating the feature selection results for the three classes. It is obvious that the feature selection results for the three types of MEs are class-related, and only a few features are selected multiple times across different types of MEs. For the Full database, 0.57% of the features are selected repeatedly. On the SMIC, CASME, and SAMM data sets, the percentages of repeatedly selected features are 0.63%, 0.56%, and 0.51%, respectively. The results prove that GP is able to extract features related to the given class, discovering the characteristics of different MEs.

Table 9 Feature selection results for different classes and different data sets (measured in terms of the rate)

Figure 11 shows the average feature subsets size selected by the 50th generation populations on different classes for different data sets. The results show that our algorithm selects more features the negative class. On average, each population selects 95 features and the best individual selects 83 features. It shows that our algorithm can control the size of feature subsets well without adding a penalty function.

Fig. 11
figure 11

The average feature subset size of the last population for different classes and different data sets

In addition, GP has some specificity in the field of image feature selection. If the traditional GA is used to perform feature selection for MEs, it needs to stretch the image into a one-dimensional vector. In this process, the GA loses the spatial information of the image. In contrast, our GP method uses a growth-based approach to continuously expand and modify the tree structure. We find that the subtree structures contain certain spatial information. The subtrees may surround a certain facial region, such as the nose, eyes, and lips. This coincides with the action unit proposed by experts. The mutation operation in GP is equivalent to perform recombination of these facial regions. The tree structure of GP also contains the information of constructing a subset of features. In this way, our GP implements the memetic automaton through the building block of information. While GA lacks this capability since its linear individuals can’t represent the connections among different features.

Figure 12 shows the key features selected by our GP for different types of MEs. The feature subset is mapped onto the facial image using squares to indicate the selected facial regions. The red, green, and blue squares indicate that the algorithm selects the horizontal component of the optical flow field, the vertical component of the optical flow field, and the optical dependent variable, respectively. The features selected in Fig. 12c are located in the eyes and eyebrows, on the sides of the lips, and under the chin. According to psychologists' analysis, when a person is surprised, his face is characterized by raised eyebrows, open eyes, an open mouth, and a drooping chin. This coincides with the selected features, as most of the selected features are located on the sides of the eyes, nose, mouth. Their subtle changes reflect the MEs of the subjects. Figure 13 shows one of the best individuals obtained from the feature selection process, which corresponds to Fig. 12a. It uses 43 nodes to represent a subset of features containing 19 blocks.

Fig. 12
figure 12

The feature selection results

Fig. 13
figure 13

An example of the best individual

5.4 Analysis of the impact of the siding window size

Table 10 demonstrates the effect of sliding window size on classifier performance. The shape of the sliding window: height \(\times\) weight \(\times\) channel. In this paper, the width and height of the sliding window are set to 2 \(\times\) 2, and each sliding window contains 4 pixels. In this way, the search space of GP is reduced by limiting the neighbor pixels. If we expand the width and height of the sliding window, the size of the terminal set will be further reduced, but more redundant features may be included. The number of channels of the sliding window is set to 1 because different channels of the optical flow features include different directions of motion information. The larger sliding window size deteriorates the classification performance of the algorithm and increases the training time of the classifier.

Table 10 The results of different siding windows

As shown in Table 10, the ME recognition performance decreases when the sliding window is set to 2 × 2 × 3, as the features of three channels are selected simultaneously. The increase of the sliding window size degrades the classification performance of the algorithm.

6 Conclusion

In this paper, we design a novel GP-based feature selection algorithm for the ME recognition task. The sliding window generates feature subsets, serving as the terminal set of the GP. In this way, our GP selects a set of facial regions based on the optical flow features. The fitness function is the combination of the F1-score and the cross-entropy, which can guide the GP to better handle the class imbalance problem. The proposed GP is applied to select the optical flow features, and handle the three-class ME recognition problem by using the OVR decomposition scheme, and the final solution is the fusion of three binary-class problems through the majority voting strategy. The experimental results demonstrate that the classification performance of our GP is higher than other algorithms in most cases.

To the best of our knowledge, it is the first time that GP is deployed to tackle the ME recognition problem. As the results confirm the potential of the GP in this research field, we believe that it is worth getting further efforts on this research topic. The future research directions include the design of more non-terminal operators for the GP framework, and the more effective method for the class decomposition in the imbalanced data set. Furthermore, from the successful applications of other memetic algorithms [57,58,59], it is obvious that the elaborate design of new memetic algorithms would offer promising solutions for this field.