1 Introduction

Multimedia data, with its diverse forms of expression and unique aesthetic appeal, is widely applied in various fields such as business communication [1, 4, 17]. In the era of big data, manual assessment of the aesthetic appeal of multimedia data is not only inefficient but also costly, highlighting the urgent need for automated methods to evaluate the aesthetic qualities of data [3, 5, 18, 25]. However, traditional aesthetic evaluation methods, especially those focused on image aesthetics, often only consider the overall perceptual experience of humans, while neglecting individual preferences and subjective characteristics.The aesthetic quality of images is primarily subjective. Different individuals may assign different scores to the same image due to their personal preferences or emotions. Consequently, general image aesthetic assessment approaches tend to overlook subjective characteristics such as individual feelings and personality, resulting in a lack of personalized image aesthetic assessment. Furthermore, design experts have recognized the close relationship between image aesthetics and brand image design. Therefore, combining image aesthetic evaluation with brand image design has become imperative, as it can further enhance the effectiveness of brand image shaping. Thus, there is a need to adopt a deep learning-based automated approach that comprehensively considers individual differences and subjective aspects to evaluate the aesthetic appeal of multimedia data more comprehensively and accurately. By integrating this evaluation with brand image design, better results can be achieved.

Although deep learning models have achieved success in various fields [8, 10, 12], their emergence has come at the cost of increasingly large-scale models, requiring substantial computational and data resources. However, in practical computer vision applications, especially in edge applications like mobile devices, resource constraints such as limited computing power, high real-time requirements, and insufficient data must be taken into account. Therefore, resource-constrained deep learning theories, methods, and applications should receive adequate attention. There are several different directions that can make deep learning more effective in computer vision, such as reducing the required dataset size, computational memory, or related training and inference time. Most existing research focuses on improving the efficiency of deep learning methods. However, resources associated with real-world mobile devices can vary significantly, and what may be considered an effective approach for one resource budget choice may be entirely ineffective for another.

Therefore, this paper proposes Emo-AEN that combines brand image design with attention mechanisms for image aesthetic evaluation, fully considering the subjective features of brand image design. Obviously, the key focus of Emo-AEN lies in exploring the intrinsic relationship between brand image design and image aesthetics. Attention mechanisms provide a promising solution as they can filter out a small set of important information from a large amount of data, enabling deep learning algorithms to analyze and concentrate on this crucial information. Self-attention mechanism, as a variant of attention mechanisms, better captures the internal correlations of data or features, reducing reliance on external information. Inspired by this, we design a fusion feature extraction algorithm based on brand image and utilize self-attention mechanisms to merge image aesthetics and design information,and a model lightweighting strategy was further implemented to ensure the efficiency and applicability of the algorithm, especially in resource-constrained mobile device environments.

The main contributions of the study can be listed as follows:

  • By extracting the fused information of image aesthetics and brand image, we propose Emo-AEN, a brand image-based fusion approach for image aesthetic evaluation.

  • Introducing self-attention mechanisms to capture the intrinsic information of image aesthetics and the fusion of design elements.

  • Experimental results demonstrate the competitiveness of our method compared to existing approaches. Additionally, a comprehensive discussion on the relationship between brand image design and image aesthetics is provided.

The following sections of this paper provide additional details. Section 2 offers an overview of the related work, and Section 3 elaborates on the framework and intricacies of the proposed method. Section 4 presents the experimental results and compares them with a baseline method. Finally, Section 5 provides a summary of the paper.

2 Related work

2.1 Image aesthetic assessment

Early image aesthetic assessment algorithms often relied on carefully selected features based on expert knowledge. Yan et al. [2] assessed image aesthetics by leveraging their expertise in photography, constructing advanced features such as color, contrast, exposure, and simplicity. Luo et al. [11] created a large-scale dataset which is used for aesthetic visual analysis (AVA), which comprised over 250,000 tagged aesthetic images. While these methods achieved success in image aesthetic assessment, they suffered from certain limitations. On one hand, the design of manual features required expertise in photography. On the other hand, simple features alone may not fully capture the nuances of image aesthetics, thereby limiting the generalizability of these approaches. Subsequently, the progress of image aesthetic assessment has been propelled by the advent of Convolutional Neural Networks (CNNs). Kong et al. [13] proposed a sorting network that incorporated adaptive image attributes and content information for image aesthetic regression sorting. Jin et al. [14] introduced a novel CNN model, ILGNet, which connected Inception modules with shallow and global features to classify images based on different aesthetic qualities. Compared to manual feature design, the utilization of deep learning methods is more straightforward and yields a more comprehensive expression of images, greatly enhancing the accuracy of image aesthetic assessment.

2.2 Image emotion

Research on image emotion classification holds great potential and research value. During the initial stages, the classification of image emotions heavily relied on manually crafted features.

Although manual feature extraction and classification technology have advantages in the application of few samples, they are increasingly limited due to the growing amount of data. With the increasing availability of Internet data and the improvement of algorithm computing power, deep learning has gradually become the mainstream technology for image classification. Compared to the expert design of manual features, deep learning methods can automatically extract effective features.

2.3 Attention mechanism

The concept of the attention mechanism was originally introduced in the domain of visual image processing. Subsequently, similar extensions based on the attention mechanism have been adopted in various natural language processing tasks [16].

The attention mechanism allows the computer vision system to effectively and swiftly concentrate on key areas, mirroring the selective focus of the human visual system. The fundamental principle of the attention mechanism lies in identifying pertinent information while suppressing irrelevant information. The outcomes are commonly visualized as a probability graph or represented as a probability feature vector. In theory, the attention mechanism can be primarily categorized into three different types: the channel attention model, the spatial attention model, and the channel and spatial mixed attention model. The attention mechanism can be used to allocate weight to information and promote the end-to-end fusion of different information, such as multi-modal information fusion and multi-source information fusion.

Fig. 1
figure 1

Framework of the proposed method

3 Proposed method

We propose Emo-AEN, a method based on emotion internal fusion. The method framework consists of five parts: the backbone network module, the internal fusion module, the self-attention module, the prediction module, and the pruned model. Firstly, image aesthetics and emotional features are extracted from images using the image aesthetics and emotional backbone networks. Then, the features extracted from the two backbone networks are fused internally. The fused features are then inputted into the self-attention mechanism to mine the inherent relationship between them. Furthermore, the features are fed into the prediction network to complete the aesthetic prediction task. Finally, these features are also utilized in the prediction network to accomplish the aesthetic prediction task, while the model undergoes fine-grained pruning to reduce model redundancy.

3.1 Backbone network

The architecture of the network proposed in this paper is shown in Fig. 1, which relies on ResNet as the backbone network to extract features.

ResNet [20] has been widely used in various tasks, including image classification , object detection , semantic segmentation , and others . Due to the excellent performance of ResNet in feature extraction, the pre-trained ResNet50 is used in the proposed method to separately extract the emotional and aesthetic features from the image. ResNet50 is composed of four basic modules named layer_1, layer_2, layer_3, and layer_4. Each basic module is stacked using residual blocks, known as Bottleneck, which consists of three-layer convolutions (\(1\times 1\), \(3\times 3\), and \(1\times 1\)) and a skip connection.

3.2 Internal fusion

To explore the correlation between emotion and image aesthetics, we implement an internal fusion mechanism. The images are first passed through the image aesthetics and emotion backbone network and subsequently standardized to a uniform size within each basic module. The features acquired from each layer are then internally fused by utilizing point multiplication, which can be mathematically represented as:

$$\begin{aligned} a\cdot b=a_{1}b_{1}+a_{2}b_{2}+a_{3}b_{3}, \end{aligned}$$
(1)

Where vectors a and b are:

$$\begin{aligned} a= & {} [a_{1},a_{2},...,a_{n}], \end{aligned}$$
(2)
$$\begin{aligned} b= & {} [b_{1},b_{2},...,b_{n}], \end{aligned}$$
(3)

3.3 Self-attention mechanism

Self-attention mechanisms are significant within the realm of deep learning, serving as variants of attention mechanisms. Their purpose is to decrease reliance on external information and maximize the use of inherent information within features for interaction. Originally used in natural language processing to tackle challenges in training models with inputs consisting of multiple vectors of varying sizes and potential relationships between them, the self-attention mechanism has also found success in image vision domains such as image detection and semantic segmentation.

The Transformer model stands out as one of the most prominent examples of the self-attention mechanism. The architecture of the Transformer model is composed of two main components: an encoder and a decoder. In each network block of the encoder, there is a multi-head attention sublayer and a feedforward neural network sublayer. The entire encoder stack comprises N blocks. During the self-attention process, the original feature map is transformed into three separate vector branches: Q (Query), K (Key), and V (Value). Firstly, we calculate the correlation weight matrix coefficients for Q and K. Secondly, by using the method of soft operations, the weight matrix is normalized. Finally, the weight coefficients are applied to V to effectively incorporate global context information into the modeling process. The self-attention mechanism can be seen as an improvement rather than a variation of the attention mechanism, as it reduces the dependence on external information and is more effective at extracting internal data or feature correlations. The calculation process is represented as follows:

$$\begin{aligned} Attention(Q,K,V) = softmax\left( \frac{QK^{T}}{\sqrt{d_k}}\right) V, \end{aligned}$$
(4)

where \(d_k\) represents the key’s dimension. The attention mechanism then focuses on different representation subspaces at different locations to establish a multi-head attention model. This model comprises multiple self-attention blocks and is computed as follows:

$$\begin{aligned} MultiHead(Q,K,V) = Concat(head_1,\cdots ,head_h)W^{o}, \end{aligned}$$
(5)

where

$$\begin{aligned} head_i=Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}), \end{aligned}$$
(6)

In Eq. 6, the weight matrices \(W_{i}^{Q}\in R^{d_{model}\times d_{k}}\), \(W_{i}^{K}\in R^{d_{model}\times d_{k}}\), \(W_{i}^{V}\in R^{d_{model}\times d_{v}}\), and \(W_{o}\in R^{hd_{v}\times d_{model}}\) are trainable parameters. Here, the dimension of the value is denoted by \(d_{v}\), while the number of parallel attention layers is represented by h. Moreover, \(d_{model}\) represents the dimension of all Q/K/V vectors within the entire multi-attention model. As this study focuses solely on Transformer encoders, no further discussion will be conducted regarding the decoder.

3.4 Aesthetic prediction

Following the feature fusion process, the intrinsic connection between emotion and image aesthetics is explored through the self-attention mechanism. Subsequently, the features are transferred to the aesthetic prediction module, which comprises a global average pooling layer and a fully connected layer. Ultimately, the aesthetic score is determined.

3.5 Model pruning

The obtained baseline model undergoes fine-grained pruning and retraining. Specifically, after fusing emotional and image aesthetic features, a complex feature representation is acquired. Pruning techniques are applied to simplify the feature extraction network by removing neurons that contribute minimally to the final prediction, thereby reducing the number of parameters and computational load.

The gradient of each neuron to the loss function is first calculated to estimate its importance. The obtained baseline model was used for forward propagation and then the gradient of the loss function to each neuron was calculated. A larger gradient indicates that the neuron has a greater influence on the model’s predictions. Subsequently, based on the importance scores, a threshold is chosen to determine which neurons to retain and which neurons to prune. The selection of the threshold is a crucial step as it directly affects the extent of pruning and the performance of the model. In this study, a method based on percentiles is employed to select an appropriate threshold. A lower threshold will retain more neurons, while a higher threshold will prune more neurons. The pruned neurons will no longer have an impact on the model’s output.

After pruning, the number of parameters in the model is reduced. To maintain the structural integrity of the model, parameter trimming and reconstruction can be performed. Specifically, after removing neurons, it is necessary to adjust the inputs and outputs of other neurons in the model accordingly. This entails reconnecting or adjusting connection weights to ensure that the flow of inputs and outputs in the model remains unaffected. Subsequently, the model’s structure is reorganized based on the layout of pruned neurons, including removing unnecessary layers, adjusting connections between layers, and even adding new layers to reconnect the pruned parts. Through parameter trimming and reconstruction, the structure of the model aligns with the pruned parameters while preserving the overall architecture of the original model. This ensures that the functionality of the model remains intact and can be used for inference or training tasks.

Table 1 Ten methods for overall and individual distortion performance on NAMA3DS1-COSPAD1 database

4 Experiments

In this section, we present a brief overview of several indicators used to measure the prediction performance during the experiment. Next, to evaluate the effectiveness of the proposed method, we compare it with existing methods using a relevant dataset. Additionally, we perform ablation experiments to assess the effectiveness of each component of the method, thereby demonstrating their utility.

Fig. 2
figure 2

Some result of prediction

4.1 Databases and indicators

We perform the experiments using the AVA dataset, which serves as a benchmark dataset for image aesthetic assessment and can be obtained from the DPChasty website. The dataset comprises over 255,000 images, covering 963 topics which is challenging. Each individual image is assigned a label indicating its aesthetic score based on the average score of the candidates, while the number of different candidates and their given score can be used to obtain the aesthetic distribution. For binary classification tasks, images that have an average score of more than five are marked as “1”, representing high aesthetic value, while images that have an average score of less than five are marked as “0”, representing low aesthetic value. Due to the presence of some damaged pictures, a total of 252,423 pictures were used for the experiment. In order to achieve a fair comparison with other models, the AVA dataset is partitioned in a similar manner to prior studies, i.e., 80% of the dataset is allocated for training purposes, while the remaining 20% is reserved for testing.

Furthermore, emotional datasets were also utilized to ensure the fairness and feasibility of the experiment.

The Flickr and Instaraim (FI) dataset contains approximately 23,000 emotion images. It comprises eight emotions: anger, awe, disgust, entertainment, excitement, fear, satisfaction, and sadness. The Emotion6 emotion dataset consists of about 1,980 images, obtained by searching emotion keywords and synonyms in Flickr [21], which contains six basic emotions, utilizing emotion distribution probabilities instead of a single emotion label. The Abstract emotion dataset [22] comprises 279 abstract paintings that do not depict specific objects. Around 230 individuals categorized these pictures into eight emotional categories, and each picture’s average score was evaluated approximately 14 times. The Artphoto emotion dataset combines three datasets: the International Affective Picture System (IAPS), a compilation of art photographs sourced from photo sharing websites, and a collection of peer-rated abstract paintings. It comprises a total of 806 pictures, including eight categories of discrete emotions: Amusement, Anger, Awe, Contentment, Disgust, Excitement, Fear and Sadness [23].

To facilitate a fair comparison with existing work, this section employs several indicators to evaluate the performance of different methods, including accuracy rate (ACC) method, Pearson linear correlation coefficient (PLCC) method, Spearman rank order correlation coefficient (SROCC) method, and Earth Mover’s Distance (EMD) method. ACC represents the ratio of accurately predicted samples among all evaluated samples, where a higher ACC value indicates better classification performance. PLCC and SROCC are employed to assess the correlation between the subjective score and the predicted score. EMD assesses the model’s ability to predict aesthetic distribution.

Regarding network settings, this paper utilizes ResNet50, a neural network pretrained on ImageNet, as the backbone network. ResNet50 consists of 50 layers and includes structures such as shortcuts, batch normalization, and pooling. By starting the training process with a pre-trained model capable of extracting features from natural images, better initial performance for the task can be obtained. The AVA dataset and FI dataset are employed for the aesthetic and emotional datasets, respectively. Each image is first adjusted to a size of \(224 \times 224 \times 3\) and then sent to the network for processing. During the training process of the network, the initial learning rate for the backbone network is defined as 1e-4, while for other networks it is defined as 1e-5. The learning rate is reduced to 0.1 times of the original value every five epochs. The entire process employs the Adam optimizer with a weight decay of 5e-4. The utilization of the mean squared error (MSE) loss function is employed. The code for the experiment is written in Python, and the deep learning computations are accelerated using the NVIDIA TITAN XP graphics card.

4.2 Overall performance

There are three main tasks involved in image aesthetic assessment: binary classification, score regression, and distribution prediction. In binary classification, we assign “0” and “1” as image labels and use ACC indicators to evaluate the model’s classification performance. For score regression, we use the average score as the image label and assess the consistency and monotony of subjective score and objective score using SROCC and PLCC. To assess the model’s capability in predicting distributions, we use the EMD value.

Table 1 presents a comparison of the prediction performance of the method outlined in this section with other advanced image aesthetic assessment methods on the AVA dataset. The table showcases the best results attained, which are indicated in bold.

Overall, the method proposed in the paper demonstrates superior competitive performance, with top scores in SROCC, PLCC, and EMD. While the MSE index of our method is marginally lower than that of the HLA-GCN method, the PLCC, SROCC and EMD indexes of our method are significantly higher. This indicates that the proposed method possesses better fractional regression and distribution prediction capabilities compared to HLA-GCN. Furthermore, when compared to the handmade feature method AVA, all depth learning methods show a significant improvement in performance. This further confirms the strong learning ability and image representation capability of neural networks. In terms of score regression task, the methods in this section demonstrate the highest SROCC and PLCC values. Additionally, in the distribution prediction task, the proposed method achieves the minimum EMD value, highlighting its strong performance in this area.

To assess the performance of our method in the score prediction task, we present several score prediction results in Fig. 2. In the captions of each image in Fig. 2, the upper number represents the ground truth, while the lower number represents the prediction score. The results of the network prediction are clearly close to the ground truth, highlighting the effectiveness of our method.

Table 2 The performance of cross database testing
Fig. 3
figure 3

The Matrix of Two Sample T-test

The performance of cross-dataset testing is presented in Table 2. The methods used for comparison include “W/0 emotion”, “Emotion6”, “Abstract”, “Artphoto”, and our proposed method. The performance metrics evaluated include Accuracy (ACC) metric, Pearson Linear Correlation Coefficient (PLCC) metric, Spearman’s Rank Order Correlation Coefficient (SROCC) metric, Mean Squared Error (MSE) metric, and Earth Mover’s Distance (EMD) metric. The table clearly illustrates that the proposed method surpasses other methods in terms of all metrics, achieving the highest ACC, SROCC, PLCC, and lowest MSE and EMD.

Fig. 4
figure 4

The Matrix of Two Sample Wilcoxon Rank Sum Test

4.3 Cross database experiment

In addition, this section also includes cross dataset experiments conducted on other emotional datasets. Table 2 displays the obtained results. The test dataset is partitioned, with 80% allocated for training and 20% reserved for testing. The ACC column in the table depicts the outcomes of cross-dataset experiments conducted by employing the weights of the original training parameters, conversely, the retraining ACC signifies the results obtained subsequent to retraining the network with alternative emotional datasets. The utilization of original network weights for cross-dataset experiments leads to a notable decline in accuracy, particularly for the Abstract emotion dataset and Artphoto emotion dataset. This difference is due to the variation in the types and number of emotion datasets. However, after retraining with relevant emotion datasets, the accuracy rate is significantly improved, especially for the Emotion6 dataset which shows a 2% increase in accuracy. The decline in accuracy for the Abstract dataset is attributed to the small number of datasets and the large difference in image types compared to other emotional datasets. This highlights the sensitivity of deep networks to the number of datasets.

4.4 Statistical significance analysis

In addition, we conducted a statistical significance analysis on the SROCC using both the two-sample t-test and the two-sample Wilcoxon rank sum test. The results are shown in Figs. 3 and 4, where “1”, “-1”, and “0” denote the comparison of model performance. The comparison is conducted horizontally, with “1” signifying that the row model surpasses the column model, “0” indicating equal performance between the two models, and “-1” denoting inferior performance of the row model compared to the column model. The methods considered for comparison are those related to image aesthetic distribution prediction. All algorithms were tested on the AVA aesthetic dataset at a 95% confidence level. The comparison results indicate that the methods proposed in this section exhibit strong competitive advantages.

In conclusion, the proposed method demonstrates superior performance in score prediction tasks compared to existing methods. The cross dataset experiments further confirm the effectiveness of the proposed method, especially after retraining with relevant emotion datasets. The statistical significance analysis shows that the proposed method has a strong competitive advantage in image aesthetic distribution prediction.

5 Conclusion

This paper proposes Emo-AEN, an image aesthetic evaluation framework based on the internal fusion of brand image and aesthetic design. Inspired by the close connection between image design and aesthetic evaluation in design studies, Emo-AEN combines brand image design with image aesthetics fusion, allowing the network to capture both the aesthetic and detailed design features of the images. Moreover, in regard to the relationship between image aesthetics and brand image, this study utilizes self-attention mechanisms to uncover the intrinsic connection between image aesthetics and brand design. Redundant and insignificant weights within the network are eliminated using model pruning techniques. Through these lightweight strategies, our integrated feature extraction algorithm is not only capable of efficiently processing information pertaining to image aesthetics and design, but it also boasts a wider application scope. This includes operation on devices with limited computational power and storage resources. Consequently, our algorithm is well-suited not just for high-performance computing environments, but can also be deployed on a variety of mobile and edge computing devices.This versatility enables our algorithm to provide intelligent brand image optimization services to a broader range of users. Experimental results demonstrate that the proposed method performs well compared to existing approaches. The ablation experiment proves the vital role played by the self-attention mechanism and the super network module in the network. In addition, this study also demonstrates that different images can convey different visual content, and the aesthetics of images and brand image design are mutually complementary and integrated. Furthermore, this study shows that different images can convey distinct visual content, highlighting the symbiotic relationship and seamless integration between image aesthetics and brand identity design.