Multi-modal Domain Adaptation Method Based on Parameter Fusion and Two-Step Alignment

Wu, Lan; Wang, Han; Gong, Lishuang; Yao, Yuan; Guo, Xin; Li, Binquan

doi:10.1007/s11063-024-11567-3

Multi-modal Domain Adaptation Method Based on Parameter Fusion and Two-Step Alignment

Open access
Published: 15 March 2024

Volume 56, article number 107, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Multi-modal Domain Adaptation Method Based on Parameter Fusion and Two-Step Alignment

Download PDF

Lan Wu¹,
Han Wang²,
Lishuang Gong²,
Yuan Yao²,
Xin Guo² &
…
Binquan Li²

247 Accesses
1 Altmetric
Explore all metrics

Abstract

Due to the well-known domain shift problem, directly deploying a trained multi-modal classifier to a new environment usually leads to poor performance. The existing multi-modal domain adaption methods not only lack the fine-grained information of cross-modal data distribution, but also lack the cross-modal correlation research. Therefore, this paper proposes a multi-modal domain adaption method based on parameter fusion and two-step alignment (PFTS) to solve the related problems. The consistency of network parameters is used to enhance the correlation among modalities, and a higher-order moment measurement is introduced to improve the alignment of data distribution at the fine-grained level. In addition, the weighting of each modality is further carried out to achieve focused transfer. Comprehensive experiments based on multi-modal datasets with different domain adaption settings have been conducted, the results show that the precision of PFTS is 5.38% higher than state-of-the-art multi-modal domain adaption methods.

Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation

MixStyle Neural Networks for Domain Generalization and Adaptation

Article 17 October 2023

A Survey on Adversarial Domain Adaptation

Article Open access 13 August 2022

1 Introduction

Multi-modal learning studies machine learning problems with diverse modal information such as visual, text, and sound. These modes often come from different sensors, and there are greatly various in how they are formed and their internal structures [1]. The heterogeneity of multi-modal data poses a challenge to learning the relevance and complementarity among multiple modalities. Extensive research has been devoted to multi-modal learning, while multi-modal fusion is occupying an important position. Multi-modal fusion [2, 3] is the process of synthesizing information from two or more modes for predictions. In the procedure of forecasting, a single mode usually cannot involve the whole effective message needed to produce precise prediction results. Whereas, the multi-modal fusion can widen the coverage of message range in the input mode, and increase the robustness of the prediction model. The multi-modal fusion can be divided into pre-fusion, post-fusion and hybrid fusion according to the order of multi-modal fusion. Castellano et al. [4] adopted the pre-fusion method to integrate or combine the features of all modalities before modal modeling, thus completing feature level fusion. Ramirez et al. [5] employed the post-fusion method, which first models each mode separately, then synthesizes the output or decisions of the model to generate the final decision result, and completes the fusion at the decision-making level. Lan et al. [6] combined pre-fusion and post-fusion methods at the feature level and decision level. The fusion of the recent multi-modal learning methods is achieved at different levels without considering the correlation among modalities.

Transfer learning, the transfer of knowledge from the source domain to the target domain, can help to reduce the information differences among modalities. The domain adaption methods [7, 8] are widely used in transfer learning and achieved great success in recent years, which can alleviate the domain gap [9,10,11]. For this reason, domain adaption methods have been included in text classification [12, 13], and visual objection recognition [14, 15]. Most existing domain adaption methods are usually based on the assumption that data from different domains are represented by features of a common type. But in the real world, source domain samples and target domain features usually do not enjoy the same feature representation together [16, 17], because domains are often heterogeneous. Previous multi-modal domain adaption methods mainly align the data distribution at the coarse-grained level, this might lead to unsatisfying transfer learning performance. To enhance alignment at all levels, data distribution alignment at the fine-grained level should also be considered.

In order to solve these problems, we put forward the multi-modal domain adaption method based on parameter fusion and two-step alignment. The main contributions of this paper are summarized as follows:

1.
A multi-modal domain adaption method based on parameter fusion and two-step alignment is proposed. It consists of three parts: parameter fusion, two-step alignment operation and multi-modal conditional weight setting, which effectively solves the problem of difficult connection and alignment caused by ignoring the different information of each mode in the existing research.
2.
Parameter fusion is carried out in this paper to increase the bond and fusion effect among modalities.
3.
A two-step alignment strategy is set up to improve the alignment effect of the probability distribution under multiple modalities. Among them, the high-order moment measurement is used to overcome the problem of neglecting the alignment at the fine-grained level in the existing multi-modal domain adaption methods.

The rest of the investigation is arranged as follows. Section 2 illustrates the related works. Section 3 describes the multi-modal domain adaptation method. Section 4 analyzes and compares the experimental results. Finally, conclusions are drawn in Sect. 5, as well as the future prospect.

2 Related Work

In this section, the multi-modal learning and domain adaption are introduced which provide theoretical support for proposed method.

2.1 Multi-Modal Learning

Modalities are normal states of information representation for things in physical world. These data from multiple sources are semantically correlated and sometimes provide complementary information to each other, resulting in better performance when dealing with a multi-modal problem [18, 19]. Such systems integrate heterogeneous, disconnected data from disparate sensors, thereby helping to generate more reliable classification [20]. For example, human can gather information from the electroencephalogram and the eye movement signals, and then combine these two different data sources to classify someone's current emotion, thus completing a deep learning task in an emotion detector.

Multi-modal learning has progressively developed into the main solution for multimedia parsing, and scientific research personnel around the world have gradually achieved outstanding research results in this field. Among them, multi-modal fusion is proposed, which aims to integrate multiple modalities information. This technology can obtain a consistent and common model output, which is a fundamental domain in the multi-modal field. The mix of multiple modes can obtain more comprehensive features, which can enhance the robustness, and ensure that the model can still work effectively when few modalities are shortcoming. Academic community has made significant progress in the field of multi-modal fusion. Each mode will be affected by the uneven distribution of the pattern numbers, so that the information acquired by fusion cannot accurately express the appropriate characteristics. Among them, Jing et al. [21] designed an implicit conditional random field, assuming that the data of different modalities share the latent structure. Through the learning of this latent shared structure, the connection between multi-modal data can be established, and the relationship between the structure and the supervised category information can be mined at the same time, which can be applied to classification tasks. In [22], a new multi-modal event topic model is proposed to establish the model of social media documents, which is mainly based on the integration of the text and vision. By learning the correlation between textual and visual features, the visually representative topics can be distinguished from non-visually representative topics. A new hashing algorithm is proposed in paper [23], which integrates the multiple modes collected through the weak supervision into binary codes, so as to use the kernel function and SVM for classification. In paper [24], a multi-layer linear fusion based on GPS position information and antenna action information was used to detect the positioning error of the system. Moreover, the paper [25] mainly integrates 2D and 3D data features to obtain different data information to improve the final results.

2.2 Domain Adaption

Domain adaptation aims to extend knowledge obtained from the source domain to the target domain, which derives three main definitions: domain, task, and domain adaptation. Domain D contains two parts: a d-dimensional feature space X and a marginal probability distribution P(x), where X is the sets of n samples, and $x=\left\{{x}_{1},{x}_{2}, \dots ,{x}_{n}\right\} \subset C = X$, so D = $\left\{X,P(x)\right\}$ represents a field. The task is composed of two parts: the label space y and the category prediction function $f(\cdot )$. The category prediction function $f(\cdot )$ can predict the label corresponding to the sample. From the perspective of the probability distribution, it can be expressed as the marginal probability distribution P(y|x). Thus, T = $\{y,p\left(y|x\right)\}$ can be used to represent the task. In the transfer learning, it is assumed that the tasks to be solved are the same in the two domains, which is usually an image classification task. However, in heterogeneous transfer, the task can be image-to-text classification.

There are mainly three kinds of domain adaptation approaches. The first is sample domain adaptation. By constructing a weight factor for the samples, the source domain distribution could be closer to the target distribution. For example, in [26] and [27] the distribution ratio between the source domain and the target domain is estimated and used as the weight of samples. The second approach is feature domain adaptation, which mainly facilitates the alignment of probability distributions by learning transferable features and mapping each domain to a common subspace. For example, in [28], the feature transformation is used to reduce the gap among the source and target domains. Meanwhile, the sample features of the source and target domains can also be mapped into a common feature space, and then traditional machine learning methods can be applied [29]. The third method is model domain adaptation, which uses the similarity between models to transfer the source model to the target domain by introducing the knowledge of the target error on the source domain error function. Deng et al. [30] proposed the TransRKELM to update the initial model, which can improve adaptability of classifier and obtain better recognition performance.

The utilization of the combination of multi-modal learning and domain adaption methods has been well developed. Most of the existing methods only focus on the specific information in the domain when eliminating the heterogeneity of each mode, and lack the correlation among modalities. At the same time, these methods only involve coarse-grained information about the data distribution when performing domain alignment. Therefore, we introduce a multi-modal domain adaptation method based on parameter fusion and two-step alignment (PFTS). With the objective of learning a bond among modalities, we set up a parameter fusion mechanism to improve modalities integration. After the first phase of adversarial domain adaption is completed, we further design a high-order moment measurement to optimize the alignment from the perspective of fine-grained information.

3 Multi-Modal Domain Adaption Method

Figure 1 shows the overview of our proposed multi-modal domain adaption approach based on parameter fusion and two-step alignment, namely PFTS. PFTS consists of three main steps. Firstly, in order to reduce the heterogeneity among features, a feature transformer is used to map multi-modal data to a common subspace so that the feature dimensions of each modal are consistent. Then, on the basis of parameter fusion to improve the modal integration, adversarial domain adaption and similarity measurement are implemented. This two-step alignment strategy realizes the alignment among modalities through adversarial domain adaption, high-order moment measurement and task similarity measurement. Finally, the multi-modal classifier trained by the source domains weighting mechanism outputs the final decision.

The feature transformer is composed of multiple two-layer neural networks, which can map all modalities samples to the common subspace in the same dimension to solve the problem of cross-modal heterogeneity. The structure of the feature transformer is shown in Fig. 2.

3.1 Multi-Modal Parameter Fusion

In multi-modal learning, different source domains and target domains represent different modalities. Because the structure of each domain is specific, the extracted information is also different. To establish the correlations among different domains, it is necessary to realize parameter complementary fusion among different modalities. Thus, we adopt identical structures for the second layers of transformations and constraint parameters. The objective function is as follows

$$ \ell_{p} = \sum\limits_{j = 1}^{J} {\left\| {F_{s,j}^{{l_{2} }} - F_{t}^{{l_{2} }} } \right\|}_{1} + \sum\limits_{j = 2}^{J} {\left\| {F_{s,j}^{{l_{2} }} - F_{s,j - 1}^{{l_{2} }} } \right\|}_{1} $$

(1)

where J is the number of source domains, l₂ refer to the first and second layers of the network, ${F}_{s,j}^{{l}_{2}}$ is the parameters of second layer network for the jth source domain, ${F}_{t}^{{l}_{2}}$ is the parameters of second layer network for target domain, and ${F}_{s,j-1}^{{l}_{2}}$ is the parameters of second layer network for the (j-1)th source domain.

At the same time of reducing the loss, the difference between the different modalities parameters could be reduced, and the parameter consistency should be achieved on this basis. Thus, this method can model the correlations between different modalities, which is usually more practical in some complex situations.

3.2 Two-Step Alignment Strategy

3.2.1 Adversarial Domain Adaption

This paper proposes a two-step alignment strategy. The first step is adversarial domain adaption. Recently, adversarial networks have been successfully applied in the fields of distribution alignment with two competitive systems, domain discriminator and feature transformer [31, 32]. The main function of domain discriminator is to distinguish source and target samples, and the feature transformer is adopted to fool the domain discriminator. Based on these ideas, the scheme of adversarial domain adaption is designed. When the domain discriminator is minimized and the feature transformer is maximized, a competitive loss between domain discriminator and feature transformer is computed as

$$ \ell_{d}^{g} = \frac{1}{n}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{n} {L\left[ {d\left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right),H_{i,j}^{s} } \right]} + \frac{1}{z}\sum\limits_{i = 1}^{z} {L\left[ {d\left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right),H_{i}^{t} } \right]} } $$

(2)

where d is the domain discriminator, g is the feature extractor, n is the number of samples for source domain, ${\varvec{L}}[.,.]$ is squared loss, ${{\varvec{H}}}_{{\varvec{i}},{\varvec{j}}}^{{\varvec{s}}}$ is the one hot domain label of ${{\varvec{x}}}_{{\varvec{i}},{\varvec{j}}}^{{\varvec{s}}}$, z is the number of samples for target domain, and ${{\varvec{H}}}_{{\varvec{i}}}^{{\varvec{t}}}$ is the one hot domain label of ${{\varvec{x}}}_{{\varvec{i}}}^{{\varvec{t}}}$.

Subsequently, a classification loss on the label classifier and feature transformer can be formulated as

$$\begin{aligned} \ell_{f}^{g} &= \frac{1}{n}\sum\limits_{j = 1}^{J} \sum\limits_{i = 1}^{n} {L_{c} \left[ {C\left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right),y_{i,j}^{s} } \right]}\\ &\quad + \frac{1}{z}\sum\limits_{i = 1}^{z} {L_{c} \left[ {C\left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right),y_{i}^{t} } \right]} + \tau \left( {\left\| f \right\|_{2} + \left\| g \right\|_{2} } \right)\end{aligned} $$

(3)

where f is the classifier, ${{\varvec{L}}}_{{\varvec{c}}}[.,.]$ is the cross-entropy loss, τ is a positive regularization parameter which can prevent overfitting, ${\varvec{C}}(.)$ is the result of classifier and z is the number of target sample. By optimizing ${\varvec{C}}(.)$, the classification loss can be minimized and the feature transformer can achieve a better discriminability. Integrate (2) and (3), we can get

$$ \mathop {\min }\limits_{f,g} \mathop {\mathop {\max }\limits_{d} \left( {\ell_{f}^{g} - \gamma \ell_{d}^{g} } \right)}\limits_{{}} $$

(4)

where γ is the tradeoff parameter between the label classifier and the domain discriminator. Although the objective mentioned above can align the distributions of the source and target domains, only the marginal distributions are matched because label information is neglected. The deficiency is noticed in multi-modal conditional weighting part.

3.2.2 Similarity Measurement

The second step is to optimize the model by utilizing task similarity and distribution similarity between domains based on adversarial domain adaptation. In the research of transfer learning, not only it is difficult to measure the similarity of sample distribution between domains, but also it will cause negative transfer due to the inability to judge the correlation of tasks between domains. Therefore, it is necessary to measure the task similarity and distribution similarity between the source domain and the target domain. Taking image classification as an example, the task of each domain is to assign labels, then the similarity between tasks is measured to determine the similarity between labels. Thus, we introduce task similarity measurement mechanism between domains to measure the portability for source domain.

$$\begin{aligned} \ell_{Q} &= \frac{1}{n}\sum\limits_{j = 1}^{J - 1} \sum\limits_{i = 1}^{n} {C\left[ {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right),y_{i,j}^{s} } \right]} - \frac{1}{n}\sum\limits_{j = 2}^{J} {\sum\limits_{i = 1}^{n} {C\left[ {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right),y_{i,j}^{s} } \right]} }\\ &\quad - \frac{1}{z}\sum\limits_{i = 1}^{z} C\left[ {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right),y_{i}^{t} } \right],\end{aligned} $$

(5)

At the same time, we introduce high-order moment measurement methods into multi-modal fusion to improve the robustness of the model. Assuming that the data feature of source domains and target domains respectively has probability distributions p and q. In this way, the high-order moment measurement is proposed to measure distance between domains, which is given by (6)

$$ \begin{gathered} \ell_{s} = \frac{1}{n}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{n} {\frac{1}{{\left| {b - a} \right|}}} \left\| {E\left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right) - E\left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right)} \right\|_{2} } \\ + \sum\limits_{\zeta = 2}^{\zeta } {\frac{1}{{\left| {b - a} \right|^{\zeta } }}\left\| {M_{\zeta } \left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right) - M_{\zeta } \left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right)} \right\|_{2} ,} \\ \end{gathered} $$

(6)

where E(·) is the expectation vector computed on the data features and

$$ M_{\zeta } \left( x \right) = \left( {E\left( {\prod\limits_{i = 1}^{n} {\left( {x_{i} - \left( {E\left( {x_{i} } \right)} \right)^{{\lambda_{i} }} } \right)} } \right)} \right)_{{\lambda_{i} \ge 0,\sum\nolimits_{i}^{N} {\lambda_{i} } = \zeta }} , $$

(7)

3.3 Multi-Modal Conditional Weighting

Multiple modalities are not necessarily of equal importance. To do this, we use a conditional weighting scheme to measure the importance of each source domain. The proposed conditional weighting scheme can quantify the contribution of each modality. More importantly, it also aligns the conditional distribution. It is verified that the class-conditional MMD implemented using linear kernels is an effective method for measuring the difference in conditional distribution between multiple modalities [32]. Meanwhile, it can be used to estimate the nonparametric gap of two conditional distributions. Specifically, the sum of all class center point distances between the source domain and the target domain is used to model the difference in the conditional distribution, so that the degree of dissimilarity between the kth source domain and the target domain is given as (8)

$$ \delta_{k} = \frac{1}{C}\sum\limits_{i = 1}^{C} {\left\| {\frac{{\sum\limits_{i = 1}^{{n_{s}^{c} }} {F_{t} \left( {x_{i,c}^{s,k} } \right) + \sum\limits_{i = 1}^{{n_{u} }} {\tilde{y}_{i,c}^{u} F_{t} \left( {x_{i}^{l} } \right)} } }}{{n_{l}^{c} + \sum\limits_{i = 1}^{{n_{u} }} {\tilde{y}_{i,c}^{l} } }} - \frac{1}{{n_{s,k}^{c} }}\sum\limits_{i = 1}^{{n_{s,k}^{c} }} {F_{s,k} \left( {x_{i,c}^{s,k} } \right)} } \right\|^{2} } $$

(8)

where ${x}_{i, c}^{k}$ is the ith samples of class c in the kth source domain, ${x}_{i,c}^{l}$ is the ith samples of class c in target domain, ${\widetilde{y}}_{i,c}^{u}$ is the associated probability of ${x}_{i}^{u}$ with class c, ${n}_{{s}_{k}}^{c}$ is the number of samples pertain to class c from the kth source domain, and ${n}_{l}^{c}$ is the number of samples pertain to class c from target domain. $\delta_{k}$ models the degree of dissimilarity between source domain and target domain by reducing the difference of condition distribution among domains. If $\delta_{k}$ becomes smaller, the difference in condition distribution between domains will also become smaller.

It is expect that the greater the degree of dissimilarity between the source and target domains, the smaller the ${\omega }_{k}$. Subsequently, a monotonically increasing function is designed to count weight ${\omega }_{k}$. Because ${\omega }_{k}$ has a minimum value of 0.5, it not only assigns weight to all source domains, but also reduces the importance of dissimilar source domains.

$$ \omega_{k} = \frac{1}{K - 1}\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne k \end{subarray} }^{K} {\frac{{\exp \left( {\delta_{j} } \right)}}{{1 + \exp \left( {\delta_{j} } \right)}}} $$

(9)

where ${\delta }_{j}$ is composite function.

4 Experiment

In this section, the multi-modal domain adaption fusion method is evaluated used benchmark data sets. The benefits of the proposed model were tested by precision comparison, ablation experiment, parameter fusion analysis, high-order moment analysis, weight analysis and parameter sensitivity analysis. Comparative experiments from multiple aspects are implemented to examine the performance of the model and some state-of-the-art methods.

4.1 Experimental Setting

In this paper, the device is configured with Intel-i7, 3.6 GHz CPU and 16 GB running memory. 100 clients are typically used for our experiments. In the following, the experimental configurations are described in detail.

Datasets In this paper, four real-world datasets are selected to implement experiments, which are Multilingual Reuters Collection, Office-Home, Office-31, and ImageNet + NUS-WIDE, respectively. Table 1 describes each data set in detail. In Multilingual Reuters Collection dataset, 100 papers were randomly selected as labeled data for each source domain. 5 and 500 papers were chosen as the labeled and unlabeled data for target domain. The features of each paper, which have been extracted using bag-of-words model with TF-IDF. In Office-Home dataset and Office-31 dataset, three dimensional features of each image are described, including 800-D SURF (S800), 4096-D DeCAF6 (D4096), and 2048-D ResNet50 (R2048). To construct transfer task, three groups of transfer directions are built in this paper, i.e., D4096, R2048 → S800; S800, R2048 → D4096; and S800, D4096 → R2048. We use the label information of NUS-WIDE and the image data of ImageNet as the text field (Te) and image field (Im). Image-noise domain (In), deriving from Im domain, is constructed.

Table 1 A brief introduction of each dataset

Full size table

Experimental details We used the Tensorflow framework to implement our PFTS. In the training process, the Adam optimizer is used to optimize the feature transformer, classifier and domain discriminator, and the learning rate is set to 0.004, 0.004 and 0.001 respectively. As for parameter settings, we set the common feature subspace dimension to 256, and the control parameter for adversarial loss and the regularization parameter are set to 0.03 and 0.004 respectively. Specific parameter sensitivity analysis is presented in Sect. 4.3.

Evaluation criteria As a classification task, this study needs to determine whether the category given to the target domain is consistent with the real value. In order to objectively evaluate the model, we use the accuracy rate as the evaluation index. For a classification task, where classification result is consistent as Y and the classification result is different as N, then the corresponding four cases can be obtained, as shown in Table 2. According to the four scenarios obtained, we can get the indicators for evaluating the model as:

$$ {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$

(10)

Table 2 The comparison of results of the real value and the predicted value in the classification scenario

Full size table

4.2 Experimental Results

Baseline models To demonstrate the advantages of the proposed method, some state-of-the-art baseline methods are compared in the experiments. SVMt uses target domain label samples to train SVM model. DAMA [33] utilizes linear projection to map samples to a common subspace to achieve classification. Sb-CDLS [34], Sb-SSAN [35], Sb-TNT [36] and Sb-STN [37] mean that SSAN, CDLS, TNT and STN methods are used to realize the domain adaptation of each pair of source domain and target domain by order in multi-source heterogeneous transfer. SC-SSAN first projects samples from all source domains into a common subspace and then merges them into a new source domain to perform SSAN method. CWAN learns a feature transformer, a domain discriminator, and a classifier through the weighted adversarial approach, in order to achieve cross-domain adaptation. JMEA [38] is a semi-supervised heterogeneous domain adaptation (SsHeDA) algorithm, which can deal with massive datasets. In HCSA [39], structure preservation, distribution, and classification space alignment are implemented, jointly with feature representation by transferring both the source-domain representation and model knowledge. CDSPP [40] maps sample features from the source and target domains to a common subspace by learning domain-specific projections, thereby maintaining class consistency and adequate alignment of data distributions.

In the experiment, we set up four comparative scenarios: “No transfer”, “Single-best”, “S-combine” and advanced multi-modal domain adaption algorithms. “No transfer” is the no modal alignment method. “Single-best” is to decompose the multi-modal domain adaption into multiple single-source transfer scenarios and then independently performs single-source domain adaption approaches on each problem to find the best result as the final result. “S-combine” is to map all the source domain data into a common subspace and then use single-source approaches to achieve transfer. According to these settings, we can know whether the method of multi-modal domain adaption is worth exploring.

Tables 3, 4, 5 and 6 show the detailed test accuracies on the all datasets for different methods. From this table, we can see that the PFTS obtains important improvements over the other methods on the different tasks. For the three datasets, the classification accuracy of PFTS is up to 5.38% higher than CWAN. From the tables, we can analyze the following points: (1) The performance of the SVMt is very poor, demonstrating the importance of domain alignment when there is a large gap among source and target domains. (2) The classification accuracy of Single-best achieves better performance than the S-combine, which implies that distribution difference among source domains can lead to difficulty in extracting transferable features when forced to aggregate them together. (3) DAMA performs the worst of these multi-source methods because it uses shallow features. (4) The existing multi-modal domain adaption methods are better than other transfer methods, which shows that these methods can effectively improve the model performance. (5) The test accuracy of D4096, R2048 → S800 is lower than S800, R2048 → D4096 and S800, D4096 → R2048. It is partly because transfer from deep features to shallow features would lead to excessive differences between features. (6) The proposed PFTS substantially outperforms all the baseline methods on all the transfer methods.

Table 3 Test accuracies of Multilingual Reuters Collection dataset

Full size table

Table 4 Test accuracies of Office-31 dataset

Full size table

Table 5 Test accuracies of Office-Home dataset

Full size table

Table 6 Test accuracies of Im, In → Te for ImageNet + NUS-WIDE dataset

Full size table

From the above experimental results, it can be observed that the PFTS shows good performance in different data sets when compared with the traditional methods, which indicates the superiority of PFTS. Therefore, this paper improves the alignment effect by using two-stage alignment, reduces the domain gap, and increases the fusion degree of each mode by using parameter fusion.

Feature visualization The visualization of the features is shown in the Fig. 3. The xs_1 and xs_2 represent the data features of the source domains, and the xu is the unlabeled data of the target domain. It can be observed in the graph that the source and target are well aligned in the common feature space, verifying that our algorithm effectively learns domain invariant features and achieves better classification results.

4.3 Analysis

Ablation study To analysis the effectiveness of multi-modal parameter fusion, high-order moment measurement, adversarial domain adaption and conditional weighting scheme should be tested on accuracies, the results of ablation study on different datasets are shown in Figs. 4 and 5. Specifically, PFTS (AD) indicates the basic model removing the adversarial domain adaption strategy. PFTS (PF) and PFTS (HO) indicate the models ignoring parameter fusion and high-order moment measurement respectively. PFTS (CD), which removes conditional weighting scheme.

As shown from Figs. 4 and 5, we can observe that the proposed method almost does the best for the test accuracy no matter on all datasets. As expected, PFTS significantly outperforms its variants. PFTS (AD) is worse than PFTS, which implies that utilizing adversarial domain adaption can improve the domains alignment performance. PFTS (PF) and PFTS (HO) are worse than PFTS, which indicates that parameter fusion and high-order moment measurement are helpful to further increase the performance of fusion and cross-domain alignment. PFTS (CD) is worse than PFTS, which suggests that conditional weighting scheme is useful.

Parameter fusion In order to generally evaluate the significance of the parameter fusion in performance, we perform parameter fusion results testing in Fig. 6. PFTS (ST) refers to the parameter fusion of each source domain and target domain in the second-layer network during model training, and PFTS (SS) refers to the parameter fusion of the second-layer source domains.

This observation shows that PFTS (ST) and PFTS (SS) have the poor performance. This indicates that parameter fusion among all domains can increase the connection between modalities and improve the final fusion performance. Therefore, this shows the rationality of parameter fusion settings.

High-order moment measure evaluation To assess the effectiveness of high-order moment measurement, taking four levels of step spacing as an example, we respectively show the different results on Office-31 and Multilingual Reuters Collection dataset, as Fig. 7. PFTS (1), PFTS (1 + 2), PFTS (1 + 2 + 3) and PFTS (1 + 2 + 3 + 4) refer to the use of only the corresponding order moment measurement in PFTS.

It is shown that the test accuracy is lowest when only using low-order moment measurement. With the increase of the order moment, the classification accuracy also increases. This indicates that incorporating high-order moment measurement into multi-modal fusion can effectively reduce the differences in data between modalities and enhance the alignment effect.

Weight evaluation We also estimate the validity of the weighting scheme. The results of the weight comparison of different source domains and the difference in condition distribution between source domains and target domains are shown in Fig. 8. The comparison of the different weight settings is shown in Fig. 9. Among them, the comparative analysis of the results before and after adding weights is conducted in the ablation experiment.

As shown from Fig. 8, the different weights of each source domain indicate the different contributions of each source domain, and also illustrate the effectiveness of the conditional weighting mechanism. At the same time, the conditional distribution of A and D is relatively small, but correspondingly, the weight of A is relatively large. This observation indicates that a smaller divergence leads to a large weight, which obeys our weighting principle.

We also compared the conditional weighting mechanism with other weighting mechanisms, and the experimental results are shown in Fig. 9. Among them, PFTS (M) refers to use the reciprocal of the $\delta$ in Eq. (8). We also compared another weight setting method, which only focuses on the classification accuracy of each modality itself. PFTS (Z) represents that the weight is achieved by the reciprocal of the classification accuracy of each modality. The experimental results show that the conditional weighting mechanism can distinguish the importance degree of different source domains according to the conditional distribution distance.

Parameter sensitivity analysis The sensitivity analysis of the parameters is shown in the Fig. 10. We can see that the default parameter settings can achieve higher accuracy, indicating that the default parameter configuration is more reasonable. In addition, we tested the performance of the default parameter settings in all tasks and achieved good results, which demonstrated that our algorithm is relatively stable and effective for different experimental settings.

Through all experiments, it can be observed that the traditional multi-modal transfer methods usually only measure the low-order moment, while we use two-step alignment based on parameters fusion to achieve excellent alignment results. In addition, we tested the hyperparameters in the data sets of different dimensions, and the hyperparameters set could perfectly present the optimal accuracy, which showed that the model was robust. In general, our approach has certain advantages in the scenario of multi-modal data fusion domain adaption.

Discussion With other methods under the same settings, we show that this method achieves optimal performance. We also investigate the effects of multi-modal parameter fusion, high-order moment measurement, adversarial domain adaption and conditional weighting scheme on the model performance, and find that the accuracy of the model is effectively improved by adding related functions. While the effectiveness of proposed method has been validated by experiment, there are also several points can be improved. Firstly, in this paper, the feature level has been conducted, while fusion at the model and decision levels can also be considered in the future work. Secondly, the paper enhances the connection between modalities and improves the robustness of the model through fusion, but it needs further study on whether this connection expands the complementarity of information of multiple modalities and achieves complementarity between modalities. In the future, we will extend the fusion level on the basis of feature fusion and study the information complementarity between modalities through different experimental scenarios.

5 Conclusions

In this work, we propose a new strategy for multi-modal domain adaptation method based on parameter fusion and two-stage alignment. It achieves cross-modal association from parameter fusion, by means of adversarial domain adaption and high-order moment measurement, as well as realizing the cross-modal alignment. And finally implement the emphasis transfer by weight mechanism. Compared with typical multi-modal fusion methods, the proposed framework takes the data differences between modalities into consideration, which can increase the connectivity between modalities and subsequently improve the alignment effect after fusion. A large number of comparative experiments and analyses have verified the effectiveness of this method. In the future, we will continue to explore multiple fusion methods and the data complementarity between modalities.

Data Availability

Not application.

References

Baltrušaitis T, Ahuja C, Morency LP (2018) multi-modal machine learning: a survey and taxonomy. IEEE Trans pattern anal machine intell 41:423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Article Google Scholar
D’mello SK, Kory J (2015) A review and meta-analysis of multi-modal affect detection system. ACM Comput Surveys 47:1–36. https://doi.org/10.1145/2682899
Article Google Scholar
Zeng Z, Pantic M, Roisman GI (2008) A survey of affect recognition methods: audio, visiual, and spontaneous expressions. IEEE Trans pattern analy machine intell 31:39–58. https://doi.org/10.1145/1322192.1322216
Article Google Scholar
Castellano G, Kessous L, Caridakis G (2008) Emotion recognition through multiple modalities: face, body, gesture, speech. Affect Emotion Human-Comput Interact. https://doi.org/10.1007/978-3-540-85099-1_8
Article Google Scholar
Ramirez GA, Baltrušaitis T, Morency LP (2011) Modeling latent discriminative dynamic of multi-dimensional affective signals. Process Int Conf Aff Comput Intell Interact 9–12:396–406. https://doi.org/10.1007/978-3-642-24571-8_51
Article Google Scholar
Lan Z, Bao L, Yu SI et al (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71:333–347. https://doi.org/10.1007/s11042-013-1391-2
Article Google Scholar
Ouyang J, Lv Q, Zhang S, et al (2023) Energy Transfer Contrast Network for Unsupervised Domain Adaption. In: International Conference on Multimedia Modeling Springer Nature Switzerland, pp 115–126
Wang W, Chen S, Xiang Y et al (2021) Sparsely-labeled source assisted domain adaptation. Pattern Recogn 112:107803. https://doi.org/10.1016/j.patcog.2020.107803
Article Google Scholar
Ma A, Li J, Lu K et al (2021) Adversarial entropy optimization for unsupervised domain adaptation. IEEE Trans Neural Netw Learn Syst 33(11):1–12. https://doi.org/10.1109/TNNLS.2021.3073119
Article MathSciNet CAS Google Scholar
Zhu Y, Zhuang F, Wang J et al (2020) Deep subdomain adaptation network for image classification. IEEE Trans Neural Netw Learn Syst 32(4):1713–1722. https://doi.org/10.1109/TNNLS.2020.2988928
Article MathSciNet Google Scholar
Jing T, Xu B, Ding Z (2021) Towards fair knowledge transfer for imbalanced domain adaptation. IEEE Trans Image Process 30:8200–8211. https://doi.org/10.1109/TIP.2021.3113576
Article ADS PubMed Google Scholar
Xiao M, Guo Y (2014) Feature space independent semi-supervised domain adaptation via kernel matching. IEEE Trans Pattern Analy Machine Intell 37(1):54–66. https://doi.org/10.1109/TPAMI.2014.2343216
Article Google Scholar
Elhoseiny M, Elgammal A, Saleh B (2016) Write a classifier: predicting visual classifiers from unstructured text. IEEE Trans Pattern Analy Machine Intell 39(12):2539–2553. https://doi.org/10.1109/TPAMI.2016.2643667
Article Google Scholar
Li W, Duan L, Xu D et al (2013) Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Trans Pattern Analy Machine Intell 36(6):1134–1148. https://doi.org/10.1109/TPAMI.2013.167
Article Google Scholar
Yu F, Wu X, Chen J et al (2019) Exploiting images for video recognition: Heterogeneous feature augmentation via symmetric adversarial learning. IEEE Trans Image Process 28(11):5308–5321. https://doi.org/10.1109/TIP.2019.2917867
Article ADS MathSciNet PubMed Google Scholar
Song G, Zhang Y, Xu L et al (2020) Domain adaptive network embedding. IEEE Trans Big Data 8(5):1220–1232. https://doi.org/10.1109/TBDATA.2020.3034201
Article Google Scholar
Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Trans Circuits Syst Video Technol 28(9):2372–2385. https://doi.org/10.1109/TCSVT.2017.2705068
Article Google Scholar
Xu Z, Mei L, Lv Z et al (2017) Multi-modal description of public safety events using surveillance and social media. IEEE Trans Big Data 5(4):529–539. https://doi.org/10.1109/TBDATA.2017.2656918
Article Google Scholar
Yao Y, Li Y, Zhang P et al (2018) Data fusion methods for convolutional neural network based on self-sensing motor drive system. In: Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, 21–23, pp 5371–5376
Cao H, Chen G, Xia J et al (2021) Fusion-based feature attention gate component for vehicle detection based on event camera. IEEE Sens J 21(21):24540–24548. https://doi.org/10.1109/JSEN.2021.3115016
Article ADS Google Scholar
Jiang X, Wu F, Zhang Y et al (2015) The classification of multi-modal data with hidden conditional random field. Pattern Recogn Lett 51:63–69. https://doi.org/10.1016/j.patrec.2014.08.005
Article ADS Google Scholar
Qian S, Zhang T, Xu C et al (2015) Multi-modal event topic model for social event analysis. IEEE Trans Multimed 18(2):233–246. https://doi.org/10.1109/TMM.2015.2510329
Article Google Scholar
Xia Y, Zhang L, Liu Z et al (2016) Weakly supervised multi-modal kernel for categorizing aerial photographs. IEEE Trans Image Process 26(8):3748–3758. https://doi.org/10.1109/TIP.2016.2639438
Article ADS Google Scholar
Feng T, Mao X (2017) multi-modal data fusion for SB-JPALS status prediction under antenna motion fault mode. Neurocomputing 259:46–54. https://doi.org/10.1016/j.neucom.2016.08.126
Article Google Scholar
Li H, Sun J, Xu Z et al (2017) multi-modal 2D+3D facial expression recognition with deep fusion convolutional neural network. IEEE Trans Multimed 19(12):2816–2831. https://doi.org/10.1109/TMM.2017.2713408
Article Google Scholar
Khan MNA, Heisterkamp DR (2016) Adapting instance weights for unsupervised domain adaptation using quadratic mutual information and subspace learning. In: Proceedings of the 23rd ICPR, 4–8, pp 1560–1565
Cortes C, Mohri M, Riley M et al (2008) Sample selection bias correction theory. In: Proceedings of the 19th international conference on Algorithmic Learning Theory, 13–16, pp 38–53
Liu J, Shah M, Kuipers B et al (2011) Cross-view action recognition via view knowledge transfer. Proceed CVPR 20–25:3209–3216. https://doi.org/10.1109/CVPR.2011.5995729
Article Google Scholar
Duan L, Tsang IW, Xu D (2021) Domain transfer multiple kernel learning. IEEE Trans Pattern Anal Machine Intell 34(3):465–479. https://doi.org/10.1109/TPAMI.2011.114
Article Google Scholar
Deng WY, Zheng QH, Wang ZM (2014) Cross-person activity recognition using reduced kernel extreme learning machine. Neural Netw 53:1–7. https://doi.org/10.1016/j.neunet.2014.01.008
Article PubMed Google Scholar
Ganin Y, Lempitsky V (2014) Unsupervised domain adaptation by backpropagation. arXiv preprint 2014, arXiv:1409.7495
Yao Y, Li X, Zhang Y, Ye Y (2023) Multisource heterogeneous domain adaptation with conditional weighting adversarial network. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3105868
Article PubMed Google Scholar
Wang C, Mahadevan S (2011) Heterogeneous domain adaptation using manifold alignment. Proceed Int Jt Conf Artif Intell 16–22:1541. https://doi.org/10.1016/j.jpdc.2017.06.003
Article Google Scholar
Tsai YHH, Yeh YR, Wang YCF (2016) Learning cross-domain landmarks for heterogeneous domain adaptation. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 5081–5090
Li S, Xie B, Wu J et al (2020) Simultaneous semantic alignment network for heterogeneous domain adaptation. In: Proceedings of the 28th ACM international conference on multimedia, pp 3866–3874
Chen WY, Hsu TMH, Tsai YHH et al (2016) Transfer neural trees for heterogeneous domain adaptation. In: Computer Vision–ECCV 2016: 14th European Conference 11–14, pp 399–414
Yao Y, Zhang Y, Li X, Ye Y (2019) Heterogeneous domain adaptation via soft transfer network. In: Proceedings of the 27th ACM International Conference 21–25, pp 1578–1586
Fang Z, Lu J, Liu F et al (2023) Semi-supervised heterogeneous domain adaptation: theory and algorithms. IEEE Trans Pattern Anal Mach Intell 45(1):1087–1105. https://doi.org/10.1109/TPAMI.2022.3146234
Article PubMed Google Scholar
Tian Q, Sun H, Ma C et al (2022) Heterogeneous domain adaptation with structure and classification space alignment. IEEE Transact Cybernetics 52(10):10328–10338. https://doi.org/10.1109/TCYB.2021.3070545
Article Google Scholar
Wang Q, Breckon TP (2022) Cross-domain structure preserving projection for heterogeneous domain adaptation. Pattern Recogn 123:108362. https://doi.org/10.1016/j.patcog.2021.108362
Article Google Scholar

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61973103 and 62303164, in part by the Outstanding Young Scholar of Henan Province under Grant 222300420039, and in part by the Collaborative Innovation Special Project of Zhengzhou under Grant 21ZZXTCX01.

Author information

Authors and Affiliations

School of Electromechanical Engineering, Henan University of Technology, Zhengzhou, 450001, China
Lan Wu
School of Electrical Engineering, Henan University of Technology, Zhengzhou, 450001, China
Han Wang, Lishuang Gong, Yuan Yao, Xin Guo & Binquan Li

Authors

Lan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Han Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lishuang Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Yao
View author publications
You can also search for this author in PubMed Google Scholar
Xin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Binquan Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LW and HW designed the algorithm and the experiments, LW and HW performed the experiments, HW, GS and YY wrote the paper, HW, GS and YY analyzed the data, HW, GS and YY revised technical error of the paper and gave lots of advice. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Lan Wu.

Ethics declarations

Conflicts of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, L., Wang, H., Gong, L. et al. Multi-modal Domain Adaptation Method Based on Parameter Fusion and Two-Step Alignment. Neural Process Lett 56, 107 (2024). https://doi.org/10.1007/s11063-024-11567-3

Download citation

Accepted: 11 February 2024
Published: 15 March 2024
DOI: https://doi.org/10.1007/s11063-024-11567-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-modal Domain Adaptation Method Based on Parameter Fusion and Two-Step Alignment

Abstract

Similar content being viewed by others

Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation

MixStyle Neural Networks for Domain Generalization and Adaptation

A Survey on Adversarial Domain Adaptation

1 Introduction