1 Introduction

Multi-modal learning studies machine learning problems with diverse modal information such as visual, text, and sound. These modes often come from different sensors, and there are greatly various in how they are formed and their internal structures [1]. The heterogeneity of multi-modal data poses a challenge to learning the relevance and complementarity among multiple modalities. Extensive research has been devoted to multi-modal learning, while multi-modal fusion is occupying an important position. Multi-modal fusion [2, 3] is the process of synthesizing information from two or more modes for predictions. In the procedure of forecasting, a single mode usually cannot involve the whole effective message needed to produce precise prediction results. Whereas, the multi-modal fusion can widen the coverage of message range in the input mode, and increase the robustness of the prediction model. The multi-modal fusion can be divided into pre-fusion, post-fusion and hybrid fusion according to the order of multi-modal fusion. Castellano et al. [4] adopted the pre-fusion method to integrate or combine the features of all modalities before modal modeling, thus completing feature level fusion. Ramirez et al. [5] employed the post-fusion method, which first models each mode separately, then synthesizes the output or decisions of the model to generate the final decision result, and completes the fusion at the decision-making level. Lan et al. [6] combined pre-fusion and post-fusion methods at the feature level and decision level. The fusion of the recent multi-modal learning methods is achieved at different levels without considering the correlation among modalities.

Transfer learning, the transfer of knowledge from the source domain to the target domain, can help to reduce the information differences among modalities. The domain adaption methods [7, 8] are widely used in transfer learning and achieved great success in recent years, which can alleviate the domain gap [9,10,11]. For this reason, domain adaption methods have been included in text classification [12, 13], and visual objection recognition [14, 15]. Most existing domain adaption methods are usually based on the assumption that data from different domains are represented by features of a common type. But in the real world, source domain samples and target domain features usually do not enjoy the same feature representation together [16, 17], because domains are often heterogeneous. Previous multi-modal domain adaption methods mainly align the data distribution at the coarse-grained level, this might lead to unsatisfying transfer learning performance. To enhance alignment at all levels, data distribution alignment at the fine-grained level should also be considered.

In order to solve these problems, we put forward the multi-modal domain adaption method based on parameter fusion and two-step alignment. The main contributions of this paper are summarized as follows:

  1. 1.

    A multi-modal domain adaption method based on parameter fusion and two-step alignment is proposed. It consists of three parts: parameter fusion, two-step alignment operation and multi-modal conditional weight setting, which effectively solves the problem of difficult connection and alignment caused by ignoring the different information of each mode in the existing research.

  2. 2.

    Parameter fusion is carried out in this paper to increase the bond and fusion effect among modalities.

  3. 3.

    A two-step alignment strategy is set up to improve the alignment effect of the probability distribution under multiple modalities. Among them, the high-order moment measurement is used to overcome the problem of neglecting the alignment at the fine-grained level in the existing multi-modal domain adaption methods.

The rest of the investigation is arranged as follows. Section 2 illustrates the related works. Section 3 describes the multi-modal domain adaptation method. Section 4 analyzes and compares the experimental results. Finally, conclusions are drawn in Sect. 5, as well as the future prospect.

2 Related Work

In this section, the multi-modal learning and domain adaption are introduced which provide theoretical support for proposed method.

2.1 Multi-Modal Learning

Modalities are normal states of information representation for things in physical world. These data from multiple sources are semantically correlated and sometimes provide complementary information to each other, resulting in better performance when dealing with a multi-modal problem [18, 19]. Such systems integrate heterogeneous, disconnected data from disparate sensors, thereby helping to generate more reliable classification [20]. For example, human can gather information from the electroencephalogram and the eye movement signals, and then combine these two different data sources to classify someone's current emotion, thus completing a deep learning task in an emotion detector.

Multi-modal learning has progressively developed into the main solution for multimedia parsing, and scientific research personnel around the world have gradually achieved outstanding research results in this field. Among them, multi-modal fusion is proposed, which aims to integrate multiple modalities information. This technology can obtain a consistent and common model output, which is a fundamental domain in the multi-modal field. The mix of multiple modes can obtain more comprehensive features, which can enhance the robustness, and ensure that the model can still work effectively when few modalities are shortcoming. Academic community has made significant progress in the field of multi-modal fusion. Each mode will be affected by the uneven distribution of the pattern numbers, so that the information acquired by fusion cannot accurately express the appropriate characteristics. Among them, Jing et al. [21] designed an implicit conditional random field, assuming that the data of different modalities share the latent structure. Through the learning of this latent shared structure, the connection between multi-modal data can be established, and the relationship between the structure and the supervised category information can be mined at the same time, which can be applied to classification tasks. In [22], a new multi-modal event topic model is proposed to establish the model of social media documents, which is mainly based on the integration of the text and vision. By learning the correlation between textual and visual features, the visually representative topics can be distinguished from non-visually representative topics. A new hashing algorithm is proposed in paper [23], which integrates the multiple modes collected through the weak supervision into binary codes, so as to use the kernel function and SVM for classification. In paper [24], a multi-layer linear fusion based on GPS position information and antenna action information was used to detect the positioning error of the system. Moreover, the paper [25] mainly integrates 2D and 3D data features to obtain different data information to improve the final results.

2.2 Domain Adaption

Domain adaptation aims to extend knowledge obtained from the source domain to the target domain, which derives three main definitions: domain, task, and domain adaptation. Domain D contains two parts: a d-dimensional feature space X and a marginal probability distribution P(x), where X is the sets of n samples, and \(x=\left\{{x}_{1},{x}_{2}, \dots ,{x}_{n}\right\} \subset C = X\), so D = \(\left\{X,P(x)\right\}\) represents a field. The task is composed of two parts: the label space y and the category prediction function \(f(\cdot )\). The category prediction function \(f(\cdot )\) can predict the label corresponding to the sample. From the perspective of the probability distribution, it can be expressed as the marginal probability distribution P(y|x). Thus, T = \(\{y,p\left(y|x\right)\}\) can be used to represent the task. In the transfer learning, it is assumed that the tasks to be solved are the same in the two domains, which is usually an image classification task. However, in heterogeneous transfer, the task can be image-to-text classification.

There are mainly three kinds of domain adaptation approaches. The first is sample domain adaptation. By constructing a weight factor for the samples, the source domain distribution could be closer to the target distribution. For example, in [26] and [27] the distribution ratio between the source domain and the target domain is estimated and used as the weight of samples. The second approach is feature domain adaptation, which mainly facilitates the alignment of probability distributions by learning transferable features and mapping each domain to a common subspace. For example, in [28], the feature transformation is used to reduce the gap among the source and target domains. Meanwhile, the sample features of the source and target domains can also be mapped into a common feature space, and then traditional machine learning methods can be applied [29]. The third method is model domain adaptation, which uses the similarity between models to transfer the source model to the target domain by introducing the knowledge of the target error on the source domain error function. Deng et al. [30] proposed the TransRKELM to update the initial model, which can improve adaptability of classifier and obtain better recognition performance.

The utilization of the combination of multi-modal learning and domain adaption methods has been well developed. Most of the existing methods only focus on the specific information in the domain when eliminating the heterogeneity of each mode, and lack the correlation among modalities. At the same time, these methods only involve coarse-grained information about the data distribution when performing domain alignment. Therefore, we introduce a multi-modal domain adaptation method based on parameter fusion and two-step alignment (PFTS). With the objective of learning a bond among modalities, we set up a parameter fusion mechanism to improve modalities integration. After the first phase of adversarial domain adaption is completed, we further design a high-order moment measurement to optimize the alignment from the perspective of fine-grained information.

3 Multi-Modal Domain Adaption Method

Figure 1 shows the overview of our proposed multi-modal domain adaption approach based on parameter fusion and two-step alignment, namely PFTS. PFTS consists of three main steps. Firstly, in order to reduce the heterogeneity among features, a feature transformer is used to map multi-modal data to a common subspace so that the feature dimensions of each modal are consistent. Then, on the basis of parameter fusion to improve the modal integration, adversarial domain adaption and similarity measurement are implemented. This two-step alignment strategy realizes the alignment among modalities through adversarial domain adaption, high-order moment measurement and task similarity measurement. Finally, the multi-modal classifier trained by the source domains weighting mechanism outputs the final decision.

Fig. 1
figure 1

General framework of proposed method

The feature transformer is composed of multiple two-layer neural networks, which can map all modalities samples to the common subspace in the same dimension to solve the problem of cross-modal heterogeneity. The structure of the feature transformer is shown in Fig. 2.

Fig. 2
figure 2

The structure of the feature transformer

3.1 Multi-Modal Parameter Fusion

In multi-modal learning, different source domains and target domains represent different modalities. Because the structure of each domain is specific, the extracted information is also different. To establish the correlations among different domains, it is necessary to realize parameter complementary fusion among different modalities. Thus, we adopt identical structures for the second layers of transformations and constraint parameters. The objective function is as follows

$$ \ell_{p} = \sum\limits_{j = 1}^{J} {\left\| {F_{s,j}^{{l_{2} }} - F_{t}^{{l_{2} }} } \right\|}_{1} + \sum\limits_{j = 2}^{J} {\left\| {F_{s,j}^{{l_{2} }} - F_{s,j - 1}^{{l_{2} }} } \right\|}_{1} $$
(1)

where J is the number of source domains, l2 refer to the first and second layers of the network, \({F}_{s,j}^{{l}_{2}}\) is the parameters of second layer network for the jth source domain, \({F}_{t}^{{l}_{2}}\) is the parameters of second layer network for target domain, and \({F}_{s,j-1}^{{l}_{2}}\) is the parameters of second layer network for the (j-1)th source domain.

At the same time of reducing the loss, the difference between the different modalities parameters could be reduced, and the parameter consistency should be achieved on this basis. Thus, this method can model the correlations between different modalities, which is usually more practical in some complex situations.

3.2 Two-Step Alignment Strategy

3.2.1 Adversarial Domain Adaption

This paper proposes a two-step alignment strategy. The first step is adversarial domain adaption. Recently, adversarial networks have been successfully applied in the fields of distribution alignment with two competitive systems, domain discriminator and feature transformer [31, 32]. The main function of domain discriminator is to distinguish source and target samples, and the feature transformer is adopted to fool the domain discriminator. Based on these ideas, the scheme of adversarial domain adaption is designed. When the domain discriminator is minimized and the feature transformer is maximized, a competitive loss between domain discriminator and feature transformer is computed as

$$ \ell_{d}^{g} = \frac{1}{n}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{n} {L\left[ {d\left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right),H_{i,j}^{s} } \right]} + \frac{1}{z}\sum\limits_{i = 1}^{z} {L\left[ {d\left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right),H_{i}^{t} } \right]} } $$
(2)

where d is the domain discriminator, g is the feature extractor, n is the number of samples for source domain, \({\varvec{L}}[.,.]\) is squared loss, \({{\varvec{H}}}_{{\varvec{i}},{\varvec{j}}}^{{\varvec{s}}}\) is the one hot domain label of \({{\varvec{x}}}_{{\varvec{i}},{\varvec{j}}}^{{\varvec{s}}}\), z is the number of samples for target domain, and \({{\varvec{H}}}_{{\varvec{i}}}^{{\varvec{t}}}\) is the one hot domain label of \({{\varvec{x}}}_{{\varvec{i}}}^{{\varvec{t}}}\).

Subsequently, a classification loss on the label classifier and feature transformer can be formulated as

$$\begin{aligned} \ell_{f}^{g} &= \frac{1}{n}\sum\limits_{j = 1}^{J} \sum\limits_{i = 1}^{n} {L_{c} \left[ {C\left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right),y_{i,j}^{s} } \right]}\\ &\quad + \frac{1}{z}\sum\limits_{i = 1}^{z} {L_{c} \left[ {C\left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right),y_{i}^{t} } \right]} + \tau \left( {\left\| f \right\|_{2} + \left\| g \right\|_{2} } \right)\end{aligned} $$
(3)

where f is the classifier, \({{\varvec{L}}}_{{\varvec{c}}}[.,.]\) is the cross-entropy loss, τ is a positive regularization parameter which can prevent overfitting, \({\varvec{C}}(.)\) is the result of classifier and z is the number of target sample. By optimizing \({\varvec{C}}(.)\), the classification loss can be minimized and the feature transformer can achieve a better discriminability. Integrate (2) and (3), we can get

$$ \mathop {\min }\limits_{f,g} \mathop {\mathop {\max }\limits_{d} \left( {\ell_{f}^{g} - \gamma \ell_{d}^{g} } \right)}\limits_{{}} $$
(4)

where γ is the tradeoff parameter between the label classifier and the domain discriminator. Although the objective mentioned above can align the distributions of the source and target domains, only the marginal distributions are matched because label information is neglected. The deficiency is noticed in multi-modal conditional weighting part.

3.2.2 Similarity Measurement

The second step is to optimize the model by utilizing task similarity and distribution similarity between domains based on adversarial domain adaptation. In the research of transfer learning, not only it is difficult to measure the similarity of sample distribution between domains, but also it will cause negative transfer due to the inability to judge the correlation of tasks between domains. Therefore, it is necessary to measure the task similarity and distribution similarity between the source domain and the target domain. Taking image classification as an example, the task of each domain is to assign labels, then the similarity between tasks is measured to determine the similarity between labels. Thus, we introduce task similarity measurement mechanism between domains to measure the portability for source domain.

$$\begin{aligned} \ell_{Q} &= \frac{1}{n}\sum\limits_{j = 1}^{J - 1} \sum\limits_{i = 1}^{n} {C\left[ {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right),y_{i,j}^{s} } \right]} - \frac{1}{n}\sum\limits_{j = 2}^{J} {\sum\limits_{i = 1}^{n} {C\left[ {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right),y_{i,j}^{s} } \right]} }\\ &\quad - \frac{1}{z}\sum\limits_{i = 1}^{z} C\left[ {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right),y_{i}^{t} } \right],\end{aligned} $$
(5)

At the same time, we introduce high-order moment measurement methods into multi-modal fusion to improve the robustness of the model. Assuming that the data feature of source domains and target domains respectively has probability distributions p and q. In this way, the high-order moment measurement is proposed to measure distance between domains, which is given by (6)

$$ \begin{gathered} \ell_{s} = \frac{1}{n}\sum\limits_{j = 1}^{J} {\sum\limits_{i = 1}^{n} {\frac{1}{{\left| {b - a} \right|}}} \left\| {E\left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right) - E\left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right)} \right\|_{2} } \\ + \sum\limits_{\zeta = 2}^{\zeta } {\frac{1}{{\left| {b - a} \right|^{\zeta } }}\left\| {M_{\zeta } \left( {F_{s,k}^{{l_{2} }} \left( {F_{s,k}^{{l_{1} }} \left( {x_{i,j}^{s} } \right)} \right)} \right) - M_{\zeta } \left( {F_{t}^{{l_{2} }} \left( {F_{t}^{{l_{1} }} \left( {x_{i}^{t} } \right)} \right)} \right)} \right\|_{2} ,} \\ \end{gathered} $$
(6)

where E(·) is the expectation vector computed on the data features and

$$ M_{\zeta } \left( x \right) = \left( {E\left( {\prod\limits_{i = 1}^{n} {\left( {x_{i} - \left( {E\left( {x_{i} } \right)} \right)^{{\lambda_{i} }} } \right)} } \right)} \right)_{{\lambda_{i} \ge 0,\sum\nolimits_{i}^{N} {\lambda_{i} } = \zeta }} , $$
(7)

3.3 Multi-Modal Conditional Weighting

Multiple modalities are not necessarily of equal importance. To do this, we use a conditional weighting scheme to measure the importance of each source domain. The proposed conditional weighting scheme can quantify the contribution of each modality. More importantly, it also aligns the conditional distribution. It is verified that the class-conditional MMD implemented using linear kernels is an effective method for measuring the difference in conditional distribution between multiple modalities [32]. Meanwhile, it can be used to estimate the nonparametric gap of two conditional distributions. Specifically, the sum of all class center point distances between the source domain and the target domain is used to model the difference in the conditional distribution, so that the degree of dissimilarity between the kth source domain and the target domain is given as (8)

$$ \delta_{k} = \frac{1}{C}\sum\limits_{i = 1}^{C} {\left\| {\frac{{\sum\limits_{i = 1}^{{n_{s}^{c} }} {F_{t} \left( {x_{i,c}^{s,k} } \right) + \sum\limits_{i = 1}^{{n_{u} }} {\tilde{y}_{i,c}^{u} F_{t} \left( {x_{i}^{l} } \right)} } }}{{n_{l}^{c} + \sum\limits_{i = 1}^{{n_{u} }} {\tilde{y}_{i,c}^{l} } }} - \frac{1}{{n_{s,k}^{c} }}\sum\limits_{i = 1}^{{n_{s,k}^{c} }} {F_{s,k} \left( {x_{i,c}^{s,k} } \right)} } \right\|^{2} } $$
(8)

where \({x}_{i, c}^{k}\) is the ith samples of class c in the kth source domain, \({x}_{i,c}^{l}\) is the ith samples of class c in target domain, \({\widetilde{y}}_{i,c}^{u}\) is the associated probability of \({x}_{i}^{u}\) with class c, \({n}_{{s}_{k}}^{c}\) is the number of samples pertain to class c from the kth source domain, and \({n}_{l}^{c}\) is the number of samples pertain to class c from target domain. \(\delta_{k}\) models the degree of dissimilarity between source domain and target domain by reducing the difference of condition distribution among domains. If \(\delta_{k}\) becomes smaller, the difference in condition distribution between domains will also become smaller.

It is expect that the greater the degree of dissimilarity between the source and target domains, the smaller the \({\omega }_{k}\). Subsequently, a monotonically increasing function is designed to count weight \({\omega }_{k}\). Because \({\omega }_{k}\) has a minimum value of 0.5, it not only assigns weight to all source domains, but also reduces the importance of dissimilar source domains.

$$ \omega_{k} = \frac{1}{K - 1}\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne k \end{subarray} }^{K} {\frac{{\exp \left( {\delta_{j} } \right)}}{{1 + \exp \left( {\delta_{j} } \right)}}} $$
(9)

where \({\delta }_{j}\) is composite function.

4 Experiment

In this section, the multi-modal domain adaption fusion method is evaluated used benchmark data sets. The benefits of the proposed model were tested by precision comparison, ablation experiment, parameter fusion analysis, high-order moment analysis, weight analysis and parameter sensitivity analysis. Comparative experiments from multiple aspects are implemented to examine the performance of the model and some state-of-the-art methods.

4.1 Experimental Setting

In this paper, the device is configured with Intel-i7, 3.6 GHz CPU and 16 GB running memory. 100 clients are typically used for our experiments. In the following, the experimental configurations are described in detail.

Datasets In this paper, four real-world datasets are selected to implement experiments, which are Multilingual Reuters Collection, Office-Home, Office-31, and ImageNet + NUS-WIDE, respectively. Table 1 describes each data set in detail. In Multilingual Reuters Collection dataset, 100 papers were randomly selected as labeled data for each source domain. 5 and 500 papers were chosen as the labeled and unlabeled data for target domain. The features of each paper, which have been extracted using bag-of-words model with TF-IDF. In Office-Home dataset and Office-31 dataset, three dimensional features of each image are described, including 800-D SURF (S800), 4096-D DeCAF6 (D4096), and 2048-D ResNet50 (R2048). To construct transfer task, three groups of transfer directions are built in this paper, i.e., D4096, R2048 → S800; S800, R2048 → D4096; and S800, D4096 → R2048. We use the label information of NUS-WIDE and the image data of ImageNet as the text field (Te) and image field (Im). Image-noise domain (In), deriving from Im domain, is constructed.

Table 1 A brief introduction of each dataset

Experimental details We used the Tensorflow framework to implement our PFTS. In the training process, the Adam optimizer is used to optimize the feature transformer, classifier and domain discriminator, and the learning rate is set to 0.004, 0.004 and 0.001 respectively. As for parameter settings, we set the common feature subspace dimension to 256, and the control parameter for adversarial loss and the regularization parameter are set to 0.03 and 0.004 respectively. Specific parameter sensitivity analysis is presented in Sect. 4.3.

Evaluation criteria As a classification task, this study needs to determine whether the category given to the target domain is consistent with the real value. In order to objectively evaluate the model, we use the accuracy rate as the evaluation index. For a classification task, where classification result is consistent as Y and the classification result is different as N, then the corresponding four cases can be obtained, as shown in Table 2. According to the four scenarios obtained, we can get the indicators for evaluating the model as:

$$ {\text{ACC}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(10)
Table 2 The comparison of results of the real value and the predicted value in the classification scenario

4.2 Experimental Results

Baseline models To demonstrate the advantages of the proposed method, some state-of-the-art baseline methods are compared in the experiments. SVMt uses target domain label samples to train SVM model. DAMA [33] utilizes linear projection to map samples to a common subspace to achieve classification. Sb-CDLS [34], Sb-SSAN [35], Sb-TNT [36] and Sb-STN [37] mean that SSAN, CDLS, TNT and STN methods are used to realize the domain adaptation of each pair of source domain and target domain by order in multi-source heterogeneous transfer. SC-SSAN first projects samples from all source domains into a common subspace and then merges them into a new source domain to perform SSAN method. CWAN learns a feature transformer, a domain discriminator, and a classifier through the weighted adversarial approach, in order to achieve cross-domain adaptation. JMEA [38] is a semi-supervised heterogeneous domain adaptation (SsHeDA) algorithm, which can deal with massive datasets. In HCSA [39], structure preservation, distribution, and classification space alignment are implemented, jointly with feature representation by transferring both the source-domain representation and model knowledge. CDSPP [40] maps sample features from the source and target domains to a common subspace by learning domain-specific projections, thereby maintaining class consistency and adequate alignment of data distributions.

In the experiment, we set up four comparative scenarios: “No transfer”, “Single-best”, “S-combine” and advanced multi-modal domain adaption algorithms. “No transfer” is the no modal alignment method. “Single-best” is to decompose the multi-modal domain adaption into multiple single-source transfer scenarios and then independently performs single-source domain adaption approaches on each problem to find the best result as the final result. “S-combine” is to map all the source domain data into a common subspace and then use single-source approaches to achieve transfer. According to these settings, we can know whether the method of multi-modal domain adaption is worth exploring.

Tables 3, 4, 5 and 6 show the detailed test accuracies on the all datasets for different methods. From this table, we can see that the PFTS obtains important improvements over the other methods on the different tasks. For the three datasets, the classification accuracy of PFTS is up to 5.38% higher than CWAN. From the tables, we can analyze the following points: (1) The performance of the SVMt is very poor, demonstrating the importance of domain alignment when there is a large gap among source and target domains. (2) The classification accuracy of Single-best achieves better performance than the S-combine, which implies that distribution difference among source domains can lead to difficulty in extracting transferable features when forced to aggregate them together. (3) DAMA performs the worst of these multi-source methods because it uses shallow features. (4) The existing multi-modal domain adaption methods are better than other transfer methods, which shows that these methods can effectively improve the model performance. (5) The test accuracy of D4096, R2048 → S800 is lower than S800, R2048 → D4096 and S800, D4096 → R2048. It is partly because transfer from deep features to shallow features would lead to excessive differences between features. (6) The proposed PFTS substantially outperforms all the baseline methods on all the transfer methods.

Table 3 Test accuracies of Multilingual Reuters Collection dataset
Table 4 Test accuracies of Office-31 dataset
Table 5 Test accuracies of Office-Home dataset
Table 6 Test accuracies of Im, In → Te for ImageNet + NUS-WIDE dataset

From the above experimental results, it can be observed that the PFTS shows good performance in different data sets when compared with the traditional methods, which indicates the superiority of PFTS. Therefore, this paper improves the alignment effect by using two-stage alignment, reduces the domain gap, and increases the fusion degree of each mode by using parameter fusion.

Feature visualization The visualization of the features is shown in the Fig. 3. The xs_1 and xs_2 represent the data features of the source domains, and the xu is the unlabeled data of the target domain. It can be observed in the graph that the source and target are well aligned in the common feature space, verifying that our algorithm effectively learns domain invariant features and achieves better classification results.

Fig. 3
figure 3

Visualization of the t-SNE embedding of AD → W (S800, D4096 → R2048) for Office-31 dataset

4.3 Analysis

Ablation study To analysis the effectiveness of multi-modal parameter fusion, high-order moment measurement, adversarial domain adaption and conditional weighting scheme should be tested on accuracies, the results of ablation study on different datasets are shown in Figs. 4 and 5. Specifically, PFTS (AD) indicates the basic model removing the adversarial domain adaption strategy. PFTS (PF) and PFTS (HO) indicate the models ignoring parameter fusion and high-order moment measurement respectively. PFTS (CD), which removes conditional weighting scheme.

Fig. 4
figure 4

Test accuracies of ablation study on Office-31dataset

Fig. 5
figure 5

Test accuracies of ablation study on Multilingual Reuters Collection dataset. (The different colors represent the same meanings as in Fig. 4)

As shown from Figs. 4 and 5, we can observe that the proposed method almost does the best for the test accuracy no matter on all datasets. As expected, PFTS significantly outperforms its variants. PFTS (AD) is worse than PFTS, which implies that utilizing adversarial domain adaption can improve the domains alignment performance. PFTS (PF) and PFTS (HO) are worse than PFTS, which indicates that parameter fusion and high-order moment measurement are helpful to further increase the performance of fusion and cross-domain alignment. PFTS (CD) is worse than PFTS, which suggests that conditional weighting scheme is useful.

Parameter fusion In order to generally evaluate the significance of the parameter fusion in performance, we perform parameter fusion results testing in Fig. 6. PFTS (ST) refers to the parameter fusion of each source domain and target domain in the second-layer network during model training, and PFTS (SS) refers to the parameter fusion of the second-layer source domains.

Fig. 6
figure 6

Test accuracies of parameter fusion on Office-31 dataset and Mutilingual Reuters Collectionl dataset

This observation shows that PFTS (ST) and PFTS (SS) have the poor performance. This indicates that parameter fusion among all domains can increase the connection between modalities and improve the final fusion performance. Therefore, this shows the rationality of parameter fusion settings.

High-order moment measure evaluation To assess the effectiveness of high-order moment measurement, taking four levels of step spacing as an example, we respectively show the different results on Office-31 and Multilingual Reuters Collection dataset, as Fig. 7. PFTS (1), PFTS (1 + 2), PFTS (1 + 2 + 3) and PFTS (1 + 2 + 3 + 4) refer to the use of only the corresponding order moment measurement in PFTS.

Fig. 7
figure 7

Test accuracies of high-order moment measure evaluation on Office-31 dataset and Mutilingual Reuters Collectionl dataset

It is shown that the test accuracy is lowest when only using low-order moment measurement. With the increase of the order moment, the classification accuracy also increases. This indicates that incorporating high-order moment measurement into multi-modal fusion can effectively reduce the differences in data between modalities and enhance the alignment effect.

Weight evaluation We also estimate the validity of the weighting scheme. The results of the weight comparison of different source domains and the difference in condition distribution between source domains and target domains are shown in Fig. 8. The comparison of the different weight settings is shown in Fig. 9. Among them, the comparative analysis of the results before and after adding weights is conducted in the ablation experiment.

Fig. 8
figure 8

Weight setting and conditional distribution distance of As, Wd-Dr on Office-31 dataset

Fig. 9
figure 9

Comparison of different weight settings on Office-31 dataset and Mutilingual Reuters Collectionl dataset

As shown from Fig. 8, the different weights of each source domain indicate the different contributions of each source domain, and also illustrate the effectiveness of the conditional weighting mechanism. At the same time, the conditional distribution of A and D is relatively small, but correspondingly, the weight of A is relatively large. This observation indicates that a smaller divergence leads to a large weight, which obeys our weighting principle.

We also compared the conditional weighting mechanism with other weighting mechanisms, and the experimental results are shown in Fig. 9. Among them, PFTS (M) refers to use the reciprocal of the \(\delta\) in Eq. (8). We also compared another weight setting method, which only focuses on the classification accuracy of each modality itself. PFTS (Z) represents that the weight is achieved by the reciprocal of the classification accuracy of each modality. The experimental results show that the conditional weighting mechanism can distinguish the importance degree of different source domains according to the conditional distribution distance.

Parameter sensitivity analysis The sensitivity analysis of the parameters is shown in the Fig. 10. We can see that the default parameter settings can achieve higher accuracy, indicating that the default parameter configuration is more reasonable. In addition, we tested the performance of the default parameter settings in all tasks and achieved good results, which demonstrated that our algorithm is relatively stable and effective for different experimental settings.

Fig. 10
figure 10

Parameter sensitivity analysis on F, G–S for Multilingual Reuters Collection dataset

Through all experiments, it can be observed that the traditional multi-modal transfer methods usually only measure the low-order moment, while we use two-step alignment based on parameters fusion to achieve excellent alignment results. In addition, we tested the hyperparameters in the data sets of different dimensions, and the hyperparameters set could perfectly present the optimal accuracy, which showed that the model was robust. In general, our approach has certain advantages in the scenario of multi-modal data fusion domain adaption.

Discussion With other methods under the same settings, we show that this method achieves optimal performance. We also investigate the effects of multi-modal parameter fusion, high-order moment measurement, adversarial domain adaption and conditional weighting scheme on the model performance, and find that the accuracy of the model is effectively improved by adding related functions. While the effectiveness of proposed method has been validated by experiment, there are also several points can be improved. Firstly, in this paper, the feature level has been conducted, while fusion at the model and decision levels can also be considered in the future work. Secondly, the paper enhances the connection between modalities and improves the robustness of the model through fusion, but it needs further study on whether this connection expands the complementarity of information of multiple modalities and achieves complementarity between modalities. In the future, we will extend the fusion level on the basis of feature fusion and study the information complementarity between modalities through different experimental scenarios.

5 Conclusions

In this work, we propose a new strategy for multi-modal domain adaptation method based on parameter fusion and two-stage alignment. It achieves cross-modal association from parameter fusion, by means of adversarial domain adaption and high-order moment measurement, as well as realizing the cross-modal alignment. And finally implement the emphasis transfer by weight mechanism. Compared with typical multi-modal fusion methods, the proposed framework takes the data differences between modalities into consideration, which can increase the connectivity between modalities and subsequently improve the alignment effect after fusion. A large number of comparative experiments and analyses have verified the effectiveness of this method. In the future, we will continue to explore multiple fusion methods and the data complementarity between modalities.