Skip to main content
Log in

Learning sample-aware threshold for semi-supervised learning

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Pseudo-labeling methods are popular in semi-supervised learning (SSL). Their performance heavily relies on a proper threshold to generate hard labels for unlabeled data. To this end, most existing studies resort to a manually pre-specified function to adjust the threshold, which, however, requires prior knowledge and suffers from the scalability issue. In this paper, we propose a novel method named Meta-Threshold, which learns a dynamic confidence threshold for each unlabeled instance and does not require extra hyperparameters except a learning rate. Specifically, the instance-level confidence threshold is automatically learned by an extra network in a meta-learning manner. Considering limited labeled data as meta-data, the overall training objective of the classifier network and the meta-net can be formulated as a nested optimization problem that can be solved by a bi-level optimization scheme. Furthermore, by replacing the indicator function existed in the pseudo-labeling with a surrogate function, we theoretically provide the convergence of our training procedure, while discussing the training complexity and proposing a strategy to reduce its time cost. Extensive experiments and analyses demonstrate the effectiveness of our method on both typical and imbalanced SSL tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and materials

Not applicable.

References

  • Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. in NIPS 29

  • Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009) Curriculum learning. In: ICML, pp. 41–48

  • Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., & Raffel, C. (2019) Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785

  • Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C.A. (2019) Mixmatch: A holistic approach to semi-supervised learning. in NIPS 32

  • Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In: AISTATS, pp. 215–223

  • Cubuk, E.D., Zoph, B., Shlens, J., & Le, Q.V. (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: CVPRW, pp. 702–703.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR, pp. 248–255 IEEE

  • Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: ICLR (PMLR), pp. 1126–1135.

  • Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., & Pontil, M. (2018) Bilevel programming for hyperparameter optimization and meta-learning. In: ICML

  • Grefenstette, E., Amos, B., Yarats, D., Htut, P.M., Molchanov, A., Meier, F., Kiela, D., Cho, K., & Chintala, S. (2019) Generalized inner loop meta-learning. arXiv preprint arXiv:1910.01727

  • Guo, L.-Z., & Li, Y.-F. (2022) Class-imbalanced semi-supervised learning with adaptive thresholding. In: ICLR, pp. 8082–8094

  • Guo, L.-Z., Zhang, Z.-Y., Jiang, Y., Li, Y.-F., & Zhou, Z.-H. (2020) Safe deep semi-supervised learning for unseen-class unlabeled data, in: ICLR

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

    Article  Google Scholar 

  • Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217

  • Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., & Shin, J. (2020). Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. NIPS, 33, 14567–14579.

    Google Scholar 

  • Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images

  • Laine, S., & Aila, T. (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242

  • Lee, D.-H. (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop. vol 3, p. 896

  • Li, J., Xiong, C., & Hoi, S.C. (2021) Comatch: Semi-supervised learning with contrastive graph regularization. In: ICCV, pp. 9475–9484

  • Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150

  • Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011) Reading digits in natural images with unsupervised feature learning

  • Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018) Learning to reweight examples for robust deep learning. In: ICML, pp 4334–4343

  • Saito, K., Kim, D., & Saenko, K. (2021) Openmatch: Open-set consistency regularization for semi-supervised learning with outliers. in NIPS

  • Sajjadi, M., Javanmardi, M., & Tasdizen, T. (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. in NIPS 29

  • Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., & Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In NIPS, 32, 19.

    Google Scholar 

  • Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., & Li, C.-L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NIPS, 33, 596–608.

    Google Scholar 

  • Sun, H., Guo, C., Wei, Q., Han, Z., & Yin, Y. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.

    Article  Google Scholar 

  • Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 30, 17.

    Google Scholar 

  • Wang, Y., Guo, J., Song, S., & Huang, G. (2020). Meta-semi: A meta-learning approach for semi-supervised learning. arXiv preprint arXiv:2007.02394

  • Wei, C., Sohn, K., Mellina, C., Yuille, A., & Yang, F. (2021) Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR, pp. 10857–10866

  • Xiao, T., Zhang, X.-Y., Jia, H., Cheng, M.-M., & Yang, M.-H. (2021). Semi-supervised learning with meta-gradient. In: International Conference on Artificial Intelligence and Statistics, pp. 73–81

  • Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. NIPS, 33, 6256–6268.

    Google Scholar 

  • Xu, Y., Shang, L., Ye, J., Qian, Q., Li, Y.-F., Sun, B., Li, H., & Jin, R. (2021). Dash: Semi-supervised learning with dynamic thresholding. In: ICLR, pp. 11525–11536

  • Xu, Y., Zhu, L., Jiang, L., & Yang, Y. (2021) Faster meta update strategy for noise-robust deep learning. In: CVPR, pp. 144–153

  • Zhang, X., Zhao, J., & LeCun, Y. (2015) Character-level convolutional networks for text classification. in NIPS 28

  • Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., & Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NIPS, 34, 18408–18419.

    Google Scholar 

  • Zheng, M., You, S., Huang, L., Wang, F., Qian, C., & Xu, C. (2022) Simmatch: Semi-supervised learning with similarity matching. In: CVPR, pp. 14471–14481

  • Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–130.

    Article  Google Scholar 

Download references

Funding

This research was supported by Natural Science Foundation of China(No. 62106129, 62176139, 62106028), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15) and Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: W-Q; Methodology: W-Q; Theoretical analysis: F-L; Writing-original draft preparation: W-Q, S-HL; Writing-review and editing: W-R, H-RD; Funding acquisition: S-HL, F-L, Y-YL.

Corresponding author

Correspondence to Qi Wei.

Ethics declarations

Conflict of interest

The author declares that he has no confict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Code availability

Not applicable.

Additional information

Editors: Vu Nguyen, Dani Yogatama.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Theoretical proof of our method

Appendix A: Theoretical proof of our method

1.1 A.1 Proofs of smoothness

Given a small amount of meta dataset with n samples \(\{(\textbf{x}_1^l, \textbf{y}_1^l),...,(\textbf{x}_n^l, \textbf{y}_n^l)\}\) and another unlabeled data \(\{\textbf{x}_1,...,\textbf{x}_{(\mu \times n)}\}\) with size of \(\mu \times n\). By replacing the indicator function with the approximate function, the meta loss is \(L_\textrm{meta}(\textbf{w}^*({\Theta })) = \frac{1}{n} \sum \nolimits _{i=1}^n H(\textbf{y}_i^l, f(\textbf{x}_i^l; \textbf{w}^*({\Theta })))\) and the training loss is

$$\begin{aligned} L_{train}(\textbf{w},\Theta ) = \frac{1}{n\mu } \sum \nolimits _{i=1}^{n\mu } \mathbbm {1}(\max (f(\mathcal {A}^w(\textbf{x}_i); \textbf{w})) > \mathcal {V}_i(\textbf{w}, \Theta )) \cdot H(\hat{\textbf{y}}_i, f(\mathcal {A}^s(\textbf{x}_i); \textbf{w})), \end{aligned}$$
(A1)

where \(\mathcal {S}_i(\textbf{w}, \Theta ) = \mathcal {S}( \max (f(\mathcal {A}^w(\textbf{x}_i; \textbf{w}))) - \mathcal {V}_i(\textbf{w}, \Theta ))\).

Firstly, we recall the update equation of the parameters of TGN as follows:

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta ))). \end{aligned}$$
(A2)

To be concise, we formulate \(H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta )))\) as \(H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))\). Then, the computation of backpropagation for the above equation can be written as

$$\begin{aligned} \begin{aligned}&\frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = \frac{1}{n} \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}} \sum \nolimits _{j=1}^{n\mu } \frac{\partial \hat{{{\textbf{w}}}}^{(t)}(\Theta )}{\partial \mathcal {S}_j(\textbf{w}^{(t)}; \Theta )} \, \frac{\partial \mathcal {S}_j(\textbf{w}^{(t)}; \Theta )}{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )} \, \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \\ =&\frac{-\alpha }{n^2\mu } \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}} \sum \nolimits _{j=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \\ =&\frac{- \alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{1}{n} \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned} \end{aligned}$$
(A3)

Let \(G_{ij} = \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}}\) and substitute \(G_{ij}\) into Eq.  (A3), then

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} + \frac{\alpha \psi }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{1}{n} \sum \nolimits _{i=1}^{n} G_{ij} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned}$$
(A4)

Proof

The gradient of \(\Theta\) w.r.t. meta loss can be formulated as:

$$\begin{aligned} \begin{aligned}&\nabla _{\Theta } H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = -\frac{\alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{\partial H^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned} \end{aligned}$$
(A5)

Let \(\mathcal {V}_j(\Theta ) = \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )\) and introduce \(G_{ij}\) which is defined in Eq.  (A4). Taking the gradient of \(\Theta\) on both side of Eq.  (A5), we attain

$$\begin{aligned} \begin{aligned} \nabla _{\Theta ^2}^2 H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = -\frac{\alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {[} \frac{\partial }{\partial \Theta } (G_{ij}) \Big |_{\Theta ^{(t)}} \frac{\partial \mathcal {V}_j(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} + (G_{ij}) \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \bigg {]}. \end{aligned} \end{aligned}$$
(A6)

The first term in Eq.  (A6) right hand side can be summarized as

$$\begin{aligned} \begin{aligned}&\left\| \frac{\partial }{\partial \Theta } (G_{ij}) \Big |_{\Theta ^{(t)}} \frac{\partial \mathcal {V}_j(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \, \delta \left\| \frac{\partial }{\partial \hat{{{\textbf{w}}}}} \bigg {(} \frac{\partial H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ =&\, \delta \left\| \frac{\partial }{\partial \hat{{{\textbf{w}}}}} \bigg {(} \frac{\partial H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \hat{{{\textbf{w}}}}} \Big |_{\hat{{{\textbf{w}}}}^{(t)}} \, \frac{-\alpha }{n\mu } \sum \nolimits _{k=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_k}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_k(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ =&\, \delta \left\| \bigg {(} \frac{\partial ^2 H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \hat{{{\textbf{w}}}}^2} \Big |_{\hat{{{\textbf{w}}}}^{(t)}} \, \frac{-\alpha }{n\mu } \sum \nolimits _{k=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_k}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_k(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ \le&\, \alpha L \delta ^2 \phi ^2 \zeta ^2, \end{aligned} \end{aligned}$$
(A7)

since \({\left\| \frac{\partial H(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \right\| \le \rho , \left\| \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \right\| \le \phi , \left\| \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \right\| \le \zeta , \left\| \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \mathcal {B}}\).

The second term in Eq.  (A6) right hand side can be summarized as

$$\begin{aligned} \begin{aligned} \left\| (G_{ij}) \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| = \left\| \frac{\partial H^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \, \rho \phi \zeta \mathcal {B}. \end{aligned} \end{aligned}$$
(A8)

Combining the results in Eq.  (A7) and Eq.  (A8), we have \(\left\| \nabla _{\Theta ^2}^2 H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} \right\| \le \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B}).\) Define \({{{\hat{L}}}} = \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B})\), based on the Lagrange mean value theorem, we have:

$$\begin{aligned} \left\| \nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} - \nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _2))} \right\| \le {{{\hat{L}}}} \left\| \Theta _1 - \Theta _2 \right\| , \, \text {for all} \, \Theta _1, \Theta _2, \end{aligned}$$
(A9)

where \(\nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} = \nabla _\Theta {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))}\Big |_{\Theta _1}\). \(\square\)

1.2 A.2 Proofs of convergence

Proof

The update of parameters \(\Theta\) in t-th iteration can be written as \(\Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}}.\) Training with a mini-batch of meat-data \(\textrm{B}_t\) that is uniformly drawn from the data set, we rewrite the equation above as:

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi _t \Big [ \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) + \varepsilon ^{(t)} \Big ], \end{aligned}$$
(A10)

where \(\varepsilon ^{(t)} = \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\textrm{B}_t} - \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))\). Note that the expectation of \(\varepsilon ^{(t)}\) obeys \(\mathbbm {E}[\varepsilon ^{(t)}]=0\) and its variance is finite. Consider that

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ =&\, \underbrace{H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t +1 )}))}_\textrm{term 1} + \underbrace{H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}))}_\textrm{term 2}. \end{aligned} \end{aligned}$$
(A11)

For \(\textrm{term 1}\), by Lipschitz smoothness of the meta loss function for \(\Theta\), we have

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) \\ \le&\, \left\langle \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})), \hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)}) - \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)}) \right\rangle + \frac{L}{2} \left\| \hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)}) - \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)}) \right\| _2^2. \end{aligned} \end{aligned}$$

According to Eq. (6) (8) (A1), then we have

$$\begin{aligned} \left\| H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t +1 )})) \right\| \le \alpha _t \rho ^2 + \frac{1}{2} L \alpha _t \rho ^2 = \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) \end{aligned}$$
(A12)

since \({ \left\| \frac{\partial H_j(\textbf{w})}{\partial \textbf{w}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}\right\| \le \rho , \left\| \frac{\partial H_i^\textrm{meta}(\textbf{w})}{\partial \hat{{{\textbf{w}}}}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \right\| \le \rho }\).

For \(\mathrm term 2\), considering Lipschitz continuity of \(\nabla H_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))\) demonstrated in Lemma 1, we can obtain the following:

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ \le&\, \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \Theta ^{(t+1)} - \Theta ^{(t)} \right\rangle + \frac{L}{2} \left\| \Theta ^{(t+1)} - \Theta ^{(t)} \right\| _2^2 \\ =&-(\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 - (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$
(A13)

Summing up the Eq.  (A12)  (A13), the Eq.  (A11) can be summarized as

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ \le&\, \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 - (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

Rearranging the terms, we can obtain

$$\begin{aligned} \begin{aligned}&(\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \\ \le&\, \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta} \big ( \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}) \big ) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 \\&- (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

Summing up the above inequalities and rearranging the terms, we can obtain

$$\begin{aligned} \begin{aligned}&\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2\\ \le&\, H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) + \\&\quad \quad \quad \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - \sum \nolimits _{t=1}^T (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle \, + \, \frac{L}{2} \sum \nolimits _{t=1}^T \left\| \varepsilon ^{(t)} \right\| _2^2 \\ \le&\, H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2})\\&- \sum \nolimits _{t=1}^T (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle + \frac{L}{2} \sum \nolimits _{t=1}^T \left\| \varepsilon ^{(t)} \right\| _2^2. \end{aligned} \end{aligned}$$
(A14)

We take the expectations w.r.t. \(\varepsilon ^{(N)}\) on both size of Eq.  (A14), then we have:

$$\begin{aligned} \begin{aligned}&\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \le H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))\\&+ \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) + \frac{L\sigma ^2}{2} \sum \nolimits _{t=1}^T \psi _t^2, \end{aligned} \end{aligned}$$

since \({\mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle =0}\) and \(\mathbbm {\left\| \varepsilon ^{(t)} \right\| _2^2} \le \sigma ^2\), where \(\sigma ^2\) represents the variance of \(\varepsilon ^{(t)}\). Eventually, we deduce that

$$\begin{aligned} \mathop {\min }\nolimits _{t}&\, \mathbbm {E} \Big [ \left\| \nabla H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \Big ] \le \frac{\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\| \nabla H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 }{\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2})} \\ \le&\, \frac{1}{\sum \nolimits _{t=1}^T (2\psi _t - L\psi _t^2)} \Big [ 2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (2 + \alpha _t L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{1}{\sum \nolimits _{t=1}^T \psi _t} \Big [ 2H^{\textrm{meta}} (\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (2 + \alpha _t L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{1}{T \psi _t} \Big [ 2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \alpha _1 \rho ^2 T (2 + L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{T} \, \frac{1}{\psi _t} + \frac{2 \alpha _1 \rho ^2 (2 + L)}{\psi _t} + L\sigma ^2 \psi _t \\ =&\, \frac{H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{T} \max \{L, \frac{\sigma \sqrt{T}}{\textrm{c}}\} + \min \{1, \frac{k}{T}\}\max \{L, \frac{\sigma \sqrt{T}}{\textrm{c}}\}\rho ^2(2+L) + L\sigma ^2 \min \{\frac{1}{L}, \frac{\textrm{c}}{\sigma \sqrt{T}}\} \\ \le&\, \frac{\sigma H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{{\textrm{c}} \sqrt{T}} + \frac{k\sigma \rho ^2(2+L)}{{\textrm{c}} \sqrt{T}} + \frac{L\sigma {\textrm{c}}}{\sqrt{T}} = \mathcal {O}(\frac{1}{\sqrt{T}}). \end{aligned}$$

Therefore, we can conclude that under some mild conditions, our algorithm can always achieve \(\min _{0 \le t \le T} \mathbbm {E} \Big [ \left\| \nabla H^\textrm{meta}(\Theta ^{(t)}) \right\| _2^2 \Big ] \le \mathcal {O}(\frac{1}{\sqrt{T}})\) in T steps. \(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, Q., Feng, L., Sun, H. et al. Learning sample-aware threshold for semi-supervised learning. Mach Learn (2024). https://doi.org/10.1007/s10994-023-06425-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-023-06425-7

Keywords

Navigation