Learning sample-aware threshold for semi-supervised learning

Wei, Qi; Feng, Lei; Sun, Haoliang; Wang, Ren; He, Rundong; Yin, Yilong

doi:10.1007/s10994-023-06425-7

Learning sample-aware threshold for semi-supervised learning

Published: 18 January 2024

(2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Qi Wei ORCID: orcid.org/0000-0002-4073-7598^1,2,
Lei Feng²,
Haoliang Sun¹,
Ren Wang¹,
Rundong He¹ &
…
Yilong Yin¹

160 Accesses
1 Altmetric
Explore all metrics

Abstract

Pseudo-labeling methods are popular in semi-supervised learning (SSL). Their performance heavily relies on a proper threshold to generate hard labels for unlabeled data. To this end, most existing studies resort to a manually pre-specified function to adjust the threshold, which, however, requires prior knowledge and suffers from the scalability issue. In this paper, we propose a novel method named Meta-Threshold, which learns a dynamic confidence threshold for each unlabeled instance and does not require extra hyperparameters except a learning rate. Specifically, the instance-level confidence threshold is automatically learned by an extra network in a meta-learning manner. Considering limited labeled data as meta-data, the overall training objective of the classifier network and the meta-net can be formulated as a nested optimization problem that can be solved by a bi-level optimization scheme. Furthermore, by replacing the indicator function existed in the pseudo-labeling with a surrogate function, we theoretically provide the convergence of our training procedure, while discussing the training complexity and proposing a strategy to reduce its time cost. Extensive experiments and analyses demonstrate the effectiveness of our method on both typical and imbalanced SSL tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

Fig. 8

LaRW: boosting open-set semi-supervised learning with label-guided re-weighting

Article 20 October 2023

AdaptMatch: Adaptive Consistency Regularization for Semi-supervised Learning with Top-k Pseudo-labeling and Contrastive Learning

Unsupervised Selective Labeling for More Effective Semi-supervised Learning

Availability of data and materials

Not applicable.

References

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. in NIPS 29
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009) Curriculum learning. In: ICML, pp. 41–48
Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., & Raffel, C. (2019) Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C.A. (2019) Mixmatch: A holistic approach to semi-supervised learning. in NIPS 32
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In: AISTATS, pp. 215–223
Cubuk, E.D., Zoph, B., Shlens, J., & Le, Q.V. (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: CVPRW, pp. 702–703.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR, pp. 248–255 IEEE
Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: ICLR (PMLR), pp. 1126–1135.
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., & Pontil, M. (2018) Bilevel programming for hyperparameter optimization and meta-learning. In: ICML
Grefenstette, E., Amos, B., Yarats, D., Htut, P.M., Molchanov, A., Meier, F., Kiela, D., Cho, K., & Chintala, S. (2019) Generalized inner loop meta-learning. arXiv preprint arXiv:1910.01727
Guo, L.-Z., & Li, Y.-F. (2022) Class-imbalanced semi-supervised learning with adaptive thresholding. In: ICLR, pp. 8082–8094
Guo, L.-Z., Zhang, Z.-Y., Jiang, Y., Li, Y.-F., & Zhou, Z.-H. (2020) Safe deep semi-supervised learning for unseen-class unlabeled data, in: ICLR
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Article Google Scholar
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217
Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., & Shin, J. (2020). Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. NIPS, 33, 14567–14579.
Google Scholar
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Laine, S., & Aila, T. (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242
Lee, D.-H. (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop. vol 3, p. 896
Li, J., Xiong, C., & Hoi, S.C. (2021) Comatch: Semi-supervised learning with contrastive graph regularization. In: ICCV, pp. 9475–9484
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011) Reading digits in natural images with unsupervised feature learning
Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018) Learning to reweight examples for robust deep learning. In: ICML, pp 4334–4343
Saito, K., Kim, D., & Saenko, K. (2021) Openmatch: Open-set consistency regularization for semi-supervised learning with outliers. in NIPS
Sajjadi, M., Javanmardi, M., & Tasdizen, T. (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. in NIPS 29
Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., & Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In NIPS, 32, 19.
Google Scholar
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., & Li, C.-L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NIPS, 33, 596–608.
Google Scholar
Sun, H., Guo, C., Wei, Q., Han, Z., & Yin, Y. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.
Article Google Scholar
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 30, 17.
Google Scholar
Wang, Y., Guo, J., Song, S., & Huang, G. (2020). Meta-semi: A meta-learning approach for semi-supervised learning. arXiv preprint arXiv:2007.02394
Wei, C., Sohn, K., Mellina, C., Yuille, A., & Yang, F. (2021) Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR, pp. 10857–10866
Xiao, T., Zhang, X.-Y., Jia, H., Cheng, M.-M., & Yang, M.-H. (2021). Semi-supervised learning with meta-gradient. In: International Conference on Artificial Intelligence and Statistics, pp. 73–81
Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. NIPS, 33, 6256–6268.
Google Scholar
Xu, Y., Shang, L., Ye, J., Qian, Q., Li, Y.-F., Sun, B., Li, H., & Jin, R. (2021). Dash: Semi-supervised learning with dynamic thresholding. In: ICLR, pp. 11525–11536
Xu, Y., Zhu, L., Jiang, L., & Yang, Y. (2021) Faster meta update strategy for noise-robust deep learning. In: CVPR, pp. 144–153
Zhang, X., Zhao, J., & LeCun, Y. (2015) Character-level convolutional networks for text classification. in NIPS 28
Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., & Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NIPS, 34, 18408–18419.
Google Scholar
Zheng, M., You, S., Huang, L., Wang, F., Qian, C., & Xu, C. (2022) Simmatch: Semi-supervised learning with similarity matching. In: CVPR, pp. 14471–14481
Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–130.
Article Google Scholar

Download references

Funding

This research was supported by Natural Science Foundation of China(No. 62106129, 62176139, 62106028), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15) and Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, China
Qi Wei, Haoliang Sun, Ren Wang, Rundong He & Yilong Yin
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Qi Wei & Lei Feng

Authors

Qi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Lei Feng
View author publications
You can also search for this author in PubMed Google Scholar
Haoliang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ren Wang
View author publications
You can also search for this author in PubMed Google Scholar
Rundong He
View author publications
You can also search for this author in PubMed Google Scholar
Yilong Yin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: W-Q; Methodology: W-Q; Theoretical analysis: F-L; Writing-original draft preparation: W-Q, S-HL; Writing-review and editing: W-R, H-RD; Funding acquisition: S-HL, F-L, Y-YL.

Corresponding author

Correspondence to Qi Wei.

Ethics declarations

Conflict of interest

The author declares that he has no confict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Code availability

Not applicable.

Additional information

Editors: Vu Nguyen, Dani Yogatama.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Theoretical proof of our method

1.1 A.1 Proofs of smoothness

Given a small amount of meta dataset with n samples $\{(\textbf{x}_1^l, \textbf{y}_1^l),...,(\textbf{x}_n^l, \textbf{y}_n^l)\}$ and another unlabeled data $\{\textbf{x}_1,...,\textbf{x}_{(\mu \times n)}\}$ with size of $\mu \times n$. By replacing the indicator function with the approximate function, the meta loss is $L_\textrm{meta}(\textbf{w}^*({\Theta })) = \frac{1}{n} \sum \nolimits _{i=1}^n H(\textbf{y}_i^l, f(\textbf{x}_i^l; \textbf{w}^*({\Theta })))$ and the training loss is

$$\begin{aligned} L_{train}(\textbf{w},\Theta ) = \frac{1}{n\mu } \sum \nolimits _{i=1}^{n\mu } \mathbbm {1}(\max (f(\mathcal {A}^w(\textbf{x}_i); \textbf{w})) > \mathcal {V}_i(\textbf{w}, \Theta )) \cdot H(\hat{\textbf{y}}_i, f(\mathcal {A}^s(\textbf{x}_i); \textbf{w})), \end{aligned}$$

(A1)

where $\mathcal {S}_i(\textbf{w}, \Theta ) = \mathcal {S}( \max (f(\mathcal {A}^w(\textbf{x}_i; \textbf{w}))) - \mathcal {V}_i(\textbf{w}, \Theta ))$.

Firstly, we recall the update equation of the parameters of TGN as follows:

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta ))). \end{aligned}$$

(A2)

To be concise, we formulate $H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta )))$ as $H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))$. Then, the computation of backpropagation for the above equation can be written as

$$\begin{aligned} \begin{aligned}&\frac{1}{n} \sum \nolimits _{i=1}^{n} \nabla _{\Theta } H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = \frac{1}{n} \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}} \sum \nolimits _{j=1}^{n\mu } \frac{\partial \hat{{{\textbf{w}}}}^{(t)}(\Theta )}{\partial \mathcal {S}_j(\textbf{w}^{(t)}; \Theta )} \, \frac{\partial \mathcal {S}_j(\textbf{w}^{(t)}; \Theta )}{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )} \, \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \\ =&\frac{-\alpha }{n^2\mu } \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}} \sum \nolimits _{j=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \\ =&\frac{- \alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{1}{n} \sum \nolimits _{i=1}^{n} \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned} \end{aligned}$$

(A3)

Let $G_{ij} = \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}}$ and substitute $G_{ij}$ into Eq. (A3), then

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} + \frac{\alpha \psi }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{1}{n} \sum \nolimits _{i=1}^{n} G_{ij} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned}$$

(A4)

Proof

The gradient of $\Theta$ w.r.t. meta loss can be formulated as:

$$\begin{aligned} \begin{aligned}&\nabla _{\Theta } H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = -\frac{\alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {(} \frac{\partial H^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \Big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \bigg {)} \frac{\partial \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}}. \end{aligned} \end{aligned}$$

(A5)

Let $\mathcal {V}_j(\Theta ) = \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )$ and introduce $G_{ij}$ which is defined in Eq. (A4). Taking the gradient of $\Theta$ on both side of Eq. (A5), we attain

$$\begin{aligned} \begin{aligned} \nabla _{\Theta ^2}^2 H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} = -\frac{\alpha }{n\mu } \sum \nolimits _{j=1}^{n\mu } \bigg {[} \frac{\partial }{\partial \Theta } (G_{ij}) \Big |_{\Theta ^{(t)}} \frac{\partial \mathcal {V}_j(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} + (G_{ij}) \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \bigg {]}. \end{aligned} \end{aligned}$$

(A6)

The first term in Eq. (A6) right hand side can be summarized as

$$\begin{aligned} \begin{aligned}&\left\| \frac{\partial }{\partial \Theta } (G_{ij}) \Big |_{\Theta ^{(t)}} \frac{\partial \mathcal {V}_j(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \, \delta \left\| \frac{\partial }{\partial \hat{{{\textbf{w}}}}} \bigg {(} \frac{\partial H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ =&\, \delta \left\| \frac{\partial }{\partial \hat{{{\textbf{w}}}}} \bigg {(} \frac{\partial H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \hat{{{\textbf{w}}}}} \Big |_{\hat{{{\textbf{w}}}}^{(t)}} \, \frac{-\alpha }{n\mu } \sum \nolimits _{k=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_k}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_k(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ =&\, \delta \left\| \bigg {(} \frac{\partial ^2 H^\textrm{meta} (\hat{{{\textbf{w}}}})}{\partial \hat{{{\textbf{w}}}}^2} \Big |_{\hat{{{\textbf{w}}}}^{(t)}} \, \frac{-\alpha }{n\mu } \sum \nolimits _{k=1}^{n\mu } \frac{\partial \ell _{\textbf{x}_k}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \frac{\partial \mathcal {V}_k(\Theta )}{\partial \Theta } \Big |_{\Theta ^{(t)}} \bigg {)} \Big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \Big |_{\textbf{w}^{(t)}} \right\| \\ \le&\, \alpha L \delta ^2 \phi ^2 \zeta ^2, \end{aligned} \end{aligned}$$

(A7)

since ${\left\| \frac{\partial H(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \right\| \le \rho , \left\| \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \right\| \le \phi , \left\| \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \right\| \le \zeta , \left\| \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \mathcal {B}}$.

The second term in Eq. (A6) right hand side can be summarized as

$$\begin{aligned} \begin{aligned} \left\| (G_{ij}) \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| = \left\| \frac{\partial H^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \, \rho \phi \zeta \mathcal {B}. \end{aligned} \end{aligned}$$

(A8)

Combining the results in Eq. (A7) and Eq. (A8), we have $\left\| \nabla _{\Theta ^2}^2 H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} \right\| \le \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B}).$ Define ${{{\hat{L}}}} = \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B})$, based on the Lagrange mean value theorem, we have:

$$\begin{aligned} \left\| \nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} - \nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _2))} \right\| \le {{{\hat{L}}}} \left\| \Theta _1 - \Theta _2 \right\| , \, \text {for all} \, \Theta _1, \Theta _2, \end{aligned}$$

(A9)

where $\nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} = \nabla _\Theta {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))}\Big |_{\Theta _1}$. $\square$

1.2 A.2 Proofs of convergence

Proof

The update of parameters $\Theta$ in t-th iteration can be written as $\Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}}.$ Training with a mini-batch of meat-data $\textrm{B}_t$ that is uniformly drawn from the data set, we rewrite the equation above as:

$$\begin{aligned} \Theta ^{(t+1)} = \Theta ^{(t)} - \psi _t \Big [ \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) + \varepsilon ^{(t)} \Big ], \end{aligned}$$

(A10)

where $\varepsilon ^{(t)} = \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\textrm{B}_t} - \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))$. Note that the expectation of $\varepsilon ^{(t)}$ obeys $\mathbbm {E}[\varepsilon ^{(t)}]=0$ and its variance is finite. Consider that

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ =&\, \underbrace{H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t +1 )}))}_\textrm{term 1} + \underbrace{H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}))}_\textrm{term 2}. \end{aligned} \end{aligned}$$

(A11)

For $\textrm{term 1}$, by Lipschitz smoothness of the meta loss function for $\Theta$, we have

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) \\ \le&\, \left\langle \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})), \hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)}) - \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)}) \right\rangle + \frac{L}{2} \left\| \hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)}) - \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)}) \right\| _2^2. \end{aligned} \end{aligned}$$

According to Eq. (6) (8) (A1), then we have

$$\begin{aligned} \left\| H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t +1 )})) \right\| \le \alpha _t \rho ^2 + \frac{1}{2} L \alpha _t \rho ^2 = \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) \end{aligned}$$

(A12)

since ${ \left\| \frac{\partial H_j(\textbf{w})}{\partial \textbf{w}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}\right\| \le \rho , \left\| \frac{\partial H_i^\textrm{meta}(\textbf{w})}{\partial \hat{{{\textbf{w}}}}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \right\| \le \rho }$.

For $\mathrm term 2$, considering Lipschitz continuity of $\nabla H_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))$ demonstrated in Lemma 1, we can obtain the following:

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ \le&\, \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \Theta ^{(t+1)} - \Theta ^{(t)} \right\rangle + \frac{L}{2} \left\| \Theta ^{(t+1)} - \Theta ^{(t)} \right\| _2^2 \\ =&-(\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 - (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

(A13)

Summing up the Eq. (A12) (A13), the Eq. (A11) can be summarized as

$$\begin{aligned} \begin{aligned}&H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t+1)}(\Theta ^{(t+1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \\ \le&\, \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 - (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

Rearranging the terms, we can obtain

$$\begin{aligned} \begin{aligned}&(\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \\ \le&\, \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta} \big ( \hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)}) \big ) \right\| _2^2 + \frac{L \psi _t^2}{2} \left\| \varepsilon ^{(t)} \right\| _2^2 \\&- (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle . \end{aligned} \end{aligned}$$

Summing up the above inequalities and rearranging the terms, we can obtain

$$\begin{aligned} \begin{aligned}&\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2\\ \le&\, H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) - H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) + \\&\quad \quad \quad \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) - \sum \nolimits _{t=1}^T (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle \, + \, \frac{L}{2} \sum \nolimits _{t=1}^T \left\| \varepsilon ^{(t)} \right\| _2^2 \\ \le&\, H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2})\\&- \sum \nolimits _{t=1}^T (\psi _t - L\psi _t^2) \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle + \frac{L}{2} \sum \nolimits _{t=1}^T \left\| \varepsilon ^{(t)} \right\| _2^2. \end{aligned} \end{aligned}$$

(A14)

We take the expectations w.r.t. $\varepsilon ^{(N)}$ on both size of Eq. (A14), then we have:

$$\begin{aligned} \begin{aligned}&\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\| \nabla H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \le H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))\\&+ \sum \nolimits _{t=1}^T \alpha \rho ^2 (1 + \frac{\alpha _t L}{2}) + \frac{L\sigma ^2}{2} \sum \nolimits _{t=1}^T \psi _t^2, \end{aligned} \end{aligned}$$

since ${\mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle =0}$ and $\mathbbm {\left\| \varepsilon ^{(t)} \right\| _2^2} \le \sigma ^2$, where $\sigma ^2$ represents the variance of $\varepsilon ^{(t)}$. Eventually, we deduce that

$$\begin{aligned} \mathop {\min }\nolimits _{t}&\, \mathbbm {E} \Big [ \left\| \nabla H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 \Big ] \le \frac{\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2}) \mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\| \nabla H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})) \right\| _2^2 }{\sum \nolimits _{t=1}^T (\psi _t - \frac{L \psi _t^2}{2})} \\ \le&\, \frac{1}{\sum \nolimits _{t=1}^T (2\psi _t - L\psi _t^2)} \Big [ 2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (2 + \alpha _t L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{1}{\sum \nolimits _{t=1}^T \psi _t} \Big [ 2H^{\textrm{meta}} (\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \sum \nolimits _{t=1}^T \alpha \rho ^2 (2 + \alpha _t L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{1}{T \psi _t} \Big [ 2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)})) + \alpha _1 \rho ^2 T (2 + L) + L\sigma ^2 \sum \nolimits _{t=1}^T \psi _t^2 \Big ] \\ \le&\, \frac{2H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{T} \, \frac{1}{\psi _t} + \frac{2 \alpha _1 \rho ^2 (2 + L)}{\psi _t} + L\sigma ^2 \psi _t \\ =&\, \frac{H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{T} \max \{L, \frac{\sigma \sqrt{T}}{\textrm{c}}\} + \min \{1, \frac{k}{T}\}\max \{L, \frac{\sigma \sqrt{T}}{\textrm{c}}\}\rho ^2(2+L) + L\sigma ^2 \min \{\frac{1}{L}, \frac{\textrm{c}}{\sigma \sqrt{T}}\} \\ \le&\, \frac{\sigma H^{\textrm{meta}}(\hat{{{\textbf{w}}}}^{(1)}(\Theta ^{(1)}))}{{\textrm{c}} \sqrt{T}} + \frac{k\sigma \rho ^2(2+L)}{{\textrm{c}} \sqrt{T}} + \frac{L\sigma {\textrm{c}}}{\sqrt{T}} = \mathcal {O}(\frac{1}{\sqrt{T}}). \end{aligned}$$

Therefore, we can conclude that under some mild conditions, our algorithm can always achieve $\min _{0 \le t \le T} \mathbbm {E} \Big [ \left\| \nabla H^\textrm{meta}(\Theta ^{(t)}) \right\| _2^2 \Big ] \le \mathcal {O}(\frac{1}{\sqrt{T}})$ in T steps. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wei, Q., Feng, L., Sun, H. et al. Learning sample-aware threshold for semi-supervised learning. Mach Learn (2024). https://doi.org/10.1007/s10994-023-06425-7

Download citation

Received: 31 May 2023
Revised: 24 August 2023
Accepted: 03 October 2023
Published: 18 January 2024
DOI: https://doi.org/10.1007/s10994-023-06425-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning sample-aware threshold for semi-supervised learning

Abstract

Access this article

Similar content being viewed by others

LaRW: boosting open-set semi-supervised learning with label-guided re-weighting

AdaptMatch: Adaptive Consistency Regularization for Semi-supervised Learning with Top-k Pseudo-labeling and Contrastive Learning

Unsupervised Selective Labeling for More Effective Semi-supervised Learning

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Appendix A: Theoretical proof of our method

1.1 A.1 Proofs of smoothness

Proof

1.2 A.2 Proofs of convergence

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning sample-aware threshold for semi-supervised learning

Abstract

Access this article

Similar content being viewed by others

LaRW: boosting open-set semi-supervised learning with label-guided re-weighting

AdaptMatch: Adaptive Consistency Regularization for Semi-supervised Learning with Top-k Pseudo-labeling and Contrastive Learning

Unsupervised Selective Labeling for More Effective Semi-supervised Learning

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Code availability

Additional information

Publisher's Note

Appendix A: Theoretical proof of our method

Appendix A: Theoretical proof of our method

1.1 A.1 Proofs of smoothness

Proof

1.2 A.2 Proofs of convergence

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation