Abstract
Pseudo-labeling methods are popular in semi-supervised learning (SSL). Their performance heavily relies on a proper threshold to generate hard labels for unlabeled data. To this end, most existing studies resort to a manually pre-specified function to adjust the threshold, which, however, requires prior knowledge and suffers from the scalability issue. In this paper, we propose a novel method named Meta-Threshold, which learns a dynamic confidence threshold for each unlabeled instance and does not require extra hyperparameters except a learning rate. Specifically, the instance-level confidence threshold is automatically learned by an extra network in a meta-learning manner. Considering limited labeled data as meta-data, the overall training objective of the classifier network and the meta-net can be formulated as a nested optimization problem that can be solved by a bi-level optimization scheme. Furthermore, by replacing the indicator function existed in the pseudo-labeling with a surrogate function, we theoretically provide the convergence of our training procedure, while discussing the training complexity and proposing a strategy to reduce its time cost. Extensive experiments and analyses demonstrate the effectiveness of our method on both typical and imbalanced SSL tasks.
Similar content being viewed by others
Availability of data and materials
Not applicable.
References
Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., & De Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. in NIPS 29
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009) Curriculum learning. In: ICML, pp. 41–48
Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H., & Raffel, C. (2019) Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C.A. (2019) Mixmatch: A holistic approach to semi-supervised learning. in NIPS 32
Coates, A., Ng, A., & Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In: AISTATS, pp. 215–223
Cubuk, E.D., Zoph, B., Shlens, J., & Le, Q.V. (2020) Randaugment: Practical automated data augmentation with a reduced search space. In: CVPRW, pp. 702–703.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: CVPR, pp. 248–255 IEEE
Finn, C., Abbeel, P., & Levine, S. (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: ICLR (PMLR), pp. 1126–1135.
Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., & Pontil, M. (2018) Bilevel programming for hyperparameter optimization and meta-learning. In: ICML
Grefenstette, E., Amos, B., Yarats, D., Htut, P.M., Molchanov, A., Meier, F., Kiela, D., Cho, K., & Chintala, S. (2019) Generalized inner loop meta-learning. arXiv preprint arXiv:1910.01727
Guo, L.-Z., & Li, Y.-F. (2022) Class-imbalanced semi-supervised learning with adaptive thresholding. In: ICLR, pp. 8082–8094
Guo, L.-Z., Zhang, Z.-Y., Jiang, Y., Li, Y.-F., & Zhou, Z.-H. (2020) Safe deep semi-supervised learning for unseen-class unlabeled data, in: ICLR
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217
Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S. J., & Shin, J. (2020). Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. NIPS, 33, 14567–14579.
Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images
Laine, S., & Aila, T. (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242
Lee, D.-H. (2013) Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop. vol 3, p. 896
Li, J., Xiong, C., & Hoi, S.C. (2021) Comatch: Semi-supervised learning with contrastive graph regularization. In: ICCV, pp. 9475–9484
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A.Y. (2011) Reading digits in natural images with unsupervised feature learning
Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018) Learning to reweight examples for robust deep learning. In: ICML, pp 4334–4343
Saito, K., Kim, D., & Saenko, K. (2021) Openmatch: Open-set consistency regularization for semi-supervised learning with outliers. in NIPS
Sajjadi, M., Javanmardi, M., & Tasdizen, T. (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. in NIPS 29
Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., & Meng, D. (2019). Meta-weight-net: Learning an explicit mapping for sample weighting. In NIPS, 32, 19.
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., Cubuk, E. D., Kurakin, A., & Li, C.-L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. NIPS, 33, 596–608.
Sun, H., Guo, C., Wei, Q., Han, Z., & Yin, Y. (2022). Learning to rectify for robust learning with noisy labels. Pattern Recognition, 124, 108467.
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, 30, 17.
Wang, Y., Guo, J., Song, S., & Huang, G. (2020). Meta-semi: A meta-learning approach for semi-supervised learning. arXiv preprint arXiv:2007.02394
Wei, C., Sohn, K., Mellina, C., Yuille, A., & Yang, F. (2021) Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In: CVPR, pp. 10857–10866
Xiao, T., Zhang, X.-Y., Jia, H., Cheng, M.-M., & Yang, M.-H. (2021). Semi-supervised learning with meta-gradient. In: International Conference on Artificial Intelligence and Statistics, pp. 73–81
Xie, Q., Dai, Z., Hovy, E., Luong, T., & Le, Q. (2020). Unsupervised data augmentation for consistency training. NIPS, 33, 6256–6268.
Xu, Y., Shang, L., Ye, J., Qian, Q., Li, Y.-F., Sun, B., Li, H., & Jin, R. (2021). Dash: Semi-supervised learning with dynamic thresholding. In: ICLR, pp. 11525–11536
Xu, Y., Zhu, L., Jiang, L., & Yang, Y. (2021) Faster meta update strategy for noise-robust deep learning. In: CVPR, pp. 144–153
Zhang, X., Zhao, J., & LeCun, Y. (2015) Character-level convolutional networks for text classification. in NIPS 28
Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., & Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NIPS, 34, 18408–18419.
Zheng, M., You, S., Huang, L., Wang, F., Qian, C., & Xu, C. (2022) Simmatch: Semi-supervised learning with similarity matching. In: CVPR, pp. 14471–14481
Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–130.
Funding
This research was supported by Natural Science Foundation of China(No. 62106129, 62176139, 62106028), Natural Science Foundation of Shandong Province (No. ZR2021QF053, ZR2021ZD15) and Chongqing Overseas Chinese Entrepreneurship and Innovation Support Program, and CAAI-Huawei MindSpore Open Fund.
Author information
Authors and Affiliations
Contributions
Conceptualization: W-Q; Methodology: W-Q; Theoretical analysis: F-L; Writing-original draft preparation: W-Q, S-HL; Writing-review and editing: W-R, H-RD; Funding acquisition: S-HL, F-L, Y-YL.
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no confict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Code availability
Not applicable.
Additional information
Editors: Vu Nguyen, Dani Yogatama.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Theoretical proof of our method
Appendix A: Theoretical proof of our method
1.1 A.1 Proofs of smoothness
Given a small amount of meta dataset with n samples \(\{(\textbf{x}_1^l, \textbf{y}_1^l),...,(\textbf{x}_n^l, \textbf{y}_n^l)\}\) and another unlabeled data \(\{\textbf{x}_1,...,\textbf{x}_{(\mu \times n)}\}\) with size of \(\mu \times n\). By replacing the indicator function with the approximate function, the meta loss is \(L_\textrm{meta}(\textbf{w}^*({\Theta })) = \frac{1}{n} \sum \nolimits _{i=1}^n H(\textbf{y}_i^l, f(\textbf{x}_i^l; \textbf{w}^*({\Theta })))\) and the training loss is
where \(\mathcal {S}_i(\textbf{w}, \Theta ) = \mathcal {S}( \max (f(\mathcal {A}^w(\textbf{x}_i; \textbf{w}))) - \mathcal {V}_i(\textbf{w}, \Theta ))\).
Firstly, we recall the update equation of the parameters of TGN as follows:
To be concise, we formulate \(H(\textbf{y}_i^l, f(\textbf{x}_i^l; {\hat{{{\textbf{w}}}}}^{(t)}(\Theta )))\) as \(H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))\). Then, the computation of backpropagation for the above equation can be written as
Let \(G_{ij} = \frac{\partial H_i^\textrm{meta}(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \, \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}}\) and substitute \(G_{ij}\) into Eq. (A3), then
Proof
The gradient of \(\Theta\) w.r.t. meta loss can be formulated as:
Let \(\mathcal {V}_j(\Theta ) = \mathcal {V}_j(\textbf{w}^{(t)}; \Theta )\) and introduce \(G_{ij}\) which is defined in Eq. (A4). Taking the gradient of \(\Theta\) on both side of Eq. (A5), we attain
The first term in Eq. (A6) right hand side can be summarized as
since \({\left\| \frac{\partial H(\hat{{{\textbf{w}}}})}{\partial {\hat{{\textbf{w}}}}} \big |_{\hat{{\textbf{w}}}^{(t)}}^T \right\| \le \rho , \left\| \frac{\partial \ell _{\textbf{x}_j}(\mathcal {S}_j(\textbf{w}))}{\partial \mathcal {S}_j(\textbf{w})} \right\| \le \phi , \left\| \frac{\partial \mathcal {S}_j(\textbf{w})}{\partial \textbf{w}} \big |_{\textbf{w}^{(t)}} \right\| \le \zeta , \left\| \frac{\partial ^2 \mathcal {V}_j(\Theta )}{\partial ^2 \Theta } \Big |_{\Theta ^{(t)}} \right\| \le \mathcal {B}}\).
The second term in Eq. (A6) right hand side can be summarized as
Combining the results in Eq. (A7) and Eq. (A8), we have \(\left\| \nabla _{\Theta ^2}^2 H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}} \right\| \le \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B}).\) Define \({{{\hat{L}}}} = \phi \zeta (\alpha L \delta ^2 \phi \zeta + \rho \mathcal {B})\), based on the Lagrange mean value theorem, we have:
where \(\nabla {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta _1))} = \nabla _\Theta {L_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))}\Big |_{\Theta _1}\). \(\square\)
1.2 A.2 Proofs of convergence
Proof
The update of parameters \(\Theta\) in t-th iteration can be written as \(\Theta ^{(t+1)} = \Theta ^{(t)} - \psi \frac{1}{n} \sum \nolimits _{i=1}^n \nabla _\Theta H_i^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\Theta ^{(t)}}.\) Training with a mini-batch of meat-data \(\textrm{B}_t\) that is uniformly drawn from the data set, we rewrite the equation above as:
where \(\varepsilon ^{(t)} = \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta )) \Big |_{\textrm{B}_t} - \nabla _\Theta H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))\). Note that the expectation of \(\varepsilon ^{(t)}\) obeys \(\mathbbm {E}[\varepsilon ^{(t)}]=0\) and its variance is finite. Consider that
For \(\textrm{term 1}\), by Lipschitz smoothness of the meta loss function for \(\Theta\), we have
According to Eq. (6) (8) (A1), then we have
since \({ \left\| \frac{\partial H_j(\textbf{w})}{\partial \textbf{w}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}\right\| \le \rho , \left\| \frac{\partial H_i^\textrm{meta}(\textbf{w})}{\partial \hat{{{\textbf{w}}}}} \big |_{\hat{{{\textbf{w}}}}^{(t)}}^T \right\| \le \rho }\).
For \(\mathrm term 2\), considering Lipschitz continuity of \(\nabla H_\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ))\) demonstrated in Lemma 1, we can obtain the following:
Summing up the Eq. (A12) (A13), the Eq. (A11) can be summarized as
Rearranging the terms, we can obtain
Summing up the above inequalities and rearranging the terms, we can obtain
We take the expectations w.r.t. \(\varepsilon ^{(N)}\) on both size of Eq. (A14), then we have:
since \({\mathop {\mathbbm {E}}\nolimits _{\varepsilon ^{(N)}} \left\langle H^\textrm{meta}(\hat{{{\textbf{w}}}}^{(t)}(\Theta ^{(t)})), \varepsilon ^{(t)} \right\rangle =0}\) and \(\mathbbm {\left\| \varepsilon ^{(t)} \right\| _2^2} \le \sigma ^2\), where \(\sigma ^2\) represents the variance of \(\varepsilon ^{(t)}\). Eventually, we deduce that
Therefore, we can conclude that under some mild conditions, our algorithm can always achieve \(\min _{0 \le t \le T} \mathbbm {E} \Big [ \left\| \nabla H^\textrm{meta}(\Theta ^{(t)}) \right\| _2^2 \Big ] \le \mathcal {O}(\frac{1}{\sqrt{T}})\) in T steps. \(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wei, Q., Feng, L., Sun, H. et al. Learning sample-aware threshold for semi-supervised learning. Mach Learn (2024). https://doi.org/10.1007/s10994-023-06425-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-023-06425-7