Skip to main content
Log in

Abstract

We study a problem of best-effort adaptation motivated by several applications and considerations, which consists of determining an accurate predictor for a target domain, for which a moderate amount of labeled samples are available, while leveraging information from another domain for which substantially more labeled samples are at one’s disposal. We present a new and general discrepancy-based theoretical analysis of sample reweighting methods, including bounds holding uniformly over the weights. We show how these bounds can guide the design of learning algorithms that we discuss in detail. We further show that our learning guarantees and algorithms provide improved solutions for standard domain adaptation problems, for which few labeled data or none are available from the target domain. We finally report the results of a series of experiments demonstrating the effectiveness of our best-effort adaptation and domain adaptation algorithms, as well as comparisons with several baselines. We also discuss how our analysis can benefit the design of principled solutions for fine-tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data Availability

The datasets analyzed in this study are all public datasets and are available from the URLs referenced. Our artificial dataset used for a simulation is described in detail and the code generating it can be provided upon request.

References

  1. Aghajanyan, A., Gupta, A., Shrivastava, A., Chen, X., Zettlemoyer, L., Gupta, S.: Muppet: Massive multi-task representations with pre-finetuning. (2021)

  2. Aribandi, V., Tay, Y., Schuster, T., Rao, J., Zheng, H.S., Mehta, S.V., Zhuang, H., Tran, V.Q., Bahri, D., Ni, J., Gupta, J., Hui, K., Ruder, S., Metzler, D.: Ext5: Towards extreme multi-task scaling for transfer learning. (2021)

  3. Balcan, M., Khodak, M., Talwalkar, A.: Provable guarantees for gradient-based meta-learning. In: Proceedings of ICML, vol. 97, pp. 424–433. PMLR (2019)

  4. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3 (2002)

  5. Beck, A.: On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM J. Optim. 25(1), 185–209 (2015)

    Article  MathSciNet  Google Scholar 

  6. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Mach. Learn. 79(1–2), 151–175 (2010)

    Article  MathSciNet  Google Scholar 

  7. Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Proceedings of NIPS, pp. 137–144. MIT Press (2006)

  8. Ben-David, S., Lu, T., Luu, T., Pál, D.: Impossibility theorems for domain adaptation. J. Mach. Learn. Res. - Proc. Track 9, 129–136 (2010)

    Google Scholar 

  9. Berlind, C., Urner, R.: Active nearest neighbors in changing environments. In: Proceedings of ICML, vol. 37, pp. 1870–1879. JMLR.org (2015)

  10. Blanchard, G., Lee, G., Scott, C.: Generalizing from several related classification tasks to a new unlabeled sample. In: NIPS, pp. 2178–2186 (2011)

  11. Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds for domain adaptation. In: Proceedings of NIPS, pp. 129–136 (2008)

  12. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In: Proceedings of ACL, pp. 440–447 (2007)

  13. Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3722–3731 (2017)

  14. Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2014)

    Google Scholar 

  15. Chattopadhyay, R., Fan, W., Davidson, I., Panchanathan, S., Ye, J.: Joint transfer and batch-mode active learning. In: Proceedings of ICML, vol. 28, pp. 253–261. JMLR.org (2013)

  16. Chen, M., Weinberger, K.Q., Blitzer, J.: Co-training for domain adaptation. In: Nips, vol. 24, pp. 2456–2464. Citeseer (2011)

  17. Chen, R.S., Lucier, B., Singer, Y., Syrgkanis, V.: Robust optimization for non-convex objectives. In: Advances in Neural Information Processing Systems, pp. 4705–4714 (2017)

  18. Cortes, C., Greenberg, S., Mohri, M.: Relative deviation learning bounds and generalization with unbounded loss functions. Ann. Math. Artif. Intell. 85(1), 45–70 (2019)

    Article  MathSciNet  Google Scholar 

  19. Cortes, C., Mansour, Y., Mohri, M.: Learning bounds for importance weighting. In: Proceedings of NIPS, pp. 442–450. Curran Associates, Inc (2010)

  20. Cortes, C., Mohri, M.: Domain adaptation in regression. In: Proceedings of ALT, pp. 308–323 (2011)

  21. Cortes, C., Mohri, M.: Domain adaptation and sample bias correction theory and algorithm for regression. Theor. Comput. Sci. 519, 103–126 (2014)

    Article  MathSciNet  Google Scholar 

  22. Cortes, C., Mohri, M., Muñoz Medina, A.: Adaptation based on generalized discrepancy. J. Mach. Learn. Res. 20, 1:1-1:30 (2019)

    MathSciNet  Google Scholar 

  23. Cortes, C., Mohri, M., Theertha Suresh, A., Zhang, N.: A discriminative technique for multiple-source adaptation. In: Meila, M., Zhang, T. (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, vol. 139 of Proceedings of Machine Learning Research, pp. 2132–2143. PMLR (2021)

  24. Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1853–1865 (2016)

    Article  Google Scholar 

  25. Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1853–1865 (2017)

    Article  Google Scholar 

  26. Crammer, K., Kearns, M.J., Wortman, J.: Learning from multiple sources. J. Mach. Learn. Res. 9(Aug), 1757–1774 (2008)

    MathSciNet  Google Scholar 

  27. Daumé, H., III.: Frustratingly easy domain adaptation. ACL 2007, 256 (2007)

    Google Scholar 

  28. de Mathelin, A., Mougeot, M., Vayatis, N.: Discrepancy-based active learning for domain adaptation. (2021). arXiv:2103.03757

  29. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. (2018). arXiv:1810.04805

  30. Du, S.S., Koushik, J., Singh, A., Póczos, B.: Hypothesis transfer learning via transformation functions. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 574–584 (2017)

  31. Dua, D., Graff, C.: UCI machine learning repository (2017)

  32. Duan, L., Tsang, I.W., Xu, D., Chua, T.: Domain adaptation from multiple sources via auxiliary classifiers. In: ICML, vol. 382, pp. 289–296 (2009)

  33. Duan, L., Xu, D., Tsang, I.W.: Domain adaptation from multiple sources: a domain-dependent regularization approach. IEEE Trans. Neural Netw. Learn. Syst. 23(3), 504–518 (2012)

    Article  Google Scholar 

  34. Fernandes, K.: A proactive intelligent decision support system for predicting the popularity of online news. In: Springer Science and Business Media LLC‘< (2015)

  35. Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Unsupervised visual domain adaptation using subspace alignment. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2960–2967 (2013)

  36. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70 of Proceedings of Machine Learning Research, pp. 1126–1135. PMLR (2017)

  37. Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 87–97 (2016)

  38. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)

    MathSciNet  Google Scholar 

  39. Garcke, J., Vanck, T.: Importance weighted inductive transfer learning for regression. In: Calders, T., F. Esposito, Hüllermeier, E., Meo, R. (Eds.), Proceedings of ECML, vol. 8724 of Lecture Notes in Computer Science, pp. 466–481. Springer (2014)

  40. Germain, P., Habrard, A., Laviolette, F., Morvant, E.: A PAC-bayesian approach for domain adaptation with specialization to linear classifiers. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, vol. 28 of JMLR Workshop and Conference Proceedings, pp. 738–746. JMLR.org (2013)

  41. Ghifary, M., Balduzzi, D., Kleijn, W.B., Zhang, M.: Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE Trans. Pattern Anal. Mach. Intell. 39(7), 1414–1430 (2016)

    Article  Google Scholar 

  42. Ghifary, M., Bastiaan Kleijn, W., Zhang, M., Balduzzi, D.: Domain generalization for object recognition with multi-task autoencoders. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2551–2559 (2015)

  43. Ghifary, M., Kleijn, W.B., Zhang, M., Balduzzi, D., Li, W.: Deep reconstruction-classification networks for unsupervised domain adaptation. In: European Conference on Computer Vision, pp. 597–613. Springer (2016)

  44. Gong, B., Grauman, K., Sha, F.: Connecting the dots with landmarks: discriminatively learning domain-invariant features for unsupervised domain adaptation. In: ICML, vol. 28, pp. 222–230 (2013)

  45. Gong, B., Grauman, K., Sha, F.: Reshaping visual datasets for domain adaptation. In: NIPS, pp. 1286–1294 (2013)

  46. Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp. 2066–2073 (2012)

  47. Gong, M., Zhang, K., Liu, T., Tao, D., Glymour, C., Schölkopf, B.: Domain adaptation with conditional transferable components. In: Balcan, M., Weinberger, K.Q. (Eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, vol. 48 of JMLR Workshop and Conference Proceedings, pp. 2839–2848. JMLR.org (2016)

  48. Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear gauss-seidel method under convex constraints. Oper. Res. Lett. 26(3), 127–136 (2000)

    Article  MathSciNet  Google Scholar 

  49. Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., Feris, R.: Spottune: transfer learning through adaptive fine-tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  50. Hanneke, S., Kpotufe, S.: On the value of target data in transfer learning. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 9867–9877 (2019)

  51. Haslett, J., Raftery, A.E.: Space-time modeling with long-memory dependence: assessing ireland’s wind-power resource. technical report. J. R. Stat. Soc. 38(1) (1989)

  52. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  53. Hedegaard, L., Sheikh-Omar, O.A., Iosifidis, A.: Supervised domain adaptation: a graph embedding perspective and a rectified experimental protocol. IEEE Trans. Image Process. 30, 8619–8631 (2021)

    Article  MathSciNet  Google Scholar 

  54. Hoffman, J., Kulis, B., Darrell, T., Saenko, K.: Discovering latent domains for multisource domain adaptation. In: ECCV, vol. 7573, pp. 702–715 (2012)

  55. Hoffman, J., Mohri, M., Zhang, N.: Algorithms and theory for multiple-source adaptation. In: Proceedings of NeurIPS, pp. 8256–8266 (2018)

  56. Hoffman, J., Mohri, M., Zhang, N.: Multiple-source adaptation theory and algorithms. Ann. Math. Artif. Intell. 89(3–4), 237–270 (2021)

    MathSciNet  Google Scholar 

  57. Hoffman, J., Mohri, M., Zhang, N.: Multiple-source adaptation theory and algorithms - addendum. Ann. Math. Artif. Intell. 90(6), 569–572 (2022)

    Article  MathSciNet  Google Scholar 

  58. Horst, R., Thoai, N.V.: DC programming: overview. J. Optim. Theory Appl. 103(1), 1–43 (1999)

    Article  MathSciNet  Google Scholar 

  59. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. (2019). arXiv:1902.00751

  60. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), Melbourne, Australia, pp. 328–339. Association for Computational Linguistics (2018)

  61. Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., Schölkopf, B.: Correcting sample selection bias by unlabeled data. In: NIPS 2006, vol. 19, pp. 601–608 (2006)

  62. Huang, X., Rao, Y., Xie, H., Wong, T.L., Wang, F.L.: Cross-domain sentiment classification via topic-related tradaboost. In: Thirty-first AAAI Conference on Artificial Intelligence (2017)

  63. Ikonomovska, E: Airline dataset. Online (2009)

  64. Jhuo, I.H., Liu, D., Lee, D., Chang, S.F.: Robust visual domain adaptation with low-rank reconstruction. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2168–2175. IEEE (2012)

  65. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: ECCV, vol. 7572, pp. 158–171 (2012)

  66. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Nascimento, M.A., Özsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (Eds.), (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, Toronto, Canada, August 31 - September 3 2004, pp. 180–191. Morgan Kaufmann (2004)

  67. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Advances in neural information processing systems, pp. 3294–3302 (2015)

  68. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D., Goel, S.: Racial disparities in automated speech recognition. Proc. Natl. Acad. Sci. USA 117(14), 7684–7689 (2020)

    Article  Google Scholar 

  69. Koltchinskii, V., Panchenko, D.: Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Stat. 30 (2002)

  70. Konstantinov, N., Lampert, C.: Robust learning from untrusted sources. In: International Conference on Machine Learning, pp. 3488–3498 (2019)

  71. Kpotufe, S., Martinet, G.: Marginal singularity, and the benefits of labels in covariate-shift. In: Bubeck, S., Perchet, V., Rigollet, P. (Eds.), Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, vol. 75 of Proceedings of Machine Learning Research, pp. 1882–1886. PMLR (2018)

  72. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images. Technical report, Toronto University (2009)

  73. Kundu, J.N., Venkat, N., Babu, R.V., et al.: Universal source-free domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4544–4553 (2020)

  74. Kuzborskij, I., Orabona, F.: Stability and hypothesis transfer learning. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, vol. 28 of JMLR Workshop and Conference Proceedings, pp. 942–950. JMLR.org (2013)

  75. Kwon, T.M.: TMC traffic data automation for Mn/DOT’s traffic monitoring program. Univ. of Minnesota Report no. Mn/DOT 2004-29 (2004)

  76. Li, J., Lu, K., Huang, Z., Zhu, L., Shen, H.T.: Transfer independently together: A generalized framework for domain adaptation. IEEE Trans. Cybern. 49(6), 2144–2155 (2018)

    Article  Google Scholar 

  77. Li, Q.: Literature survey: domain adaptation algorithms for natural language processing, pp. 8–10. The City University of New York, Department of Computer Science The Graduate Center (2012)

    Google Scholar 

  78. Li, Q., Zhu, Z., Tang, G.: Alternating minimizations converge to second-order optimal solutions. In: Chaudhuri, K., Salakhutdinov, R. (Eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, vol. 97 of Proceedings of Machine Learning Research, pp. 3935–3943. PMLR (2019)

  79. Liu, H., Shao, M., Fu, Y.: Structure-preserved multi-source domain adaptation. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1059–1064. IEEE (2016)

  80. Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Bach, F.R., Blei, D.M. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, vol. 37 of JMLR Workshop and Conference Proceedings, pp. 97–105. JMLR.org (2015)

  81. Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. (2016). arXiv:1602.04433

  82. Lu, N., Zhang, T., Fang, T., Teshima, T., Sugiyama, M.: Rethinking importance weighting for transfer learning. (2021). arXiv:2112.10157

  83. Mansour, Y., Mohri, M., Ro, J., Theertha Suresh, A., Wu, K: A theory of multiple-source adaptation with limited target labeled data. In: Banerjee, A., Fukumizu, K. (Eds.), The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, vol. 130 of Proceedings of Machine Learning Research, pp. 2332–2340. PMLR (2021)

  84. Mansour, Y., Mohri, M., Rostamizadeh, A: Domain adaptation: Learning bounds and algorithms. In: COLT 2009 - the 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009 (2009)

  85. Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation with multiple sources. In: NIPS, pp. 1041–1048 (2009)

  86. Maurer, A.: Bounds for linear multi-task learning. J. Mach. Learn. Res. 7, 117–139 (2006)

    MathSciNet  Google Scholar 

  87. Maurer, A., Pontil, M., Romera-Paredes, B.: The benefit of multitask representation learning. J. Mach. Learn. Res. 17, 81:1-81:32 (2016)

    MathSciNet  Google Scholar 

  88. Meir, R., Zhang, T.: Generalization error bounds for Bayesian mixture algorithms. J. Mach. Learn. Res. 4, 839–860 (2003)

    MathSciNet  Google Scholar 

  89. Mohri, M., Muñoz Medina, A.: New analysis and algorithm for learning with drifting distributions. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (Eds.), Algorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, October 29-31, 2012. Proceedings, vol. 7568 of Lecture Notes in Computer Science, pp. 124–138. Springer (2012)

  90. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning (Second ed.). MIT Press (2018)

  91. Mohri, M., Sivek, G., Suresh, A.T.: Agnostic federated learning. In: International Conference on Machine Learning, pp. 4615–4625. PMLR (2019)

  92. Motiian, S., Jones, Q., Iranmanesh, S., Doretto, G.: Few-shot adversarial domain adaptation. In: Advances in neural information processing systems, pp. 6670–6680 (2017)

  93. Motiian, S., Piccirilli, M., Adjeroh, D.A., Doretto, G.: Unified deep supervised domain adaptation and generalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5715–5725 (2017)

  94. Muandet, K., Balduzzi, D., Schölkopf, B.: Domain generalization via invariant feature representation. In: ICML, vol. 28, pp. 10–18 (2013)

  95. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms. (2018). arXiv:1803.02999

  96. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009)

    Article  Google Scholar 

  97. Pavlopoulos, J., Sorensen, J., Dixon, L., Thain, N., Androutsopoulos, I.: Toxicity detection: does context really matter? (2020)

  98. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  99. Pei, Z., Cao, Z., Long, M., Wang, J.: Multi-adversarial domain adaptation. In: AAAI, pp. 3934–3941 (2018)

  100. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415 (2019)

  101. Pentina, A., Ben-David, S.: Multi-task Kernel Learning based on Probabilistic Lipschitzness. In: Janoos, F., Mohri, M., Sridharan, K. (Eds.), Algorithmic Learning Theory, ALT 2018, 7-9 April 2018, Lanzarote, Canary Islands, Spain, vol. 83 of Proceedings of Machine Learning Research, pp. 682–701. PMLR (2018)

  102. Pentina, A., Lampert, C.H.: A PAC-bayesian bound for lifelong learning. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, vol. 32 of JMLR Workshop and Conference Proceedings, pp. 991–999. JMLR.org (2014)

  103. Pentina, A., Lampert, C.H.: Lifelong learning with non-i.i.d. tasks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1540–1548 (2015)

  104. Pentina, A., Lampert, C.H.: Multi-task learning with labeled and unlabeled tasks. In: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70 of Proceedings of Machine Learning Research, pp. 2807–2816. PMLR (2017)

  105. Pentina, A., Urner, R.: Lifelong learning with weighted majority votes. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3612–3620 (2016)

  106. Perrot, M., Habrard, A.: A theoretical analysis of metric hypothesis transfer learning. In: Bach, F.R., Blei, D.M. (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, vol. 37 of JMLR Workshop and Conference Proceedings, pp. 1708–1717. JMLR.org (2015)

  107. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. Association for Computational Linguistics (2018)

  108. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. (2019). arXiv:1910.10683

  109. Redko, I., Bennani, Y.: Non-negative embedding for fully unsupervised domain adaptation. Pattern Recognit. Lett. 77, 35–41 (2016)

    Article  Google Scholar 

  110. Redko, I., Habrard, A., Sebban, M.: Theoretical analysis of domain adaptation with optimal transport. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Dzeroski, S. (eds.) Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II. Lecture Notes in Computer Science, vol. 10535, pp. 737–753. Springer (2017)

    Google Scholar 

  111. Rodriguez-Lujan, I., Fonollosa, J., Vergara, A., Homer, M., Huerta, R.: On the calibration of sensor arrays for pattern recognition using the minimal number of experiments. Chemometr. Intell. Lab. Syst. 130, 123–134 (2014). https://doi.org/10.1016/j.chemolab.2013.10.012

    Article  Google Scholar 

  112. Saito, K., Kim, D., Sclaroff, S., Darrell, T., Saenko, K.: Semi-supervised domain adaptation via minimax entropy. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8050–8058 (2019)

  113. Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723–3732 (2018)

  114. Sener, O., Song, H.O., Saxena, A., Savarese, S.: Learning transferrable representations for unsupervised domain adaptation. In: Advances in Neural Information Processing Systems, pp. 2110–2118 (2016)

  115. Sriperumbudur, B.K., Torres, D.A., Lanckriet, G.R.G.: Sparse eigen methods by D.C. programming. In: ICML, pp. 831–838 (2007)

  116. Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose distributed sentence representations via large scale multi-task learning. (2018). arXiv:1804.00079

  117. Sugiyama, M., Krauledat, M., Müller, K.: Covariate shift adaptation by importance weighted cross validation. vol. 8, pp. 985–1005 (2007)

  118. Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P., Kawanabe, M.: Direct importance estimation with model selection and its application to covariate shift adaptation. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (Eds.), Advances in neural information processing systems 20, Proceedings of the Twenty-first Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pp. 1433–1440. Curran Associates, Inc (2007)

  119. Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

  120. Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation. In: European Conference on Computer Vision, pp. 443–450. Springer (2016)

  121. Sun, Q., Chattopadhyay, R., Panchanathan, S., Ye, J.: A two-stage weighting framework for multi-source domain adaptation. In: Advances in Neural Information Processing Systems, pp. 505–513 (2011)

  122. Tao, P.D., An, L.T.H.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Math. Vietnam. 22(1), 289–355 (1997)

    MathSciNet  Google Scholar 

  123. Tao, P.D., An, L.T.H.: A DC optimization algorithm for solving the trust-region subproblem. SIAM J. Optim. 8(2), 476–505 (1998)

    Article  MathSciNet  Google Scholar 

  124. Tuy, H.: Concave programming under linear constraints. Transl. Sov. Math. 5, 1437–1440 (1964)

    Google Scholar 

  125. Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4068–4076 (2015)

  126. Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators B Chem. 166–167, 320–329 (2012). https://doi.org/10.1016/j.snb.2012.01.074

    Article  Google Scholar 

  127. Wang, B., Mendez, J.A., Cai, M., Eaton, E.: Transfer learning via minimizing the performance gap between domains. In: Proceedingz of NeurIPS, pp. 10644–10654 (2019)

  128. Wang, C.. Mahadevan, S.: Heterogeneous domain adaptation using manifold alignment. In: Twenty-second International Joint Conference on Artificial Intelligence (2011)

  129. Wang, J., Feng, W., Chen, Y., Yu, H., Huang, M., Yu, P.S.: Visual domain adaptation with manifold embedded distribution alignment. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 402–410 (2018)

  130. Wang, M., Deng, W.: Deep visual domain adaptation: a survey. Neurocomputing 312, 135–153 (2018)

    Article  Google Scholar 

  131. Wang, T., Zhang, X., Yuan, L., Feng, J.: Few-shot adaptive faster r-cnn. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7173–7182 (2019)

  132. Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners (2021)

  133. Wen, J., Greiner, R., Schuurmans, D.: Domain aggregation networks for multi-source domain adaptation. In: International Conference on Machine Learning, pp. 10214–10224. PMLR (2020)

  134. Yang, J., Yan, R., Hauptmann, A.G.: Cross-domain video concept detection using adaptive svms. In: ACM multimedia, pp. 188–197 (2007)

  135. Yang, L., Hanneke, S., Carbonell, J.G.: A theory of transfer learning with applications to active learning. Mach. Learn. 90(2), 161–189 (2013)

    Article  MathSciNet  Google Scholar 

  136. You, K., Kou, Z., Long, M., Wang, J.: Co-tuning for transfer learning. Adv. Neural Inf. Process. Syst. 33 (2020)

  137. Yuille, A.L., Rangarajan, A.: The concave-convex procedure. Neural Comput. 15(4), 915–936 (2003)

    Article  Google Scholar 

  138. Zhang, K., Schölkopf, B., Muandet, K., Wang,Z.: Domain adaptation under target and conditional shift. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, vol. 28 of JMLR Workshop and Conference Proceedings, pp. 819–827. JMLR.org (2013)

  139. Zhang, T., Yamane, I., Lu, N., Sugiyama, M.: A one-step approach to covariate shift adaptation. In: Proceedings of ACML, vol. 129 of Proceedings of Machine Learning Research, pp. 65–80. PMLR (2020)

  140. Zhang, Y., Liu, T., Long, M., Jordan, M.: 09–15 Jun. Bridging theory and algorithm for domain adaptation. In: K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, pp. 7404–7413. PMLR (2019a)

  141. Zhang, Y., Liu, T., Long, M., Jordan, M.: Bridging theory and algorithm for domain adaptation. In: International Conference on Machine Learning, pp. 7404–7413. PMLR (2019b)

  142. Zhang, Y., Liu, T., Long, M., Jordan, M.I.: Bridging theory and algorithm for domain adaptation. In: K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, vol. 97 of Proceedings of Machine Learning Research, pp. 7404–7413. PMLR (2019c)

  143. Zhang, Y., Long, M., Wang, J., Jordan, M.I.: On localized discrepancy for domain adaptation. (2020). arXiv:2008.06242

  144. Zhao, H., Des Combes, R.T., Zhang, K., Gordon, G.: On learning invariant representations for domain adaptation. In: International Conference on Machine Learning, pp. 7523–7532. PMLR (2019)

  145. Zhao, H., Zhang, S., Wu, G., Moura, J.M., Costeira, J.P., Gordon, G.J.: Adversarial multiple source domain adaptation. Adv. Neural Inf. Process. Syst. 31, 8559–8570 (2018)

    Google Scholar 

  146. Zheng, L., Liu, G., Yan, C., Jiang, C., Zhou, M., Li, M.: Improved tradaboost and its application to transaction fraud detection. IEEE Trans. Comput. Soc. Syst. 7(5), 1304–1316 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehryar Mohri.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Best-effort adaptation

1.1 A.1 Theorems and proofs

Below we will work with a notion of discrepancy extended to finite signed measures, as defined in ().

Theorem 1

Fix a vector \(\textsf {q}\) in \([0, 1]^{[m + n]}\). Then, for any \(\delta > 0\), with probability at least \(1 - \delta \) over the draw of an i.i.d. sample S of size m from \({\mathscr {Q}}\) and an i.i.d. sample \(S'\) of size n from \({\mathscr {P}}\), the following holds for all \(h \in {\mathscr {H}}\):

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h) \le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}}\big ) + 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) + \Vert \textsf {q}\Vert _2 \sqrt{\frac{\log \frac{1}{\delta }}{2}}. \end{aligned}$$

Proof

Let \(S = ((x_1, y_1), \ldots , (x_m, y_m))\) be a sample of size m drawn i.i.d. from \({\mathscr {Q}}\) and similarly \(S' = ((x_{m + 1}, y_{m + 1}), \ldots , (x_{m + n}, y_{m + n}))\) a sample of size n drawn i.i.d. from \({\mathscr {P}}\). Let T denote the sample formed by S and \(S'\), \(T = (S, S')\). For any such sample T, define \(\Phi (T)\) as follows:

$$\begin{aligned} \Phi (T) = \sup _{h \in {\mathscr {H}}} \{ \overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) + (\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}}) \mathcal {L}({\mathscr {P}}, h) - \mathcal {L}_T(\textsf {q}, h)\}, \end{aligned}$$

with \(\mathcal {L}_T(\textsf {q}, h) = \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i)\). Changing point \(x_i\) to some other point \(x'_i\) affects \(\Phi (T)\) by at most \(\textsf {q}_i\). Thus, by McDiarmid’s inequality, for any \(\delta > 0\), with probability at least \(1 - \delta \), the following holds for all \(h \in {\mathscr {H}}\):

$$\begin{aligned} \overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) + (\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}}) \mathcal {L}({\mathscr {P}}, h) \le \mathcal {L}_T(\textsf {q}, h) + \mathbb E[\Phi (T)] + \Vert \textsf {q}\Vert _2 \sqrt{\frac{\log \frac{1}{\delta }}{2}}. \end{aligned}$$
(A.1)

Now, let \(T' = ((x'_1, y'_1), \ldots , (x'_m, y'_m), (x'_{m + 1}, y'_{m + 1}), \ldots , (x'_{m + n}, y'_{m + n})))\) be a sample drawn according to the same distribution as T, then we can write:

$$\begin{aligned} \underset{T'}{\mathbb E}[\mathcal {L}_{T'}(q, h)]&= \sum _{i=1}^m q_i \mathbb E[\ell (h(x'_i), y'_i)] + \sum _{i = m + 1}^{m+n} q_i \mathbb E[\ell (h(x'_i), y'_i)] \nonumber \\&{(\text {linearity of expectation and weights } \textsf {q}_i \text { independent of } T)} \nonumber \\&= \sum _{i = 1}^m q_i \mathcal {L}({\mathscr {Q}}, h) + \sum _{i=m+1}^{m+n} q_i \mathcal {L}({\mathscr {P}}, h) {(\text {i.i.d. sample})} \nonumber \\&= \overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) + (\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}})\mathcal {L}({\mathscr {P}}, h). \end{aligned}$$
(A.2)

In light of that equality, we can analyze the expectation term as follows:

$$\begin{aligned} \mathbb E[\Phi (T)]&= \underset{T}{\mathbb E}\left[ {\sup _{h \in {\mathscr {H}}} \overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) + (\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}}) \mathcal {L}({\mathscr {P}}, h) - \mathcal {L}_T(\textsf {q}, h)}\right] \\&= \underset{T}{\mathbb E}\left[ {\sup _{h \in {\mathscr {H}}} \underset{T'}{\mathbb E}\ \left[ {\mathcal {L}_{T'}(\textsf {q}, h)}\right] - \mathcal {L}_T(\textsf {q}, h)}\right] \\&= \underset{T}{\mathbb E}\left[ {\sup _{h \in {\mathscr {H}}} \underset{T'}{\mathbb E}\ \left[ {\mathcal {L}_{T'}(\textsf {q}, h) - \mathcal {L}_T(\textsf {q}, h)}\right] }\right] \quad \quad \quad \quad \quad \quad {(\mathcal {L}_{T'} (\textsf {q}, h) \text { independent of } T')}\\&\le \underset{T, T'}{\mathbb E}\ \left[ {\sup _{h \in {\mathscr {H}}} \mathcal {L}_{T'}(\textsf {q}, h) - \mathcal {L}_T(\textsf {q}, h)}\right] \quad \quad \quad \quad \quad \quad \quad \ \ {(\text {sub-additivity of supremum})}\\&= \underset{T, T'}{\mathbb E}\ \left[ {\sup _{h \in {\mathscr {H}}} \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x'_i), y'_i) - \textsf {q}_i \ell (h(x_i), y_i)}\right] \\&= \underset{T, T', {\varvec{\sigma }}}{\mathbb E}\ \left[ {\sup _{h \in {\mathscr {H}}} \sum _{i = 1}^{m + n} \sigma _i \big ({\textsf {q}_i \ell (h(x'_i), y'_i) - \textsf {q}_i \ell (h(x_i), y_i)}\big )}\right] \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \ \ {(\text {introducing Rademacher variables } \sigma _i)}\\&\le \underset{T', {\varvec{\sigma }}}{\mathbb E}\ \left[ {\sup _{h \in {\mathscr {H}}} \sum _{i = 1}^{m + n} \sigma _i \textsf {q}_i \ell (h(x'_i), y'_i)}\right] + \underset{T, {\varvec{\sigma }}}{\mathbb E}\ \left[ {\sup _{h \in {\mathscr {H}}} \sum _{i = 1}^{m + n} -\sigma _i \textsf {q}_i \ell (h(x_i), y_i)}\right] \\&\qquad \qquad \qquad \qquad \qquad \qquad \quad \ {(\text {sub-addivity of supremum and linearity of expectation})}\\&= 2 \underset{T, {\varvec{\sigma }}}{\mathbb E}\ \left[ {\sup _{h \in {\mathscr {H}}} \sum _{i = 1}^{m + n} \sigma _i \textsf {q}_i \ell (h(x_i), y_i)}\right] \quad \ \ {(-\sigma _i \text { and } \sigma _i \text { follow the same distribution}}\\&= 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}). \end{aligned}$$

Finally, using the upper bound

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h) - \left[ {\overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) + (\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}}) \mathcal {L}({\mathscr {P}}, h)}\right]&= \left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] \mathcal {L}({\mathscr {P}}, h) - \overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h)\\&\le \text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}}\big ), \end{aligned}$$

inequality (A.1), and the upper bound on \(\mathbb E[\Phi (T)]\), we obtain:

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h)&\le \left[ {\overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) + (\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}}) \mathcal {L}({\mathscr {P}}, h)}\right] + \text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}}\big )\\&\le \mathcal {L}_T(\textsf {q}, h) + \text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}}\big ) + 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) + \Vert \textsf {q}\Vert _2 \sqrt{\frac{\log \frac{1}{\delta }}{2}}, \end{aligned}$$

which completes the proof. \(\square \)

Next, we show that the learning bound just proven is tight in terms of the weighted-discrepancy term.

Theorem 2

Fix a distribution \(\textsf {q}\) in the simplex \(\Delta _{m + n}\). Then, for any \(\epsilon > 0\), there exists \(h \in {\mathscr {H}}\) such that, for any \(\delta > 0\), the following lower bound holds with probability at least \(1 - \delta \) over the draw of an i.i.d. sample S of size m from \({\mathscr {Q}}\) and an i.i.d. sample \(S'\) of size n from \({\mathscr {P}}\):

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h) \ge \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) - 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) - \Vert \textsf {q}\Vert _2 \sqrt{\frac{\log \frac{1}{\delta }}{2}} -\epsilon . \end{aligned}$$

In particular, for \(\big \Vert {\textsf {q}}\big \Vert _2,\mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) \in O\big ({\frac{1}{\sqrt{m + n}}}\big )\), we have:

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h) \ge \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) + \Omega \big ({\frac{1}{\sqrt{m + n}}}\big ). \end{aligned}$$

Proof

Let \(\mathcal {L}(\textsf {q}, h)\) denote \(\sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i)\). By definition of discrepancy as a supremum, for any \(\epsilon > 0\), there exists \(h \in {\mathscr {H}}\) such that \(\mathcal {L}({\mathscr {P}}, h) - \mathcal {L}({\mathscr {Q}}, h) \ge \text {dis}({\mathscr {P}}, {\mathscr {Q}}) - \epsilon \). For that h, we have

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h) - \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) - \mathcal {L}(\textsf {q}, h)&\ge \mathcal {L}({\mathscr {P}}, h) - \overline{\textsf {q}}\big ({\mathcal {L}({\mathscr {P}}, h) - \mathcal {L}({\mathscr {Q}}, h)}\big ) - \mathcal {L}(\textsf {q}, h) - \epsilon \\&= (1 - \overline{\textsf {q}}) \mathcal {L}({\mathscr {P}}, h) + \overline{\textsf {q}}\mathcal {L}({\mathscr {Q}}, h) - \mathcal {L}(\textsf {q}, h) - \epsilon \\&= \mathbb E[\mathcal {L}(\textsf {q}, h)] - \mathcal {L}(\textsf {q}, h) - \epsilon . \end{aligned}$$

By McDiarmid’s inequality, with probability at least \(1 - \delta \), we have \(\mathbb E[\mathcal {L}(\textsf {q}, h)] - \mathcal {L}(\textsf {q}, h) \ge - 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) - \Vert \textsf {q}\Vert _2 \sqrt{\frac{\log \frac{1}{\delta }}{2}}\). Thus, we have:

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h) - \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) - \mathcal {L}(\textsf {q}, h) \ge - 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) - \Vert \textsf {q}\Vert _2 \sqrt{\frac{\log \frac{1}{\delta }}{2}} - \epsilon . \end{aligned}$$

The last inequality follows directly by using the assumptions and Lemma 10. \(\square \)

Theorem 3

For any \(\delta > 0\), with probability at least \(1 - \delta \) over the draw of an i.i.d. sample S of size m from \({\mathscr {Q}}\) and an i.i.d. \(S'\) of size n from \({\mathscr {P}}\), the following holds for all \(h \in {\mathscr {H}}\) and \(\textsf {q}\in \{\textsf {q}:\Vert \textsf {q}- \textsf {p}^0 \Vert _1 < 1\}\):

$$\begin{aligned}&\mathcal {L}({\mathscr {P}}, h) \le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}}\big ) + \text {dis}(\textsf {p}^0, \textsf {q})\\&+ 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) + 7 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1 + \left[ {\big \Vert {\textsf {q}}\big \Vert _2 + 2 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1}\right] \left[ { \sqrt{\log \log _2 \tfrac{2}{1 - \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1}} + \sqrt{\tfrac{\log \frac{2}{\delta }}{2}} }\right] . \end{aligned}$$

Proof

Consider two sequences \((\epsilon _k)_{k \ge 0}\) and \((\textsf {q}^k)_{k \ge 0}\). By Theorem 1, for any fixed \(k \ge 0\), we have:

$$\begin{aligned} \mathbb {P}\Bigg [ \mathcal {L}({\mathscr {P}}, h) > \sum _{i = 1}^{m + n} \textsf {q}^k_i \ell (h(x_i), y_i) + \text {dis}\left( {\left[ {(1 - \big \Vert {\textsf {q}^k}\big \Vert _1) + \overline{\textsf {q}}^k}\right] {\mathscr {P}}, \overline{\textsf {q}}^k {\mathscr {Q}}}\right) \\ + 2 \mathfrak {R}_{\textsf {q}^k}(\ell \circ {\mathscr {H}}) + \frac{\Vert \textsf {q}^k \Vert _2}{\sqrt{2}} \epsilon _k \Bigg ] \le e^{-\epsilon _k^2}. \end{aligned}$$

Choose \(\epsilon _k = \epsilon + \sqrt{2 \log (k + 1)}\). Then, by the union bound, we can write:

$$\begin{aligned} \mathbb {P}\Bigg [&\exists k \ge 1 :\mathcal {L}({\mathscr {P}}, h) > \sum _{i = 1}^{m + n} \textsf {q}^k_i \ell (h(x_i), y_i) + \text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}^k}\big \Vert _1) + \overline{\textsf {q}}^k}\right] {\mathscr {P}}, \overline{\textsf {q}}^k {\mathscr {Q}}}\big ) \\&+ 2 \mathfrak {R}_{\textsf {q}^k}(\ell \circ {\mathscr {H}}) + \frac{\Vert \textsf {q}^k \Vert _2}{\sqrt{2}} \epsilon _k \Bigg ] \nonumber \\&\le \sum _{k = 0}^{+\infty } e^{-\epsilon _k^2} \le \sum _{k = 0}^{+\infty } e^{-\epsilon ^2 - \log ((k + 1)^2)} = e^{-\epsilon ^2} \sum _{k = 1}^{+\infty } \frac{1}{k^2} = \frac{\pi ^2}{6} e^{-\epsilon ^2} \le 2 e^{-\epsilon ^2}.\nonumber \end{aligned}$$
(A.3)

We can choose \(\textsf {q}^k\) such that \(\Vert \textsf {q}^k - \textsf {p}^0 \Vert _1 = 1 - \frac{1}{2^k}\). Then, for any \(\textsf {q}\in \{\textsf {q}:\Vert \textsf {q}- \textsf {p}^0 \Vert _1 < 1\}\), there exists \(k \ge 0\) such that \(\Vert \textsf {q}^k - \textsf {p}^0 \Vert _1 \le \Vert \textsf {q}- \textsf {p}^0 \Vert _1 < \Vert \textsf {q}^{k + 1} - \textsf {p}^0 \Vert _1\) and thus such that

$$\begin{aligned} \sqrt{2 \log (k + 1)} = \sqrt{2 \log \log _2 \frac{1}{1 - \big \Vert {\textsf {q}^{k + 1} - \textsf {p}^0}\big \Vert _1}}&= \sqrt{2 \log \log _2 \frac{2}{1 - \big \Vert {\textsf {q}^k - \textsf {p}^0}\big \Vert _1}}\\&\le \sqrt{2 \log \log _2 \frac{2}{1 - \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1}}. \end{aligned}$$

Furthermore, for that k, the following inequalities hold:

$$\begin{aligned} \sum _{i = 1}^{m + n} \textsf {q}^k_i \ell (h(x_i), y_i)&\le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \text {dis}(\textsf {q}^k, \textsf {q})\\&\le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \text {dis}(\textsf {q}^k, \textsf {p}^0) + \text {dis}(\textsf {p}^0, \textsf {q})\\&\le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \big \Vert {\textsf {q}^k - \textsf {p}^0}\big \Vert _1 + \text {dis}(\textsf {p}^0, \textsf {q})\\&\le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1 + \text {dis}(\textsf {p}^0, \textsf {q}),\\ \text {dis}\left( {\left[ {(1 - \big \Vert {\textsf {q}^k}\big \Vert _1) + \overline{\textsf {q}}^k}\right] {\mathscr {P}}, \overline{\textsf {q}}^k {\mathscr {Q}}}\right)&\le \text {dis}\left( \left[ {\left( 1 - \big \Vert {\textsf {q}}\big \Vert _1\right) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}\right) \\&\qquad + \big \Vert { \left[ {(\big \Vert {\textsf {q}}\big \Vert _1 - \overline{\textsf {q}}) - (\big \Vert {\textsf {q}^k}\big \Vert _1 - \overline{\textsf {q}}^k)}\right] {\mathscr {P}}\!+\! \left[ {\overline{\textsf {q}}- \overline{\textsf {q}}^k}\right] {\mathscr {Q}}}\big \Vert _1\\&\le \text {dis}\left( \left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}\right) + \big \Vert {\textsf {q}^k - \textsf {q}}\big \Vert _1\\&\le \text {dis}\left( \left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}\right) + 2 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1,\\ \mathfrak {R}_{\textsf {q}^k}(\ell \circ {\mathscr {H}})&\le \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) + \big \Vert {\textsf {q}^k - \textsf {q}}\big \Vert _1 \le \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) \!+\! 2 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1,\\ \text {and} \qquad \big \Vert {\textsf {q}^k}\big \Vert _2&\le \big \Vert {\textsf {q}}\big \Vert _2 + \big \Vert {\textsf {q}^k - \textsf {q}}\big \Vert _2\\&\le \big \Vert {\textsf {q}}\big \Vert _2 + \big \Vert {\textsf {q}^k - \textsf {q}}\big \Vert _1 \le \big \Vert {\textsf {q}}\big \Vert _2 + 2 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1. \end{aligned}$$

Plugging in these inequalities in (A.3) concludes the proof. \(\square \)

Corollary 4

For any \(\delta > 0\), with probability at least \(1 - \delta \) over the draw of an i.i.d. sample S of size m from \({\mathscr {Q}}\) and an i.i.d. sample \(S'\) of size n from \({\mathscr {P}}\), the following holds for all \(h \in {\mathscr {H}}\) and \(\textsf {q}\in \{\textsf {q}:\Vert \textsf {q}- \textsf {p}^0 \Vert _1 < 1\}\):

$$\begin{aligned} \mathcal {L}({\mathscr {P}}, h)&\le \sum _{i = 1}^{m + n} \textsf {q}_i \ell (h(x_i), y_i) + \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) + \text {dis}(\textsf {p}^0, \textsf {q}) + 2 \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}})\\&\quad + 8 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1 + \left[ {\big \Vert {\textsf {q}}\big \Vert _2 + 2 \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1}\right] \left[ { \sqrt{\log \log _2 \tfrac{2}{1 - \big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1}} + \sqrt{\frac{\log \tfrac{2}{\delta }}{2}} }\right] . \end{aligned}$$

Proof

Note that the discrepancy term of the bound of Theorem 3 can be further upper bounded as follows:

$$\begin{aligned}&\text {dis}\big ({\left[ {(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\right] {\mathscr {P}}, \overline{\textsf {q}}{\mathscr {Q}}} \big )\\&= \underset{h \in {\mathscr {H}}}{\sup }\ \left\{ \big [{(1 - \big \Vert {\textsf {q}}\big \Vert _1) + \overline{\textsf {q}}}\big ] \underset{(x, y) \sim {\mathscr {P}}}{\mathbb E}\ [\ell (h(x), y)] - \overline{\textsf {q}}\underset{(x, y) \sim {\mathscr {Q}}}{\mathbb E}\ [\ell (h(x), y)]\right\} \\&\le \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) + \left| {1 - \big \Vert {\textsf {q}}\big \Vert _1}\right| \underset{h \in {\mathscr {H}}}{\sup }\ \underset{(x, y) \sim {\mathscr {P}}}{\mathbb E}\ [\ell (h(x), y)]\\&\le \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) + \left| {1 - \big \Vert {\textsf {q}}\big \Vert _1}\right| \\&= \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) + \left| {\big \Vert {\textsf {p}^0}\big \Vert _1 - \big \Vert {\textsf {q}}\big \Vert _1} \right| \\&\le \overline{\textsf {q}}\text {dis}({\mathscr {P}}, {\mathscr {Q}}) + \big \Vert {\textsf {p}^0 - \textsf {q}}\big \Vert _1. \end{aligned}$$

Plugging this in the right-hand side in the bound of Theorem 3 completes the proof. \(\square \)

Lemma 10

Fix a distribution \(\textsf {q}\) over \([m + n]\). Then, the following holds for the \(\textsf {q}\)-weighted Rademacher complexity:

$$\begin{aligned} \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) \le \Vert \textsf {q}\Vert _\infty (m + n) \, \mathfrak {R}_{m + n}(\ell \circ {\mathscr {H}}). \end{aligned}$$

Proof

Since for any \(i \in [m + n]\), the function \(\varphi _i :x \mapsto \textsf {q}_i x\) is \(\textsf {q}_i\)-Lipschitz and thus \(\big \Vert {\textsf {q}}\big \Vert _\infty \)-Lipschitz, the result is an application of the result of Meir and Zhang [88, Theorem 7]. \(\square \)

Note that the bound of the lemma is tight: equality holds when \(\textsf {q}\) is chosen to be the uniform distribution. By McDiarmid’s inequality, the \(\textsf {q}\)-weighted Rademacher complexity can be estimated from the empirical quantity

$$\begin{aligned} \widehat{\mathfrak {R}}_{\textsf {q}, S, S'}(\ell \circ {\mathscr {H}}) = \underset{\varvec{\sigma }}{\mathbb E}\ \left[ {\underset{h \in {\mathscr {H}}}{\sup }\sum _{i = 1}^{m + n} \sigma _i \textsf {q}_i \ell (h(x_i), y_i)}\right] , \end{aligned}$$

modulo a term in \(O(\Vert \textsf {q}\Vert _2)\).

1.2 A.2 Convex optimization solution

In the case of the squared loss with the hypothesis set of linear functions or kernel-based functions, the optimization algorithm for best can be formulated as a convex optimization problem.

We can proceed as follows when \(\ell \) is the squared loss. We introduce new variables \({\textsf {u}}_i = 1/\textsf {q}_i\), \({\textsf {v}}_i = 1/{\textsf {p}^0_i}\) and define the convex set \({\mathscr {U}}= \{ {\textsf {u}}:{\textsf {u}}_i \ge 1 \} \). Using the following four expressions:

$$\begin{aligned} \textsf {q}_i (h(x_i) - y_i)^2 = \frac{(h(x_i) - y_i)^2}{{\textsf {u}}_i},&\big \Vert {\textsf {q}}\big \Vert _2^2 = \sum _i \frac{1}{{\textsf {u}}_i^2},\\ \big \Vert {\textsf {q}}\big \Vert _\infty \big \Vert {h}\big \Vert ^2 = \max _i \frac{\big \Vert {h}\big \Vert ^2}{{\textsf {u}}_i} = \frac{\big \Vert {h}\big \Vert ^2}{{\textsf {u}}_{\min }},&\big \Vert {\textsf {q}- \textsf {p}^0}\big \Vert _1 \le \sum _i \left| {{\textsf {v}}_i - {\textsf {u}}_i}\right| = \big \Vert {{\textsf {u}}- {\textsf {v}}}\big \Vert _1, \end{aligned}$$

leads to the following convex optimization problem with

new hyperparameters \(\gamma _\infty , \gamma _1, \gamma _2\):

$$\begin{aligned} \min _{h \in {\mathscr {H}}, {\textsf {u}}\in {\mathscr {U}}}&\sum _{i = 1}^{m + n} \frac{(h(x_i) - y_i)^2 + d_i}{{\textsf {u}}_i} + \text {dis}\big ({\left( { \tfrac{1}{{\textsf {u}}_i}}\right) _i, \left( {\tfrac{1}{{\textsf {v}}_i}}\right) _i}\big ) \\&+ \gamma _\infty \frac{\big \Vert {h}\big \Vert ^2}{{\textsf {u}}_{\min }} + \gamma _1 \big \Vert {{\textsf {u}}- {\textsf {v}}}\big \Vert _1 + \gamma _2 \sum _{i = 1}^{m + n} \frac{1}{{\textsf {u}}_i^2}. \end{aligned}$$

Note that the first term is jointly convex as a sum of quadratic-over-linear or matrix fractional functions [14]. When \({\mathscr {H}}\) is a subset of the reproducing kernel Hilbert space associated to a positive definite kernel K, for a fixed \({\textsf {u}}\), the problem coincides with a standard kernel ridge regression problem. Thus, we can rewrite it in terms of dual variables \(\varvec{\alpha }\), the kernel matrix K, \(Y = (y_1, \ldots , y_{m + n})^\top \) and \(U = (u_1, \ldots , u_{m + n})^\top \) as follows:

$$\begin{aligned} \min _{{\textsf {u}}\in {\mathscr {U}}} \max _{\varvec{\alpha }}&- \varvec{\alpha }^\top \big ({K + \frac{\gamma _\infty }{{\textsf {u}}_{\min }} U}\big ) \varvec{\alpha }+ 2 \varvec{\alpha }^\top Y + \sum _{i = 1}^{m + n} \frac{d_i}{{\textsf {u}}_i}\\&\text {dis}\big ({\left( { \tfrac{1}{{\textsf {u}}_i}}\right) _i, \left( {\tfrac{1}{{\textsf {v}}_i}}\right) _i}\big ) + \gamma _1 \big \Vert {{\textsf {u}}- {\textsf {v}}}\big \Vert _1 + \gamma _2 \sum _{i = 1}^{m + n} \frac{1}{{\textsf {u}}_i^2}. \end{aligned}$$

Solving for \(\varvec{\alpha }\) yields the following convex optimization problem:

$$\begin{aligned} \min _{{\textsf {u}}\in {\mathscr {U}}} Y^\top \big ({K + \frac{\gamma _\infty }{{\textsf {u}}_{\min }} U}\big )^{-1} Y + \sum _{i = 1}^{m + n} \frac{d_i}{{\textsf {u}}_i} + \gamma _1 \big \Vert {{\textsf {u}}- {\textsf {v}}}\big \Vert _1 + \gamma _2 \sum _{i = 1}^{m + n} \frac{1}{{\textsf {u}}_i^2}. \end{aligned}$$

Standard descent methods such as SGD can be used to solve this problem. Note that the above can be further simplified using the upper bound \(1/{{\textsf {u}}_{\min }} \le \sum _{i = 1}^{m + n} 1/{u_i}\).

1.3 A.3 Discrepancy estimation

First, note that if the \({\mathscr {P}}\)-drawn labeled sample at our disposal is sufficiently large, we can reserve a sub-sample of size \(n_1\) to train a relatively accurate model \(h_{\mathscr {P}}\). Thus, we can subsequently reduce \({\mathscr {H}}\) to a ball \({\textsf {B}}(h_{\mathscr {P}}, r)\) of radius \(r \sim \frac{1}{\sqrt{n_1}}\). This helps us work with a finer local labeled discrepancy since the maximum in the definition is then taken over a smaller set.

We do not have access to the discrepancy value \(\text {dis}({\mathscr {P}}, {\mathscr {Q}})\), which defines \(d_i\)s. Instead, we can use the labeled samples from \({\mathscr {Q}}\) and \({\mathscr {P}}\) to estimate it. Our estimate \(\widehat{d}\) of the discrepancy is given by

$$\begin{aligned} \widehat{d} = \max _{h \in {\mathscr {H}}} \left\{ \frac{1}{n} \sum _{i = m + 1}^{m + n} \ell (h(x_i), y_i) - \frac{1}{m} \sum _{i = 1}^m \ell (h(x_i), y_i) \right\} . \end{aligned}$$
(A.4)

Thus, for a convex loss \(\ell \), the optimization problems for computing \(\widehat{d}\) can be naturally cast as DC-programming problem, which can be tackled using the DCA algorithm [123] and related methods already discussed for sbest. For the squared loss, the DCA algorithms is guaranteed to converge to a global optimum [123].

Fig. 2
figure 2

Alternate minimization procedure for best effort adaptation

By McDiarmid’s inequality, with high probability, \(\left| {\text {dis}({\mathscr {P}}, {\mathscr {Q}}) - \widehat{d}}\right| \) can be bounded by \(O(\sqrt{\frac{m + n}{mn}})\). More refined bounds such as relative deviation bounds or Bernstein-type bounds provide more favorable guarantee when the discrepancy is relatively small. When \({\mathscr {H}}\) is chosen to be a small ball \({\textsf {B}}(h_{\mathscr {P}}, r)\), our estimate of the discrepancy is further refined.

The optimization problem (A.4) can be equivalently solved via the following minimization:

$$\begin{aligned} \widehat{d}&= \min _{h \in {\mathscr {H}}} \left\{ \frac{1}{m} \sum _{i = 1}^m \ell (h(x_i), y_i) -\frac{1}{n} \sum _{i = m + 1}^{m + n} \ell (h(x_i), y_i)\right\} . \end{aligned}$$

The DCA solution for this problem then consists of solving a sequence of T convex optimization problems where \(h_1 \in {\mathscr {H}}\) is chosen arbitrarity and where \(h_{t + 1}\), \(t \in [T]\) is obtained as follows

$$\begin{aligned} h_{t + 1}&\in \underset{h \in {\mathscr {H}}}{\arg \min }\ \left\{ \frac{1}{m} \sum _{i = 1}^m \ell (h(x_i), y_i) -\frac{1}{n} \sum _{i = m + 1}^{m + n} \nabla \ell (h_t(x_i), y_i) \cdot (h - h_t)\right\} . \end{aligned}$$

The second term of the objective is obtained by a linearization of the loss.

1.4 A.4 Pseudocode of alternate minimization procedure

1.5 A.5 \(\alpha \)-reweighting method

Let \(d\! = \!\text {dis}({\mathscr {P}}, {\mathscr {Q}})\), \(\widehat{d}\) and \(\widehat{d} = \text {dis}(\widehat{\mathscr {Q}}, \widehat{\mathscr {P}})\). Consider the following simple, and in general suboptimal, choice of \(\textsf {q}\) as a distribution defined by:

$$\begin{aligned} \overline{\textsf {q}}&= \frac{\alpha m}{m + n} \qquad \textsf {q}_i = {\left\{ \begin{array}{ll} \frac{\overline{\textsf {q}}}{m} = \frac{\alpha }{m + n} &{} \text {if } i \in [m];\\[.25cm] \frac{1 - \overline{\textsf {q}}}{n} = \frac{m (1 - \alpha ) + n}{(m + n) n} &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$

where \(\alpha = \Psi (1 - d)\) for some non-decreasing function \(\Psi \) with \(\Psi (0) = 0\) and \(\Psi (1) = 1\). We will compare the right-hand side of the bound of Theorem 1, which we denote by B, with its right-hand side \(B_0\) for \(\textsf {q}\) chosen to be uniform over \(S'\) corresponding to supervised learning on just \(S'\):

$$\begin{aligned} B_0 = \mathcal {L}(\widehat{\mathscr {P}}, h) + 2 \mathfrak {R}_n(\ell \circ {\mathscr {H}}) + \sqrt{\frac{\log \frac{1}{\delta }}{2n}}. \end{aligned}$$

We now show that under some assumptions, we have \(B - B_0 \le 0\). Thus, even for this sub-optimal choice of \(\overline{\textsf {q}}\), under those assumptions, the guarantee of the theorem is then strictly more favorable than the one for training on \(S'\) only, uniformly over \(h \in {\mathscr {H}}\).

By definition of \(\widehat{d}\), we can write:

$$\begin{aligned} \mathcal {L}(\textsf {q}, h)&= \overline{\textsf {q}}\mathcal {L}(\widehat{\mathscr {Q}}, h) + (1 - \overline{\textsf {q}}) \mathcal {L}(\widehat{\mathscr {P}}, h) \le \overline{\textsf {q}}\widehat{d} + \mathcal {L}(\widehat{\mathscr {P}}, h). \end{aligned}$$

By definition of the \(\textsf {q}\)-Rademacher complexity and the sub-additivity of the supremum, the following inequality holds:

$$\begin{aligned} \mathfrak {R}_{\textsf {q}}(\ell \circ {\mathscr {H}}) \le \overline{\textsf {q}}\mathfrak {R}_{m}(\ell \circ {\mathscr {H}}) + (1 - \overline{\textsf {q}}) \mathfrak {R}_{n}(\ell \circ {\mathscr {H}}). \end{aligned}$$

By definition of \(\textsf {q}\), we can write:

$$\begin{aligned} \big \Vert {\textsf {q}}\big \Vert ^2_2 n = n \left[ {m \big ({\frac{\overline{\textsf {q}}}{m}}\big )^2 + n \big ({\frac{1 - \overline{\textsf {q}}}{n}}\big )^2}\right]&= \frac{n}{m} \overline{\textsf {q}}^2 + (1 - \overline{\textsf {q}})^2 \\&= 1 - 2 \overline{\textsf {q}}+ \frac{m + n}{m} \overline{\textsf {q}}^2\\&= 1 - (2 - \alpha ) \overline{\textsf {q}}\le 1 - \overline{\textsf {q}}. \end{aligned}$$

Thus, using the inequality \(\sqrt{1 - x} \le 1 - \frac{x}{2}\), \(x \le 1\), we have:

$$\begin{aligned} B - B_0&\le 2 \overline{\textsf {q}}\left[ {\mathfrak {R}_{m}(\ell \circ {\mathscr {H}}) - \mathfrak {R}_{n}(\ell \circ {\mathscr {H}}) }\right] + \overline{\textsf {q}}(d + \widehat{d}) + \left[ {\sqrt{1 - \overline{\textsf {q}}} - 1}\right] \left[ {\tfrac{\log \frac{1}{\delta }}{2n}}\right] ^{\frac{1}{2}}\\&\le 2 \overline{\textsf {q}}\left[ {\mathfrak {R}_{m}(\ell \circ {\mathscr {H}}) - \mathfrak {R}_{n}(\ell \circ {\mathscr {H}}) }\right] + \overline{\textsf {q}}(d + \widehat{d}) - \overline{\textsf {q}}\left[ {\tfrac{\log \frac{1}{\delta }}{8n}}\right] ^{\frac{1}{2}}. \end{aligned}$$

Suppose we are in the regime of relatively small discrepancies and that, given n, both the discrepancy and the empirical discrepancies are upper bounded as follows: \(\max \{d, \overline{d}\} < \sqrt{\frac{\log 1/{\delta }}{32n}}\). Assume also that for \(m \gg n\) (which is the setting we are interested in), we have \(\mathfrak {R}_{m}(\ell \circ {\mathscr {H}}) - \mathfrak {R}_{n}(\ell \circ {\mathscr {H}}) \le 0\). Then, the first term is non-positive and, regardless of the choice of \(\alpha < 1\), we have \(B - B_0 \le 0\). Thus, even for this sub-optimal choice of \(\overline{\textsf {q}}\), under some assumptions, the guarantee of the theorem is then strictly more favorable than the one for training on \(S'\) only, uniformly over \(h \in {\mathscr {H}}\).

Note that the assumption about the difference of Rademacher complexities is natural. For example, for a kernel-based hypothesis set \({\mathscr {H}}\) with a normalized kernel such as the Gaussian kernel and the norm of the weight vectors in the reproducing kernel Hilbert space (RKHS) bounded by \(\Lambda \), it is known that the following inequalities hold: \(\frac{1}{\sqrt{2}} \frac{ \Lambda }{\sqrt{m}} \le \mathfrak {R}_m({\mathscr {H}}) \le \frac{ \Lambda }{\sqrt{m}}\) [90]. Thus, for \(m > 2n\), we have \(\mathfrak {R}_m({\mathscr {H}}) - \mathfrak {R}_n({\mathscr {H}}) \le \frac{ \Lambda }{\sqrt{m}} - \frac{ \Lambda }{\sqrt{2n}} < 0\).

Appendix B: Domain adaptation

1.1 B.1 Proof of Lemma 8

Lemma 8

Let \(\ell \) be the squared loss. Then, for any hypothesis \(h_0\) in \({\mathscr {H}}\), the following upper bound holds for the labeled discrepancy:

$$ \text {dis}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}) \le \overline{\text {dis}}_{{\mathscr {H}}\times \{ h_0 \} }(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}) + 2 \delta _{{\mathscr {H}}, h_0}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}). $$

Proof

For any \(h_0\), using the definition of the squared loss, the following inequalities hold:

$$\begin{aligned} \text {dis}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})&= \sup _{h \in {\mathscr {H}}} \, \left| { \mathbb E_{(x, y) \sim \widehat{\mathscr {P}}}[\ell (h(x), y)] - \mathbb E_{(x, y) \sim \widehat{\mathscr {Q}}}[\ell (h(x), y)] }\right| \\&\le \sup _{h \in {\mathscr {H}}} \, \left| { \mathbb E_{(x, y) \sim \widehat{\mathscr {P}}}[\ell (h(x), h_0(x))] - \mathbb E_{(x, y) \sim \widehat{\mathscr {Q}}}[\ell (h(x), h_0(x))]}\right| \\&+ \sup _{h \in {\mathscr {H}}} \, \bigg | \mathbb E_{(x, y) \sim \widehat{\mathscr {P}}}[\ell (h(x), y)] - \mathbb E_{(x, y) \sim \widehat{\mathscr {P}}}[\ell (h(x), h_0(x))] \\&+ \mathbb E_{(x, y) \sim \widehat{\mathscr {Q}}}[\ell (h(x), h_0(x))] - \mathbb E_{(x, y) \sim \widehat{\mathscr {Q}}}[\ell (h(x), y)] \bigg |\\&= \overline{\text {dis}}_{{\mathscr {H}}\times \{ h_0 \} }(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})\\&\quad + 2 \sup _{h \in {\mathscr {H}}} \, \left| { \mathbb E_{(x, y) \sim \widehat{\mathscr {P}}} \left[ {h(x) \big ({y - h_0(x)}\big )}\right] - \mathbb E_{(x, y) \sim \widehat{\mathscr {Q}}} \left[ {h(x) \big ({y - h_0(x)}\big )}\right] }\right| \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad {(\text {def. of squared loss})}\\&= \overline{\text {dis}}_{{\mathscr {H}}\times \{ h_0 \} }(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}) + 2 \delta _{{\mathscr {H}}, h_0}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}).\qquad \qquad \qquad \qquad \ \ {(\text {def. of local discrepancy})} \end{aligned}$$

This completes the proof. \(\square \)

1.2 B.2 Proof of Lemma 9

Lemma 9

Let \(\ell \) be a loss function that is \(\mu \)-Lipschitz with respect to its second argument. Then, for any hypothesis \(h_0\) in \({\mathscr {H}}\), the following upper bound holds for the labeled discrepancy:

$$\begin{aligned} \text {dis}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}) \le \overline{\text {dis}}_{{\mathscr {H}}\times \{ h_0 \} }(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}) + \mu \, \eta _{{\mathscr {H}}, h_0}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}). \end{aligned}$$

Proof

When the loss function \(\ell \) is \(\mu \)-Lipschitz with respect to its second argument, we can use the following upper bound:

$$\begin{aligned} \text {dis}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})&= \sup _{h \in {\mathscr {H}}} \, \left| { \underset{(x, y) \sim \widehat{\mathscr {P}}}{\mathbb E}[\ell (h(x), y)] - \underset{(x, y) \sim \widehat{\mathscr {Q}}}{\mathbb E}[\ell (h(x), y)] }\right| \\&\le \sup _{h \in {\mathscr {H}}} \, \left| { \underset{(x, y) \sim \widehat{\mathscr {P}}}{\mathbb E}[\ell (h(x), h_0(x))] - \underset{(x, y) \sim \widehat{\mathscr {Q}}}{\mathbb E}[\ell (h(x), h_0(x))]}\right| \\&\quad + \sup _{h \in {\mathscr {H}}} \, \bigg | \underset{(x, y) \sim \widehat{\mathscr {P}}}{\mathbb E}[\ell (h(x), y)] - \underset{(x, y) \sim \widehat{\mathscr {P}}}{\mathbb E}[\ell (h(x), h_0(x))] \\&\quad + \underset{(x, y) \sim \widehat{\mathscr {Q}}}{\mathbb E}[\ell (h(x), h_0(x))] - \underset{(x, y) \sim \widehat{\mathscr {Q}}}{\mathbb E}[\ell (h(x), y)] \bigg |\\&\le \overline{\text {dis}}_{{\mathscr {H}}\times \{ h_0 \} }(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}}) + \mu \underset{(x, y) \sim \widehat{\mathscr {P}}}{\mathbb E}\left[ {\left| {y - h_o(x)}\right| }\right] + \mu \underset{(x, y) \sim \widehat{\mathscr {Q}}}{\mathbb E}\left[ {\left| {y - h_o(x)}\right| }\right] . \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \ \quad {(\ell \text { assumed } \mu -\text {Lipschitz})} \end{aligned}$$

This completes the proof. \(\square \)

1.3 B.3 Sub-Gradients and estimation of unlabeled discrepancy terms

Here, we first describe how to compute the sub-gradients of the unlabeled weighted discrepancy term \(\text {dis}(\textsf {q}', \textsf {p})\) that appears in the optimization problem for domain adaptation (), and similarly \(\overline{\text {dis}}(\textsf {p}^0, (\textsf {q}, \textsf {q}'))\), in the case of the squared loss with linear functions. Next, we show how the same analysis can be used to compute the empirical discrepancy term \(\overline{\text {dis}}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})\), which provides an accurate estimate of \(\overline{d} = \overline{\text {dis}}({\mathscr {P}}, {\mathscr {Q}})\).

1.3.1 B.3.1 Sub-Gradients of unlabeled weighted discrepancy terms

Let \(\ell \) be the squared loss and let \({\mathscr {H}}\) be the family of linear functions defined by \({\mathscr {H}}= \{x \mapsto {\textbf{w}}\cdot \varvec{\Phi }(x) :\big \Vert {{\textbf{w}}}\big \Vert _2 \le \Lambda \}\), where \(\varvec{\Phi }\) is a feature mapping from \({\mathscr {X}}\) to \(\mathbb {R}^k\).

We can analyze the unlabeled discrepancy term \(\overline{\text {dis}}(\textsf {q}', \textsf {p})\) using an analysis similar to that of [21]. By definition of the unlabeled discrepancy, we can write:

$$\begin{aligned} \overline{\text {dis}}(\textsf {q}', \textsf {p})&= \sup _{h, h' \in {\mathscr {H}}} \left\{ \sum _{i = 1}^{n} \textsf {q}'_i \ell (h(x_{m + i}), h'(x_{m + i})) - \sum _{i = 1}^{m} \textsf {p}_i \ell (h(x_{i}), h'(x_{i}))\right\} \\&= \sup _{\big \Vert {{\textbf{w}}}\big \Vert _2 , \big \Vert {{\textbf{w}}'}\big \Vert _2 \le \Lambda } \left\{ \sum _{i = 1}^{n} \textsf {q}'_i \left[ {({\textbf{w}}- {\textbf{w}}') \cdot \varvec{\Phi }(x_{m + i})}\right] ^2 - \sum _{i = 1}^{m} \textsf {p}_i \left[ {({\textbf{w}}- {\textbf{w}}') \cdot \varvec{\Phi }(x_i)}\right] ^2\right\} \\&= \sup _{\big \Vert {{\textbf{u}}}\big \Vert _2 \le 2\Lambda } \left\{ \sum _{i = 1}^{n} \textsf {q}'_i \left[ {{\textbf{u}}\cdot \varvec{\Phi }(x_{m + i})}\right] ^2 - \sum _{i = 1}^{m} \textsf {p}_i \left[ {{\textbf{u}}\cdot \varvec{\Phi }(x_i)}\right] ^2\right\} \\ \end{aligned}$$
$$\begin{aligned}&= \sup _{\big \Vert {{\textbf{u}}}\big \Vert _2 \le 2\Lambda } \left\{ \sum _{i = 1}^{n} \textsf {q}'_i {\textbf{u}}^\top \varvec{\Phi }(x_{m + i}) \varvec{\Phi }(x_{m + i})^\top {\textbf{u}}- \sum _{i = 1}^{m} \textsf {p}_i {\textbf{u}}^\top \varvec{\Phi }(x_i) \varvec{\Phi }(x_i)^\top {\textbf{u}}\right\} \\&= \sup _{\big \Vert {{\textbf{u}}}\big \Vert _2 \le 2\Lambda } \left\{ {\textbf{u}}^\top \left[ {\sum _{i = 1}^{n} \textsf {q}'_i \varvec{\Phi }(x_{m + i}) \varvec{\Phi }(x_{m + i})^\top - \sum _{i = 1}^{m} \textsf {p}_i \varvec{\Phi }(x_i) \varvec{\Phi }(x_i)^\top }\right] {\textbf{u}}\right\} \\&= 4\Lambda ^2 \sup _{\big \Vert {{\textbf{u}}}\big \Vert _2 \le 1} {\textbf{u}}^\top {\textbf{M}}(\textsf {q}', \textsf {p}) {\textbf{u}}\\&= 4\Lambda ^2 \max \left\{ 0, \sup _{\big \Vert {{\textbf{u}}}\big \Vert _2 = 1} {\textbf{u}}^\top {\textbf{M}}(\textsf {q}', \textsf {p}) {\textbf{u}}\right\} \\&= 4\Lambda ^2 \max \left\{ 0, \lambda _{\max } \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\right\} , \end{aligned}$$

where \({\textbf{M}}(\textsf {q}', \textsf {p}) = \sum _{i = 1}^{n} \textsf {q}'_i \varvec{\Phi }(x_{m + i}) \varvec{\Phi }(x_{m + i})^\top - \sum _{i = 1}^{m} \textsf {p}_i \varvec{\Phi }(x_i) \varvec{\Phi }(x_i)^\top \) and where \(\lambda _{\max }\) \( \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\) denotes the maximum eigenvalue of the symmetric matrix \({\textbf{M}}(\textsf {q}', \textsf {p})\). Thus, the unlabeled discrepancy \(\overline{\text {dis}}(\textsf {q}', \textsf {p})\) can be obtained from the maximum eigenvalue of a symmetric matrix that is an affine function of \(\textsf {q}'\) and \(\textsf {p}\). Since \(\lambda _{\max }\) is a convex function and since composition with an affine function preserves convexity, \(\lambda _{\max } \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\) is a convex function of \(\textsf {q}'\) and \(\textsf {p}\). Since the maximum of two convex function is convex, \(\max \left\{ 0, \lambda _{\max } \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\right\} \) is also convex.

Rewriting \(\lambda _{\max } \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\) as \(\max _{\big \Vert {{\textbf{u}}}\big \Vert _2 = 1} {\textbf{u}}^\top {\textbf{M}}(\textsf {q}', \textsf {p}) {\textbf{u}}\) helps derive the sub-gradient of \(\lambda _{\max } \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\) using the sub-gradient calculation of the maximum of a set of functions:

$$\begin{aligned} \nabla _{(\textsf {q}', \textsf {p})} \lambda _{\max } \big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big ) = \begin{bmatrix} {\textbf{u}}^\top \varvec{\Phi }(x_{m + 1}) \varvec{\Phi }(x_{m + 1})^\top {\textbf{u}}\\ \vdots \\ {\textbf{u}}^\top \varvec{\Phi }(x_{m + n}) \varvec{\Phi }(x_{m + n})^\top {\textbf{u}}\\ -{\textbf{u}}^\top \varvec{\Phi }(x_1) \varvec{\Phi }(x_1)^\top {\textbf{u}}\\ \vdots \\ -{\textbf{u}}^\top \varvec{\Phi }(x_m) \varvec{\Phi }(x_m)^\top {\textbf{u}}\end{bmatrix} = \begin{bmatrix} \big ({\varvec{\Phi }(x_{m + 1}) \cdot {\textbf{u}}}\big )^2\\ \vdots \\ \big ({\varvec{\Phi }(x_{m + n}) \cdot {\textbf{u}}}\big )^2\\ -\big ({\varvec{\Phi }(x_1) \cdot {\textbf{u}}}\big )^2\\ \vdots \\ - \big ({\varvec{\Phi }(x_m) \cdot {\textbf{u}}}\big )^2 \end{bmatrix}, \end{aligned}$$

where \({\textbf{u}}\) is the eigenvector corresponding to the maximum eigenvalue of \({\textbf{M}}(\textsf {q}', \textsf {p})\). Alternatively, we can approximate the maximum eigenvalue via the softmax expression

$$\begin{aligned} f(\textsf {q}', \textsf {p}) = \frac{1}{\mu } \log \left[ {\sum _{j = 1}^k e^{\mu \lambda _j\big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )}}\right] = \frac{1}{\mu } \log \left[ {{{\,\textrm{Tr}\,}}\big ({ e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}}\big ) }\right] , \end{aligned}$$

where \(e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}\) denotes the matrix exponential of \(\mu {\textbf{M}}(\textsf {q}', \textsf {p})\) and \(\lambda _j\big ({{\textbf{M}}(\textsf {q}', \textsf {p})}\big )\) the jth eigenvalue of \({\textbf{M}}(\textsf {q}', \textsf {p})\). The matrix exponential can be computed in \(O(k^3)\) time by computing the singular value decomposition (SVD) of the matrix. We have:

$$\begin{aligned} \lambda _{\max } ({\textbf{M}}(\textsf {q}', \textsf {p})) \le f(\textsf {q}', \textsf {p}) \le \lambda _{\max } ({\textbf{M}}(\textsf {q}', \textsf {p})) + \frac{\log k}{\mu }. \end{aligned}$$

Thus, for \(\mu = \frac{\log k}{\epsilon }\), \(f(\textsf {q}', \textsf {p})\) provides a uniform \(\epsilon \)-approximation of \(\lambda _{\max } ({\textbf{M}}(\textsf {q}', \textsf {p}))\). The gradient of \(f(\textsf {q}', \textsf {p})\) is given for all \(j \in [n]\) and \(i \in [m]\) by

$$\begin{aligned}&\nabla _{\textsf {q}'_j} f(\textsf {q}', \textsf {p}) = \frac{\left\langle {e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}, \varvec{\Phi }(x_{m + j})\varvec{\Phi }(x_{m + j})^\top }\right\rangle }{{{\,\textrm{Tr}\,}}\big ({ e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}}\big ) } = \frac{\varvec{\Phi }(x_{m + j})^\top e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})} \varvec{\Phi }(x_{m + j})}{{{\,\textrm{Tr}\,}}\big ({ e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}}\big ) }\\&\nabla _{\textsf {p}_i} f(\textsf {q}', \textsf {p}) = - \frac{\left\langle {e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}, \varvec{\Phi }(x_{i})\varvec{\Phi }(x_{i})^\top }\right\rangle }{{{\,\textrm{Tr}\,}}\big ({ e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}}\big ) } = \frac{\varvec{\Phi }(x_{i})^\top e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})} \varvec{\Phi }(x_{i})}{{{\,\textrm{Tr}\,}}\big ({ e^{\mu {\textbf{M}}(\textsf {q}', \textsf {p})}}\big ) }. \end{aligned}$$

The sub-gradient of the unlabeled discrepancy term \(\overline{\text {dis}}(\textsf {p}^0, (\textsf {q}, \textsf {q}'))\) or a smooth approximation can be derived in a similar, using the same analysis as above.

1.3.2 B.3.2 Estimation of unlabeled discrepancy terms

The unlabeled discrepancy \(\overline{d} = \overline{\text {dis}}({\mathscr {P}}, {\mathscr {Q}})\) can be accurately estimated from its empirical version \(\overline{\text {dis}}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})\) [84]. In view of the analysis of the previous section, we have

$$\begin{aligned} \overline{\text {dis}}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})&= 4 \Lambda ^2 \lambda _{\max } \big ({{\textbf{M}}(\widehat{\mathscr {P}}, \widehat{\mathscr {Q}})}\big )\\&= 4 \Lambda ^2 \lambda _{\max } \left( { \frac{1}{n} \sum _{i = 1}^{n} \varvec{\Phi }(x_{m + i}) \varvec{\Phi }(x_{m + i})^\top - \frac{1}{m} \sum _{i = 1}^{m} \varvec{\Phi }(x_i) \varvec{\Phi }(x_i)^\top }\right) . \end{aligned}$$

Thus, this last expression can be used in place of \(\overline{d}\) in the optimization problem for domain adaptation.

Appendix C: Further details about experimental settings

In this section we provide further details on our experimental setup starting with best effort adaptation.

1.1 C.1 Best-effort adaptation

Recall that in this setting we have labeled data from both source and target, however the amount of labeled data from the source is much larger. We start by describing the baselines that we compare our algorithms with. For the best-effort adaptation problem two natural baselines are to learn a hypothesis solely on the target \({\mathscr {P}}\), or train solely on the source \({\mathscr {Q}}\). A third baseline that we consider is the \(\alpha \)-reweighted q as discussed in Section 3.2. Note, \(\alpha =1\) corresponds to training on all the available data with a uniform weighting.

1.1.1 C.2 Simulated data

We first consider a simulated scenario where n samples from the target distribution \({\mathscr {P}}\) are generated by first drawing the feature vector x i.i.d. from a normal distribution with zero mean and spherical covariance matrix, i.e, \(N(0, I_{d \times d})\). Given x, a binary label \(y \in \{ -1, +1 \} \) is generated as \(\text {sgn}(w_p \cdot x)\) for a randomly chosen unit vector \(w_p \in \mathbb {R}^{d}\). For a fixed \(\eta \in (0.5, 1)\), \(m=1\mathord {,}000\) i.i.d. samples from the source distribution \({\mathscr {Q}}\) are generated by first drawing \((1-\eta )m\) examples from \(N(0, I_{d \times d})\) and labeled according to \(\text {sgn}(w_q \cdot x)\) where \(\Vert w_p - w_q\Vert \le \epsilon \), for a small value of \(\epsilon \). Notice that when \(\epsilon \) is small, the \((1-\eta )m\) samples are highly relevant for learning the target \({\mathscr {P}}\). The remaining \(\eta m\) examples from \({\mathscr {Q}}\) are all set to a fixed vector u and are labeled as \(+1\). These examples represent the noise in \({\mathscr {Q}}\) and as \(\eta \) increases the presence of such examples makes \(\text {dis}({\mathscr {P}}, {\mathscr {Q}})\) larger. In our experiments we set \(d = 20, \epsilon =0.01\), and vary \(\eta \in \{0.05, 0.1, 0.15, 0.2\}\).

On the above adaptation problem we evaluate the performance of the previously discussed baselines with our proposed \(\text {sbest}\) algorithm implemented via the alternate minimization, sbest-AM, and the DC-programming algorithms, sbest-DC, where the loss function considered is the logistic loss and the hypothesis set is the set of linear models with zero bias. For each value of \(\eta \), the results are averaged over 50 independent runs using the data generation process described above.

Figure 3 shows the performance of the different algorithms for various values of the noise level \(\eta \) and as the number of examples n from the target increases. As can be seen from the figure, both \(\alpha \)-reweighting and the baseline that trains solely on \({\mathscr {Q}}\) degrade significantly in performance as \(\eta \) increases. This is due to the fact the \(\alpha \)-reweighting procedure cannot distinguish between non-noisy and noisy data points within the m samples generated from \({\mathscr {Q}}\).

In Fig. 4(Left) we plot the best \(\alpha \) chosen by the \(\alpha \)-reweighting procedure as a function of n. For reference we also plot the amount of mass on the non-noisy points from \({\mathscr {Q}}\), i.e., \((1-\eta ) \cdot m/(m+n)\). As can be seen from the figure, as n increases the amount of mass selected over the source \({\mathscr {Q}}\) decreases. Furthermore, as expected this decrease is sharper as the amount of noise level increases. In particular, \(\alpha \)-reweighting is not able to effectively use the non-noisy samples from \({\mathscr {Q}}\).

On the other hand, both sbest-AM and sbest-DC are able to counter the effect of the noise by generating \(\textsf {q}\)-weightings that are predominantly supported on the non-noisy samples. In Fig. 4(Right) we plot the amount of probability mass that the alternate minimization and the DC-programming implementations of \(\text {sbest}\) assign to the noisy data points.

As can be seen from the figure, the total probability mass decreases with n and is also decreasing with the noise levels. These results also demonstrate that our algorithms that compute a good q-weighting can do effective outlier detection since they lead to solutions that assign much smaller mass to the noisy points.

Fig. 3
figure 3

Comparison of \(\text {sbest}\) against the baselines on simulated data in the classification setting. As the noise rate and therefore the discrepancy between \({\mathscr {P}}\) and \({\mathscr {Q}}\) increases the performance of the baselines degrades. In contrast, both the alternate minimization and the DC-programming algorithms effectively find a good \(\textsf {q}\)-weighting and can adapt to the target

Fig. 4
figure 4

(Left) Best \(\alpha \) chosen by \(\alpha \)-reweighting as a function of n. (Right) Total probability mass assigned by \(\text {sbest}\) to the noisy points

1.1.2 C.1.2 Real-world data: classification and regression

Classification Next we evaluate our proposed algorithms and baselines for three real-world datasets obtained from the UCI machine learning repository [31]. We first describe the datasets and our choices of the source and target domains in each case. The first dataset we consider is the Adult-Income dataset. This is a classification task where the goal is to predict whether the income of a given individual is greater than or equal to \(\$50\)K. The dataset has 32, 561 examples. We form the source domain \({\mathscr {Q}}\) by taking examples where the attribute gender equals ‘Male’ and the target domain \({\mathscr {P}}\) corresponds to examples where the gender is ‘Female’. This leads to 21, 790 examples from \({\mathscr {Q}}\) and 10, 771 examples from \({\mathscr {P}}\).

The second dataset we consider is the South-German-Credit dataset. This dataset consists of \(1\mathord {,}000\) examples and the goal is to predict whether a given individual has good credit or bad credit. We form the source domain \({\mathscr {Q}}\) by condition on the residence attribute and taking all examples where the attribute value is in \(\{3, 4\}\) (indicating that the individual has lived at the current residence for 3 or more than 4 years.) The target domain is formed by taking examples where the residence attribute value is in \(\{1,2\}\). This split leads to 620 examples from \({\mathscr {Q}}\) and 380 training examples from \({\mathscr {P}}\).

The third dataset we consider is the Speaker-Accent-Recognition dataset. In this dataset the goal is to predict the accent of a speaker given the speech signal. We consider the source \({\mathscr {Q}}\) to be examples where the accent is ’US’ or ’UK’ and the target to be examples where the accent is in {’ES’, ’FR’, ’GE’, ’IT’}. This split leads to 150 training examples from \({\mathscr {Q}}\) and 120 training examples from \({\mathscr {P}}\).

In each case we randomly split the examples from \({\mathscr {P}}\) into a training set of \(70\%\) examples and a test set of \(20\%\) examples. The remaining \(10\%\) of the data is used for cross validation. We provide results averaged over 10 such random splits. For the six tasks from the Newsgroups dataset we follow the same methodology as in [127] to create the tasks.

In each of the above three cases we consider training a logistic regression classifier and compare the performance of \(\text {sbest}\) with the baselines that we previously discussed. The results are shown in Table 1 in the main paper.

Regression Next we consider the following regression datasets from the UCI repository.

The wind dataset [51] where the task is to predict wind speed from the given features. The source consists of data from months January to November and the target is the data from December. This leads to a total of 5, 500 examples from \({\mathscr {Q}}\), 350 examples from \({\mathscr {P}}\) used for training and validation and 200 examples from \({\mathscr {P}}\) for testing. We create 10 random splits by dividing the 300 examples from \({\mathscr {P}}\) into a train set of size 150 and a validation set of size 200.

The airline dataset is derived from [63]. We create the task of predicting the amount of time the flight is delayed from various features such as the arrival time, distance, whether or not the flight was diverted, and the day of the week. We take a subset of the data for the Chicago O’Haire International Airport (ORD) in 2008. The source and target consists of datat from different hours of the day. This leads to 16, 000 examples from \({\mathscr {Q}}\) and 500 examples from \({\mathscr {P}}\) (used as 200 for training and 300 for validation) and 300 examples for testing.

The gas dataset [31, 111, 126] where the task is to predict the concentration level from various sensor measurements. The dataset consists of pre-determined batches and we take the first six to be the source and the last batch of size 360, 000 as the target (600 for training and 1000 for validation and 1000 for testing).

The news dataset [31, 34] where the goal is to predict the popularity of an article. Our source data consists of articles from Monday to Saturday and the target consists of articles from Sunday. This leads to 32500 examples from the source and 2737 examples for the target (737 for training, 1000 for validation and 1000 for testing).

The traffic dataset from the Minnesota Department of Transportation [31, 75] where the goal is to predict the traffic volume. We create source and target by splitting based on the time of the day. This leads to 2200 examples from the source and 1000 examples from the test set (200 for training, 400 for validation and 400 for testing).

In each of the datasets above we create 10 random splits based on the shuffling of the training and validation set and report mean and average values over the splits. We compare as baselines the KMM [61] algorithm and the DM algorithm [21]. Since both the algorithms were originally designed for the setting when the target has no labels we modify them in the following way. We run KMM (DM) on the source vs. target data to get a weight distribution \(\textsf {q}\) over the source data. Finally, we perform weighted loss minimization by using the weights in \(\textsf {q}\) for the source and uniform 1/n weights on the target of size n. The results are shown in Table 4. As can be seen sbest consistently outperforms the baselines.

Table 4 MSE of the sbest algorithm against baselines

1.2 C.2 Fine-tuning tasks

In this section we demonstrate the effectiveness of our proposed algorithms for the purpose of fine-tuning pre-trained representations. In the standard pre-training/fine-tuning paradigm [108] a model is first pre-trained on a generalist dataset (which is identified as coming from distribution \({\mathscr {Q}}\)). Once a good representation is learned, the model is then fine-tuned on a task specific dataset (generated from target \({\mathscr {P}}\)). Two of the predominantly used fine-tuning approaches in the literature are last layer fine-tuning [67, 116] and full model fine-tuning [60]. In the former approach the representations obtained from the last layer of the pre-trained model are used to train a simple model (often a linear hypothesis) on the data coming from \({\mathscr {P}}\). In our experiments we fix the choice of the simple model to be a multi-class logistic regression model. In the latter approach, the model when train on \({\mathscr {P}}\), is initialized from the pre-trained model and all the parameters of the model are fine-tuned (via gradient descent) on the target distribution \({\mathscr {P}}\). In this section we explore the additional advantages of combining data from both \({\mathscr {P}}\) and \({\mathscr {Q}}\) during the fine-tuning stage via our proposed algorithms. There has been recent interest in carefully combining various tasks/data for the purpose of fine-tuning and avoid the phenomenon of “negative transfer” [2]. Our proposed theoretical results present a principled approach towards this.

To evaluate the effectiveness of our theory for this purpose, we consider the CIFAR-10 vision dataset [72]. The dataset consists of 50000 training and 10000 testing examples belonging to 10 classes. We form a pre-training task on data from \({\mathscr {Q}}\), by combining all the data belonging to classes: {’airplane’, ’automobile’, ’bird’, ’cat’, ’deer’, ’dog’}. The fine-tuning task consists of data belonging to classes: {’frog’, ’horse’, ’ship’, ’truck’}. We consider both the approaches of last layer fine-tuning and full-model fine-tuning and compare the standard approach of fine-tuning only using data from \({\mathscr {P}}\) with our proposed algorithms. We use \(60\%\) of the data from the source for pre-training, and the remaining \(40\%\) is used in fine-tuning.

We split the fine-tuning data from \({\mathscr {P}}\) randomly into a \(70\%\) training set to be used in fine-tuning, \(10\%\) for cross validation and and the remaining \(20\%\) to be used as a test set. The results are reported over 5 such random splits. We perform pre-training on a standard ResNet-18 architecture [52] by optimizing the cross-entropy loss via the Adam optimizer. As can be seen in Table 2 both gapBoost and \(\text {sbest}\) that combine data from \({\mathscr {P}}\) and \({\mathscr {Q}}\) lead to a classifier with better performance for the downstream task, however, sbest clearly outperforms gapBoost.

The second dataset we consider is the Civil Comments dataset [97]. This dataset consists of text comments in online forums and the goal is to predict whether a given comment is toxic or not. Each data point is also labeled with identity terms that describes which subgroup the text in the comment is related to. We create a subsample of the dataset where the target consists of examples from the data points where the identity terms is “asian” and the source is the remaining set of points. This leads to 394, 000 points from the source and 20, 000 points from the target. We create 5 random splits of the data by randomly partitioning the target data into 10, 000 examples for finetuning, 2000 for validation and 8000 for testing. We perform pre-training on a BERT-small model [29] starting from the default checkpoint as obtained from the standard tensorflow implementation of the model.

1.3 C.3 Domain adaptation

In this section we evaluate the effectiveness of our proposed \(\text {best-da}\) objective for adaptation in settings where the target has very little to no labeled data. In order to do this we consider multi-domain sentiment analysis dataset of [12] that has been used in prior works on domain adaptation. The dataset consists of text reviews associated with a star rating from 1 to 5 for various different categories such as books, dvd, etc. We specifically consider four categories namely books, dvd, electronics, and kitchen. Inspired form the methodology adapted in prior works [21, 89], for each category, we form a regression task by converting the review text to a 128 dimensional vector and fitting a linear regression model to predict the rating. In order to get the features we first combine all the data from the four tasks and convert the raw text to a TF-IDF representation using scikit-learn’s feature extraction library [98]. Following this, we compute the top 5000 most important features by using scikit-learn’s feature selection library, that in turn uses a chi-squared test to perform feature selection. Finally, we project the obtained onto a 128 dimensional space via performing principal component analysis.

After feature extraction, for each task we fit a ridge regression model in the 128 dimensional space to predict the ratings. The predictions of the model are then defined as the ground truth regression labels. Following the above pre-processing we form 12 adaptation problems for each pair of distinct tasks: (TaskA, TaskB) where TaskA, TaskB are in {books, dvd, electronics, kitchen}. In each case we form the source domain (\({\mathscr {Q}}\)) by taking 500 labeled samples from TaskA and 200 labeled examples from TaskB. The target (\({\mathscr {P}}\)) is formed by taking 300 unlabeled examples from TaskB. To our knowledge, there exists no principled method for cross-validation in fully unsupervised domain adaptation. Thus, in our adaptation experiments, we used a small labeled validation set of size 50 to determine the parameters for all the algorithms. This is consistent with experimental results reported in prior work (e.g., [21]).

We compare our \(\text {best-da}\) algorithm with the discrepancy minimization (DM) algorithm of [21], and the (GDM) algorithm, [22], which is a state of the art adaptation algorithm for regression problems. We also compare with the popular Kernel Mean Matching (KMM) algorithm, [61], for domain adaptation. the results averaged over 10 independent source and target splits, where we normalize the mean squared error (MSE) of \(\text {best-da}\) to be 1.0 and present the relative MSE achieved by the other methods. The results show that in most adaptation problems, \(\text {best-da}\) outperforms (boldface) or ties with (italics) existing methods.

1.3.1 C.3.1 Domain adaptation – covariate-shift

Here we perform experiments for domain adaptation only under covariate shift and compare the performance of our proposed best-da objective with previous state of the art algorithms. We again consider the multi-domain sentiment analysis dataset [12] from the previous section and in particular focus on the books category. We use the same feature representation as before and define the ground truth as \(y = w^* \cdot x + \sigma ^2\) where \(w^*\) is obtained by fitting a ridge regression classifier. We let the target be the uniform distribution over the entire dataset. We define the source as follows: for a fixed value of \(\epsilon \), we pick a random hyperplane w and consider a mixture distribution with mixture weight 0.99 on the set \(w \cdot x \ge \epsilon \) and the mixture weight of 0.01 on the set \(w \cdot x < \epsilon \). The performance of best-da as compared to DM and KMM is shown in Table 5. As can be seen our proposed algorithm either matches or outperforms current algorithms.

Table 5 MSE achieved by \(\text {best-da}\) as compared to DM and KMM on the covariate shift task for various values of \(\epsilon \)

Hyperparameters for the algorithms

For our proposed sbest and sbest-da algorithms the hyperparameters \(\lambda _\infty , \lambda _1, \lambda _2\) were chosen via cross validation in the range \(\{1e-3, 1e-2, 1e-1\} \cup \{0,1,2,\dots , 10\} \cup \{0, 1000, 2000, 10000, 50000, 100000\}\). The h optimization step of alternate minimization was performed using sklearn’s linear regression/logistic regression methods [98]. During full layer fine-tuning on ResNet/BERT models we use the Adam optimizer for the h optimization step with a learning rate of \(1e-3\) used for the CIFAR-10 dataset and a learning rate of \(1e-5\) for the BERT-small models.

For the q optimization we used projected gradient descent and the step size was chosen via cross validation in the range \(\{1e-3, 1e-2, 1e-1\}\).

We re-implemented the gapBoost algorithm [127] in Python. Following the prescription by the authors of gapBoost we set the parameter \(\gamma = 1/n\) where n is the size of the target. We tune parameters \(\rho _S, \rho _T\) in the range \(\{0.1, 0.2, \ldots ,1\}\) and the number of rounds of boosting in the range \(\{5,10,15,20\}\). We also re-implemented baselines DM [21] and the GDM algorithm [22]. These DM algorithm was implemented via gradient descent and the second stage of the GDM algorithm was implemented via alternate minimization. The learning rates in each case searched in the range \(\{1e-3, 1e-2, 1e-1\}\) and the regularization parameters were searched in the range \(\{1e-3, 1e-2, 1e-1, 0, 10, 100\}\). The radius parameter for GDM was searched in the range [0.01, 1] in steps of 0.01. In line with our proposed algorithms, all baselines were implemented without incorporating a bias term.

To our knowledge, there exists no principled method for cross-validation in fully unsupervised domain adaptation. Thus, in our unsupervised adaptation experiments, we used a small labeled validation set of size 50 to determine the parameters for all the algorithms. This is consistent with experimental results reported in prior work [21, 22].

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Awasthi, P., Cortes, C. & Mohri, M. Best-effort adaptation. Ann Math Artif Intell 92, 393–438 (2024). https://doi.org/10.1007/s10472-023-09917-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-023-09917-3

Keywords

Mathematics Subject Classification (2010)

Navigation