skip to main content
research-article

On the Effectiveness of Sampled Softmax Loss for Item Recommendation

Authors Info & Claims
Published:22 March 2024Publication History
Skip Abstract Section

Abstract

The learning objective plays a fundamental role to build a recommender system. Most methods routinely adopt either pointwise (e.g., binary cross-entropy) or pairwise (e.g., BPR) loss to train the model parameters, while rarely pay attention to softmax loss, which assumes the probabilities of all classes sum up to 1, due to its computational complexity when scaling up to large datasets or intractability for streaming data where the complete item space is not always available. The sampled softmax (SSM) loss emerges as an efficient substitute for softmax loss. Its special case, InfoNCE loss, has been widely used in self-supervised learning and exhibited remarkable performance for contrastive learning. Nonetheless, limited recommendation work uses the SSM loss as the learning objective. Worse still, none of them explores its properties thoroughly and answers “Does SSM loss suit for item recommendation?” and “What are the conceptual advantages of SSM loss, as compared with the prevalent losses?”, to the best of our knowledge.

In this work, we aim at offering a better understanding of SSM for item recommendation. Specifically, we first theoretically reveal three model-agnostic advantages: (1) mitigating popularity bias, which is beneficial to long-tail recommendation; (2) mining hard negative samples, which offers informative gradients to optimize model parameters; and (3) maximizing the ranking metric, which facilitates top-K performance. However, based on our empirical studies, we recognize that the default choice of cosine similarity function in SSM limits its ability in learning the magnitudes of representation vectors. As such, the combinations of SSM with the models that also fall short in adjusting magnitudes (e.g., matrix factorization) may result in poor representations. One step further, we provide mathematical proof that message passing schemes in graph convolution networks can adjust representation magnitude according to node degree, which naturally compensates for the shortcoming of SSM. Extensive experiments on four benchmark datasets justify our analyses, demonstrating the superiority of SSM for item recommendation. Our implementations are available in both TensorFlow1 and PyTorch.2

REFERENCES

  1. [1] Yu Bai, Sally Goldman, and Li Zhang. 2017. TAPAS: Two-pass approximate adaptive sampling for softmax.Google ScholarGoogle Scholar
  2. [2] Bengio Yoshua and Senecal Jean-Sébastien. 2003. Quick training of probabilistic neural nets by importance sampling. In Proceedings of the International Workshop on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  3. [3] Blanc Guy and Rendle Steffen. 2018. Adaptive sampled softmax with kernel based sampling. In Proceedings of the International Conference on Machine Learning. 589598.Google ScholarGoogle Scholar
  4. [4] Bruch Sebastian, Wang Xuanhui, Bendersky Michael, and Najork Marc. 2019. An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. 7578.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cao Zhe, Qin Tao, Liu Tie-Yan, Tsai Ming-Feng, and Li Hang. 2007. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning. 129136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Jiawei, Dong Hande, Wang Xiang, Feng Fuli, Wang Meng, and He Xiangnan. 2023. Bias and debias in recommender system: A survey and future directions. ACM Transactions on Information Systems 41, 3 (2023), 67:1–67:39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen Lei, Wu Le, Hong Richang, Zhang Kun, and Wang Meng. 2020. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach. In Proceedings of theAAAI Conference on Artificial Intelligence. AAAI Press, 2734.Google ScholarGoogle Scholar
  8. [8] Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey E.. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. 15971607.Google ScholarGoogle Scholar
  9. [9] Covington Paul, Adams Jay, and Sargin Emre. 2016. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems. 191198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 41714186.Google ScholarGoogle Scholar
  11. [11] Gao Tianyu, Yao Xingcheng, and Chen Danqi. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 68946910.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Gidaris Spyros, Singh Praveer, and Komodakis Nikos. 2018. Unsupervised representation learning by predicting image rotations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  13. [13] Glorot Xavier and Bengio Yoshua. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249256.Google ScholarGoogle Scholar
  14. [14] Guo Lei, Yin Hongzhi, Wang Qinyong, Chen Tong, Zhou Alexander, and Hung Nguyen Quoc Viet. 2019. Streaming session-based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 15691577.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gutmann Michael and Hyvärinen Aapo. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13, 11 (2012), 307–361.Google ScholarGoogle Scholar
  16. [16] He Xiangnan and Chua Tat-Seng. 2017. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. 355364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] He Xiangnan, Deng Kuan, Wang Xiang, Li Yan, Zhang Yong-Dong, and Wang Meng. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639648.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] He Xiangnan, He Zhankui, Song Jingkuan, Liu Zhenguang, Jiang Yu-Gang, and Chua Tat-Seng. 2018. NAIS: Neural attentive item similarity model for recommendation. IEEE Transactions on Knowledge and Data Engineering 30, 12 (2018), 23542366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] He Xiangnan, Liao Lizi, Zhang Hanwang, Nie Liqiang, Hu Xia, and Chua Tat-Seng. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Hidasi Balázs, Karatzoglou Alexandros, Baltrunas Linas, and Tikk Domonkos. 2016. Session-based recommendations with recurrent neural networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  21. [21] Hu Yifan, Koren Yehuda, and Volinsky Chris. 2008. Collaborative filtering for implicit feedback datasets. In Proceeings of the 8th IEEE International Conference on Data Mining. 263272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kabbur Santosh, Ning Xia, and Karypis George. 2013. FISM: Factored item similarity models for top-N recommender systems. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge discovery and Data Mining. 659667.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Khosla Prannay, Teterwak Piotr, Wang Chen, Sarna Aaron, Tian Yonglong, Isola Phillip, Maschinot Aaron, Liu Ce, and Krishnan Dilip. 2020. Supervised contrastive learning. In Proceedings of the Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  24. [24] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  25. [25] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations. (Poster).Google ScholarGoogle Scholar
  26. [26] Koren Yehuda. 2008. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 426434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Koren Yehuda, Bell Robert M., and Volinsky Chris. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 3037.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Lan Zhenzhong, Chen Mingda, Goodman Sebastian, Gimpel Kevin, Sharma Piyush, and Soricut Radu. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  29. [29] Lian Defu, Liu Qi, and Chen Enhong. 2020. Personalized ranking with importance sampling. In Proceedings of the Web Conference 2020. 10931103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Liu Zhuang, Ma Yunpu, Ouyang Yuanxin, and Xiong Zhang. 2021. Contrastive learning for recommender system. arXiv:2101.01317. Retrieved from https://arxiv.org/abs/2101.01317Google ScholarGoogle Scholar
  31. [31] Mao Kelong, Zhu Jieming, Wang Jinpeng, Dai Quanyu, Dong Zhenhua, Xiao Xi, and He Xiuqiang. 2021. SimpleX: A simple and strong baseline for collaborative filtering. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management. 1243–1252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Ning Xia and Karypis George. 2011. SLIM: Sparse linear methods for top-N recommender systems. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining. 497506.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Pareto Vilfredo. 1964. Cours d’économie politique. Vol. 1. Librairie Droz.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Park Dohyung, Neeman Joe, Zhang Jin, Sanghavi Sujay, and Dhillon Inderjit S.. 2015. Preference completion: Large-scale collaborative ranking from pairwise comparisons. In Proceedings of the International Conference on Machine Learning. 19071916.Google ScholarGoogle Scholar
  35. [35] Rawat Ankit Singh, Chen Jiecao, Yu Felix X., Suresh Ananda Theertha, and Kumar Sanjiv. 2019. Sampled softmax with random fourier features. In Proceedings of the Advances in Neural Information Processing Systems. 1383413844.Google ScholarGoogle Scholar
  36. [36] Rendle Steffen. 2022. Item recommendation from implicit feedback. In Proceedings of the Recommender Systems Handbook. Springer US, 143171.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Rendle Steffen and Freudenthaler Christoph. 2014. Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM international Conference on Web Search and Data Mining. 273282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Rendle Steffen, Freudenthaler Christoph, Gantner Zeno, and Schmidt-Thieme Lars. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. 452461.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Rennie Jason D. M. and Srebro Nathan. 2005. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine Learning. 713719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Schnabel Tobias, Swaminathan Adith, Singh Ashudeep, Chandak Navin, and Joachims Thorsten. 2016. Recommendations as treatments: Debiasing learning and evaluation. In Proceedings of the International Conference on Machine Learning. 16701679.Google ScholarGoogle Scholar
  41. [41] Shan Ying, Hoens T. Ryan, Jiao Jian, Wang Haijing, Yu Dong, and Mao J. C.. 2016. Deep crossing: Web-scale modeling without manually crafted combinatorial features. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 255262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Tang Hao, Zhao Guoshuai, Wu Yuxia, and Qian Xueming. 2023. Multisample-based contrastive loss for top-K recommendation. IEEE Transactions on Multimedia 25 (2023), 339351.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Oord Aäron van den, Li Yazhe, and Vinyals Oriol. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748. Retrieved from http://arxiv.org/abs/1807.03748Google ScholarGoogle Scholar
  44. [44] Wan Qi, He Xiangnan, Wang Xiang, Wu Jiancan, Guo Wei, and Tang Ruiming. 2022. Cross pairwise ranking for unbiased item recommendation. In Proceedings of the ACM Web Conference 2022. 23702378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Wang Jinpeng, Zhu Jieming, and He Xiuqiang. 2021. Cross-batch negative sampling for training two-tower recommenders. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 16321636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wang Xiang, He Xiangnan, Wang Meng, Feng Fuli, and Chua Tat-Seng. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR Conference on Research and Development in Information Retrieval. 165174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Weston Jason, Bengio Samy, and Usunier Nicolas. 2011. WSABIE: Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence. 27642770.Google ScholarGoogle Scholar
  48. [48] Wu Junkang, Chen Jiawei, Wu Jiancan, Shi Wentao, Wang Xiang, and He Xiangnan. 2023. Understanding contrastive learning via distributionally robust optimization. In Proceedings of the Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  49. [49] Wu Jiancan, He Xiangnan, Wang Xiang, Wang Qifan, Chen Weijian, Lian Jianxun, and Xie Xing. 2022. Graph convolution machine for context-aware recommender system. Frontiers of Computer Science 16, 6 (2022), 166614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Wu Jiancan, Wang Xiang, Feng Fuli, He Xiangnan, Chen Liang, Lian Jianxun, and Xie Xing. 2021. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 726735.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wu Jiancan, Yang Yi, Qian Yuchun, Sui Yongduo, Wang Xiang, and He Xiangnan. 2023. GIF: A general graph unlearning strategy via influence function. In Proceedings of the ACM Web Conference 2023. 651661.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Xia Fen, Liu Tie-Yan, Wang Jue, Zhang Wensheng, and Li Hang. 2008. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning. 11921199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Yang Ji, Yi Xinyang, Cheng Derek Zhiyuan, Hong Lichan, Li Yang, Wang Simon Xiaoming, Xu Taibai, and Chi Ed H.. 2020. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion of The Web Conference. 441447.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Yang Zhengyi, He Xiangnan, Zhang Jizhi, Wu Jiancan, Xin Xin, Chen Jiawei, and Wang Xiang. 2023. A generic learning framework for sequential recommendation with distribution shifts. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 331340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Yang Zhengyi, Wu Jiancan, Wang Zhicai, Wang Xiang, Yuan Yancheng, and He Xiangnan. 2023. Generate what you prefer: Reshaping sequential recommendation via guided diffusion. In Proceedings of the Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  56. [56] Yao Tiansheng, Yi Xinyang, Cheng Derek Zhiyuan, Yu Felix X., Chen Ting, Menon Aditya Krishna, Hong Lichan, Chi Ed H., Tjoa Steve, Kang Jieqi (Jay), and Ettinger Evan. 2021. Self-supervised learning for large-scale item recommendations. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management. 43214330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Yi Xinyang, Yang Ji, Hong Lichan, Cheng Derek Zhiyuan, Heldt Lukasz, Kumthekar Aditee, Zhao Zhe, Wei Li, and Chi Ed H.. 2019. Sampling-bias-corrected neural modeling for large corpus item recommendations. In Proceedings of the 13th ACM Conference on Recommender Systems. 269277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Ying Rex, He Ruining, Chen Kaifeng, Eksombatchai Pong, Hamilton William L., and Leskovec Jure. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 974983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Yun Hyokun, Yu Hsiang-Fu, Hsieh Cho-Jui, Vishwanathan S. V. N., and Dhillon Inderjit. 2014. NOMAD: Non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. In Proceedings of the VLDB Endowment. 7, 11 (2014), 975–986.Google ScholarGoogle Scholar
  60. [60] Zhang An, Ma Wenchang, Wang Xiang, and Chua Tat-Seng. 2022. Incorporating bias-aware margins into contrastive loss for collaborative filtering. In Proceedings of the Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  61. [61] Zhang Weinan, Chen Tianqi, Wang Jun, and Yu Yong. 2013. Optimizing top-n collaborative filtering via dynamic negative item sampling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 785788.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Zhang Yang, Feng Fuli, He Xiangnan, Wei Tianxin, Song Chonggang, Ling Guohui, and Zhang Yongdong. 2021. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Zhou Chang, Ma Jianxin, Zhang Jianwei, Zhou Jingren, and Yang Hongxia. 2021. Contrastive learning for debiased candidate generation in large-scale recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 39853995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Zhou Kun, Wang Hui, Zhao Wayne Xin, Zhu Yutao, Wang Sirui, Zhang Fuzheng, Wang Zhongyuan, and Wen Ji-Rong. 2020. S3-Rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. 18931902.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On the Effectiveness of Sampled Softmax Loss for Item Recommendation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Information Systems
        ACM Transactions on Information Systems  Volume 42, Issue 4
        July 2024
        751 pages
        ISSN:1046-8188
        EISSN:1558-2868
        DOI:10.1145/3613639
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 March 2024
        • Online AM: 13 December 2023
        • Accepted: 27 November 2023
        • Revised: 26 September 2023
        • Received: 1 October 2022
        Published in tois Volume 42, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text