Cover-based multiple book genre recognition using an improved multimodal network

Rasheed, Assad; Umar, Arif Iqbal; Shirazi, Syed Hamad; Khan, Zakir; Shahzad, Muhammad

doi:10.1007/s10032-022-00413-8

Cover-based multiple book genre recognition using an improved multimodal network

Original Paper
Published: 20 September 2022

Volume 26, pages 65–88, (2023)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Assad Rasheed¹,
Arif Iqbal Umar¹,
Syed Hamad Shirazi¹,
Zakir Khan¹ &
…
Muhammad Shahzad¹

479 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Despite the idiom not to prejudge something by its outward appearance, we consider deep learning to learn whether we can judge a book by its cover or, more precisely, by its text and design. The classification was accomplished using three strategies, i.e., text only, image only, and both text and image. State-of-the-art CNNs (convolutional neural networks) models were used to classify books through cover images. The Gram and SE layers (squeeze and excitation) were used as an attention unit in them to learn the optimal features and identify characteristics from the cover image. The Gram layer enabled more accurate multi-genre classification than the SE layer. The text-based classification was done using word-based, character-based, and feature engineering-based models. We designed EXplicit interActive Network (EXAN) composed of context-relevant layers and multi-level attention layers to learn features from books title. We designed an improved multimodal fusion architecture for multimodal classification that uses an attention mechanism between modalities. The disparity in modalities convergence speed is addressed by pre-training each sub-network independently prior to end-to-end training of the model. Two book cover datasets were used in this study. Results demonstrated that text-based classifiers are superior to image-based classifiers. The proposed multimodal network outperformed all models for this task with the highest accuracy of 69.09% and 38.12% for Latin and Arabic book cover datasets. Similarly, the proposed EXAN surpassed the extant text classification models by scoring the highest prediction rates of 65.20% and 33.8% for Latin and Arabic book cover datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking Deep Learning Models for Classification of Book Covers

Article 24 April 2020

CNN-Based Book Cover and Back Cover Recognition and Classification

EAML: ensemble self-attention-based mutual learning network for document image classification

Article 24 June 2021

Data availability

The data and code will be available publically.

References

Lucieri, A., et al.: Benchmarking deep learning models for classification of book covers. SN Comput. Sci. 1(3), 1–16 (2020)
Article Google Scholar
Iwana, B.K. et al.: Judging a book by its cover. arXiv preprint arXiv:1610.09204 (2016)
Chiang, H., Ge, Y., Wu, C.: Classification of book genres by cover and title. (2015)
Buczkowski, P., Sobkowicz, A., Kozlowski, M.: Deep learning approaches towards book covers classification. In: ICPRAM (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Kundu, C., Lukun, Z.: Deep multimodal networks for book genre classification based on its cover. arXiv preprint arXiv:2011.07658 (2020)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Cer, D. et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10(5), 293–302 (2002)
Article Google Scholar
McKay, C., Fujinaga, I.: Automatic genre classification using large high-level musical feature sets. ISMIR 2004, 2004 (2004)
Google Scholar
Pye, D.: Content-based methods for the management of digital music. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100). Vol. 4. IEEE (2000)
Karayev, S. et al.: Recognizing image style. arXiv preprint arXiv:1311.3715 (2013)
Kong, J., Zhang, L., Jiang, M., Liu, T.: Incorporating multi-level CNN and attention mechanism for Chinese clinical named entity recognition. J. Biomed. Inform. 116, 103737 (2021)
Article Google Scholar
Liu, T., Zhao, R., Lam, K.M., Kong, J.: Visual-semantic graph neural network with pose-position attentive learning for group activity recognition. Neurocomputing 491, 217–231 (2022)
Article Google Scholar
Zujovic, J. et al.: Classifying paintings by artistic genre: an analysis of features & classifiers. In: 2009 IEEE International Workshop on Multimedia Signal Processing. IEEE (2009)
Finn, A., Kushmerick, N.: Learning to classify documents according to genre. J. Am. Soc. Inform. Sci. Technol. 57(11), 1506–1518 (2006)
Article Google Scholar
Petrenz, P., Webber, B.: Stable classification of text genres. Comput. Linguist. 37(2), 385–393 (2011)
Article Google Scholar
Brown, P.F., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–480 (1991)
Google Scholar
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Du, C. et al.: Explicit interaction model towards text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. (2019)
Joulin, A. et al.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Zhang, X., Junbo Z., Yann L.: Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626 (2015)
Conneau, A. et al.: Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781 (2016)
Szegedy, C. et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
He, K. et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016)
Xie, S. et al.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Huang, G. et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Szegedy, C. et al.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. (2017)
Sandler, M. et al.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 15(3), 211–252 (2015)
Article MathSciNet Google Scholar
Gatys, L.A., Alexander S.E., Matthias B.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Luan, F. et al.: Deep photo style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., Huang, Y., Wang, L., Huang, C., Xu, W., Ramanan, D.: Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 2956-2964 (2015)
Hu, J., Li, S., Gang, S.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Han, W., Chen, H. and Poria, S.: Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. arXiv preprint arXiv:2109.00412. (2021)
Li, Z., Xu, B., Zhu, C., Zhao, T.: CLMLF: a contrastive learning and multi-layer fusion method for multimodal sentiment detection. arXiv preprint arXiv:2204.05515 (2022)
Truong, Q.T., Lauw, H.W.: Vistanet: visual aspect attention network for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, No. 01, pp. 305–312 (2019)
You, Q., Cao, L., Jin, H., Luo, J.: Robust visual-textual sentiment analysis: when attention meets tree-structured recursive neural networks. In: Proceedings of the 24th ACM International Conference on Multimedia pp 1008–1017 (2016)
Heaton, J.: Ian goodfellow, yoshua bengio, and aaron courville: deep learning. 305–307 (2018)
Koontz, C., Barbara, G. (eds.): IFLA Public Library Service Guidelines. De Gruyter, Berlin (2020)
Google Scholar

Download references

Funding

The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Information Technology, Hazara University, Mansehra, Pakistan
Assad Rasheed, Arif Iqbal Umar, Syed Hamad Shirazi, Zakir Khan & Muhammad Shahzad

Authors

Assad Rasheed
View author publications
You can also search for this author in PubMed Google Scholar
Arif Iqbal Umar
View author publications
You can also search for this author in PubMed Google Scholar
Syed Hamad Shirazi
View author publications
You can also search for this author in PubMed Google Scholar
Zakir Khan
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Shahzad
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Assad Rasheed and Arif Iqbal Umar; methodology, Assad Rasheed and Syed Hamad Shirazi; software, Assad Rasheed and Zakir Khan.; validation, Zakir Khan, and Shahzad Ahmad; formal analysis, Assad Rasheed; investigation, Shahzad Ahmad; data curation, Assad Rasheed; draft preparation, Arif Iqbal Umar; review and editing, Syed Hamad Shirazi and Zakir Khan.; supervision, Arif Iqbal Umar and Syed Hamad Shirazi.

Corresponding author

Correspondence to Assad Rasheed.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rasheed, A., Umar, A.I., Shirazi, S.H. et al. Cover-based multiple book genre recognition using an improved multimodal network. IJDAR 26, 65–88 (2023). https://doi.org/10.1007/s10032-022-00413-8

Download citation

Received: 29 April 2022
Revised: 11 July 2022
Accepted: 28 August 2022
Published: 20 September 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10032-022-00413-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cover-based multiple book genre recognition using an improved multimodal network

Abstract

Access this article

Similar content being viewed by others

Benchmarking Deep Learning Models for Classification of Book Covers

CNN-Based Book Cover and Back Cover Recognition and Classification

EAML: ensemble self-attention-based mutual learning network for document image classification

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cover-based multiple book genre recognition using an improved multimodal network

Abstract

Access this article

Similar content being viewed by others

Benchmarking Deep Learning Models for Classification of Book Covers

CNN-Based Book Cover and Back Cover Recognition and Classification

EAML: ensemble self-attention-based mutual learning network for document image classification

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation