Comparative Analysis of Machine Learning Algorithms for Author Age and Gender Identification (original) (raw)
Abstract
Author profiling is part of information retrieval in which different perspectives of the author are observed by considering various characteristics like native language, gender, and age. Different techniques are used to extract the required information using text analysis, like author identification on social media and for Short Text Message Service. Author profiling helps in security and blogs for identification purposes while capturing authors’ writing behaviors through messages, posts, comments, blogs, comments, and chat logs. Most of the work in this area has been done in English and other native languages. On the other hand, Roman Urdu is also getting attention for the author profiling task, but it needs to convert Roman-Urdu to English to extract important features like Named Entity Recognition (NER) and other linguistic features. The conversion may lose important information while having limitations in converting one language to another language. This research explores machine learning techniques that can be used for all languages to overcome the conversion limitation. The Vector Space Model (VSM) and Query Likelihood (Q.L.) are used to identify the author’s age and gender. Experimental results revealed that Q.L. produces better results in terms of accuracy.
Similar content being viewed by others
References
- Akram Chughtai R (2021) Author region identification for the Urdu language (Doc. dissertation, Dep. of Computer science, COMSATS University Lahore)
Google Scholar - Alam M, Hussain SU (2022) Roman-Urdu-Parl: Roman-Urdu and Urdu parallel corpus for Urdu language understanding. Trans Asian Low-Resour Lang Inf Process 21(1):1–20
Article Google Scholar - Alowibdi JS, Buy UA, Yu P (2013) Language independent gender classification on Twitter. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 739–743
Google Scholar - Ameer I, Sidorov G, Nawab RMA (2019) Author profiling for age and gender using combinations of features of various types. J Intell Fuzzy Syst 36:4833–4843
Article Google Scholar - Bilal M, Israr H, Shahid M, Khan A (2016) Sentiment classification of roman-Urdu opinions using näıve Bayesian, decision tree, and KNN classification techniques. J King Saud Univ-Comput Inf Sci 28:330–344
Google Scholar - Bilal A, Rextin A, Kakakhel A, Nasim M (2017) Roman-txt: forms and functions of roman Urdu texting. In: Proceedings of the 19th international conference on HCI with mobile devices and services, pp 1–9
Google Scholar - Biswas B, Bhadra S, Sanyal MK, Das S (2018) Cloud adoption: a future road map for Indian SMEs. In: Intelligent engineering informatics. Springer, pp 513–521
Google Scholar - Ciot M, Sonderegger M, Ruths D (2013) Gender inference of Twitter users in non-English contexts. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1136–1145
Google Scholar - Daud M, Khan R, Daud A et al (2015) Roman Urdu opinion mining system (rooms). arXiv preprint arXiv:1501.01386
- Estival D, Gaustad T, Pham SB, Radford W, Hutchinson B (2007) Author profiling for English emails. In: Proceedings of the 10th conference of the Pacific Association for computational linguistics, pp 263–272
Google Scholar - Fatima M, Anwar S, Naveed A, Arshad W, Nawab RMA, Iqbal M, Masood A (2018) Multilingual SMS-based author profiling: data and methods. Nat Lang Eng 24:695–724
Article Google Scholar - Fatima M, Hasan K, Anwar S, Nawab RMA (2017) Multilingual author profiling on Facebook. Inform Process Manag 53:886–904
Article Google Scholar - Guglielmi G, De Terlizzi F, Torrente I, Mingarelli R, Dallapiccola B (2005) Quantitative ultrasound of the hand phalanges in a cohort of monozygotic twins: influence of genetic and environmental factors. Skele-Tal Radiol 34:727–735
Article Google Scholar - Khan S, Ullah R, Khan A, Wahab N, Bilal M, Ahmed M (2016) Analysis of dengue infection based on Raman spectroscopy and support vector machine (SVM). Biomed Opt Express 7:2249–2256
Article Google Scholar - Koppel M, Argamon S, Shimoni AR (2002) Automatically categorizing written texts by author gender. Lit Linguist Comput 17:401–412
Article Google Scholar - Krenek J, Kuca K, Blazek P, Krejcar O, Jun D (2016) Application of artificial neural networks in condition-based predictive maintenance. Recent developments in intelligent information and database systems, pp 75–86
Google Scholar - Kurochkin I, Saevskiy A (2016) Boinc forks, issues, and directions of de-development. Procedia Comput Sci 101:369–378
Article Google Scholar - Mechti S, Jaoua M, Faiz R, Bouhamed H, Belguith LH (2016) Author profiling: age prediction based on advanced Bayesian networks. Res Comput Sci 110:129–137
Article Google Scholar - Mehmood K, Afzal H, Majeed A, Latif H (2015) Contributions to the study of bi-lingual roman Urdu SMS spam filtering. In: 2015 National software engineering conference (NSEC). IEEE, pp 42–47
Google Scholar - Mikros GK (2012) Authorship attribution and gender identification in Greek blogs. Methods Appl Quant Linguist 21:21–32
Google Scholar - Mukund S, Srihari RK (2012) Analyzing urdu social media for sentiments using transfer learning with controlled translations. In: Proceedings of the second workshop on language in social media, pp 1–8
Google Scholar - Nemati A (2018) Gender and age prediction multilingual author profiles based on comments. In: FIRE (Working Notes), pp 232–239
Google Scholar - Ogaltsov A, Romanov A (2017) Language variety and gender classification for author profiling in pan 2017. In: CLEF (Working notes)
Google Scholar - Peersman C, Daelemans W, Van Vaerenbergh L (2011) Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on search and mining user-generated contents, pp 37–44
Google Scholar - Plank B, Hovy D (2015) Personality traits on Twitter—or—how to get 1,500 personality tests in a week. In: Proceedings of the 6th workshop on computational approaches to subjectivity, sentiment, and social media analysis, pp 92–98
Google Scholar - Quirk GJ, Mueller D (2008) Neural mechanisms of extinction learning and retrieval. Neuropsychopharmacology 33:56–72
Article Google Scholar - Rangel F, Herna´ndez I, Rosso P, Reyes A (2014) Emotions and irony per gender in Facebook. In: Proceedings of workshop ES3LOD, LREC, pp 1–6
Google Scholar - Rangel F, Rosso P, Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at pan 2013. In: CLEF conference on multilingual and multimodal information access evaluation, CELCT, pp 352–365
Google Scholar - Rangel F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at pan. In: Poceedings of CLEF, sn. p.
Google Scholar - Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in Twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, pp 37–44
Google Scholar - Rosenthal S, McKeown K (2011) Age prediction in blogs: a study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 763–772
Google Scholar - Safdar Z, Bajwa RS, Hussain S, Abdullah HB, Safdar K, Draz U (2020) The role of Roman Urdu in multilingual information retrieval: a regional study. J Acad Librariansh 46(6):102258
Article Google Scholar - Sap M, Park G, Eichstaedt J, Kern M, Stillwell D, Kosinski M, Un- gar L, Schwartz HA (2014) Developing age and gender predictive lexica over social media. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1146–1151
Google Scholar - Schler J, Koppel M, Argamon S, Pennebaker JW (2006) Effects of age and gender on blogging. In: AAAI spring symposium: computational approaches to analyzing weblogs, pp 199–205
Google Scholar - Sittar A, Ameer I (2018) Multilingual author profiling using stylistic features. In: FIRE (Working Notes), pp 240–246
Google Scholar - Tudisca S, Di Trapani AM, Sgroi F, Testa R (2013) Marketing strategies for Mediterranean wineries competitiveness in the case of Pantelleria. Calitatea 14:101
Google Scholar - Verhoeven B, Plank B, Daelemans W (2016) Multilingual personality profiling on twitter. In: To be presented at DHBenelux 2016
Google Scholar - Wanner L et al (2017) On the relevance of syntactic and discourse features for author profiling and identification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers, pp 681–687
Google Scholar - Zhang W, Caines A, Alikaniotis D, Buttery P (2016) Predicting author age from Weibo microblog posts. In: Proceedings of the tenth international conference on language resources and evaluation, pp 2990–2997
Google Scholar
Author information
Authors and Affiliations
- City University of Science and Information Technology, Peshawar, Pakistan
Zarah Zainab - College of Technological Innovation, Zayed University, Abu Dhabi, UAE
Feras Al-Obeidat - REMIT, IJP, Universidade Portucalense, Porto, Portugal
Fernando Moreira - IEETA, Universidade de Aveiro, Aveiro, Portugal
Fernando Moreira - Center for Excellence in Information Technology, Institute of Management Sciences, Peshawar, Pakistan
Haji Gul & Adnan Amin
Authors
- Zarah Zainab
You can also search for this author inPubMed Google Scholar - Feras Al-Obeidat
You can also search for this author inPubMed Google Scholar - Fernando Moreira
You can also search for this author inPubMed Google Scholar - Haji Gul
You can also search for this author inPubMed Google Scholar - Adnan Amin
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toFernando Moreira .
Editor information
Editors and Affiliations
- Institute of Management Sciences, Peshawar, Pakistan
Sajid Anwar - School of Mathematical and Computer Science, Heriot-Watt University, Dubai, United Arab Emirates
Abrar Ullah - University of Lisbon, Lisbon, Portugal
Álvaro Rocha - University Institute of Lisbon (ISCTE), Lisbon, Portugal
Maria José Sousa
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zainab, Z., Al-Obeidat, F., Moreira, F., Gul, H., Amin, A. (2023). Comparative Analysis of Machine Learning Algorithms for Author Age and Gender Identification. In: Anwar, S., Ullah, A., Rocha, Á., Sousa, M.J. (eds) Proceedings of International Conference on Information Technology and Applications. Lecture Notes in Networks and Systems, vol 614. Springer, Singapore. https://doi.org/10.1007/978-981-19-9331-2\_11
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/978-981-19-9331-2\_11
- Published: 19 May 2023
- Publisher Name: Springer, Singapore
- Print ISBN: 978-981-19-9330-5
- Online ISBN: 978-981-19-9331-2
- eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)