Vol. 10 (2023)

Using keywords in the automatic classification of language of gender violence

Héctor Castro Mosqueda
Escuela Normal Superior Oficial de Guanajuato
Antonio Rico Sulayes
Universidad de las Americas Puebla
Publicado marzo 1, 2023

Palabras clave:

Corpus Linguistics; Automatic Text Classification; Sexist Language Detection
Cómo citar
Castro Mosqueda, H., & Rico Sulayes, A. (2023). Using keywords in the automatic classification of language of gender violence. CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 10, 19–43. https://doi.org/10.15366/chimera2023.10.002


This paper employs lexical analysis tools, quantitative processing methods, and natural language processing procedures to analyze language samples and identify lexical items that support automatic topic detection in natural language processing. This paper discusses how keyword extraction, a technique from corpus linguistics, can be employed in obtaining features that improve automatic classification; in particular, this research is concerned with extracting keywords from a corpus obtained from social networks. The corpus consists of 1,841,385 words and is subdivided into three sub-corpora that have been categorized according to the topic of the comments in each one of them. These three topics are violence against women, violence against the LGBT community, and violence in general. The corpus has been obtained by scraping comments from YouTube videos that address issues such as street harassment, femicide, feminist movements, drug trafficking, forced disappearances, equal marriage, among others. The topic detection tasks performed with the corpus extracted from the social media showed that the keywords rendered a 98% accuracy when classifying the collection of comments from 51 videos, as one of the three categories mentioned above, and 92% when classifying almost 7,500 comments individually. When keywords were removed from the classification task and all words were used to perform the classification task, accuracy dropped by an average of 17%. These results support the argument for keyword relevance in automatic topic detection.


Los datos de descargas todavía no están disponibles.


Anzovino, M., Fersini, E., & Rosso, P. (2018). Automatic identification and classification of misogynistic language on twitter. In M. Silberztein, F. Atigui, E. Kornyshova, E. Métais, & F. Meziane (Eds.), Natural language processing and information systems (pp. 57-64). Springer. https://doi.org/10.1007/978-3-319-91947-8_6

Allan, J. (2002). Introduction to topic detection and tracking. In J. Allan (Ed.), In Topic detection and tracking (pp. 1-16). Springer.

Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4), 346-359. https://doi.org/10.1177/0075424204269894

Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M., Rosso, P., & Sanguinetti, M. (2019). SemEval-2019 Task 5: Multilingual detection of hate speech against immigrants and women in twitter. Proceedings of the 13th International Workshop on Semantic Evaluation, 54-63. https://doi.org/10.18653/v1/S19-2007

Bermingham, A., & Smeaton, A. F. (2010, October). Classifying sentiment in microblogs: Is brevity an advantage? In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (pp. 1833-1836).

Birjali, M., Kasri, M., & Beni-Hssane, A. (2021). A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, 226, 107134.

Burgess, J., & Green, J. (2013). YouTube: Online Video and Participatory Culture. John Wiley & Sons.

Canós, J. S. (2018). Misogyny identification through SVM at IberEval 2018. IberEval@SEPLN. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages, 229-233.

Chang, I., Yu, T. K., Chang, Y. J., & Yu, T. Y. (2021). Applying text mining, clustering analysis, and latent dirichlet Allocation techniques for topic classification of environmental education journals. Sustainability, 13(19), 10856.

Cordobés, H., Fernández Anta, A., Chiroque, L. F., Pérez, F., Redondo, T., & Santos, A. (2014). Graph-based techniques for topic classification of tweets in Spanish. International Journal of Interactive Multimedia and Artificial Intelligence, 2(5), 31-38.

Dalal, M. K., & Zaveri, M. A. (2011). Automatic text classification: A technical review. International Journal of Computer Applications, 28(2), 37-40.

Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools and Applications, 78(3), 3797-3816.

Fernández Anta, A., Morere, P., Chiroque, L. F., & Santos, A. (2012, September). Techniques for sentiment analysis and topic detection of Spanish tweets: preliminary report. In Spanish Society for Natural Language Processing Conference.

Fersini, E., Rosso, P., & Anzovino, M. (2018). Overview of the Task on Automatic Misogyny Identi?cation at IberEval 2018. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages, 214-228.

García-Díaz, J. A., Cánovas-García, M., Colomo-Palacios, R., & Valencia-García, R. (2021). Detecting misogyny in Spanish tweets. An approach based on linguistics features and word embeddings. Future Generation Computer Systems, 114, 506-518. https://doi.org/10.1016/j.future.2020.08.032

Hardaker, C. (2010). Trolling in asynchronous computer-mediated communication: From user discussions to academic definitions. Journal of Politeness Research, 6(2), 215-242. https://doi.org/10.1515/jplr.2010.011

Hundt, M., Nesselhauf, N., & Biewer, C. (Eds.). (2007). Corpus Linguistics and the Web. Rodopi.

Jelodar, H., Wang, Y., Orji, R., & Huang, S. (2020). Deep sentiment classification and topic discovery on novel coronavirus or covid-19 online discussions: Nlp using lstm recurrent neural network approach. IEEE Journal of Biomedical and Health Informatics, 24(10), 2733-2742.

Jelodar, H., Orji, R., Matwin, S., Weerasinghe, S., Oyebode, O., & Wang, Y. (2021). Artificial intelligence for emotion-semantic trending and people emotion detection during covid-19 social isolation. DOI: https://doi.org/10.48550/arXiv.2101.06484

Kadhim, A. I. (2018). An evaluation of preprocessing techniques for text classification. International Journal of Computer Science and Information Security (IJCSIS), 16(6), 22-32.

Lee, K., Palsetia, D., Narayanan, R., Patwary, M. M. A., Agrawal, A., & Choudhary, A. (2011, December). Twitter trending topic classification. In 2011 IEEE 11th International Conference on Data Mining Workshops, 251-258. IEEE.

Liu, H., & Yu, L. (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng, 17(4):491–502.

Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundation and Trends in Information Retrieval, 8, 1-135. DOI: f10.1561/1500000001

Pihlaja, S. (2014). Antagonism on YouTube: Metaphor in online discourse. Bloomsbury Publishing.

Plaza-del-Arco, F. M., Molina-González, M. D., Martin, M., & Ureña-López, L. A. (2019). SINAI at SemEval-2019 Task 5: Ensemble learning to detect hate speech against inmigrants and women in English and Spanish tweets. In Proceedings of the 13th International Workshop on Semantic Evaluation, 476–479. https://doi.org/10.18653/v1/S19-2084

Pojanapunya, P., & Todd, R. W. (2018). Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 14(1), 133-167. https://doi.org/10.1515/cllt-2015-0030

Rico Sulayes, A. (2018). Authorship attribution on crime-related social media: Research on the darknet in forensic linguistics. Aracne.

Sebastiani, F. (2005). Text Categorization. Encyclopedia of Database Technologies and Applications. IGI Global, 683-687. https://doi.org/10.1007/978-0-387-39940-9_414

Scott, M., & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. John Benjamins Publishing.

Sriram, B. (2010). Short text classification in twitter to improve information filtering, unpublished Master’s thesis, The Ohio State University.

Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: A comprehensive guide to building real-world NLP systems. O’Reilly Media.

Vilares, D., Alonso, M. A., & Gómez-Rodríguez, C. (2015). A linguistic approach for determining the topics of Spanish Twitter messages. Journal of Information Science, 41(2), 127-145.

Yang, J., Liu, Y., Zhu, X., Liu, Z., & Zhang, X. (2012). A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management, 48(4), 741-754.