Corpus Linguistics and Artificial Intelligence

doi:10.1590/1678-460x202541474063

Authors

Pontifícia Universidade Católica de São Paulo - PUC-SP

DOI:

https://doi.org/10.1590/1678-460x202541474063

Keywords:

Corpus Linguistics, Artificial Intelligence, Register, Multidimensional Analysis

Abstract

This article argues that a register-based Multi-Dimensional (MD) description is a suitable route for characterizing AI-generated language in corpus linguistics. The argument is illustrated with two sample studies: a grammar-oriented investigation that applies traditional MD analysis to English-as-a-foreign-language textbook texts and a discourse-oriented analysis that relies on lexical MD analysis to explore AI-generated pop music lyrics. In both cases, the results reveal sharp differences between AI-generated and human language. In the EFL texts, AI-written texts are more informational, abstract, and impersonal whereas human texts display interpersonal awareness, stance, and engagement. In the pop lyrics, AI generates moralized empowerment discourses that recast historical conflicts depicted in rap music as generalized virtue narratives. In both cases, AI demonstrates signs of register deficit (a limited awareness of register variation due to shallow knowledge of the linguistic constituency of human registers) and register metamorphosis (generation of texts that resemble one register on the surface but are realized linguistically as another).

References

Ash, A. M. (2025). Three in 10 Teachers Use AI Weekly, Saving Six Weeks a Year. 2025/06/24. https://news.gallup.com/poll/691967/three-teachers-weekly-saving-six-weeks-year.aspx » https://news.gallup.com/poll/691967/three-teachers-weekly-saving-six-weeks-year.aspx

Baker, P., & McEnery, T. (2015). Corpora and Discourse Studies: Integrating Discourse and Corpora Palgrave Macmillan.

Berber Sardinha, T. (2000a). Análise Multidimensional [Multi-Dimensional Analysis]. DELTA , 16 (1), 99-127.

Berber Sardinha, T. (2000b). Linguística de corpus: Histórico e problemática. DELTA , 16 (2), 323-367.

Berber Sardinha, T. (2024). AI-generated vs human-authored texts: A Multidimensional comparison. Applied Corpus Linguistics , 4(1), 100083. https://doi.org/10.1016/j.acorp.2023.100083 » https://doi.org/https://doi.org/10.1016/j.acorp.2023.100083

Berber Sardinha, T., & Fitzsimmons-Doolan, S. (2025). Lexical Multidimensional Analysis: Identifying Discourses and Ideologies Cambridge University Press.

Biber, D. (1988). Variation Across Speech and Writing Cambridge University Press.

Biber, D. (2012). Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory , 8(1), 9-37.

Biber, D., & Conrad, S. (2009). Register, Genre, and Style Cambridge University Press.

Biber, D., Conrad, S., & Cortes, V. (2004). If you look at...: Lexical bundles in university teaching and textbooks. Applied Linguistics , 25 (3), 371-405.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics - Investigating Language Structure and Use Cambridge University Press.

Breazu, P., & Katsos, N. (2024). ChatGPT-4 as a journalist: Whose perspectives is it reproducing? Discourse & Society , 35 (6), 687-707. https://doi.org/10.1177/09579265241251479 » https://doi.org/https://doi.org/10.1177/09579265241251479

Burr, V. (1995). An Introduction to Social Constructionism Routledge.

Buyl, M., Rogiers, A., Noels, S., Dominguez-Catena, I., Heiter, E., Romero, R., Johary, I., Mara, A.-C., Lijffijt, J., & De Bie, T. (2024). Large language models reflect the ideology of their creators. Cornell University. https://doi.org/arXiv:2410.18417 » https://doi.org/https://doi.org/arXiv:2410.18417

Carter, R. (1998). Orders of reality: CANCODE, communication, and culture. ELT Journal , 52 (1), 43-56.

Collentine, J., & Berber Sardinha, T. (in press). Artificial Intelligence and Corpus Linguistics. In C. Chapelle (Ed.), The Encyclopedia of Applied Linguistics (2nd ed.). Wiley.

Cortes, V. (2004). Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes , 23 (397-423).

Cortes, V. (2013). Lexical bundles and grammar. In C. Chapelle (Ed.), The Encyclopedia of Applied Linguistics (pp. 3313-3319). Wiley.

Delfino, M. C. N. (2022). More than words: Análise multimodal multidimensional da música popular em língua inglesa [More than words: Multimodal Multi-Dimensional Analysis of popular music in English] [PhD Dissertation, PUCSP]. São Paulo.

Delfino, M. C. N., & Berber Sardinha, T. (2025). Exploring AI creativity in songwriting: A lexical multi-dimensional analysis of human and machine-generated lyrics [Paper]. Workshop "Corpus linguistics 2040: Which data, which methods, which models?", Leibniz-Institut für Deutsche Sprache (IDS), Mannheim, Germany.

Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora , 14 (1), 77-104.

Egbert, J., & Staples, S. (2019). Doing Multi-Dimensional Analysis in SPSS, SAS, and R. In T. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-dimensional Analysis: Research Methods and Current Issues (pp. 125-144). Bloomsbury.

Friginal, E., & Hardy, J. (2019). From factors to dimensions: Interpreting linguistic co-occurrence patterns. In T. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-Dimensional Analysis: Research Methods and Current Issues (pp. 145-164). Bloomsbury Academic.

Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M., Kim, S., Dernoncourt, F., Yu, T., Zhang, R., & Ahmed, N. K. (2024). Bias and fairness in large language models: A survey. Computational Linguistics , 50 (3), 1097-1179.

Gilmore, A. (2004). A comparison of textbook and authentic interactions. ELT Journal , 58 (4), 363-374.

Goulart, L. (2024). Variation in University student writing : A communicative text type approach John Benjamins.

Goulart, L., Matte, M. L., Mendoza, A., Alvarado, L., & Veloso, I. (2024). AI or student writing? Analyzing the situational and linguistic characteristics of undergraduate student writing and AI-generated assignments. Journal of Second Language Writing , 66 (101160), 1-19.

Hajer, M. (1993). Discourse coalitions and the institutionalization of practice. In The argumentative turn in policy analysis and planning (pp. 43-76). Duke University Press.

Hall, S. (1992). Formations of Modernity Polity Press.

Halliday, M. A. K. (1991). Corpus studies and probabilistic grammar. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: Studies in Honour of Jan Svartvik (pp. 30-43). Longman.

Hirose, M., & Uchida, M. (2025). Decoding the mind of large language models: A quantitative evaluation of ideology and biases. Cornell University. https://doi.org/arXiv:2505.12183 » https://doi.org/https://doi.org/arXiv:2505.12183

Jiang, F., & Hyland, K. (2025). Does ChatGPT argue like students? Bundles in argumentative essays. Applied Linguistics , 46 , 375-391.

Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK.

Scott, M. (1997). PC Analysis of key words - and key key words. System , 25 (2), 233-245.

Scott, M. (2000). Focusing on the text and its key words. In L. Burnard & T. McEnery (Eds.), Rethinking Language Pedagogy from a Corpus Perspective - Papers from the Third International Conference on Teaching and Language Corpora (pp. 103-122). Peter Lang.

Shimazumi, M., & Berber Sardinha, T. (2023). Variation in learner writing and a comparison with textbook models [Paper]. 12th Corpus Linguistics Conference, Lancaster, UK.

Shimazumi, M., & Berber Sardinha, T. (2024). Comparing EFL textbook and AI-generated writing: A multi-dimensional analysis [Paper]. American Association for Corpus Linguistics (AACL 2024), Oregon University, Eugene, OR, USA.

Shimazumi, M., & Berber Sardinha, T. (2025). Simulating human artificiality: How well does AI generate EFL textbook texts? [Paper]. Corpus Linguistics 2025 International Conference (CL2025), Aston University, Birmingham, UK.

Sinclair, J. M. (1991). Corpus, Concordance, Collocation Oxford University Press.

Smith, G., Fleisig, E., Bossi, M., Rustagi, I., & Yin, X. (2025). Standard language ideology in AI-generated language. Cornell University. https://doi.org/10.48550/arXiv.2406.08726 » https://doi.org/https://doi.org/10.48550/arXiv.2406.08726

Szczepanik, R. J. (2025). The limits to growth(ism) in ChatGPT-corpus assisted discourse studies in AI-generated texts. Discourse & Society , Online access https://doi.org/10.1177/09579265241308585 » https://doi.org/https://doi.org/10.1177/09579265241308585

Corpus Linguistics and Artificial Intelligence

Authors

DOI:

Keywords:

Abstract

References

Published

How to Cite

Issue

Section

Qualis

Make a Submission

Language

Keywords