A construção de corpus de larga escala da fala bilíngue de crianças e da fala bilíngue dirigida à criança, anotado e alinhado aos arquivos de áudio: desafios, soluções e implicações para a pesquisa

Autores

  • Alex Lưu Brandeis University, Michtom School of Computer Science, Waltham, Massachusetts, USA;
  • Pavel Koval University of Connecticut, Department of Linguistics, Storrs, Connecticut,
  • Sophia Malamud Brandeis University, Michtom School of Computer Science, the Linguistics Program, Waltha
  • Irina Dubinina Brandeis University

Palavras-chave:

Corpus de fala em russo, Anotação de disfluências, Marcação morfológica, Análise sintática, Falantes bilíngues, Falantes de herança

Resumo

O projeto BiRCh (The Corpus of Bilingual Russian Child Speech, Corpus de fala de crianças bilíngues em russo) envolve a construção de um corpus longitudinal composto de gravações de fala em russo produzida por crianças e suas famílias na Rússia, Ucrânia, Alemanha, EUA e Canadá. Estamos construindo um corpus de larga escala com base no conjunto dessas gravações, o ‘Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)’, com os dois componentes básicos: (1) as transcrições de um milhão de palavras alinhadas com os arquivos de áudio, em que pode ser realizada a busca textual, e (2) as transcrições de 500 mil palavras anotadas morfologicamente e analisadas sintaticamente, também alinhadas com os arquivos de áudio. Estamos utilizando o corpus para investigar os diversos fenômenos no input linguístico e na trajetória do desenvolvimento de falantes de herança, tais como o uso de caso, gênero, construções passivas e impessoais, marcadores de polidez, disfluências e marcadores discursivos. Este artigo enfoca os desafios e soluções no processo da construção do BiRCh e as implicações para a pesquisa com base nos dados detalhadamente anotados fornecidos pelo corpus.

Downloads

Não há dados estatísticos.

Referências

ARSLAN, S. Neurolinguistic and Psycholinguistic Investigations on Evidentiality in Turkish. 2015. University of Groningen, 2015.

ARSLAN, S.; BASTIAANSE, R. Chapter 6. First Language Exposure Predicts Attrition Patterns in Turkish Heritage Speakers’ Use of Grammatical Evidentiality. In: Studies in Bilingualism. Edited by Fatih Bayram. Amsterdam: John Benjamins Publishing Company, 2020. pp.105–126.

BECK, J. E. Penn Parsed Corpora of Historical Greek (PPCHiG). Disponível em: https://www.ling.upenn.edu/~janabeck/greek-corpora.html . Acesso em: 22 de julho 2021.

BENMAMOUN, E. et al. Arabic Plurals and Root and Pattern Morphology in Palestinian and Egyptian Heritage Speakers. Linguistic Approaches to Bilingualism. v. 4, no. 1, pp.89--123. 2014.

BENMAMOUN, E.; MONTRUL, S.; POLINSKY, M. Prolegomena to Heritage Linguistics. 2010. Disponível em: https://dash.harvard.edu/handle/1/23519841 . Acesso em: 3 de março 2022.

BOGUSLAVSKY, I. et al. Development of a Dependency Treebank for Russian and its Possible Applications in NLP. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. Anais... In: LREC 2002. Las Palmas, Canary Islands - Spain: European Language Resources Association (ELRA), May 2002 Disponível em: http://www.lrec-conf.org/proceedings/lrec2002/pdf/161.pdf . Acesso em: 7 de agosto 2021.

CLARK, A.; FOX, C.; LAPPIN, S. The Handbook of Computational Linguistics and Natural Language Processing. John Wiley & Sons, 2013.

DE PRADA PÉREZ, A. Subject Expression in MinorcaN Spanish: Consequences of Contact with Catalan. 2009. (Doctoral dissertation) – The Pennsylvania State University, 2009.

DE PRADA PÉREZ, A. First Person Singular Subject Pronoun Expression in Spanish in Contact with Catalan. In: Subject Pronoun Expression in Spanish: A Cross-Dialectal Perspective, 2015.

DUBININA, I. Y. et al. Variability in Input: A Corpus Study of Discourse Markers in Immigrant Parents’ Speech. In: Los Angeles, CA. Anais... In: Panel on Variability and Change in Bilingual Language Acquisition: Longitudinal Perspectives, The Third International Conference on Heritage/Community Languages. Los Angeles, CA: Feb. 2018.

DUBININA, I. Y. et al. Razmetka morfologicheskoj informacii BiRCh [BiRCh Morphological annotation guidelines]. Disponível em: https://brandeis.app.box.com/file/451776894902?s=pzyzu57p9bl0s7zkqvsv6ecepwjtp5aj . Acesso em: 30 de setembro 2021.

DUBININA, I. Y. et al. Requests with and without Požalujsta ‘Please’ in Monolingual and Bilingual Acquisition. Ms., Brandeis University (in progress).

DUBININA, I. Y. et al. Razmetka morfologicheskoj informacii BiRCh [BiRCh Morphological Annotation Guidelines]. Disponível em: https://brandeis.app.box.com/file/451776894902?s=pzyzu57p9bl0s7zkqvsv6ecepwjtp5aj. Acesso em: 30 de setembro 2021.

DUBININA, I. Y.; MALAMUD, S. A. Emergent Communicative Norms in a Contact Language: Indirect Requests in Heritage Russian. Linguistics. v. 55, no. 1, pp.67–116. 1 Jan. 2017. Disponível em: https://www.degruyter.com/document/doi/10.1515/ling-2016-0039/html Acesso em: 28 de fevereiro 2021.

GALVES, C.; ANDRADE, A. L. de; FARIA, P. Tycho Brahe Parsed Corpus of Historical Portuguese, 2017.

GOLDBERG, L. Verb-Stranding VP Ellipsis: A Cross-Linguistic Study. 2005. (Doctoral dissertation) – McGill University, Montréal, Québec, Canada, 2005

GRIBANOVA, V. Verb-Stranding Verb Phrase Ellipsis and the Structure of the Russian Verbal Complex. Natural Language & Linguistic Theory. v. 31, no. 1, pp.91–136. Feb. 2013. Disponível em: http://link.springer.com/10.1007/s11049-012-9183-3 Acesso em: 22 de setembro 2021.

HAZNEDAR, B. Transfer at the Syntax-Pragmatics Interface: Pronominal Subjects in Bilingual Turkish. Second Language Research. v. 26, no. 3, pp.355–378. Jul. 2010. Disponível em: http://journals.sagepub.com/doi/10.1177/0267658310365780 Acesso em: 22 de setembro 2021.

HINDLE, D. Deterministic Parsing of Syntactic Non-fluencies. In: 21st Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, USA. Anais... In: ACL 1983. Cambridge, Massachusetts, USA: Association for Computational Linguistics, Jun. 1983. Disponível em: https://www.aclweb.org/anthology/P83-1019 Acesso em: 23 de junho 2021.

IONIN, T.; LUCHKINA, T. Scope, Syntax, and Prosody in Russian as a Second or Heritage Language. In: Exploring Interfaces. Edited by Mónica Cabrera and José Camacho. Cambridge University Press, 2019, pp.141–170.

IVANOVA-SULLIVAN, T. Anaphora Resolution in Globally Ambiguous Contexts. In: Theoretical and Experimental Aspects of Syntax-Discourse Interface in Heritage Grammars. Empirical Approaches to Linguistic Theory. Brill, 2014a. pp.125-141.

IVANOVA-SULLIVAN, T. Theoretical and Experimental Aspects of Syntax-Discourse Interface in Heritage Grammars. BRILL, 2014b.

JURAFSKY, D.; MARTIN, J. H. Chapter 26: Automatic Speech Recognition and Text-to-Speech. In: Speech and Language Processing (Draft of December 30, 2020), 2020.

KEATING, G. D.; VANPATTEN, B.; JEGERSKI, J. WHO WAS WALKING ON THE BEACH?: Anaphora Resolution in Spanish Heritage Speakers and Adult Second Language Learners. Studies in Second Language Acquisition. v. 33, no. 2, pp.193–221. Jun. 2011. Disponível em: https://www.cambridge.org/core/product/identifier/S0272263110000732/type/journal_article Acesso em: 22 de setembro 2021.

KOTELNIKOV, E.; RAZOVA, E.; FISHCHEVA, I. A Close Look at Russian Morphological Parsers: Which One Is the Best? Edited by Andrey Filchenkov; Lidia Pivovarova; and Jan Žižka In: Artificial Intelligence and Natural Language, Cham. Anais... Cham: Springer International Publishing, 2018.

KOVAL, P. et al. The Acquisition of the Left Branch Extraction by Bilingual Russian Children. In: Los Angeles, CA (virtual). Anais... In: NHLRC Fourth International Conference on Heritage/Community Languages. Los Angeles, CA (virtual): Jun. 2022.

KRAUSE, T.; ZELDES, A. ANNIS3: A New Architecture for Generic Corpus Query and Visualization. Digital Scholarship in the Humanities. v. 31, n. 1, pp.118–139. 1 Apr. 2016. Disponível em: https://doi.org/10.1093/llc/fqu057 . Acesso em: 11 de junho 2021.

KROCH, A. et al. Penn Parsed Corpora of Historical English. Disponível em: https://www.ling.upenn.edu/hist-corpora/ . Acesso em: 11 de junho 2021.

KROCH, A. Penn Parsed Corpora of Historical English LDC2020T16. Philadelphia, 2020.

LƯU, A.; MALAMUD, S. A.; XUE, N. Converting SynTagRus Dependency Treebank into Penn Treebank Style. In: Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), Berlin, Germany. Anais... Berlin, Germany: Association for Computational Linguistics, Aug. 2016 Disponível em: https://aclanthology.org/W16-1703 Acesso em: 5 de agosto 2021.

MALAMUD, S. A.; DUBININA, I. Y. Konvencii transkripcii i annotaciya neplavnostej BiRCh [BiRCh guidelines for transcription and disfluency annotation]. Disponível em: https://brandeis.app.box.com/s/h15um924ygz3t5zdvfwmsdx5kesjrzoq Acesso em: 23 de junho 2021a.

MALAMUD, S. A.; DUBININA, I. Y. Konvencii segmentacii transkripcii na predlozheniya v BiRCh [BiRCh convensions for segmenting transcripts into sentences]. Disponível em: https://brandeis.app.box.com/file/297247548157?s=woyvgzm21u28tm43anvlda9hqla0491c Acesso em: 30 de setembro 2021b.

MALAMUD, S. A. et al. Russian “sja” Verbs in Bilingual and Monolingual Acquisition. In: Los Angeles, CA (virtual). Anais... In: NHLRC Fourth International Conference on Heritage/Community Languages. Los Angeles, CA (virtual): Jun. 2022.

MARCUS, M. P. et al. Treebank-3Linguistic Data Consortium, 1999. Disponível em: https://catalog.ldc.upenn.edu/LDC99T42 Acesso em: 5 de agosto 2021

MARCUS, M. P.; SANTORINI, B.; MARCINKIEWICZ, M. A. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. v. 19, no. 2, pp.313–330. 1993. Disponível em: https://aclanthology.org/J93-2004 Acesso em: 5 de agosto 2021.

MARTINEAU, F. Un corpus pour l’analyse de la variation et du changement linguistique. Corpus. no. 7. 10 Nov. 2008. Disponível em: http://journals.openedition.org/corpus/1508 Acesso em: 22 de setembro 2021.

MONTRUL, S. Subject and Object Expression in Spanish Heritage Speakers: A Case of Morphosyntactic Convergence. Bilingualism: Language and Cognition. v. 7, no. 2, pp.125–142. Aug. 2004. Disponível em: https://www.cambridge.org/core/product/identifier/S1366728904001464/type/journal_article Acesso em: 22 de setembro 2021.

NAGY, N. G. et al. Null Subjects in Heritage Languages: Contact Effects in a Cross-linguistic Context. In: Anais...2011.

OŽEGOV, S. I.; ŠVEDOVA, N. Ju. Explanatory Dictionary of the Russian Language. Disponível em: https://dic.academic.ru/dic.nsf/ogegova/ Acesso em: 2 de setembro 2021.

PÕLDVERE, N. et al. Challenges of Releasing Audio Material for Spoken Data: The Case of the London–Lund Corpus 2. Research in Corpus Linguistics. v. 9, no. 1, pp.35–62. 7 Jun. 2021. Disponível em: https://ricl.aelinco.es/index.php/ricl/article/view/157 Acesso em: 16 de julho 2021.

POLINSKY, M. Reanalysis in Adult Heritage Language: New Evidence in Support of Attrition. Studies in Second Language Acquisition. v. 33, no. 2, pp.305–328. Jun. 2011. Disponível em: https://www.cambridge.org/core/journals/studies-in-second-language-acquisition/article/reanalysis-in-adult-heritage-language/FC20F543D25513287F4FC8CB3E0B6ACF Acesso em: 27 de setembro 2021.

POLINSKY, M. Structure vs. Use in Heritage Language. Linguistics Vanguard. v. 2, no. 1. 1 Dec. 2016. Disponível em: https://www.degruyter.com/document/doi/10.1515/lingvan-2015-0036/html. Acesso em: 22 de setembro 2021

POLINSKY, M. Heritage Languages and Their Speakers. Cambridge University Press, 2018.

POLINSKY, M.; KAGAN, O. Heritage Languages: In the ‘Wild’ and in the Classroom: Heritage Languages: In the ‘Wild’ and in the Classroom. Language and Linguistics Compass. v. 1, no. 5, pp.368-395. Sep. 2007. Disponível em: https://onlinelibrary.wiley.com/doi/10.1111/j.1749-818X.2007.00022.x Acesso em: 22 de setembro 2021.

POLINSKY, M.; SCONTRAS, G. Understanding Heritage Languages. Bilingualism: Language and Cognition. v. 23, no. 1, pp.4–20. Jan. 2020. Disponível em: https://www.cambridge.org/core/product/identifier/S1366728919000245/type/journal_article Acesso em: 22 de setembro 2021.

POPLACK, S. et al. Revisiting Phonetic Integration in Bilingual Borrowing. Language. v. 96, no. 1, pp.126–159. 2020. Disponível em: https://muse.jhu.edu/article/751035. Acesso em: 7 de março 2022.

RAKHILINA, E.; VYRENKOVA, A.; POLINSKY, M. Linguistic Creativity in Heritage Speakers. Glossa: a journal of general linguistics. v. 1, no. 1, p.43. 26 Oct. 2016. Disponível em: http://www.glossa-journal.org/article/10.5334/gjgl.90/ Acesso em: 22 de setembro 2021.

RANDALL, B.; TAYLOR, A.; KROCH, A. CorpusSearch 2. 2005.

RNC. Russian National Corpus. Disponível em: https://ruscorpora.ru/new/en/index.html Acesso em: 2 de setembro 2021.

ROSS, J. R. Constraints on Variables in Syntax. 1967. MIT, Cambridge, Massachusetts, USA, 1967.

ROWLAND, C. F.; FLETCHER, S. L.; FREUDENTHAL, D. How Big Is Big Enough? Assessing the Reliability of Data from Naturalistic Samples. Corpora in Language Acquisition Research. 9 Apr. 2008. Disponível em: https://www.jbe-platform.com/content/books/9789027290267-tilar.6.04row Acesso em: 27 de setembro 2021.

SANTORINI, B. Syntactic Annotation Manual for the Penn Historical Corpora and the York-Helsinki Corpus of Early English Correspondence. Disponível em: https://www.ling.upenn.edu/hist-corpora/annotation/index.html Acesso em: 11 de junho 2021.

SANTORINI, B.; DIERTANI, A. Syntactic Annotation Manual for Audio-Aligned Parsed Corpora. Disponível em: https://www.ling.upenn.edu/~beatrice/annotation-audio-aligned-corpora/index.html Acesso em: 11 de junho 2021.

SEDDAH, D. et al. Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages. In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA. Anais... Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013. Disponível em: https://aclanthology.org/W13-4917 Acesso em: 7 de agosto 2021.

SEDDAH, D.; KÜBLER, S.; TSARFATY, R. Introducing the SPMRL 2014 Shared Task on Parsing Morphologically-rich Languages. In: Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, Dublin, Ireland. Anais... Dublin, Ireland: Dublin City University, Aug. 2014. Disponível em: https://aclanthology.org/W14-6111 Acesso em: 7 de agosto 2021.

SEKERINA, I. A.; SAUERMANN, A. Visual Attention and Quantifier-Spreading in Heritage Russian Bilinguals. Second Language Research. v. 31, no. 1, pp.75–104. Jan. 2015. Disponível em: http://journals.sagepub.com/doi/10.1177/0267658314537292 Acesso em: 22 de setembro 2021.

SERRATRICE, L.; SORACE, A.; PAOLI, S. Crosslinguistic Influence at the Syntax–Pragmatics Interface: Subjects and Objects in English–Italian Bilingual and Monolingual Acquisition. Bilingualism: Language and Cognition. v. 7, no. 3, pp.183–205. Dec. 2004. Disponível em: https://www.cambridge.org/core/product/identifier/S1366728904001610/type/journal_article Acesso em: 22 de setembro 2021.

TORTORA, C. The Audio-Aligned and Parsed Corpus of Appalachian English: Design and Use. In: WORKSHOP ON DATABASES AND CORPORA IN LINGUISTICS. Stony Brook University, NY, 17 Oct. 2014. Disponível em: https://aapcappe.commons.gc.cuny.edu/wp-content/blogs.dir/3140/files/2019/03/tortora_sb_corpus_handout_101614.pdf Acesso em: 10 de junho 2021.

TORTORA, C. et al. The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE), version 0.1. Disponível em: https://www.aapcappe.org/ Acesso em: 11 de junho 2021.

TORTORA, C. et al. Corpus of New York City English (CUNY-CoNYCE). Disponível em: https://conyce.commons.gc.cuny.edu/

TORTORA, C.; SANTORINI, B.; BLANCHETTE, F. Romance Parsed Corpora: Editors’ Introduction. Linguistic Variation. v. 18, no. 1, pp.1–22. 1 Jan. 2018. Disponível em: https://www.jbe-platform.com/content/journals/10.1075/lv.00002.tor#html_fulltext Acesso em: 11 de junho 2021.

TSIMPLI, I. et al. First Language Attrition and Syntactic Subjects: A Study of Greek and Italian near-Native Speakers of English. International Journal of Bilingualism. v. 8, no. 3, pp.257–277. Sep. 2004. Disponível em: http://journals.sagepub.com/doi/10.1177/13670069040080030601 Acesso em: 22 de setembro 2021.

UD POS. Universal Dependencies POS tags. Disponível em: https://universaldependencies.org/u/pos/index.html Acesso em: 2 de setembro 2021.

UNSWORTH, S. et al. The Role of Age of Onset and Input in Early Child Bilingualism in Greek and Dutch. Applied Psycholinguistics. v. 35, no. 4, pp.765–805. Dec. 2012. Disponível em: https://www.cambridge.org/core/journals/applied-psycholinguistics/article/role-of-age-of-onset-and-input-in-early-child-bilingualism-in-greek-and-dutch/1B686FAC86608EB5F4EBB5F44B9B0FF0 Acesso em: 30 de setembro 2021.

UNSWORTH, S. Bilingual Language Exposure Questionnaire. Disponível em: https://www.iris-database.org/iris/app/home/detail?id=york%3A928327&ref=search

UŠAKOV, D. N. Explanatory Dictionary of the Russian Language. Disponível em: https://dic.academic.ru/contents.nsf/ushakov Acesso em: 2 de setembro 2021.

VAN GOMPEL, M. et al. FoLiA in Practice: The Infrastructure of a Linguistic Annotation Format. In: CLARIN in the Low Countries. Edited by Jan Odijk and Arjan van Hessen. Ubiquity Press, 2017. pp.71–82.

VAN GOMPEL, M.; REYNAERT, M. FoLiA: A Practical XML Format for Linguistic Annotation – a Descriptive and Comparative Study. Computational Linguistics in the Netherlands Journal. v. 3, pp.63–81. 1 Dec. 2013. Disponível em: https://www.clips.uantwerpen.be/clinjournal/clinj/article/view/26 Acesso em: 10 de junho 2021.

VAN KAMPEN, J. The Learnability of the Left Branch Condition. Linguistics in the Netherlands. v. 11, pp.83–94. 6 Oct. 1994. Disponível em: http://www.jbe-platform.com/content/journals/10.1075/avt.11.10kam Acesso em: 22 de setembro 2021.

WALLENBERG, J. C. et al. Icelandic Parsed Historical Corpus (IcePaHC), 2011.

ZALIZNYAK, A. A. A Grammatical Dictionary of the Russian Language. Disponível em: https://www.morfologija.ru Acesso em: 2 de setembro 2021.

ZIPSER, F.; ROMARY, L. A model oriented approach to the mapping of annotation formats using standards. In: Anais... In: Workshop on Language Resource and Language Technology Standards, LREC, 2010, 18 May. Disponível em: https://hal.inria.fr/inria-00527799 Acesso em: 11 de junho 2021.

Publicado

2022-10-28

Como Citar

Lưu, A., Koval, P. ., Malamud, S., & Dubinina, I. (2022). A construção de corpus de larga escala da fala bilíngue de crianças e da fala bilíngue dirigida à criança, anotado e alinhado aos arquivos de áudio: desafios, soluções e implicações para a pesquisa. Bakhtiniana. Revista De Estudos Do Discurso, 17(4), Port. 223–261 / Eng. 229. Recuperado de https://revistas.pucsp.br/index.php/bakhtiniana/article/view/55831

Edição

Seção

Artigos