Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
Keywords:
Spoken Russian corpus, Disfluency annotation, Morphological tagging, Syntactic parsing, Bilingual and heritage speakersAbstract
The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully text-searchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.
Downloads
References
ARSLAN, S. Neurolinguistic and Psycholinguistic Investigations on Evidentiality in Turkish. 2015. University of Groningen, 2015.
ARSLAN, S.; BASTIAANSE, R. Chapter 6. First Language Exposure Predicts Attrition Patterns in Turkish Heritage Speakers’ Use of Grammatical Evidentiality. In: Studies in Bilingualism. Edited by Fatih Bayram. Amsterdam: John Benjamins Publishing Company, 2020. pp.105–126.
BECK, J. E. Penn Parsed Corpora of Historical Greek (PPCHiG). Disponível em: https://www.ling.upenn.edu/~janabeck/greek-corpora.html . Acesso em: 22 de julho 2021.
BENMAMOUN, E. et al. Arabic Plurals and Root and Pattern Morphology in Palestinian and Egyptian Heritage Speakers. Linguistic Approaches to Bilingualism. v. 4, no. 1, pp.89--123. 2014.
BENMAMOUN, E.; MONTRUL, S.; POLINSKY, M. Prolegomena to Heritage Linguistics. 2010. Disponível em: https://dash.harvard.edu/handle/1/23519841 . Acesso em: 3 de março 2022.
BOGUSLAVSKY, I. et al. Development of a Dependency Treebank for Russian and its Possible Applications in NLP. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. Anais... In: LREC 2002. Las Palmas, Canary Islands - Spain: European Language Resources Association (ELRA), May 2002 Disponível em: http://www.lrec-conf.org/proceedings/lrec2002/pdf/161.pdf . Acesso em: 7 de agosto 2021.
CLARK, A.; FOX, C.; LAPPIN, S. The Handbook of Computational Linguistics and Natural Language Processing. John Wiley & Sons, 2013.
DE PRADA PÉREZ, A. Subject Expression in MinorcaN Spanish: Consequences of Contact with Catalan. 2009. (Doctoral dissertation) – The Pennsylvania State University, 2009.
DE PRADA PÉREZ, A. First Person Singular Subject Pronoun Expression in Spanish in Contact with Catalan. In: Subject Pronoun Expression in Spanish: A Cross-Dialectal Perspective, 2015.
DUBININA, I. Y. et al. Variability in Input: A Corpus Study of Discourse Markers in Immigrant Parents’ Speech. In: Los Angeles, CA. Anais... In: Panel on Variability and Change in Bilingual Language Acquisition: Longitudinal Perspectives, The Third International Conference on Heritage/Community Languages. Los Angeles, CA: Feb. 2018.
DUBININA, I. Y. et al. Razmetka morfologicheskoj informacii BiRCh [BiRCh Morphological annotation guidelines]. Disponível em: https://brandeis.app.box.com/file/451776894902?s=pzyzu57p9bl0s7zkqvsv6ecepwjtp5aj . Acesso em: 30 de setembro 2021.
DUBININA, I. Y. et al. Requests with and without Požalujsta ‘Please’ in Monolingual and Bilingual Acquisition. Ms., Brandeis University (in progress).
DUBININA, I. Y. et al. Razmetka morfologicheskoj informacii BiRCh [BiRCh Morphological Annotation Guidelines]. Disponível em: https://brandeis.app.box.com/file/451776894902?s=pzyzu57p9bl0s7zkqvsv6ecepwjtp5aj. Acesso em: 30 de setembro 2021.
DUBININA, I. Y.; MALAMUD, S. A. Emergent Communicative Norms in a Contact Language: Indirect Requests in Heritage Russian. Linguistics. v. 55, no. 1, pp.67–116. 1 Jan. 2017. Disponível em: https://www.degruyter.com/document/doi/10.1515/ling-2016-0039/html Acesso em: 28 de fevereiro 2021.
GALVES, C.; ANDRADE, A. L. de; FARIA, P. Tycho Brahe Parsed Corpus of Historical Portuguese, 2017.
GOLDBERG, L. Verb-Stranding VP Ellipsis: A Cross-Linguistic Study. 2005. (Doctoral dissertation) – McGill University, Montréal, Québec, Canada, 2005
GRIBANOVA, V. Verb-Stranding Verb Phrase Ellipsis and the Structure of the Russian Verbal Complex. Natural Language & Linguistic Theory. v. 31, no. 1, pp.91–136. Feb. 2013. Disponível em: http://link.springer.com/10.1007/s11049-012-9183-3 Acesso em: 22 de setembro 2021.
HAZNEDAR, B. Transfer at the Syntax-Pragmatics Interface: Pronominal Subjects in Bilingual Turkish. Second Language Research. v. 26, no. 3, pp.355–378. Jul. 2010. Disponível em: http://journals.sagepub.com/doi/10.1177/0267658310365780 Acesso em: 22 de setembro 2021.
HINDLE, D. Deterministic Parsing of Syntactic Non-fluencies. In: 21st Annual Meeting of the Association for Computational Linguistics, Cambridge, Massachusetts, USA. Anais... In: ACL 1983. Cambridge, Massachusetts, USA: Association for Computational Linguistics, Jun. 1983. Disponível em: https://www.aclweb.org/anthology/P83-1019 Acesso em: 23 de junho 2021.
IONIN, T.; LUCHKINA, T. Scope, Syntax, and Prosody in Russian as a Second or Heritage Language. In: Exploring Interfaces. Edited by Mónica Cabrera and José Camacho. Cambridge University Press, 2019, pp.141–170.
IVANOVA-SULLIVAN, T. Anaphora Resolution in Globally Ambiguous Contexts. In: Theoretical and Experimental Aspects of Syntax-Discourse Interface in Heritage Grammars. Empirical Approaches to Linguistic Theory. Brill, 2014a. pp.125-141.
IVANOVA-SULLIVAN, T. Theoretical and Experimental Aspects of Syntax-Discourse Interface in Heritage Grammars. BRILL, 2014b.
JURAFSKY, D.; MARTIN, J. H. Chapter 26: Automatic Speech Recognition and Text-to-Speech. In: Speech and Language Processing (Draft of December 30, 2020), 2020.
KEATING, G. D.; VANPATTEN, B.; JEGERSKI, J. WHO WAS WALKING ON THE BEACH?: Anaphora Resolution in Spanish Heritage Speakers and Adult Second Language Learners. Studies in Second Language Acquisition. v. 33, no. 2, pp.193–221. Jun. 2011. Disponível em: https://www.cambridge.org/core/product/identifier/S0272263110000732/type/journal_article Acesso em: 22 de setembro 2021.
KOTELNIKOV, E.; RAZOVA, E.; FISHCHEVA, I. A Close Look at Russian Morphological Parsers: Which One Is the Best? Edited by Andrey Filchenkov; Lidia Pivovarova; and Jan Žižka In: Artificial Intelligence and Natural Language, Cham. Anais... Cham: Springer International Publishing, 2018.
KOVAL, P. et al. The Acquisition of the Left Branch Extraction by Bilingual Russian Children. In: Los Angeles, CA (virtual). Anais... In: NHLRC Fourth International Conference on Heritage/Community Languages. Los Angeles, CA (virtual): Jun. 2022.
KRAUSE, T.; ZELDES, A. ANNIS3: A New Architecture for Generic Corpus Query and Visualization. Digital Scholarship in the Humanities. v. 31, n. 1, pp.118–139. 1 Apr. 2016. Disponível em: https://doi.org/10.1093/llc/fqu057 . Acesso em: 11 de junho 2021.
KROCH, A. et al. Penn Parsed Corpora of Historical English. Disponível em: https://www.ling.upenn.edu/hist-corpora/ . Acesso em: 11 de junho 2021.
KROCH, A. Penn Parsed Corpora of Historical English LDC2020T16. Philadelphia, 2020.
LƯU, A.; MALAMUD, S. A.; XUE, N. Converting SynTagRus Dependency Treebank into Penn Treebank Style. In: Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), Berlin, Germany. Anais... Berlin, Germany: Association for Computational Linguistics, Aug. 2016 Disponível em: https://aclanthology.org/W16-1703 Acesso em: 5 de agosto 2021.
MALAMUD, S. A.; DUBININA, I. Y. Konvencii transkripcii i annotaciya neplavnostej BiRCh [BiRCh guidelines for transcription and disfluency annotation]. Disponível em: https://brandeis.app.box.com/s/h15um924ygz3t5zdvfwmsdx5kesjrzoq Acesso em: 23 de junho 2021a.
MALAMUD, S. A.; DUBININA, I. Y. Konvencii segmentacii transkripcii na predlozheniya v BiRCh [BiRCh convensions for segmenting transcripts into sentences]. Disponível em: https://brandeis.app.box.com/file/297247548157?s=woyvgzm21u28tm43anvlda9hqla0491c Acesso em: 30 de setembro 2021b.
MALAMUD, S. A. et al. Russian “sja” Verbs in Bilingual and Monolingual Acquisition. In: Los Angeles, CA (virtual). Anais... In: NHLRC Fourth International Conference on Heritage/Community Languages. Los Angeles, CA (virtual): Jun. 2022.
MARCUS, M. P. et al. Treebank-3Linguistic Data Consortium, 1999. Disponível em: https://catalog.ldc.upenn.edu/LDC99T42 Acesso em: 5 de agosto 2021
MARCUS, M. P.; SANTORINI, B.; MARCINKIEWICZ, M. A. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. v. 19, no. 2, pp.313–330. 1993. Disponível em: https://aclanthology.org/J93-2004 Acesso em: 5 de agosto 2021.
MARTINEAU, F. Un corpus pour l’analyse de la variation et du changement linguistique. Corpus. no. 7. 10 Nov. 2008. Disponível em: http://journals.openedition.org/corpus/1508 Acesso em: 22 de setembro 2021.
MONTRUL, S. Subject and Object Expression in Spanish Heritage Speakers: A Case of Morphosyntactic Convergence. Bilingualism: Language and Cognition. v. 7, no. 2, pp.125–142. Aug. 2004. Disponível em: https://www.cambridge.org/core/product/identifier/S1366728904001464/type/journal_article Acesso em: 22 de setembro 2021.
NAGY, N. G. et al. Null Subjects in Heritage Languages: Contact Effects in a Cross-linguistic Context. In: Anais...2011.
OŽEGOV, S. I.; ŠVEDOVA, N. Ju. Explanatory Dictionary of the Russian Language. Disponível em: https://dic.academic.ru/dic.nsf/ogegova/ Acesso em: 2 de setembro 2021.
PÕLDVERE, N. et al. Challenges of Releasing Audio Material for Spoken Data: The Case of the London–Lund Corpus 2. Research in Corpus Linguistics. v. 9, no. 1, pp.35–62. 7 Jun. 2021. Disponível em: https://ricl.aelinco.es/index.php/ricl/article/view/157 Acesso em: 16 de julho 2021.
POLINSKY, M. Reanalysis in Adult Heritage Language: New Evidence in Support of Attrition. Studies in Second Language Acquisition. v. 33, no. 2, pp.305–328. Jun. 2011. Disponível em: https://www.cambridge.org/core/journals/studies-in-second-language-acquisition/article/reanalysis-in-adult-heritage-language/FC20F543D25513287F4FC8CB3E0B6ACF Acesso em: 27 de setembro 2021.
POLINSKY, M. Structure vs. Use in Heritage Language. Linguistics Vanguard. v. 2, no. 1. 1 Dec. 2016. Disponível em: https://www.degruyter.com/document/doi/10.1515/lingvan-2015-0036/html. Acesso em: 22 de setembro 2021
POLINSKY, M. Heritage Languages and Their Speakers. Cambridge University Press, 2018.
POLINSKY, M.; KAGAN, O. Heritage Languages: In the ‘Wild’ and in the Classroom: Heritage Languages: In the ‘Wild’ and in the Classroom. Language and Linguistics Compass. v. 1, no. 5, pp.368-395. Sep. 2007. Disponível em: https://onlinelibrary.wiley.com/doi/10.1111/j.1749-818X.2007.00022.x Acesso em: 22 de setembro 2021.
POLINSKY, M.; SCONTRAS, G. Understanding Heritage Languages. Bilingualism: Language and Cognition. v. 23, no. 1, pp.4–20. Jan. 2020. Disponível em: https://www.cambridge.org/core/product/identifier/S1366728919000245/type/journal_article Acesso em: 22 de setembro 2021.
POPLACK, S. et al. Revisiting Phonetic Integration in Bilingual Borrowing. Language. v. 96, no. 1, pp.126–159. 2020. Disponível em: https://muse.jhu.edu/article/751035. Acesso em: 7 de março 2022.
RAKHILINA, E.; VYRENKOVA, A.; POLINSKY, M. Linguistic Creativity in Heritage Speakers. Glossa: a journal of general linguistics. v. 1, no. 1, p.43. 26 Oct. 2016. Disponível em: http://www.glossa-journal.org/article/10.5334/gjgl.90/ Acesso em: 22 de setembro 2021.
RANDALL, B.; TAYLOR, A.; KROCH, A. CorpusSearch 2. 2005.
RNC. Russian National Corpus. Disponível em: https://ruscorpora.ru/new/en/index.html Acesso em: 2 de setembro 2021.
ROSS, J. R. Constraints on Variables in Syntax. 1967. MIT, Cambridge, Massachusetts, USA, 1967.
ROWLAND, C. F.; FLETCHER, S. L.; FREUDENTHAL, D. How Big Is Big Enough? Assessing the Reliability of Data from Naturalistic Samples. Corpora in Language Acquisition Research. 9 Apr. 2008. Disponível em: https://www.jbe-platform.com/content/books/9789027290267-tilar.6.04row Acesso em: 27 de setembro 2021.
SANTORINI, B. Syntactic Annotation Manual for the Penn Historical Corpora and the York-Helsinki Corpus of Early English Correspondence. Disponível em: https://www.ling.upenn.edu/hist-corpora/annotation/index.html Acesso em: 11 de junho 2021.
SANTORINI, B.; DIERTANI, A. Syntactic Annotation Manual for Audio-Aligned Parsed Corpora. Disponível em: https://www.ling.upenn.edu/~beatrice/annotation-audio-aligned-corpora/index.html Acesso em: 11 de junho 2021.
SEDDAH, D. et al. Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages. In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA. Anais... Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013. Disponível em: https://aclanthology.org/W13-4917 Acesso em: 7 de agosto 2021.
SEDDAH, D.; KÜBLER, S.; TSARFATY, R. Introducing the SPMRL 2014 Shared Task on Parsing Morphologically-rich Languages. In: Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, Dublin, Ireland. Anais... Dublin, Ireland: Dublin City University, Aug. 2014. Disponível em: https://aclanthology.org/W14-6111 Acesso em: 7 de agosto 2021.
SEKERINA, I. A.; SAUERMANN, A. Visual Attention and Quantifier-Spreading in Heritage Russian Bilinguals. Second Language Research. v. 31, no. 1, pp.75–104. Jan. 2015. Disponível em: http://journals.sagepub.com/doi/10.1177/0267658314537292 Acesso em: 22 de setembro 2021.
SERRATRICE, L.; SORACE, A.; PAOLI, S. Crosslinguistic Influence at the Syntax–Pragmatics Interface: Subjects and Objects in English–Italian Bilingual and Monolingual Acquisition. Bilingualism: Language and Cognition. v. 7, no. 3, pp.183–205. Dec. 2004. Disponível em: https://www.cambridge.org/core/product/identifier/S1366728904001610/type/journal_article Acesso em: 22 de setembro 2021.
TORTORA, C. The Audio-Aligned and Parsed Corpus of Appalachian English: Design and Use. In: WORKSHOP ON DATABASES AND CORPORA IN LINGUISTICS. Stony Brook University, NY, 17 Oct. 2014. Disponível em: https://aapcappe.commons.gc.cuny.edu/wp-content/blogs.dir/3140/files/2019/03/tortora_sb_corpus_handout_101614.pdf Acesso em: 10 de junho 2021.
TORTORA, C. et al. The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE), version 0.1. Disponível em: https://www.aapcappe.org/ Acesso em: 11 de junho 2021.
TORTORA, C. et al. Corpus of New York City English (CUNY-CoNYCE). Disponível em: https://conyce.commons.gc.cuny.edu/
TORTORA, C.; SANTORINI, B.; BLANCHETTE, F. Romance Parsed Corpora: Editors’ Introduction. Linguistic Variation. v. 18, no. 1, pp.1–22. 1 Jan. 2018. Disponível em: https://www.jbe-platform.com/content/journals/10.1075/lv.00002.tor#html_fulltext Acesso em: 11 de junho 2021.
TSIMPLI, I. et al. First Language Attrition and Syntactic Subjects: A Study of Greek and Italian near-Native Speakers of English. International Journal of Bilingualism. v. 8, no. 3, pp.257–277. Sep. 2004. Disponível em: http://journals.sagepub.com/doi/10.1177/13670069040080030601 Acesso em: 22 de setembro 2021.
UD POS. Universal Dependencies POS tags. Disponível em: https://universaldependencies.org/u/pos/index.html Acesso em: 2 de setembro 2021.
UNSWORTH, S. et al. The Role of Age of Onset and Input in Early Child Bilingualism in Greek and Dutch. Applied Psycholinguistics. v. 35, no. 4, pp.765–805. Dec. 2012. Disponível em: https://www.cambridge.org/core/journals/applied-psycholinguistics/article/role-of-age-of-onset-and-input-in-early-child-bilingualism-in-greek-and-dutch/1B686FAC86608EB5F4EBB5F44B9B0FF0 Acesso em: 30 de setembro 2021.
UNSWORTH, S. Bilingual Language Exposure Questionnaire. Disponível em: https://www.iris-database.org/iris/app/home/detail?id=york%3A928327&ref=search
UŠAKOV, D. N. Explanatory Dictionary of the Russian Language. Disponível em: https://dic.academic.ru/contents.nsf/ushakov Acesso em: 2 de setembro 2021.
VAN GOMPEL, M. et al. FoLiA in Practice: The Infrastructure of a Linguistic Annotation Format. In: CLARIN in the Low Countries. Edited by Jan Odijk and Arjan van Hessen. Ubiquity Press, 2017. pp.71–82.
VAN GOMPEL, M.; REYNAERT, M. FoLiA: A Practical XML Format for Linguistic Annotation – a Descriptive and Comparative Study. Computational Linguistics in the Netherlands Journal. v. 3, pp.63–81. 1 Dec. 2013. Disponível em: https://www.clips.uantwerpen.be/clinjournal/clinj/article/view/26 Acesso em: 10 de junho 2021.
VAN KAMPEN, J. The Learnability of the Left Branch Condition. Linguistics in the Netherlands. v. 11, pp.83–94. 6 Oct. 1994. Disponível em: http://www.jbe-platform.com/content/journals/10.1075/avt.11.10kam Acesso em: 22 de setembro 2021.
WALLENBERG, J. C. et al. Icelandic Parsed Historical Corpus (IcePaHC), 2011.
ZALIZNYAK, A. A. A Grammatical Dictionary of the Russian Language. Disponível em: https://www.morfologija.ru Acesso em: 2 de setembro 2021.
ZIPSER, F.; ROMARY, L. A model oriented approach to the mapping of annotation formats using standards. In: Anais... In: Workshop on Language Resource and Language Technology Standards, LREC, 2010, 18 May. Disponível em: https://hal.inria.fr/inria-00527799 Acesso em: 11 de junho 2021.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Bakhtiniana. Revista de Estudos do Discurso
This work is licensed under a Creative Commons Attribution 4.0 International License.
The authors grant the journal all copyrights relating to the work published. The concepts expressed in signed articles are absolute and exclusive responsibility of their authors.