The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts

Marcio Lima Inácio; Marco Antonio Sobrevilla Cabezudo; Renata Ramisch; Ariani Di Felippo; Thiago Alexandre Salgueiro Pardo

doi:10.1590/1678-460X202339355159

Autores/as

Marcio Lima Inácio Instituto de Ciências Matemáticas e de Computação. Universidade de São Paulo.
Marco Antonio Sobrevilla Cabezudo Instituto de Ciências Matemáticas e de Computação. Universidade de São Paulo. https://orcid.org/0000-0001-7625-9914
Renata Ramisch https://orcid.org/0000-0003-3372-6150
Ariani Di Felippo Departamento de Linguística. Universidade Federal de São Carlos https://orcid.org/0000-0002-4566-9352
Thiago Alexandre Salgueiro Pardo Instituto de Ciências Matemáticas e de Computação. Universidade de São Paulo. https://orcid.org/0000-0003-2111-1319

DOI:

https://doi.org/10.1590/1678-460X202339355159

Palabras clave:

corpus annotation, knowledge representation, semantics

Resumen

One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR.

Citas

Abend, O., & Rappoport, A. (2013). UCCA: A semantics-based grammatical annotation scheme. Proceedings of the 10th International Conference on Computational Semantics - Long Papers, 1-12. https://aclanthology.org/W13-0101.pdf (accessed 23 August, 2022).

Abzianidze, L., Bjerva, J., Evang, K., Haagsma, H., van Noord, R., Ludmann, P., Nguyen, D.-D., & Bos, J. (2017). The Parallel Meaning Bank: Towards a multilingual corpus of translations annotated with compositional meaning representations. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 242-247. https://aclanthology.org/E17-2039.pdf (accessed 23 August, 2022).

Alves, E. (2006). O diminutivo no português do Brasil: funcionalidade e tipologia. Estudos Linguísticos, 35, 694-701. http://www.gel.hospedagemdesites.ws/estudoslinguisticos/edicoesanteriores/4publica-estudos-2006/sistema06/885.pdf (accessed 23 August, 2022).

Anchiêta, R. T., & Pardo, T. A. S. (2018a). Towards AMR-BR: A sembank for Brazilian Portuguese language. Proceedings of the eleventh international conference on language resources and evaluation, 974-979. https://aclanthology.org/L18-1157.pdf (accessed 23 August, 2022).

Anchiêta, R. T., & Pardo, T. A. S. (2018b). A rule-based AMR parser for Portuguese. Proceedings of the 16th Ibero-American Conference on Artificial Intelligence, 341-353. https://doi.org/10.1007/978-3-030-03928-8_28

Anchiêta, R. T., & Pardo, T. A. S. (2022). Abstract meaning representation parsing for the Brazilian Portuguese language. Proceedings of the International Conference on Computational Processing of Portuguese, 429-434. https://doi.org/10.11606/T.55.2020.tde-29072020-120805

Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of Natural Language Processing, 2nd ed., (pp. 267-292). CRC Press.

Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M., & Schneider, N. (2013). Abstract meaning representation for sembanking. Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, 178-186. https://aclanthology.org/W13-2322.pdf (accessed 23 August, 2022).

Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27(1), 3-23. https://doi.org/10.2307/3315487

Basile, V., Bos, J., Evang, K., & Venhuizen, N. (2012). Developing a large semantically annotated corpus. Proceedings of the Eighth International Conference on Language Resources and Evaluation, 3196-3200. http://www.lrec-conf.org/proceedings/lrec2012/pdf/534_Paper.pdf

Bertaglia, T. F. C., & Nunes, M. das G. V. (2016). Exploring word embeddings for unsupervised textual user-generated content normalization. Proceedings of the 2nd Workshop on Noisy User-generated Text, 112-120. https://aclanthology.org/W16-3916.pdf (accessed 23 August, 2022).

Cai, S., & Knight, K. (2013). Smatch: An evaluation metric for semantic feature structures. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 748-752. https://aclanthology.org/P13-2131.pdf (accessed 23 August, 2022).

Cambria, E., Poria, S., Bisio, F., Bajpai, R., & Chaturvedi, I. (2015). The CLSA model: A novel framework for concept-level sentiment analysis. Proceedings of the Computational linguistics and intelligent text processing conference, 3-22. https://doi.org/10.1007/978-3-319-18117-2_1

López Condori, R. E., Pardo, T. A. S., Avanço, L. V., Filho, P., Bokan, A., Cardoso, P., Dias, M., Nóbrega, F., Sobrevilla Cabezudo, M. A., Souza, J., Zacarias, A., Seno, E., & Di Felippo, A. (2015). A qualitative analysis of a corpus of opinion summaries based on aspects. Proceedings of the 9th Linguistic Annotation Workshop, 62-71. http://dx.doi.org/10.3115/v1/W15-1607

Constant, M., Eryiğit, G., Monti, J., van der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multiword expression processing: A survey. Computational Linguistics, 43(4), 837-892. https://doi.org/10.1162/COLI_a_00302

Damonte, M., & Cohen, S. B. (2018). Cross-lingual abstract meaning representation parsing. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1146-1155. https://doi.org/10.18653/v1/N18-1104

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171-4186. https://aclanthology.org/N19-1423.pdf (accessed 23 August, 2022).

Duran, M. S., & Aluísio, S. M. (2015). Automatic generation of a lexical resource to support semantic role labeling in Portuguese. Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, 216-221. https://aclanthology.org/S15-1026.pdf (accessed 23 August, 2022).

Duran, M. S., & Aluísio, S. M. (2012). Propbank-Br: A Brazilian treebank annotated with semantic role labels. Proceedings of the Eighth International Conference on Language Resources and Evaluation , 1862-1867. http://www.lrec-conf.org/proceedings/lrec2012/pdf/272_Paper.pdf (accessed 23 August, 2022).

Freitas, C., Motta, E., Milidiú, R. L., & César, J. (2014). Sparkling vampire... LOL! Annotating opinions in a book review corpus. New Language Technologies and Linguistic Research, 128-146. https://www.researchgate.net/publication/271836545_Sparkling_Vampire_lol_Annotating_Opinions_in_a_Book_Review_Corpus (accessed 23 August, 2022).

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning (1st ed.). MIT Press.

Hermjakob, U. (2013). AMR editor: A tool to build abstract meaning representations. https://amr.isi.edu/editor.html (accessed 23 August, 2022).

Inácio, M. L., & Pardo, T. A. S. (2021). Semantic-based opinion summarization. Proceedings of Recent Advances in Natural Language Processing, 624-633. https://doi.org/10.11606/D.55.2021.tde-13092021-141741

Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural language processing, computational linguistics and speech recognition (2nd ed.). Prentice Hall.

Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From word embeddings to document distances. Proceedings of the 32nd International Conference on International Conference on Machine Learning, 957-966. https://proceedings.mlr.press/v37/kusnerb15.pdf (accessed 23 August, 2022).

Legroski, M. (2009). Definindo metáfora. Revista Polidisciplinar Eletrônica da Faculdade Guairacá, 1(2), 15-31. http://www.revistavoos.com.br/seer/index.php/voos/article/viewFile/42/02_Vol2_VOOS2009_CL1 (accessed 23 August, 2022).

Linh, H., & Nguyen, H. (2019). A case study on meaning representation for Vietnamese. Proceedings of the First International Workshop on Designing Meaning Representations, 148-153. https://doi.org/10.18653/v1/W19-3317

Migueles-Abraira, N. (2017). A study towards Spanish abstract meaning representation [Master thesis]. Universidad del País Vasco. https://addi.ehu.es/bitstream/handle/10810/22056/NMA-MScThesis-June2017.pdf?sequence=5 (accessed 23 August, 2022).

Migueles-Abraira, N., Agerri, R., & Diaz de Ilarraza, A. (2018). Annotating abstract meaning representations for Spanish. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 3074-3078. https://aclanthology.org/L18-1486.pdf (accessed 23 August, 2022).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of the International Conference on Learning Representations, 1-12. https://doi.org/10.48550/arXiv.1301.3781

Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71-106. https://doi.org/10.1162/0891201053630264

Ramisch, C., Ramisch, R., Zilio, L., Villavicencio, A., & Cordeiro, S. (2018). A corpus study of verbal multiword expressions in Brazilian Portuguese. Proceedings of the International Conference on Computational Processing of the Portuguese Language, 24-34. https://doi.org/10.1007/978-3-319-99722-3_3

Sobrevilla Cabezudo, M. A., & Pardo, T. A. S. (2019). Towards a general abstract meaning representation corpus for Brazilian Portuguese. Proceedings of the 13th Linguistic Annotation Workshop, 236-244. https://doi.org/10.18653/v1/W19-4028

Sobrevilla Cabezudo, M. A., & Pardo, T. A. S. (2022). Low-resource AMR-to-text generation: A study on Brazilian Portuguese. Procesamiento del Lenguaje Natural, 68, 85-97. https://repositorio.usp.br/directbitstream/a9595448-e887-4a55-9e7b-4789667ee3e0/3070380.pdf (accessed 23 August, 2022).

Yampolskiy, R.V. (2013). Turing Test as a Defining Feature of AI-Completeness. In X. S. Yang (Ed.), Artificial intelligence, evolutionary computation and metaheuristics (pp. 3-17). Springer.

White, A. S., Reisinger, D., Sakaguchi, K., Vieira, T., Zhang, S., Rudinger, R., Rawlins, K., & Van Durme, B. (2016). Universal decompositional semantics on universal dependencies. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1713-1723. https://aclanthology.org/D16-1177.pdf (accessed 23 August, 2022).

Wittenberg, E., Jackendoff, R., Kuperberg, G., Paczynski, M., Snedeker, J., Wiese, H., & Wittenberg, E. (2014). The processing and representation of light verb constructions. In A. Bachrach, I. Roy & L. Stockall (Eds.), Structuring the argument: Multidisciplinary research on verb argument structure (pp. 61-80). John Benjamins. https://doi.org/10.1075/lfab.10

Xue, N., Bojar, O., Hajič, J., Palmer, M., Urešová, Z., & Zhang, X. (2014). Not an interlingua, but close: Comparison of English AMRs to Chinese and Czech. Proceedings of the Ninth International Conference on Language Resources and Evaluation, 1765-1772. http://www.lrec-conf.org/proceedings/lrec2014/pdf/384_Paper.pdf (accessed 23 August, 2022).

The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts

Autores/as

DOI:

Palabras clave:

Resumen

Citas

Publicado

Cómo citar

Número

Sección

Qualis

Enviar un artículo

Idioma

Palabras clave