A Parser for News Downloads

Autores

  • Mike Scott Aston University - School of Languages & Social Sciences - Birmingham

Palavras-chave:

News corpus, Corpus clean-up, Duplicate texts, Building sub-corpora

Resumo

This paper presents the Download Parser, a tool for handling text downloads from large online databases. Many universities have access to full-text databases which allow the user to search their holdings and then view and ideally download the full text of relevant articles, but there are important problems in practice in managing such downloads, because of factors such as duplication, unevenness of formatting standards, lack of documentation. The tool under discussion was devised to parse downloads, clean them up and standardise them, identify headlines and insert suitably marked-up headers for corpus analysis.

Downloads

Como Citar

Scott, M. (2018). A Parser for News Downloads. DELTA: Documentação E Estudos Em Linguística Teórica E Aplicada, 34(1). Recuperado de https://revistas.pucsp.br/index.php/delta/article/view/38974

Edição

Seção

Artigos